Quantitative Data Analysis for Language Assessment Volume II: Advanced Methods [1° ed.] 1138733148, 9781138733145

Quantitative Data Analysis for Language Assessment Volume II: Advanced Methods demonstrates advanced quantitative techni

1,252 64 7MB

English Pages 260 Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Quantitative Data Analysis for Language Assessment Volume II: Advanced Methods [1° ed.]
 1138733148, 9781138733145

Table of contents :
Cover
Half Title
Title Page
Copyright Page
Table of contents
List of figures
List of tables
List of volume II contributors
Preface
Introduction
SECTION I: Advanced item response theory (IRT) models in language assessment
1.
Applying the mixed Rasch model in assessing reading comprehension
2.
Multidimensional Rasch models in first language listening tests
3.
The log-linear cognitive diagnosis modeling (LCDM) in second language listening assessment
4.
Application of a hierarchical diagnostic classification model in assessing reading comprehension
SECTION II:
Advanced statistical methods in language assessment
5.
Structural equation modeling to predict performance in English proficiency tests
6.
Student growth percentiles in the formative assessment of English language proficiency
7.
Multilevel modeling to examine sources of variability in second language test scores
8.
Longitudinal multilevel modeling to examine changes in second language test scores
SECTION III:
Nature-inspired data-mining methods in language assessment
9.
Classification and regression trees in predicting listening item difficulty
10.
Evolutionary algorithm-based symbolic regression to determine the relationship of reading and lexicogrammatical knowledge
Index

Citation preview

Quantitative Data Analysis for Language Assessment Volume II

Quantitative Data Analysis for Language Assessment Volume II: Advanced Methods demonstrates advanced quantitative techniques for language assessment. The volume takes an interdisciplinary approach and taps into expertise from language assessment, data mining, and psychometrics. The techniques covered include Structural Equation Modeling, Data Mining, Multidimensional Psychometrics and Multilevel Data Analysis. Volume II is distinct among available books in language assessment, as it engages the readers in both theory and application of the methods and introduces relevant techniques for theory construction and validation. This book is highly recommended to graduate students and researchers who are searching for innovative and rigorous approaches and methods to achieve excellence in their dissertations and research. It is also a valuable source for academics who teach quantitative approaches in language assessment and data analysis courses. Vahid Aryadoust is assistant professor of language assessment literacy at the National Institute of Education of Nanyang Technological University, Singapore. He has led a number of language assessment research projects funded by, for example, the Ministry of Education (Singapore), Michigan Language Assessment (USA), Pearson Education (UK), and Paragon Testing Enterprises (Canada) and has published his research in Language Testing, Language Assessment Quarterly, Assessing Writing, Educational Assessment, Educational Psychology, and Computer Assisted Language Learning. He has also (co)authored a number of book chapters and books that have been published by Routledge, Cambridge University Press, Springer, Cambridge Scholar Publishing, Wiley Blackwell, and so on. He is a member of the advisory board of multiple international journals including Language Testing (Sage), Language Assessment Quarterly (Taylor & Francis), Educational Assessment (Taylor & Francis), Educational Psychology (Taylor & Francis), and Asia Pacific Journal of Education (Taylor & Francis). In addition, he has been awarded the Intercontinental Academia Fellowship (2018– 2019), which is an advanced research program launched by the University-Based Institutes for Advanced Studies. Vahid’s areas of interest include theory-building and quantitative data analysis in language assessment, neuroimaging in language comprehension, and eye tracking research.

Michelle Raquel is a senior lecturer at the Centre of Applied English Studies, University of Hong Kong, where she teaches language testing and assessment to postgraduate students. She has worked in several tertiary institutions in Hong Kong as an assessment developer and has either led or been part of a group that designed and administered large-scale diagnostic and language proficiency assessments such as Hong Kong Institute of Education’s Tertiary English Language Test (TELT), Hong Kong University of Science and Technology’s English Language Proficiency Assessment (ELPA), and Diagnostic English Language Tracking Assessment (DELTA), a government-funded inter-institutional project tasked to develop a computer-based academic English diagnostic test. She specializes in data analysis, specifically Rasch measurement, and has published several articles in international journals on this topic as well as on academic English diagnostic assessment, English as a second language (ESL) testing of reading and writing, dynamic assessment of second language dramatic skills, and English for specific purposes (ESP) testing.

Quantitative Data Analysis for Language Assessment Volume II Advanced Methods

Edited by Vahid Aryadoust and Michelle Raquel

First published 2020 by Routledge 2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN and by Routledge 52 Vanderbilt Avenue, New York, NY 10017 Routledge is an imprint of the Taylor & Francis Group, an informa business © 2020 selection and editorial matter, Vahid Aryadoust and Michelle Raquel; individual chapters, the contributors The right of Vahid Aryadoust and Michelle Raquel to be identified as the authors of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging in Publication Data A catalog record for this book has been requested ISBN: 978-1-138-73314-5 (hbk) ISBN: 978-1-315-18780-8 (ebk) Typeset in Galliard by Cenveo® Publisher Services Visit the eResources: https: ⁄ ⁄ www.routledge.com ⁄ 9781138733145

Table of contents

List of figures List of tables List of volume II contributors Preface Introduction

vii ix xi xv 1

VAHID ARYADOUST AND MICHELLE RAQUEL

SECTION I

Advanced item response theory (IRT) models in language assessment

13

1 Applying the mixed Rasch model in assessing reading comprehension

15

PURYA BAGHAEI, CHRISTOPH J. KEMPER, MONIQUE REICHERT, AND SAMUEL GREIFF

2 Multidimensional Rasch models in first language listening tests

33

CHRISTIAN SPODEN AND JENS FLEISCHER

3 The log-linear cognitive diagnosis modeling (LCDM) in second language listening assessment

56

TUĞBA ELIF TOPRAK, VAHID ARYADOUST, AND CHRISTINE GOH

4 Application of a hierarchical diagnostic classification model in assessing reading comprehension HAMDOLLAH RAVAND

79

vi  Table of contents SECTION II

Advanced statistical methods in language assessment 5 Structural equation modeling to predict performance in English proficiency tests

99 101

XUELIAN ZHU, MICHELLE RAQUEL, AND VAHID ARYADOUST

6 Student growth percentiles in the formative assessment of English language proficiency

127

HUSEIN TAHERBHAI AND DAERYONG SEO

7 Multilevel modeling to examine sources of variability in second language test scores

150

YO IN’NAMI AND KHALED BARKAOUI

8 Longitudinal multilevel modeling to examine changes in second language test scores

171

KHALED BARKAOUI AND YO IN’NAMI

SECTION III

Nature-inspired data-mining methods in language assessment

191

9 Classification and regression trees in predicting listening item difficulty

193

VAHID ARYADOUST AND CHRISTINE GOH

10 Evolutionary algorithm-based symbolic regression to determine the relationship of reading and lexicogrammatical knowledge

215

VAHID ARYADOUST

Index

234

Figures

1.1 Item parameter profiles of latent classes (Class 1 = 57.1%, Class 2 = 42.9%). 2.1 Item characteristic curves of two Rasch model items with lower (Item 1) and higher (Item 2) item difficulty, and individuals with lower (θ1) and higher (θ2) ability (figure adapted from Fleischer, Wirth, & Leutner, 2007). 2.2 Between- and within-item dimensionality models for two measurement dimensions. 3.1 ICBC for Item 34. 3.2 ICBC for Item 14. 3.3 ICBC for Item 17. 4.1 Some possible attribute hierarchies.  4.2 Hypothesized hierarchies. 5.1 A hypothesized SEM model including two latent variables and multiple observed variables. 5.2 A hypothesized model of factors predicting IELTS scores. 5.3 DELTA track. 5.4 A CFA model with the IELTS latent factor causing the variance in the observable variables (IELTSL: IELTS listening score, IELTSR: IELTS reading score, IELTSW: IELTS writing score, and IELTSS: IELTS speaking score) (n = 257). 5.5 An SEM model with observable DELTA data predicting the latent variable, IELTS (n = 257). 5.6 An SEM model with faculty predicting the latent variable, IELTS, and the observable DELTA variable (n = 257). 6.1 Student’s score estimates at predetermined percentiles conditioned on a prior score. (This figure is adapted from Seo, McGrane, & Taherbhai, 2015. While this student is functioning in the 30th percentile [dashed line], he ⁄she needs to perform in the 80th percentile to meet proficiency [dash-dot-dot line]. 6.2 An example of smoothing the data set.

24

35 37 71 71 72 89 90 103 113 114

115 116 116

132 136

viii  Figures 7.1 a–e Examples of nested data structures. 7.2 a–d Fixed and random coefficients with intercepts and slopes at Level 2. 7.3 Schematic graph of the nested structure of the current data. 9.1 A hypothetical classification model for decision-making in a biased language assessment. 9.2 Train-test cross-validation (top) versus k-fold cross-validation (bottom). 9.3 CART model for the data (based on Gini method of tree-growing). 10.1 A two-plate model depicting the relationship between reading and lexicogrammatical knowledge.

151 155 158 195 197 208 229

Tables

1.1 1.2 1.3 1.4 1.5 2.1 2.2

2.3

3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 4.4 4.5 4.6 5.1

5.2 5.3

Model fit statistics for the estimated models Item parameters and fit indices in the two latent classes Class size and classification accuracy for the latent classes Mean ability estimates and reliability Class comparison Summary of skill sets, languages, and models investigated in Analysis 1 and Analysis 2 Parameter estimates and global fit statistics for three Rasch measurement models applied to L1 listening and reading comprehension test data Parameter estimates and global fit statistics for three Rasch measurement models applied to L1 and L2 listening comprehension test data Model modifications and fit indices for the MET listening test The MET listening test Q-matrix and LCDM item parameter estimates Probabilities of a correct response for nonmasters and masters on all the MET items Tetrachoric correlations among attributes Attribute mastery classes and their respective attribute mastery profiles DCM categorization DCM studies of language assessment The final Q-matrix Absolute fit indices Relative fit indices Attribute profile prevalence Language assessment SEM research reporting guidelines (based on Ockey and Choi, 2015, and In’nami and Kiozumi, 2011a) Fit indices of the hypothesized models and the data Standardized and unstandardized estimates of SEM model 3 (see Figure 5.6)

23 23 24 25 26 45

46

47 66 67 70 73 73 82 83 88 91 91 92

110 117 118

x  Tables 6.1 Summary statistics of five years’ ELP assessment: Total and modality score 6.2 Regression coefficient estimates of SGP model at the 50th percentile 6.3 Examples of students’ predicted scale scores across the specified percentiles 6.4 Comparing each student’s predicted growth percentiles across total and modality scores 7.1 Sample items from the vocabulary size and depth tests 7.2 Descriptive statistics 7.3 Multilevel model results 8.1 Descriptive statistics for number of times test was taken (N = 1,000) 8.2 Descriptive statistics for PTE total scores by test occasion 8.3 Models 1–4 for PTE total scores 8.4 Models 5–7 for PTE total scores 9.1 Demographic information of the listening tests and test takers 9.2 Independent variables in the CART model 9.3 Distribution of data in test-train cross-validation CART analysis 9.4 Classification accuracy, specificity, sensitivity, and ROC 9.5 CART-estimated variable importance indices 9.6 Sample IF-THEN rules generated by the CART model 10.1 Competing models with their R, R2, MSE, and MAE indices 10.2 Sensitivity analysis of the variables in the models

137 138 139 142 156 159 161 175 175 179 183 203 204 206 207 207 209 225 227

Volume II contributors

Purya Baghaei is an associate professor in the English Department, Islamic Azad University, Mashhad Branch, Mashhad, Iran. He holds a PhD in applied linguistics from Alpen-Adria Universität, Klagenfurt, Austria. He is a scholar of the Alexander von Humboldt Foundation in Germany and has conducted post-doctoral research at universities in Vienna, Berlin, Jena, and Bamberg. His major research interest is in foreign language proficiency testing with a focus on the applications of item response theory models in test validation and scaling. He has published numerous articles on language testing and cognitive components of second language acquisition in international journals. Khaled Barkaoui is an associate professor at the Faculty of Education, York University, Canada. His current research focuses on second-language (L2) assessment, L2 writing, L2 program evaluation, longitudinal and mixed-­ methods research, and English for academic purposes (EAP). His publications have appeared in Annual Review of Applied Linguistics, Applied Linguistics, Assessing Writing, Language Testing, Language Assessment Quarterly, System, and TESOL Quarterly. Jens Fleischer is a post-doc researcher at the Department of Instructional Psychology of the University of Duisburg-Essen (Germany). His research interests include the assessment of competencies in secondary and higher education and research on cognitive and motivational predictors of academic learning. Most recently, he has worked on the assessment and modeling of cross-curricular problem-solving competence and was a member of the coordination team of the interdisciplinary priority research program “Models of Competence.” Currently, he is working in a research group investigating factors that influence academic learning and academic success in the entry phase of science and technology study programs. Christine Goh is professor of linguistics and language education at the National Institute of Education, Nanyang Technological University, Singapore. Her areas of interest and expertise are cognitive and metacognitive processes in L2 listening and speaking, teaching and assessment of L2 listening and speaking, and teacher cognition. She publishes extensively in these areas in books, book chapters, and journal articles, and her work has been widely cited internationally.

xii  Volume II contributors Samuel Greiff is research group leader, principal investigator, and ATTR ACT-­ fellow at the University of Luxembourg. He holds a PhD in cognitive and experimental psychology from the University of Heidelberg, Germany. He has been awarded national and international research funds by diverse organizations, is currently a fellow in the Luxembourg research programme of excellency, and has published articles in national and international scientific journals and books. He has an extensive record of conference contributions and invited talks and serves as editor for several journals, such as the European Journal of Psychological Assessment and the Journal of Educational Psychology and Thinking Skills and Creativity. Yo In’nami is a professor of English at Chuo University, Japan. He is also a PhD candidate’s adviser and an external PhD examiner at Temple University Japan Campus. He has taught various undergraduate- and graduate-level courses in language testing, second language acquisition, and statistics for language research. He currently is interested in meta-analytic inquiry into the variability of effects and the longitudinal measurement of change in language proficiency. His publications have appeared in International Journal of Testing, Language Assessment Quarterly, Language Learning, Language Testing, System, and TESOL Quarterly. His website is https: ⁄ ⁄sites.google.com ⁄site ⁄yoinnami Christoph J. Kemper worked at different universities and research institutes in Germany and abroad. He currently holds a position as Professor for Differential Psychology and Assessment at the HSD University of Applied Sciences in Cologne, Germany. He was recently appointed head of the Center for Psychological Assessment of HSD University. He teaches psychological assessment, differential psychology, and research methods in bachelor’s and master’s programs (psychology). His research on individual differences and their assessment such as personality, motives, and anxiety, as well as different aspects of assessment ⁄survey methodology, is widely published in more than 50 research papers, including 25 peer-reviewed publications. He is also first and coauthor of many assessment instruments. Hamdollah Ravand received a PhD in English language teaching from the University of Isfahan, Iran, in 2013. He joined the English Department of Vali-eAsr University of Rafsanjan where he is teaching master’s courses on research methods in Teaching English as a Foreign Language (TEFL), language testing, statistics and computers, and advanced writing. Hamdollah has been a visiting researcher to Jena University and the Institute for Educational Quality Improvement (IQB), Germany in 2012 and 2016, respectively. His major research interests are applications of diagnostic classification models, structural equation models, item response theory models, and multilevel models to second language data. Monique Reichert is a psychologist specializing in language assessment, cognitive science issues, and empirical methods. Since 2004, she has worked at the Luxembourg Centre for Educational Testing (LUCET; formerly: EMACS) of the University of Luxembourg. As a head of the domain of language test

Volume II contributors xiii development, her main areas of work lie in the development and implementation of quality assurance procedures in language test development. She took part in different European projects relating to language evaluation, such as in the EBAFLS, or the CEF-ESTIM project (http: ⁄ ⁄cefestim.ecml.at  ⁄), and she continuously develops and conducts training in language test development. She received her PhD for her work on the validity of C-tests. Daeryong Seo is a senior research scientist at Pearson. He has led various state assessments and brings international psychometric experience through his work with the Australian NAPLAN and Global Scale of English. He has published several studies in international journals and presented numerous psychometric issues at international conferences, such as for the American Educational Research Association (AER A). He also served as a program chair of the Rasch special interest group, AER A. In 2013, he and Dr. Husein Taherbhai received an outstanding paper award from the California Educational Research Association. Their paper is titled, “What Makes High School Asian English Learners Tick?” Christian Spoden is research methods consultant at the German Institute for Adult Education—Leibniz Centre for Lifelong Learning, Bonn (Germany). His research focuses on item response theory modeling, the assessment of competencies in secondary and higher education, and large-scale assessment methods for educational sciences practice. Previously, he was engaged in the implementation of statewide standardized scholastic achievement tests for German and English language assessment and mathematical competencies assessment. Dr. Spoden also worked on the topic of quality of physics instruction and, recently, on psychometric methods for computerized adaptive testing. Husein Taherbhai is a retired principal research scientist who led large-scale assessments in the United States, such as in Arizona, Washington, New York, Maryland, Virginia, and Tennessee, and for the National Physical Therapists’ Association’s licensure examination. Internationally, Dr. Taherbhai led the Educational Quality and Assessment Office in Ontario, Canada, and worked for the Central Board of Secondary Education’s Assessment in India. He has published in various scientific journals and has reviewed and presented at the NCME, AER A, and Florida State conferences, with papers relating to language learners, rater effects, and students’ equity and growth in education. Tuğba Elif Toprak is an assistant professor of applied linguistics  ⁄ ELT at Izmir Bakircay University, Izmir, Turkey. Her primary research interests are implementing cognitive diagnostic assessment by using contemporary item response theory models (named cognitive diagnostic psychometric models) and blending cognition with language assessment. Dr. Toprak has been collaborating with international researchers on several projects that are largely situated in the fields of language assessment, psycholinguistics, and the learning sciences. Her future plan of study includes focusing on intelligent real-time assessment systems by combining the techniques from several areas such as the learning sciences, cognitive science, and psychometrics.

xiv  Volume II contributors Xuelian Zhu h  olds a master of arts in applied linguistics from Nanyang Technological University, Singapore, with special focus on language assessment and psychometric analysis, such as Rasch measurement and structural equation modeling. She is a lecturer in English-Chinese translation and interpretation at Sichuan International Studies University in China, where she oversees the designing of the listening component of the in-house placement test for English majors. She is also a member of the panel of test designers for a national placement test for the Foreign Language Test and Research Press (FLTRP). Xuelian’s current research focuses on eye tracking and on the design and validation of language assessments for academic purposes.

Preface Vahid Aryadoust and Michelle Raquel

The two-volume book, Quantitative Data Analysis for Language Assessment (Fundamental Techniques and Advanced Methods), together with the Companion website, were motivated by the growing need for a comprehensive sourcebook of quantitative data analysis for the community of language assessment. As the focus on developing valid and useful assessments continues to intensify in different parts of the world, having a robust and sound knowledge of quantitative methods has become an increasingly essential requirement. This is particularly important given that one of the community’s responsibilities is to develop language assessments that have evidence of validity, fairness, and reliability. We believe this would be achieved primarily by leveraging quantitative data analysis in test development and validation efforts. It has been the contributors’ intention to write the chapters with an eye toward what professors, graduate students, and test development companies would need. The chapters progress gradually from fundamental concepts to advanced topics, making the volumes suitable reference books for professors who teach quantitative methods. If the content of the volumes is too heavy for teaching in one course, we would suggest professors consider using them across two semesters, or alternatively, choose any chapters that fit the focus and scope of their courses. For graduate students who have just embarked on their studies or are writing dissertations or theses, the two volumes would serve as a cogent and accessible introduction to the methods that are often used in assessment development and validation research. For organizations in the test development business, the volumes provide a unique topic coverage and examples of applications of the methods in small- and large-scale language tests that such organizations often deal with. We would like to thank all of the authors who contributed their expertise in language assessment and quantitative methods. This collaboration has allowed us to emphasize the growing interdisciplinarity in language assessment that draws knowledge and information from many different fields. We wish to acknowledge that in addition to editorial reviews, each chapter has been subjected to rigorous

xvi  Preface double-blind peer review. We extend a special note of thanks to a number of colleagues who helped us during the review process: Beth Ann O’Brien, National Institute of Education, Singapore Christian Spoden, The German Institute for Adult Education, Leibniz Centre for Lifelong Learning, Germany Guangwei Hu, Hong Kong Polytechnic University, Hong Kong Hamdollah Ravand, Vali-e-Asr University of Rafsanjan, Iran Ikkyu Choi, Educational Testing Service, USA Kirby Grabowski, Teachers College, Columbia University, USA Mehdi Riazi, Macquarie University, Australia Moritz Heene, Ludwig-Maximilians-Universität München, Germany Purya Baghaei, Islamic Azad University of Mashad, Iran Shane Phillipson, Monash University, Australia Shangchao Min, Zhejiang University, China Thomas Eckes, Gesellschaft für Akademische Studienvorbereitung und Testentwicklung e. V. c  ⁄o TestDaF-Institut Ruhr-Universität Bochum, Germany Trevor Bond, James Cook University, Australia Tuğba Elif Toprak, Izmir Bakircay University, Turkey Wenshu Luo, National Institute of Education, Singapore Yan Zi, The Education University of Hong Kong, Hong Kong Yasuyo Sawaki, Waseda University, Japan Yo In’nami, Chuo University, Japan Zhang Jie, Shanghai University of Finance and Economics, China We hope that the readers will find the volumes useful in their research and pedagogy. Vahid Aryadoust and Michelle Raquel Editors July 2019

Introduction Vahid Aryadoust and Michelle Raquel

Quantitative techniques are mainstream components in most of the published literature in language assessment as they are essential in test development and validation research (Chapelle, Enright, & Jamieson, 2008). There are three families of quantitative methods adopted in language assessment research: measurement models, statistical methods, and data mining (although admittedly, setting a definite boundary in this classification of methods would not be feasible). Borsboom (2005) proposes that measurement models, the first family of quantitative methods in language assessment, fall in the paradigm of classical test theory (CTT), Rasch measurement, or item response theory (IRT). The common feature of the three measurement techniques is that they are intended to predict outcomes of cognitive, educational, and psychological testing. However, they do have significant differences in their underlying assumptions and applications. CTT is founded on true scores that can be estimated by using the error of measurement and observed scores. Internal consistency reliability and generalizability theory are also formulated based on CTT premises. Rasch measurement and IRT, on the other hand, are probabilistic models that are used for the measurement of latent variables—attributes that are not directly observed. There are a number of unidimensional Rasch and IRT models, which assume the attribute underlying test performance is comprised of only one measurable feature. There are also multidimensional models, which postulate that latent variables measured by tests are many and multidivisible. Determining whether a test is unidimensional or multidimensional requires theoretical grounding, the application of sophisticated quantitative methods, and an evaluation of the test context. For example, multidimensional tests can be used to provide fine-grained diagnostic information to stakeholders; thus a multidimensional IRT model can be used to derive useful diagnostic information from test scores. In these two volumes, CTT and unidimensional Rasch models are discussed in Volume I and multidimensional techniques are covered in Volume II. The second group of methods is statistical and consists of the commonly used methods in language assessment such as t-tests, analysis of variance (ANOVA), analysis of covariance (ANCOVA), multivariate analysis of covariance (MANCOVA), regression models, and factor analysis, which are covered in Volume I. In addition, multilevel modeling and structural equation modeling (SEM)

2  Aryadoust and Raquel are presented in Volume II. The research questions that these techniques aim to address range from comparing average performances of test takers to prediction and data reduction. The third group of models falls under the umbrella of data-mining techniques, which we believe are relatively under-researched and underutilized techniques in language assessment. Volume II presents two data-mining methods: classification and regression trees (CART) and evolutionary algorithm-based symbolic regression, both of which are used for prediction and classification. These methods detect the relationship between dependent and independent variables in the form of mathematical functions and confirm the relationships across separate data sets. This feature of the two data-mining techniques, discussed in Volume II, improves the precision and generalizability of the detected relationships. We provide an overview of the two volumes in the next sections.

Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques This volume is comprised of 11 chapters that are contributed by a number of experts in the field of language assessment and quantitative data analysis techniques. The aim of the volume is to revisit the fundamental quantitative topics that have been used in the language assessment literature and shed light on their rationales and assumptions. This is achieved through delineating the technique covered in each chapter, providing a (brief) review of its application in previous language assessment research, and giving a theory-driven example of the application of the technique. The chapters in Volume I are grouped into three main sections, which are discussed as follows.

Section I. Test development, reliability, and generalizability Chapter 1: Item analysis in language assessment (Rita Green) This chapter deals with a fundamental but, as R ita Green notes, often-delayed step in language test development. Item analysis is a quantitative method that allows test developers to examine the quality of test items, i.e., which test items are working well (constructed to assess the construct they are meant to assess) and which items should be revised or dropped to improve overall test reliability. Unfortunately, as the author notes, this step commonly is done after a test has been administered and not when items have just been developed. The chapter starts with an explanation of the importance of this method at the test development stage. Then, several language testing studies, which have utilized this method to investigate test validity and reliability, to improve standard setting sessions, and to investigate the impact of test format and different testing conditions on test taker performance, are reviewed. The author further emphasizes the need for language testing professionals to

Introduction 3 learn this method and its link to language assessment research by suggesting five research questions in item analysis. The use of this method is demonstrated by an analysis of a multiple-choice grammar and vocabulary test. The author concludes the chapter by demonstrating how the analysis can answer the five research questions proposed, as well as offering suggestions on how to improve the test.

Chapter 2: Univariate generalizability theory in language assessment (Yasuyo Sawaki and Xiaoming Xi) In addition to item analysis, investigating reliability and generalizability is a fundamental consideration of test development. Chapter 2 presents and extends the framework to investigate reliability within the paradigm of classical test theory (CTT). Generalizability theory (G theory) is a powerful method of investigating the extent to which scores are reliable as it is able to account for different sources of variability and their interactions in one analysis. The chapter provides an overview of the key concepts in this method, outlines the steps in the analyses, and presents an important caveat in the application of this method, i.e., conceptualization of an appropriate rating design that fits the context. A sample study demonstrating the use of this method is presented to investigate the dependability of ratings given on an English as a foreign language (EFL) summary writing task. The authors compared the results of two G theory analyses, the rating method and the block method, to demonstrate to readers the impact of rating design on the results of the analysis. The chapter concludes with a discussion of the strengths of the analysis compared to other CTT-based reliability indices, the value of this method in investigating rater behavior, and suggested references should readers wish to extend their knowledge of this technique.

Chapter 3: Multivariate generalizability theory in language assessment (Kirby C. Grabowski and Rongchan Lin) In performance assessments, multiple factors contribute to generate a test taker’s overall, score such as task type, the rating scale structure, and the rater, meaning that scores are influenced by multiple sources of variance. Although univariate G theory analysis is able to determine the reliability of scores, it is limited in that it does not consider the impact of these sources of variance simultaneously. Multivariate G theory analysis is a powerful statistical technique because, in addition to results generated by univariate G theory analysis, it is able to generate a reliability index accounting for all these factors in one analysis. The analysis is also able to consider the impact of subscales of a rating scale. The authors begin the chapter with an overview of the basic concepts of multivariate G theory. Next, they illustrate an application of this method through an analysis of a listening-speaking test where they make clear links between research questions and the results of the analysis. The chapter concludes with caveats in the

4  Aryadoust and Raquel use of this method and suggested references for readers who wish to complement their MG theory analyses with other methods.

Section II. Unidimensional Rasch measurement Chapter 4: Applying Rasch measurement in language assessment: Unidimensionality and local independence (Jason Fan and Trevor Bond) This chapter discusses the two fundamental concepts required in the application of Rasch measurement in language assessment research, i.e., unidimensionality and local independence. It provides an accessible discussion of these concepts in the context of language assessment. The authors first explain how the two concepts should be perceived from a measurement perspective. This is followed by a brief explanation of the Rasch model, a description of how these two measurement properties are investigated through Rasch residuals, and a review of Rasch-based studies in language assessment that reports the existence of these properties to strengthen test validity claims. The authors demonstrate the investigation of these properties through the analysis of items in a listening test using the Partial Credit Rasch model. The results of the study revealed that the listening test is unidimensional and that the principal component analysis of residuals analysis provides evidence of local independence of items. The chapter concludes with a discussion of the practical considerations and suggestions on steps to take should test developers encounter situations where these properties of measurement are violated.

Chapter 5: The Rasch measurement approach to differential item functioning (DIF) analysis in language assessment research (Michelle Raquel) This chapter continues the discussion of test measurement properties. Differential item functioning (DIF) is the statistical term used to describe items that inadvertently have different item estimates for different subgroups because they are affected by characteristics of the test takers such as gender, age group, or ethnicity. The author first explains the concept of DIF and then provides a brief overview of different DIF detection methods used in language assessment research. A review of DIF studies in language testing follows that includes a summary of current DIF studies, the DIF method(s) used, and whether or not the studies investigated the causes of DIF. The chapter then illustrates one of the most commonly used DIF detection methods, the Rasch-based DIF analysis method. The sample study investigates the presence of DIF in a diagnostic English listening test where students were classified according to the English language curriculum they have taken, Hong Kong versus Macau. The results of the study revealed that although there were a significant number of items flagged for DIF, overall test results did not seem to be affected.

Introduction 5

Chapter 6: Application of the rating scale model and the partial credit model in language assessment research (Ikkyu Choi) This chapter introduces two Rasch models that are used to analyze polytomous data usually generated by performance assessments (speaking or writing tests) and questionnaires used in language assessment studies. First, Ikkyu Choi explains the relationship of the rating scale model (RSM) and the partial credit model (PCM) through a gentle review of their algebraic representations. This is followed by a discussion of the differences of these models and a review of studies that have utilized this method. The author notes in his review that researchers rarely provide a rationale for the choice of model and neither do they compare models. In the sample study investigating the scale of a motivation questionnaire, the author provides a thorough and graphic comparison and evaluation of the RSM and the PCM models and their impact on the scale structure of the questionnaire. The chapter concludes by providing justification as to why the PCM was more appropriate for the context, the limitations of the parameter estimation method used by the sample study, and a list of suggested topics to extend the reader’s knowledge of the topic.

Chapter 7: Many-facet Rasch measurement: Implications for rater-mediated language assessment (Thomas Eckes) This chapter discusses one of the most popular item-response theory (IRT)based methods to analyze rater-mediated assessments. A common problem in speaking and writing tests is that the marks or grades are dependent on human raters who most likely have their own conceptions of how to mark despite their training, and this impacts test reliability. Many-facet Rasch measurement (MFRM) provides a solution to this problem in that the analysis simultaneously includes multiple facets such as raters, assessment criteria, test format, or the time when a test is taken. The author first provides an overview of rater-mediated assessments and MFRM concepts. The application of this method is illustrated through an analysis of a writing assessment where the author demonstrates how to determine rater severity, consistency of ratings, and how to generate test scores after adjusting for differences in ratings. The chapter concludes with a discussion on advances in MFRM research and controversial issues related to this method.

Section III. Univariate and multivariate statistical analysis Chapter 8: Analysis of differences between groups: The t-test and the analysis of variance (ANOVA) in language assessment (Tuğba Elif Toprak) The third section of this volume starts with a discussion of two of the most fundamental and commonly used statistical techniques for comparing test score results and determining whether differences between the groups are due to

6  Aryadoust and Raquel chance. For example, language testers often find themselves trying to compare two or multiple groups of test takers or to compare pre-test and posttest scores. The chapter starts with an overview of t-tests and the analysis of variance (ANOVA) and the assumptions that must be met before embarking on these analyses. The literature review provides summary tables of recent studies that have employed each method. The application of the t-test is demonstrated through a sample study that investigated the impact of English songs on students’ pronunciation development where the author divided the students into two groups (experimental versus control group) and then compared the groups’ results on a pronunciation test. The second study utilized ANOVA to determine if students’ academic reading proficiency differed across college years (freshman, sophomores, juniors, and seniors) and which group was significantly different from the others.

Chapter 9: Application of ANCOVA and MANCOVA in language assessment research (Zhi Li and Michelle Y. Chen) This chapter extends the discussion of methods used to compare test results. Instead of using one variable to classify groups that are compared, analysis of covariance (ANCOVA) and multivariate analysis of covariance (MANCOVA) consider multiple variables of multiple groups to determine whether or not differences in group scores are statistically significant. ANCOVA is used when there is only one independent variable, while MANCOVA is used when there are two or more independent variables that are included in the comparison. Both techniques control for the effect of one or more variables that covary with the dependent variables. The chapter begins with a brief discussion of these two methods, the situations in which they should be used, the assumptions that must be fulfilled before analysis can begin, and a brief discussion of how results should be reported. The authors present the results of their meta-analyses of studies that have utilized these methods and outline the issues related to results reported in these studies. The application of these methods is demonstrated in the analyses of the Programme for International Student Assessment (PISA 2009) reading test results of Canadian children.

Chapter 10: Application of linear regression in language assessment (Daeryong Seo and Husein Taherbhai) There are cases when language testers need to determine the impact of one variable on another variable such as if someone’s first language has an impact on their learning of a second language. Linear regression is the appropriate statistical technique to use when one aims to determine the extent to which one or more independent variables linearly impact a dependent variable. This chapter opens with a brief discussion of the differences between single and multiple linear regression and a full discussion on the assumptions that must be fulfilled before commencing analysis. Next, the authors present a brief literature review

Introduction 7 of factors that affect English language proficiency as these determine what variables should be included in the statistical model. The sample study illustrates the application of linear regression by predicting students’ results on an English language arts examination based on their performance in English proficiency tests of reading, listening, speaking, and writing. The chapter concludes with a checklist of concepts to consider before doing regression analysis.

Chapter 11: Application of exploratory factor analysis in language assessment (Limei Zhang and Wenshu Luo) A standard procedure in test and survey development is to check and see whether a test or questionnaire measures one underlying construct or dimension. Ideally, test and questionnaire items are constructed to measure a latent construct (e.g., 20 items to measure listening comprehension), but each item is designed to measure different aspects of the construct (e.g., items that measure the ability to listen for details, ability to listen for main ideas, etc.). Exploratory factor analysis (EFA) is a statistical technique that examines how items are grouped together into themes and how they ultimately measure the latent trait. The chapter commences with an overview of EFA, the different methods to extract the themes (factors) from the data, and an outline of steps in conducting an EFA. This is followed by a literature review that highlights the different ways the method has been applied in language testing research, with specific focus on studies that confirm the factor structure of tests and questionnaires. The sample study demonstrates how EFA can do this by analyzing the factor structure of the Reading Test Strategy Use Questionnaire used to determine the types of reading strategies that Chinese students utilize as they complete reading comprehension tests.

Quantitative Data Analysis for Language Assessment Volume II: Advanced Methods Volume II comprises three major categories of quantitative methods in language testing research: advanced item response theory (IRT), advanced statistical methods, and nature-inspired data-mining methods. We provide an overview of the sections and chapters as follows.

Section I. Advanced item response theory (IRT) models in language assessment Chapter 1: Applying the mixed Rasch modeling in assessing reading comprehension (Purya Baghaei, Christoph J. Kemper, Monique Reichert, and Samuel Greif) In this chapter, the authors discuss the application of the mixed Rasch model (MR M) in assessing reading comprehension. MR M is an advanced

8  Aryadoust and Raquel psychometric approach for detecting latent class differential item functioning (DIF) that conflates the Rasch model and latent class analysis. MR M relaxes some of the requirements of conventional Rasch measurement while preserving most of the fundamental features of the method. MR M further combines the Rasch model with latent class modeling that classifies test takers into exclusive classes with qualitatively different features. Baghaei et al. apply the model to a high-stakes reading comprehension test in English as a foreign language and detect two latent classes of test takers for whom the difficulty level of the test items differs. They discuss the differentiating feature of the classes and conclude that MR M can be applied to identify sources of multidimensionality.

Chapter 2: Multidimensional Rasch models in first language listening tests (Christian Spoden and Jens Fleischer) Since the introduction of Rasch measurement to language assessment, a group of scholars have contended that language is not a unidimensional phenomenon and, accordingly, unidimensional modeling of language assessment data (e.g., through the unidimensional Rasch model) would conceal the role of many linguistic features that are integral to language performance. The multidimensional Rasch model could be viewed as a response to these concerns. In this chapter, the authors provide a didactic presentation of the multidimensional Rasch model and apply it to a listening assessment. They discuss the advantages of adopting the model in language assessment research, specifically the improvement in the estimation of reliability as a result of the incorporation of dimension correlations, and explain how model comparison can be carried out, while elaborating on multidimensionality in listening comprehension assessments. They conclude the chapter with a brief summary of other multidimensional Rasch models and their value in language assessment research.

Chapter 3: The log-linear cognitive diagnosis modeling (LCDM) in second language listening assessment (Tug˘ ba Elif Toprak, Vahid Aryadoust, and Christine Goh) Another group of multidimensional models, called cognitive diagnostic models (CDMs), combine psychometrics and psychology. One of the differences between CDMs and the multidimensional Rasch models is that the former family estimates subskills mastery of test takers, whereas the latter group provides a general estimation of ability for each subskill. In this chapter, the authors introduce the log-linear cognitive diagnosis modeling (LCDM), which is a flexible CDM technique for modeling assessment data. They apply the model to a high-stakes norm-referenced listening test (a practice that is known as retrofitting) to determine whether they can derive diagnostic information

Introduction 9 concerning test takers’ weaknesses and strengths. Toprak et al. argue that although norm-referenced assessments do not usually provide such diagnostic information about the language abilities of test takers, providing such information is practical as it helps language learners who wish to know this information to improve their language skills. They provide guidelines on the estimation and fitting of the LCDM, which is also applicable to other CDM techniques.

Chapter 4: Application of a hierarchical diagnostic classification model in assessing reading comprehension (Hamdollah Ravand) In this chapter, the author presents another group of CDM techniques including the deterministic input, noisy, and gate (DINA) model and the generalized deterministic input, noisy, and gate (G-DINA) model, which are increasingly attracting more attention in language assessment research. Ravand begins the chapter by providing step-by-step guidelines to model selection, development, and evaluation, elaborating on fit statistics and other relevant concepts in CDM analysis. Like Toprak et al., who presented the LCDM in Chapter 3, Ravand argues for retrofitting CDMs to norm-referenced language assessments and provides an illustrative example of the application of CDMs to a nondiagnostic high-stakes test of reading. He further explains how to use and interpret fit statistics (i.e., relative and absolute fit indices) to select the optimal model among the available CDMs.

Section II. Advanced statistical methods in language assessment Chapter 5: Structural equation modeling to predict performance in English proficiency tests (Xuelian Zhu, Michelle Raquel, and Vahid Aryadoust) This chapter discusses one of the most commonly used techniques in the field whose application in assessment research goes back to at least the 1990s. Instead of modeling a linear relationship of variables, structural equation modeling (SEM) is used to concurrently model direct and indirect relationships between variables. The authors first provide a review of SEM in language assessment research and propose a framework for model development, specification, and validation. They discuss the requirements of sample size, fit, and model re-specification and apply SEM to confirm the use of a diagnostic test in predicting the proficiency level of test takers as well as the possible mediating role for some demographic factors in the model tested. While SEM can be applied to both dichotomous and polytomous data, Zhu and Raquel focus on the latter group of data, while stressing that the principles and guidelines spelled out are directly applicable to dichotomous data. They further mention other applications of SEM such as multigroup modeling and SEM of dichotomous data.

10  Aryadoust and Raquel

Chapter 6: Student growth percentiles in the formative assessment of English language proficiency (Husein Taherbhai and Daeryong Seo) This chapter presents a method for modeling growth that is called student growth percentile (SGP) for longitudinal data and is estimated by using the quantile regression method. A distinctive feature of SGP is that it compares test takers with those who had the same history of test performance and achievement. This means that even when the current test scores are the same for two test takers with different assessment histories, their actual SGP scores on the current test can be different. Another feature of SGP that differentiates it from similar techniques such as MLM and latent growth curve models is that SGP does not require test equating, which, in itself, could be a time-consuming process. Oftentimes, researchers and language teachers wish to determine whether a particular test taker has a chance to achieve a predetermined cut score, but a quick glance at the available literature shows that the quantitative tools available do not provide such information. Seo and Taherbhai show that through the quantile regression method, one can estimate the propensity of test takers to achieve an SGP score required to reach the cut score. The technique lends itself to investigation of change in four language modalities, i.e., reading, writing, listening, and speaking.

Chapter 7: Multilevel modeling to examine sources of variability in second language test scores (Yo In’nami and Khaled Barkaoui) Multilevel modeling (MLM) is based on the premise that test takers’ performance is a function of students’ measured abilities as well as another level of variation such as the classrooms, schools, or cities that the test takers come from. According to the authors, MLM is particularly useful when test takers are from prespecified homogeneous subgroups such as classrooms, which have different characteristics from test takers placed in other subgroups. The between-subgroup heterogeneity combined with the within-subgroup homogeneity yield a source of variance in data that, if ignored, can inflate chances of a Type I error (i.e., rejection of a true null hypothesis). The authors provide guidelines and advice on using MLM and showcase the application of the technique to a second language vocabulary test.

Chapter 8: Longitudinal multilevel modeling to examine changes in second language test scores (Khaled Barkaoui and Yo In’nami) In this chapter, the authors propose that flexibility of MLM renders it wellsuited for modeling growth and investigating the sensitivity of test scores to change over time. The authors argue that MLM is a hierarchical method that is an alternative to linear methods such as analysis of variance (ANOVA) and linear regression. They present an example of second language longitudinal data. They encourage MLM users to consider and control for the variability of test forms, which can confound assessments over time, to ensure test equity before using

Introduction 11 test scores for MLM analysis and to maximize the validity of the uses and interpretations of the test scores.

Section III. Nature-inspired data-mining methods in language assessment Chapter 9: Classification and regression trees in predicting listening item difficulty (Vahid Aryadoust and Christine Goh) The first data-mining method Aryadoust and Goh present in this section is a classification and regression tree (CART). CART is used in a way that is similar to how linear regression or classification techniques are used in prediction and classification research. CART, however, relaxes the normality and other assumptions that are necessary for parametric models such as regression analysis. Aryadoust and Goh review the literature on the application of CART in language assessment and propose a multistage framework for CART modeling that starts with establishing theoretical frameworks and ends with cross-validation. The authors apply CART to 321 listening test items and generate a number of IF-THEN rules that link item difficulty to the linguistic features of the items in a nonlinear way. The chapter also stresses the role of cross-validation in CART modeling and the features of two cross-validation methods (n-fold cross validation and train-test cross validation).

Chapter 10: Evolutionary algorithm-based symbolic regression to determine the relationship of reading and lexicogrammatical knowledge (Vahid Aryadoust) Aryadoust introduces the evolutionary algorithm-based (EA-based) symbolic regression method and showcases its application in reading assessment. Like CART, EA-based symbolic regression is a nonlinear data analysis method that comprises a training and a cross-validation stage. The technique is inspired by the principles of Darwinian evolution. Accordingly, concepts such as survival of the fittest, offspring, breeding, chromosomes, and cross-over are incorporated into the mathematical modeling procedures. The nonparametric nature and cross-validation capabilities EA-Based symbolic regression render it a powerful classification and prediction model in language assessment. Aryadoust presents a prediction study where he adopts lexicogrammatical abilities as independent variables to predict the reading ability of English learners. He compares the prediction power of the method with that of a linear regression model and shows that the technique renders more precise solutions.

Conclusion In sum, Volumes I and II present 27 fundamental and advanced quantitative methods and their applications in language testing research covered across 21 chapters. Important factors to consider in choosing these fundamental and

12  Aryadoust and Raquel advanced methods are the role of theory and the nature of research questions. Although some may be drawn to use advanced methods as they might provide stronger evidence to support validity and reliability claims, in some cases, using less complex methods might cater to the needs of researchers. Nevertheless, oversimplifying research problems could result in overlooking significant sources of variation in data and possibly drawing wrong or naïve inferences. The authors of the chapters have, therefore, emphasized that the first step in choosing the methods is the postulation of theoretical frameworks to specify the nature of relationships among variables, processes, and mechanisms of the attributes under investigation. Only after establishing the theoretical framework should one proceed to select quantitative methods to test the hypotheses of the study. To this end, the chapters in the volumes provide step-by-step guidelines to achieve accuracy and precision in choosing and conducting the relevant quantitative techniques. We are confident that the joint efforts of the authors have emphasized the research rigor required in the field and highlighted strengths and weaknesses of the data analysis techniques.

References Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psychometrics. Cambridge, UK: Cambridge University Press. Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity argument for the Test of English as a Foreign Language. New York: Routeledge.

Section I

Advanced item response theory (IRT) models in language assessment

1

Applying the mixed Rasch model in assessing reading comprehension Purya Baghaei, Christoph J. Kemper, Monique Reichert, and Samuel Greiff

Introduction In language and educational assessment, differences in learners’ progress or proficiency usually are measured by comparing their test performances on a continuous ability scale. Most item response theory models (IRT, Birnbaum, 1968) as well as the Rasch model—arguably the most important model among IRT models (Rasch, 1960⁄ 1980), which commonly are used for scaling test takers, assume that differences among (language) learners are quantitative differences or differences in degree. That is, test takers are different from one another in language ability by some degree, which implies that all the test takers use different amounts of the same skills and strategies to answer test items. This assumption neglects the fact that the differences among test takers could also be qualitative differences or differences in kind. That is, test takers could be employing abilities and strategies in a different way to answer the items, and test scores could reflect the differences in the kinds of processes and strategies that they use and not only quantitative differences in one single latent ability. If this were indeed the case, one would be comparing test takers on different constructs with the same test. The distinction between quantitative and qualitative differences in second language (L2) test performance is of paramount importance. As Bachman and Palmer (1996) pointed out, personal characteristics can considerably impact the outcome of language assessments. The authors discuss age, native language(s), sex, level of general education, type and amount of experience with a given test, nationality, and immigrant status as individual attributes that may affect performance on a given language assessment over and above the test taker’s language ability. Thus, in such a case, performance on the measures would not only depend on the test taker’s language ability (i.e., the primary target construct) but also on individual characteristics. Considering the potential effects of the aforementioned attributes on the outcome of the test, mixed Rasch model (MRM) analysis not only helps in developing theories of L2 acquisition, it is also relevant in understanding learning difficulties and designing remedial courses and interventions. In addition, it can play an important role in test validation (see Baghaei & Carstensen, 2013; Reichert, Brunner, & Martin, 2014).

16  Baghaei, Kemper, Reichert, and Greiff Putting construct validation (see Messick, 1989, for an extensive discussion of different facets of test validity) into the focus of test development also means that higher demands are put on research methodologies. This indicates that test developers are required to state as clearly as possible which processes or strategies are employed when test takers respond to the test items. This, in turn, necessitates that different methodologies should, ideally, be applied in order to shed light on the construct measured by the test. In the current chapter, we will focus on the MRM as one specific approach to test validation that, to our knowledge, has not been used much among language test developers. From a psychometric point of view, one approach that helps specify the test construct, and, in particular, establishes test takers’ qualitative and quantitative differences, is directly linked to the analysis of differential item functioning (DIF). DIF occurs when subpopulations with the same overall ability have different probabilities of correctly answering an item (see Baghaei, 2009; Ferne & Rupp, 2007; Smith, 2004). For instance, immigrants and natives as two distinct groups who have equal overall reading comprehension ability might have different probabilities of answering certain reading comprehension items. This could be due to applying different subskills for answering those items or to an inherent problem in the items that makes them biased against one group. If the reason for DIF is that test takers apply different strategies (which, in turn, might have various reasons, such as different test experiences, differences in educational or language background, self-concept, etc., for answering the items) (Rost, 1990), the observed DIF is an indication of qualitative differences between the groups. When DIF occurs, the unidimensionality principle is violated and the validity of the uses and interpretations of test scores is questioned.1 This could indicate that the construct underlying the test changes for the groups and, as a result, comparing test takers on the same scale is not justified. Therefore, DIF detection has become a routine procedure in test validation research (Holland & Thayer, 1988; Standards for Educational and Psychological Testing, 2014). Analysis of DIF has basically two stages: (1) statistical detection of DIF and (2) examination of the sources of DIF and the reasons why it has occurred. While the first stage is essential for validity, the second stage can be very informative from a substantive viewpoint (Van Nijlen & Janssen, 2011). Identifying sources of DIF can cast light on the item response generating mechanisms and furthers our understanding of the test construct, which is an important aspect of construct validation (Baghaei & Tabatabaee-Yazdi, 2016; Borsboom, Mellenbergh, & van Heerden, 2004). To identify DIF within a Rasch model framework, a joint analysis with all persons from both person classifications (e.g., Class A and Class B) is run to establish the baseline parameter estimates. Analysis of Class A examinees with person abilities anchored at values estimated from the joint run yields Class A item difficulties. Analysis of Class B examinees with person abilities anchored at values estimated from the joint run produces Class B item difficulties. Anchoring brings the item parameters from the two analyses onto a common scale, hence making them comparable. Anchoring persons also adjusts the item parameters

Applying the mixed Rasch model 17 (estimated in the two classifications) to match the persons’ ability levels. The relative overall performance of each person classification on each item determines the amount of differential item functioning between the classifications. Then the pairwise differences in item parameters from the two separate analyses are tested for statistical significance with t-tests (Linacre, 2017). When DIF occurs, it means that responses to items are not just a function of the intended ability but are a function of the ability and the grouping to which test takers belong. This is an instance of multidimensionality, which is a threat to the validity argument of the instrument. Put more technically, for a test to be unidimensional, that is, to measure the same ability for all test takers, we expect the same difficulty order for the items across different classes of the population (i.e., that no DIF will occur). DIF, on the other hand, implies that the test scores (i.e., difficulty estimates of items) are not invariant across some predefined subpopulations (e.g., males and females). This is a violation of IRT principles and an indication that test takers across the groups do not employ the same skills and strategies to answer the items. That is, the test activates different skills and strategies for different subpopulations or classes of respondents, which means that the construct underlying the test varies as a function of group membership. The correlation of the test with external criteria may also drastically change for the classes of the population for whom DIF exists (Embretson, 2007). This is further evidence that the construct underlying the test is not constant for all the test takers and changes with group membership. In conclusion, when DIF exists, the mean scores of the two groups are not comparable, and the differences between them are qualitative or differences in kind (Andrich, 1988).

The mixed Rasch model The mixed Rasch model (MRM, Rost, 1990) is an extension of the Rasch model that identifies DIF across latent classes, i.e., groups who are unknown prior to testing. Unlike the standard DIF detection methods where we need to have a manifest variable such as sex or nationality, a priori, MRM identifies DIF across latent subpopulations or latent classes in the data without a previously defined grouping variable. That is, to fit the MRM and examine DIF, one does not need a manifest grouping variable such as sex. Instead, MRM detects subpopulations within the population for whom DIF exists. It is then the researcher’s job to determine the qualitative differences among the detected latent classes and link the observed DIF to some class characteristics. By comparing the item response patterns across the latent classes or examining the relationship between the classes and other manifest variables, such as motivation, test taking strategies, cognitive style, first language, etc., new hypotheses can be generated concerning respondents’ differences in answering the items. The Rasch model requires constant item parameters across different subpopulations of respondents, i.e., invariance. When the Rasch model does not fit for the entire population, the MRM uncovers latent classes within the population

18  Baghaei, Kemper, Reichert, and Greiff for whom the Rasch model fits separately. This implies that the difficulty order of the items changes for the detected classes, or invariance does not hold across the classes; hence, the Rasch model does not fit for the whole sample, but it fits within each class. In this case, an unwanted nuisance dimension has crept into the measurement, and response probabilities do not depend on a single latent trait but on a continuous latent variable plus a categorical variable, i.e., examinees’ class membership. This is a violation of the unidimensionality principle of the Rasch model. When MRM fits, examinees who belong to different latent classes are not comparable since “…the cognitive structure is different for people from both classes. In that case, it can be argued that the questionnaire measures different traits in different populations and, hence, trait values cannot be compared between the populations” (Rost, Carstensen, & von Davier, 1997, p. 331). Rost (1990) writes that MRM is “heuristic,” i.e., it helps us understand what unprecedented factors interplay with the intended construct to bring about item responses. This implies that the MRM is not a model for person measurement and item calibration. Scheiblechner (2009), along the same lines, argues that: The mixed Rasch model is not a true Rasch model, but useful for model controls and heuristic purposes (p. 181)….The mixed Rasch model of Rost (1996) is not a Rasch model in the present sense because the existence of several subpopulations (or classes) of subjects with distinct item parameters is in diametric opposition to specific objectivity (p. 187). When the difficulty parameters of items change across classes, one can conclude that members of the classes employ different strategies to answer the items or they have had different learning experiences, backgrounds, or curricula. Lack of invariance also implies that test takers’ scores are not comparable as the test triggers different skills and cognitive processes for the members of the classes. Hence, it does not make much sense to compare them. To solve this problem, Maij-de Meij, Kelderman, and van der Flier (2008) suggest that we can select a number of items that function in the same way across the latent classes as anchor items and bring the person parameters in the two latent classes onto a common scale by imposing equality constraints. In order to understand the nature of the differences among the latent classes, usually the class-specific item profiles are examined. That is, the content and the patterns of variations in item parameters are carefully inspected in the classes. These inspections could inform researchers of the qualitative differences among the classes. For instance, Rost, Häussler, and Hoffmann (1989) applied MRM to a physics test composed of 10 questions. MRM detected two latent classes. Inspection of the patterns of item difficulties or class-specific item profiles revealed that the first five items were easy for Class 1 test takers, whereas the second five items were easy for Class 2 respondents. Careful investigation of the contents of the items revealed that the first five items were about theoretical knowledge and the second half were practical questions. Therefore, the

Applying the mixed Rasch model 19 two classes were identified as those with theoretical knowledge and those with applied knowledge in physics. Other researchers have also used some manifest variables such as sex and academic background to understand the nature of the latent classes and the reason for the differences in the classes (Hong & Min, 2007). That is, after detecting the latent classes, the researcher tries to link the classes to some manifest variable such as sex or nationality. This approach can be employed when examination of class-specific item profiles does not yield interpretable results. The purpose of MRM is to identify DIF across latent classes that are otherwise unknown. However, a study with a priori hypotheses regarding the sources of DIF using MRM is not a reasonable strategy. That is, MRM cannot be very informative if one runs MRM first and then tries to link the latent classes to known variables that have been selected beforehand according to some theory. If researchers have manifest variables that are hypothesized to be the sources of DIF, they can easily check DIF for these variables using conventional DIF detection methods for known groups. Basically, there are three major approaches to detecting DIF across manifest variables, namely, the Mantel-Haenszel approach, logistic regression, and IRTbased methods (Zumbo, 2007). In the Mantel-Haenszel procedure, two-bytwo contingency tables are created for all the items where the rows are groups and the columns are counts of correct and incorrect responses of the groups to the items. An odds ratio is then calculated to examine the strength of the relationship between the groups and the counts of correct ⁄  incorrect responses. In the logistic regression method, it is statistically examined whether group membership can predict correct response to an item. Here, response to a binary item is the dependent variable, and group membership, the total score for each examinee, and a group by total score interaction are independent variables (Zumbo, 1999). In IRT, usually item characteristic curves (ICCs) for items in the reference and focal groups are graphically examined. Then statistical procedures are implemented to test the difference in item parameters across groups. ICCs depict the relationship between location on the latent trait continuum and probability of a correct response to an item. With increasing ability, the probability of a correct response is expected to augment. DIF exists when the ICCs for an item change across the groups, indicating that the item behaves differently for the subsets of the sample.

Previous applications of MRM in language testing research MRM (Rost, 1990) is a relatively new technique and has not been utilized much in L2 testing research. A review of the literature revealed only a few applications in second and foreign language assessment (Aryadoust, 2015; Aryadoust & Zhang, 2015; Baghaei & Carstensen, 2013; Pishghadam, Baghaei, & Seyednozadi, 2017). Baghaei and Carstensen (2013) employed MRM to analyze a 20-item multiple choice reading comprehension test. MRM detected two

20  Baghaei, Kemper, Reichert, and Greiff latent classes with almost the same mixing proportion (proportion of test takers who fall in each latent class) that exhibited noninvariant item parameter estimates. Examination of item difficulty profiles in the two classes revealed that test takers in Class 1 had found the items based on short texts (20–30) to be easier, whereas test takers in Class 2 had found the items based on longer texts (400–500) to be easier. They concluded that the two latent classes differed in their processing of long and short passages and that long and short passage items do not form one single dimension. Aryadoust and Zhang (2015) fitted MRM to a reading comprehension test given to Chinese EFL (English as a Foreign Language) learners. Along with the reading comprehension test, a lexicogrammatical knowledge test and a set of cognitive and metacognitive strategies tests were given to the participants. MRM identified two latent classes with proportions of 60% and 40%. The classes were then linked to the manifest variables in the data. Findings revealed that the students in the two classes differed in their strategy use and in their lexicogrammatical knowledge. The authors concluded that EFL reading instruction can benefit from these findings by providing differentiated instructions according to class membership. Aryadoust (2015) replicated Aryadoust and Zhang’s (2015) study with an EFL listening comprehension test. MRM again identified two latent classes with proportions of 63% and 37% of the test takers. In the first step, inspection of the item profiles in the two classes revealed that test takers in Class 1 performed better than test takers in Class 2 on the set of items that required choosing the correct options from a list of options and writing the letters next to them on the answer sheet. Aryadoust concluded that these items require cognitive-motor abilities that are irrelevant to listening comprehension. That is, the identified classes are different in their construct-irrelevant cognitive abilities, their memory, and their multitasking abilities. Next, he used artificial neural network analysis, a nonlinear approach for prediction, to predict class memberships with the manifest variables in the analysis. This analysis showed that all the independent variables (i.e., lexicogrammatical knowledge and metacognitive strategy use) are important predictors of class membership. Postulating a tentative model for listening test performance based on the findings, Aryadoust stated that lexicogrammatical knowledge and metacognitive strategy use awareness play important roles in listening comprehension test performance. Pishghadam et al. (2017) analyzed an EFL vocabulary test with the MRM to test the role of “emotioncy” in test bias. Emotioncy, which refers to the emotions generated by the senses (Pishghadam et al., 2017), was measured for each word with a questionnaire administered before the vocabulary test. Test takers specified on a five-point scale their level of emotional involvement with the Persian equivalents of the vocabularies that were presented in the test. Once again, MRM identified two latent classes. Nineteen items were easier for Class 1, and 20 items were easier for Class 2. Likewise, Class 2 test takers had indicated higher emotional involvement with the set of words that they had found easier. Pishghadam et al. (2017) argued that emotioncy or “feeling for the words,”

Applying the mixed Rasch model 21 which results from individuals’ personal and cultural experiences, can act as a source of test bias. The studies presented so far illustrate that MRM helps to identify groups of test takers who approach language test items differently, depending on their particular background characteristics, attitudes, abilities, and strategies. The following section will provide an illustration of the different steps that can be taken in the context of the validation of a particular language test by using the MRM approach.

Sample study: Application of MRM to a reading comprehension test Method Instrument and participants Participants in the current, illustrative study are a subsample of the candidates (n = 1,024; 68% females) who took the National University Entrance Examination (NUEE) in Iran in 2011 to apply for the undergraduate English programs in state universities. NUEE is a high-stakes test that screens the applicants into English Studies programs at state-run universities in Iran. The test measures general English proficiency at an intermediate level and consists of five conceptually derived sections: i A grammar section containing 15 items. ii A vocabulary section containing 15 items. iii A language function section with 10 items, where test takers have to read short independent conversations (two to four lines) and fill in a few gaps. iv A cloze test with 10 items, where, in a short passage (178 words), 10 words were missing and participants had to replace the missing words. v A reading comprehension section with 20 items. This section had three passages with word lengths of 428, 503, and 466 followed by 7, 7, and 6 questions, respectively. The passages in the reading comprehension section were academic texts and the questions were four-option multiple-choice items. Time for completing the whole test with 70 items was 105 minutes. For the present illustrative study, only the reading comprehension section and the cloze passage were used.

Data analysis MRM analysis can be applied to language test data in two complementary ways (Embretson, 2007). The first way examines the meaning of the construct under consideration (e.g., reading ability) in terms of the strategies, processes, and the knowledge structures involved in test performance (construct representation aspect of score validity). The second way focuses on the significance of the

22  Baghaei, Kemper, Reichert, and Greiff measured construct. It includes the association of trait (ability) estimates and external criteria such as demographics, other constructs, and criteria (nomothetic span aspect of score validity). Accordingly, our data analysis was conducted in two steps. First, we applied the MRM to the language test data in order to find out whether we could identify latent classes or groups of students who qualitatively differ in their responses to test items and, thus, in the mechanisms underlying their test performance. Such qualitative differences are present in the data if the MRM identifies more than one latent class. In this case, the ordering of item difficulty parameters differs across classes, indicating qualitative differences among the classes. To estimate MRM, we used the WINMIR A software package (von Davier, 2001). Models with one to four latent classes were fitted to the sample data and compared in terms of fit statistics for model comparison. Generally, three criteria can be used to compare models: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and Consistent Akaike Information Criterion (CAIC). Among these indices, the AIC does not penalize for sample size and is, thus, less accurate than BIC and CAIC when large samples are analyzed (Baghaei & Carstensen, 2013). Smaller values for these information criteria indicate better fit of the model. Accordingly, the model with the smallest values is chosen, and model parameters are estimated (i.e., item difficulty and person ability parameters, error of measurement, and item fit statistics for each of the latent groups identified). Item fit values indicate validity at the item level by showing the extent to which the item is related to the measured latent trait. Then, item difficulty and person parameter estimates (expressed in log-oddunits or logits) were used to examine the nature of the latent classes. These parameters indicate the location of the items ⁄persons on the reading ability latent variable, respectively. Within-class item difficulty parameters were compared among groups, graphically as well as statistically (correlation of item difficulties), to determine the items that cause the groups to differ. According to Rost et al. (1997), parallel profiles would indicate that the same latent trait is measured by a set of items in each class. Next, we also examined the relationship between the qualitative person parameters (class memberships) and external criteria. In this analysis, a broad array of background variables may be used (for instance, demographics and other constructs or criteria) to further elucidate the differences among the latent classes identified in MRM analysis. We compared classes in terms of gender and their performance on the cloze test—an indicator of overall language ability (Hughes, 2003; Oller, 1979).

Results In the first step of our analysis, models with one to four classes were fitted to the sample data to identify latent classes based on student response patterns in the reading comprehension test. Fit statistics are presented in Table 1.1. While

BK-TandF-ARYADOUST_TEXT_9781138733145-190306-Chp01.indd 22

26/06/19 10:32 AM

Applying the mixed Rasch model 23 Table 1.1  Model fit statistics for the estimated models Model 1-class model 2-class model 3-class model 4-class model

AIC

BIC

CAIC

13,209.18 13,031.23 13,002.65 12,947.13

13,312.74 13,243.28 13,323.19 13,403.17

13,333.74 13,286.28 13,388.19 13,490.17

Note: The smallest values for information criteria (better model fit) are set in bold

the three information criteria do not yield consistent results, a model with two latent classes is suggested by the BIC and the CAIC. Nevertheless, a model comprising four latent classes is suggested by the AIC. As the latter is considered a less accurate criterion in large samples, we chose the model with two latent classes for subsequent analyses (Baghaei & Carstensen, 2013). Item parameters, their standard errors, and their fit indices are depicted in Table 1.2. The Q index (Rost & von Davier, 1993) shows the relationship between the item and the latent variable. The Q index varies between 0 and 1, where 0 indicates perfect fit and 1 indicates perfect misfit or negative discrimination. A Q index of .50 indicates no relation between the item and the trait or random response behavior. The standardized form of the Q index, ZQ, with zero mean and variance of one, is distributed normally and the cut-off values of ±2 Table 1.2  Item parameters and fit indices in the two latent classes Class 1

Class 2

Item

Estimate

Error

Q Index (ZQ)

Estimate

Error

Q Index (ZQ)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

2.36 .65 −1.46 1.89 .21 6.99 .22 −.57 −.01 2.55 −2.70 1.08 −2.90 1.13 −1.69 −1.96 −3.71 −.75 −.11 −1.22

.86 .40 .19 .69 .34 .32 .34 .25 .31 .93 .15 .48 .15 .49 .18 .17 .14 .24 .30 .20

.50 (−.93) .08 (−1.82) .24 (1.62) .50 (−1.81) .25 (−1.56) .50 (−.09) .08 (−1.60) .15 (−1.76) .22 (.14) .50 (−.84) .13 (.22) .12 (−1.19) .11 (.09) .27 (−.67) .12 (−.16) .18 (.87) .07 (.56) .20 (.52) .18 (.29) .20 (.11)

−.39 −.65 −.48 .21 −.15 .19 −.18 −.36 1.35 .64 −.86 .23 −.26 .85 −.17 .04 −1.26 .27 .43 .53

.11 .11 .11 .12 .12 .12 .12 .11 .17 .14 .11 .12 .11 .14 .12 .12 .11 .13 .13 .13

.26 (.94) .26 (1.29) .27 (.48) .29 (1.39) .29 (1.77) .22 (−.51) .22 (−.07) .25 (.86) .31 (.82) .25 (−.08) .20 (−.73) .20 (−.84) .29 (.93) .30 (1.18) .18 (−1.73) .24 (−.40) .12 (−2.73) .24 (−.30) .19 (−1.58) .27 (.25)

BK-TandF-ARYADOUST_TEXT_9781138733145-190306-Chp01.indd 23

26/06/19 10:32 AM

24  Baghaei, Kemper, Reichert, and Greiff Table 1.3  Class size and classification accuracy for the latent classes

Class 1 Class 2

Observed Size

Mean Probability Class 1

Mean Probability Class 2

57.1% 42.9%

.68 .07

.32 .93

can be applied to them (Baghaei & Carstensen, 2013). As Table 1.2 shows, all the items fit in the two latent classes according to ZQ. However, a couple of items with Q-indices equal to .50 are apparently misfitting in Class 1, which suggests that Class 1 members are involved in random responding. In Table 1.3, key statistics concerning the chosen two-class model are provided that include class size as well as the mean probability of assignment to the classes. About 57.1% of students were assigned to Class 1 and about 42.9% were assigned to Class 2 by MRM analysis. The mean probability indicates the average hypothetical probability for the classification of students in one of the classes. For example, students in Class 2 had a low probability of being classified in Class 1 (.07) and a very high probability of assignment to Class 2 (.93). As the diagonal elements on the probability matrix are much higher than the off-diagonal elements, classification accuracy of the model is considered sufficient (Hong & Min, 2007). To examine the nature of the two latent classes, they were compared with respect to their psychometric properties, for instance, ability estimates and score reliability as well as item difficulty profiles. Figure 1.1 illustrates profiles of item difficulty parameters across the latent classes. Class 1 is depicted with a dotted line and Class 2 with a solid line. Units on the vertical axis indicate item parameters (difficulty) and the horizontal axis indicates the test items. As can be seen,

Figure 1.1  Item parameter profiles of latent classes (Class 1 = 57.1%, Class 2 = 42.9%).

Applying the mixed Rasch model 25 Table 1.4  Mean ability estimates and reliability

Mean ability score (MRM) Mean ability raw score Reliability of ability score (MRM)

Class 1

Class 2

−4.45 (1.24) 1.04 (1.79) .47

−1.32 (1.30) 5.32 (4.01) .76

Note: Standard deviation in parentheses

the profiles differ substantially. This interpretation is also supported by a rather moderate rank order correlation of r = .41 between the item parameters from the two latent classes. That is, the item parameter estimates from the two latent classes do not agree much. Within-class difficulty parameters for Class 1 have a high range from 6.99 logits to −3.71 logits, whereas item parameters for Class 2 range only from 1.35 to −1.26. For Class 1, Items 1, 4, 6, and 10 are the most difficult in the set and, by contrast, for Class 2 these items show moderate difficulty. The most difficult item for Class 2 is Item 9; but for members of Class 1, this item is of moderate difficulty. Items that are relatively easy for Class 1 members are 17, 13, and 11; but for Class 2, Items 17 and 11 are relatively easy with Item 13 having a moderate item difficulty in this group. In sum, as stated previously, Items 1, 4, 6, 9, 10, and 13 contribute to qualitative differences between the two groups. The content of these items will be analyzed in more detail (see as follows). To further inspect group differences, classes were compared in terms of their test performance and psychometric properties of the test score, that is, reliability. Table 1.4 shows the mean ability estimates and standard deviations from the MRM as well as the mean raw scores and reliabilities for the two classes. Both classes substantially differ in their ability score obtained from the MRM analysis, t (926.2) = −51.2, p < .001. This means that members of Class 1 showed considerably weaker test performance compared to Class 2 members. Further differences are observable in reliability of the item set; in Class 2, reliability is .76, whereas in Class 1, it is rather low and .47. The Iranian university entrance exam is a high-stakes test with which the future of around 1,000,000 Iranian youths is determined annually. The test is multiple choice, and there is a penalty for guessing. Therefore, candidates for the test are advised against guessing by teachers and test-taking coaches. The extremely low mean of the Class 1 examinees in a multiple-choice test where we expect a mean of 5 by random guessing indicates that Class 1 test takers are low-ability examinees with high levels of risk aversion who respond only when they are sure of an answer. Class 2 examinees, on the other hand, demonstrate higher language ability levels with less risk aversion. Descriptive statistics for Class 1 and Class 2 are presented in Table 1.5. As can be seen, gender distributions are significantly different between the two classes: Class 1 is male dominant and comprises about twice as many males compared to Class 2: χ2 = 7.59, p = .006. Comparing the two latent classes on the cloze

26  Baghaei, Kemper, Reichert, and Greiff Table 1.5  Class comparison

Gender distribution (% males) Mean score overall language ability

Class 1

Class 2

70.3% 1.98 (2.05)

29.7% 4.60 (2.57)

Note: Standard deviation in parentheses

test scores, as a measure of overall language ability (Hughes, 2003; Oller, 1979), Class 1 members scored significantly lower compared to Class 2 members: t(648.5) = −16.8, p < .001. In sum, these results suggest that Class 1 members, mainly male students, are considerably weaker in their overall English language proficiency, which is in line with previous research on gender differences and language proficiency in Iranian EFL students (Farashaiyan & Tan, 2012) and confirms our previously described results.

Discussion This chapter illustrated the application of the mixed Rasch model in language testing research and, more specifically, in the analysis of a reading ability test. In the following section, an attempt is made to uncover the qualitative differences between the two latent classes in terms of the subskills tested by the items. Furthermore, some speculations regarding the individual differences of the two classes are suggested. In the end, we summarize the applications and limitations of MRM in second language research.

Interpretation of the latent classes Two latent classes emerged from the MRM analysis. Class 1 (57%) was comprised of poor readers and risk averters, and Class 2 (43%) consisted of intermediate-level readers. To identify the qualitative differences between the two latent classes, we used the scores that were available for all test takers in a cloze test, and we were able to confirm that the members of the two latent classes were significantly different from each other with regard to their general language proficiency. This result seems to underline the fact that individuals with different levels of language proficiency do approach the items in different ways. In such a case, and in order to further explain the latent DIF found within a data set, it is interesting to have a clearer idea about the differences in the cognitive processes that might have been used by those differently proficient test takers. Indeed, the literature shows that using cognitive variables to explain latent class membership can be successful. Two approaches can be taken here: Either the focus is on a concrete measure and analysis of cognitive processes characterizing the members of the different latent classes (in our case, this has been their general language proficiency level measured with the cloze test), or an item-centered approach can be taken, that is, a thorough examination of the cognitive processes underlying

Applying the mixed Rasch model 27 the items. To illustrate this approach, in the following section, we are going to examine the traits or subskills that are needed to answer each item. To this end, we consulted Hemmati (2016) who analyzed the same reading comprehension test items using cognitive diagnostic modeling (CDM; de la Torre, 2009). For CDM analysis, Hemmati examined the content of the items against existing L2 reading comprehension models and identified the attributes each item measured. The acceptable fit of the data to the CDM model was evidence of the correct identification and assignment of the attributes to the items (Chen, de la Torre, & Zhang, 2013). Items 3, 9, 11, 13, 15, 16, 17, 18, and 20 were easier for Class 1 test takers (poor readers) and items 1, 2, 4, 6, 10 were easier for the Class 2 test takers (intermediate level readers). The items that are easy for Class 1 test takers, according to Hemmati (2016), measure the attributes “identifying word meaning from context,” “using background knowledge,” and “identify pronominal references.” These three skills refer to the cognitive processes that do not entail understanding longer chunks of language that require reading beyond a sentence. This suggests that poor readers try to score by using any skill at their disposal, such as using background knowledge or processing within the limits of a sentence to find the answers. Their poor reading abilities do not allow them to get involved in text comprehension. On the other hand, items that are easier for Class 2 test takers have a stronger focus on integrating meaning from various parts of the text to be answered (inference making, understanding intentions, and connecting information). That is, test takers need to process beyond individual sentences to answer the questions. The findings suggest an interesting pattern in the information processing capacity of the two latent classes. It seems that Class 1 learners try to answer the questions by using any piece of information available in the small context in the proximity of a word or within a sentence. If this is true, they do not seem to be able to go beyond the boundary of a sentence and integrate information across the text to extract global meanings and structures. On the other hand, Class 2 readers are able to rely on their higher-order text processing abilities as they seem to be able to make inferences and connect information from different sections of the text. These findings would be in line with the construction-integration model of text comprehension (Kintsch, 1998). According to this model, comprehension takes place in two phases. The first phase is a bottom-up process in which readers construct a crude representation of the text using the linguistic input in the text and their own world knowledge. At this stage, text-based and world knowledge propositions are constructed using the knowledge stored in memory. As new sentences are read, new explicit and inference propositions are activated and are added to the ones activated by the previous sentences kept in working memory. In the integration phase, the pool of propositions is refined and a coherent interpretable picture is made. At this stage, the plausibility of propositions is assessed by checking whether they contradict or confirm one another. In an iterative process, the propositions that seem to be less probable are suppressed and

28  Baghaei, Kemper, Reichert, and Greiff those that are supported are strengthened. This process continues until a final set of stable propositions is settled. The output of our MRM analysis seems to suggest that Class 1 readers are those who are at the construction stage of comprehension. That is, they form propositions based on the linguistic input and their own world knowledge. However, they are not capable of comparing and contrasting the propositions against each other, judging the credibility of the propositions, and integrating them into a set of valid propositions to make a coherent picture of the whole text. The reason for their failure in combining propositions could be poor working memory capacity, which, in turn, might result from their poorer overall language proficiency (cf. Nouwens, Groen, & Verhoeven, 2017, for a study on the relationship between reading ability and working memory). For successful integration to occur, the propositions from previous sentences should be kept active in the working memory and contrasted against the propositions formed from the new sentences.

Conclusion The application of MRM to the reading comprehension test used in this chapter helped us understand the qualitative and quantitative differences between the test takers. The methodology revealed an acquisition pattern that can be of interest to reading researchers and educationalists. This example illustrated a technique for language testers to analyze whether there are any qualitative differences in foreign language readers. Readers in our study seemed to differ in the acquisition and mastery of the type of skills that they need for successful reading comprehension. Findings revealed that Class 1 learners might not yet have mastered beyond sentence processing skills, while Class 2 learners have. This information has diagnostic values that could have not been detected using conventional DIF detection methods. MRM analysis shows to which latent class test takers belong, with a given probability. This information can be used to identify their specific strengths and weaknesses. For instance, in this study, remediation and intervention programs for Class 1 test takers should focus on the enhancement of language proficiency in general and on text level processing skills in particular. Reading comprehension activities for Class 1 test takers should encourage and activate learners’ higher order text level processing strategies and activities that go beyond reading and processing individual sentences and require integrating information from various parts of the text. The findings also suggest that the test measures different constructs for the members of the two classes. For Class 1 examinees, the test measures small context, sentence-level processes, while for Class 2 examinees, it is a test of beyond-sentence text-level processes. Therefore, comparing examinees’ scores who belong to either of the two classes is not justified with this test. MRM is a valuable technique that helps researchers understand group compositions by identifying subpopulations that are qualitatively different. Lack of fit to the unidimensional Rasch model or DIF usually is considered as a

Applying the mixed Rasch model 29 psychometric problem that is resolved by deleting items and persons. However, examining DIF with MRM can reveal fascinating phenomena, beyond psychometrics, in test taking and learning processes, which cannot be detected by other methods. Understanding these processes furthers our understating of the developmental patterns of learning (Glück & Spiel, 2007). A side application of the model can be seen in test validation and checking unidimensionality. If a one-class model does not fit the data, it is evidence that responses to the items depend on a grouping variable as well as on the ability of interest. This is an indication of multidimensionality and the contamination of scores by a nuisance variable. It is worth mentioning that MRM has also been used to model response styles in surveys. In survey research, respondents may use the rating scale in various ways. One common response style is the socially desirable responding or faking, where respondents usually endorse categories that are more socially acceptable. Other response styles include a tendency toward the middle category or a tendency toward extreme categories. These peculiar response styles lead to measurement error and, therefore, should be discerned and remedied. MRM has been employed to identify classes of respondents who fake (Eid & Zickar, 2007; Mneimneh, Tourangeau, Heeringa, & Elliott, 2014) or who use the middle category or the extreme categories (Carter, Dalal, Lake, Lin, & Zickar 2011; Maij-de Meij et al. 2008). Perhaps one drawback of MR M is that it requires large sample sizes. The size of the sample increases with the number of latent classes. The minimum sample size required to construct a standard one-class Rasch model should be multiplied by the number of latent classes to have accurate parameter estimates (von Davier & Rost 1995). Finally, the interpretation of latent DIF also strongly depends on the availability of further information, e.g., with regard to the test takers’ cognitive skills or with regard to the cognitive processes activated by the items.

Note 1. Rasch measurement is a unidimensional model, i.e., all the items in a test should measure one attribute or skill. What follows from this definition is that all the test takers should employ the same cognitive processes to answer the items. Unidimensionality is an important assumption of the Rasch model and is an empirical question that should be addressed before one can proceed to construct objective measures from raw counts of observations (Wright & Linacre, 1989). For more information, see Chapter 4, Volume I of Quantitative Data Analysis for Language Assessment, on Rasch measurement and unidimensionality.

References American Educational Research Association [AERA], American Psychological Association, and National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. Andrich, D. (1988). Rasch models for measurement. Newbury Park, CA: Sage.

30  Baghaei, Kemper, Reichert, and Greiff Aryadoust, V. (2015). Fitting a mixture Rasch model to English as a foreign language listening tests: The role of cognitive and background variables in explaining latent differential item functioning. International Journal of Testing, 15, 216–238. doi: 10.1080 ⁄15305058.2015.1004409 Aryadoust, V. & Zhang, L. (2016). Fitting the mixed Rasch model to a reading comprehension test: Exploring individual difference profiles in L2 reading. Language Testing, 33, 529–553. doi: 10.1177⁄ 0265532215594640 Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. Oxford, UK: Oxford University Press. Baghaei, P. (2009). Understanding the Rasch model. Mashhad, Iran: Mashhad Islamic Azad University Press. Baghaei, P. & Carstensen, C. H. (2013). Fitting the mixed Rasch model to a reading comprehension test: Identifying reader types. Practical Assessment, Research and Evaluation, 18, 1–13. Baghaei, P. & Tabatabaee-Yazdi, M. (2016). The logic of latent variable analysis as validity evidence in psychological measurement. The Open Psychology Journal, 9, 168–175. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord and M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Reading, MA: Addison-Wesley. Borsboom, D., Mellenbergh, G. J., and van Heerden J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. doi:10.1037⁄ 0033-295X.111.4.1061 Carter, N. T., Dalal, D. K., Lake, C. J., Lin, B. C., and Zickar, M. J. (2011). Using mixedmodel item response theory to analyze organizational survey responses: An illustration using the Job Descriptive Index. Organizational Research Methods 14, 116–146. Chen, J., de la Torre, J., and Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50, 123–140. doi:10.1111⁄  j.1745-3984.2012.00185.x de la Torre, J. (2009). DINA model and parameter estimation: A didactic. Journal of Educational and Behavioral Statistics, 34, 115–130. Eid, M. & Zickar, M. J. (2007). Detecting response styles and faking in personality and organizational assessment by mixed Rasch models. In von Davier, M., and Carstensen, C. H. (Eds.), Multivariate and mixture distribution Rasch models: Extensions and applications (pp. 255–270). New York: Springer Verlag. Embretson, S. E. (2007). Mixed Rasch models for measurement in cognitive psychology. In M. von Davier, and C. H. Carstensen (Eds.), Multivariate and mixture distribution Rasch models: Extensions and applications (pp. 235–253). New York: Springer Verlag. Farashaiyan, A. & Tan, K. H. (2012). On the relationship between pragmatic knowledge and language proficiency among Iranian male and female undergraduate EFL learners. 3L: Language, Linguistics, Literature. The Southeast Asian Journal of English Language Studies, 18, 33–46. Ferne, T. & Rupp, A. (2007). A synthesis of 15 years of research on DIF in language testing: Methodological advances, challenges, and recommendations. Language Assessment Quarterly, 4, 113–148. Glück, J. & Spiel, C. (2007). Studying development via Item Response Model: A wide range of potential uses. In M. von Davier, and C. H. Carstensen (Eds.), Multivariate and mixture distribution Rasch models: Extensions and applications (pp. 281–292). New York: Springer Verlag.

Applying the mixed Rasch model 31 Hemmati, S. J. (2016). Cognitive diagnostic modeling of L2 reading comprehension ability: Providing feedback on the reading performance of Iranian candidates for university entrance examination. Unpublished master’s thesis. Iran, Mashhad, Islamic Azad University. Holland, P. W. & Thayer, D. T. (1988). Differential item performance and the MantelHaenzel procedure. In H. Wainer and H. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Lawrence Erlbaum Associates. Hong, S. & Min, S. Y. (2007). Mixed Rasch modeling of the self-rating depression scale: Incorporating latent class and Rasch rating scale models. Educational and Psychological Measurement, 67, 280–299. Hughes, A. (2003). Testing for language teachers (2nd Ed.). New York: Cambridge University Press. Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge, UK: Cambridge University Press. Linacre, J. M. (2017). Winsteps® Rasch measurement computer program user’s guide. Beaverton, Oregon: Winsteps.com. Maij-de Meij, A. M., Kelderman, H., & van der Flier, H. (2008). Fitting a mixture item response theory model to personality questionnaire data: Characterizing latent classes and investigating possibilities for improving prediction. Applied Psychological Measurement, 32, 611–631. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement. New York: Macmillan. Mneimneh, Z. N., Tourangeau, R., Heeringa, S. G., & Elliott, M. R. (2014). Bridging psychometrics and survey methodology: Can mixed Rasch models identify socially desirable reporting behavior? Journal of Survey Statistics and Methodology 2, 257–282. Nouwens, S., Groen, M. & Verhoeven, L. (2017). How working memory relates to children’s reading comprehension: The importance of domain-specificity in storage and processing. Reading and Writing, 30, 105–120. doi: 10.1007⁄s11145-016-9665-5 Oller, J. W., Jr. (1979). Language tests at school. London: Longman. Pishghadam, R., Baghaei, P., & Seyednozadi, Z. (2017). Introducing emotioncy as a potential source of test bias: A mixed Rasch modeling study. International Journal of Testing, 17, 127–140. doi: 10.1080 ⁄15305058.2016.1183208 Rasch, G. (1960 ⁄1980). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research, 1960. (Expanded edition, Chicago: The University of Chicago Press, 1980). Reichert, M., Brunner, M., & Martin, R. (2014). Do test takers with different language backgrounds take the same C-test? The effect of native language on the validity of C-tests. In R. Grotjahn (Ed.). Der C-Test: Aktuelle Tendenzen ⁄  The C-Test: Current trends (pp. 111–138). Frankfurt: M. Lang. Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271–282. Rost, J., Carstensen, C., & von Davier, M. (1997). Applying the mixed Rasch model to personality questionnaires. In J. Rost and R. Langeheine (Eds.), Applications of latent trait and latent class models in the social sciences (pp. 324–332). Munster, Germany: Waxmann. Rost, J., Häussler, P., & Hoffmann, L. (1989). Long-term effects of physics education in the Federal Republic of Germany. International Journal of Science Education, 11, 213–226.

32  Baghaei, Kemper, Reichert, and Greiff Rost, J. & von Davier, M. (1993). Measuring different traits in different populations with the same items. In R. Steyer, K. F. Wender, and K. F. Widaman (Eds.), Psychometric methodology. Proceedings of the 7th European meeting of the Psychometric Society in Trier. Stuttgart: Gustav Fischer Verlag. Scheiblechner, H. (2009). Rasch and pseudo-Rasch models: Suitableness for practical test applications. Psychology Science Quarterly, 51, 181–194. Smith, R. M. (2004). Detecting item bias with the Rasch model. Journal of Applied Measurement, 5, 420–449. Van Nijlen, D. & Janssen, R. (2011). Measuring mastery across grades: An application to spelling ability. Applied Measurement in Education, 24, 367–387. von Davier, M. (2001). WINMIRA [computer software]. Groningen, The Netherlands: ASC-Assessment Systems Corporation. USA and Science Plus Group. von Davier, M. & Rost, J. (1995). Polytomous mixed Rasch models. In G. H. Fischer and I. W. Molennar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 371–379). New York: Springer Verlag. Wright, B. D. & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70, 857–860. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likerttype (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. Retrieved from: http:  ⁄ ⁄ www.faculty.educ. ubc.ca  ⁄zumbo ⁄ DIF ⁄  handbook.pdf Zumbo, B. D. (2007). Three generations of differential item functioning (DIF) analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4, 223–233.

2

Multidimensional Rasch models in first language listening tests Christian Spoden and Jens Fleischer

Introduction Language assessment researchers investigate the skill level and the processes of language acquisition, and practitioners in this field plan instructional activities to foster language acquisition. For both groups, a general ability measure of language comprehension or production provides very limited insight compared to detailed profiles of specific student abilities. In fact, many language testing instruments aim to measure multiple domains and the interplay among specific components of language. From the psychometric perspective, a flexible measurement approach to validate and appropriately model such ability profiles is necessary. One influential measurement approach that fulfills this requirement is the multidimensional Rasch model (Briggs & Wilson, 2003; Kelderman, 1996; Wang, Wilson, & Adams, 1997). The multidimensional Rasch model is applicable to different forms of item response scoring (e.g., dichotomous and polytomous data) and is robust to complex assessment designs. The multidimensional Rasch model is an extension of the simple Rasch model proposed by and named after the Danish mathematician Georg Rasch (Rasch, 1960). The usefulness of the multidimensional Rasch model for language assessment lies in a combination of benefits of the Rasch model with benefits from the multidimensional measurement. For example, the Rasch model offers a comprehensible criterion-referenced interpretation of test scores (Hartig, 2008). In a criterion-referenced test, student performance is verbally described against a fixed set of learning standards (e.g., the levels of the Common European Framework of Reference, CEFR; Council of Europe, 2001) and commonly contains a description of what students know and which tasks they are able to fulfill at a specific level of their education. Utilizing the Rasch model for language assessment allows describing of minimum, average, and maximum standards (e.g., the common reference levels A, B, and C of the CEFR) by referring to the common properties of a set of items of similar difficulty. While the exact probability of a correct item response (derived from the Rasch model equation; see Equation 2.1) depends on the ability level of a student, the rank order of item difficulties is the same for all students (e.g., Sijtsma & Hemker, 2000), which means that properties from the same item sets can be used to define common

34  Spoden and Fleischer standards validly for all of these students. Moreover, application of the Rasch model as a measurement approach gives robust, reasonable estimates in medium sample sizes of about a few hundred respondents (e.g., De Ayala, 2008) as well as in complex booklet designs (Robitzsch, 2008), such as are often used in large-scale assessments, while other types of measurement models often require a four-digit number of respondents to ensure stable estimates. Additionally, the multidimensional measurement approach is a valuable alternative to unidimensional measurement itself and offers language researchers several advantages (e.g., Baghaei, 2012). Multidimensional measurement models for language assessment came to the fore in the early 1980s when language research began to focus on more complex dimensional structures instead of a general psychometric measure. Modern theoretical models of language comprehension contain a differentiated set of language subcomponents and abilities (e.g., Bachman, 1990; Chalhoub-Deville, 1997; Jude & Klieme, 2007; Sawaki, Stricker & Oranje, 2009). For example, language processes generally may be differentiated into productive and receptive forms of language comprehension (Sawaki, Stricker, & Oranje, 2009); receptive forms of language comprehension may further be differentiated into listening and reading comprehension (e.g., Bae & Bachman, 1998; Hagtvet, 2003; Marx & Jungmann, 2000; Rost & Hartmann, 1992; Song, 2008); and listening  ⁄reading comprehension may be differentiated into subskills like understanding unexpected or explicit information, making inferences, and drawing conclusions from spoken language (Goh & Aryadoust, 2015, 2016); thus, a hierarchy of skills and components emerges. The multidimensional measurement of language comprehension gives a more detailed diagnostic profile of respondents compared to a unitary language dimension. Obviously, the diagnostic profile of respondents may also be derived from several unidimensionally calibrated abilities, but the multidimensional approach offers advantages in terms of precision and reliability. By simultaneously calibrating multiple dimensions, the ability estimates take measurement error into account, and the correlations among the dimensions are estimated free of attenuation related to that error (latent correlations). Furthermore, collateral information from the correlations among dimensions is also utilized to increase the precision and reliability of each single dimension (e.g., Baghaei, 2012; Wang, Chen, & Cheng, 2004; Yao & Boughton, 2007). Finally, multidimensional measurement models have proven to be beneficial tools for validation as the test scoring needs to reflect the dimensionality inherent in testing instrument items. Baghaei (2016) has recently given an illustrative example of how to validate the dimensionality of an L2 test in Iran by comparing a unidimensional model with multidimensional models in terms of model fit and other statistical information.

The multidimensional Rasch model The multidimensional Rasch model is most often applied to the analysis of dichotomous data (e.g., “correct answer” and “incorrect answer” or “agree” and “disagree”) obtained from educational, language, psychological, or health-related

Multidimensional Rasch models 35 tests or surveys, where each individual is characterized by a profile of d (d = 1, …, D) correlated abilities (e.g., Allen & Wilson, 2006; Baghaei, 2012, 2016; Baghaei & Grotjahn, 2014a, 2014b; Briggs & Wilson, 2003; Kelderman, 1996). For example, in language assessments, responses to the test item i are usually scored as x i = 0 for an incorrect response and x i = 1 for a correct response. In formal terms, in the multidimensional Rasch model, the probability of an individual’s response P ( x i = 1) is given by the difference of two continuous parameters, the item difficulty δi and the ability θud of individual u in dimension d:



P ( x i = 1| θud

q θ − δ  ∑  (2.1) )=  1 + exp  ∑ q θ − δ    exp  

D

d =1

id ud

i

D

d =1

id ud

i

where exp is an exponential function. Equation 2.1 also includes parameter qid defining the loading structure of item i on dimension d (i.e., item i depends on ability dimension d). These parameters are not estimated but need to be specified a priori with qid = 1 if item i depends on ability dimension d, and qid = 0 if answering this item does not require dimension d. Given Equation 2.1, the relation between the latent ability level in a specific dimension d and the probability of a correct response might be expressed in a graphical way, which is known as an item characteristic curve (ICC). Figure 2.1 shows an example of two ICCs, one for an item with lower item difficulty (Item 1) and one for higher item difficulty (Item 2). The figure further

Figure 2.1  Item characteristic curves of two Rasch model items with lower (Item 1) and higher (Item 2) item difficulty, and individuals with lower (θ1) and higher (θ2) ability (figure adapted from Fleischer, Wirth, & Leutner, 2007).

36  Spoden and Fleischer provides the probability of a correct response for a group of individuals with a lower ability level (θ1) and with a higher ability level (θ2), respectively, on this dimension. In this figure, Item 1 is the easier item compared to Item 2, and it displays a higher probability of a correct response across the complete ability dimension. Note that the slopes of the ICCs are the same for all items in the Rasch model. The item difficulty parameter is determined as the numerical value of the latent dimension where P ( x i = 1) = .50, (meaning that Item 1 has an item difficulty of δ1 = −1 and Item 2 has an item difficulty of δ 2 = 1). Although dichotomous item responses are a common case in language assessments, an extension of the multidimensional Rasch model (Kelderman, 1996) exists for polytomously scored item responses with more than two outcomes (e.g., incorrect response, x i 1 = 0; partially correct response, x i 2 = 1; and completely correct response, x i 3 = 2).

Multidimensional Rasch models with between-item and within-item dimensionality Defining the loading structure in qid is crucial for estimating different ability profiles and interpreting the ability measures. Several loading structures of varying degrees of complexity can be defined when specifying a multidimensional measurement model. A multidimensional Rasch model comprising items depending on one single (subtest) dimension exclusively is referred to as a model with simple loading structure, or between-item dimensionality. A multidimensional Rasch model comprising items with loadings on multiple dimensions is referred to as a model with complex loading structure, or within-item dimensionality (e.g., Hartig & Höhler, 2008; Wang, Wilson, & Adams, 1997). Figure 2.2 shows a typical, simplified representation of three multidimensional Rasch models (one between-item dimensionality model and two within-item dimensionality models). In this figure, the six test items (expressed as i1 to i6) are displayed as rectangles indicating manifest variables, the abilities (θ1, θ2, and in the third model also θ g ) are displayed as ovals indicating latent variables. Arrows pointing from ovals to rectangles indicate that the particular item depends on this specific ability, whereas arrows between two ovals indicate correlations among these abilities. The between-item dimensionality model (Model 1) comprises two correlated unidimensional abilities (or factors) and is therefore also referred to as a correlated factors model. To give these abilities a meaning relevant to language assessment, it is assumed here that they represent two receptive language skills: listening and reading comprehension that, according to previous research in language assessment, are correlated (e.g., Bae and Bachman, 1998; Hagtvet, 2003; Hartig & Höhler, 2008; Levine & Revers, 1988; Marx & Jungmann, 2000; Rost & Hartmann, 1992; Wolfgramm, Suter, & Göksel, 2016). In the case of dichotomously scored items, the correlated factors model has nine parameters: six item difficulty parameters, the two variance parameters of listening and reading comprehension abilities, and one covariance between these ability measures. The loading structure in this model is simple because each item from a subtest

Figure 2.2   Between- and within-item dimensionality models for two measurement dimensions.

38  Spoden and Fleischer depends only on one latent dimension (either listening or reading comprehension). Note, however, that the two dimensions are correlated, so that listening and reading comprehension abilities may not be independent of each other. The common variance of the two dimensions is expressed by the magnitude of the correlations among the subtest abilities notated as θ1 and θ2; the higher the correlations among the listening and reading abilities (θ1 and θ2), the higher the overlap between them. The correlated factors model offers researchers a clear interpretation of the abilities. Hartig and Höhler (2008, p. 92) argued that “[w] hen the between-item dimensionality model is applied, scores from both latent variables will provide two straightforward measures of performance in both areas,” and that “[t]he between-item model may, thus, be preferable for researchers interested in descriptive measures of performance in certain content areas.” The first within-item dimensionality model presented in Figure 2.2 (Model 2) is a nested factor model. In this model, only items from the first subtest load on dimension θ1 , but items from both subtests depend on dimension θ2. Again, referring to receptive language skills, dimension θ1 might now be interpreted as an auditory processing dimension, while dimension θ2 might be termed a more general text comprehension dimension (that includes common skills from listening and reading comprehension; see Hartig & Höhler, 2008). This model also includes nine parameters: six item difficulty parameters, two variances, and one covariance. The covariance between the abilities θ1 and θ2 is drawn in Figure 2.2, but a clear interpretation of the two dimensions is more easily ensured when this covariance is not estimated from the data but a priori constrained to zero in the estimation software. Interestingly, the correlated factor and the nested factor models can be equal in terms of parameter number as well as absolute and relative model fit statistics, although the interpretation of both dimensions is different to some extent (as is implicit in the way these dimensions are denoted earlier). In fact, dimension θ1 in the nested factor model is equal to the sum of dimensions θ1 and θ2 in the between-item dimensionality model, while dimension θ2 in the nested factor model is equal to dimension θ2 in the correlated factor model (Hartig & Höhler, 2008). As a consequence of this more complex loading pattern, compared to the correlated factors model, the correlations of both dimensions to external variables may differ substantially. For example, Hartig and Höhler (2008) found dissimilar results concerning school track differences and gender differences when they applied either the correlated factor or the nested factor model to data from the assessment of receptive competencies (reading and listening) in a foreign language. In a correlated factor model, gender and school track differences were of about the same effect size, while in a nested factor model with a common text comprehension dimension and an additional auditory processing dimension, school track differences were extenuated and gender differences were even reversed. According to Hartig and Höhler (2008), in comparison to the correlated factor model, the nested factor model may be appealing for researchers interested in the contribution of the subtest abilities to the overall skill set.

Multidimensional Rasch models 39 The third model displayed in Figure 2.2 (Model 3) is another variant of a within-item dimensionality or nested factor model (confirmatory Rasch bifactor model,1 see Baghaei, 2016). It is composed of three orthogonal dimensions, one general dimension, and two subtest dimensions.2 In contrast to the second model, the subtest dimensions (θ1 and θ2) might again be denoted as listening and reading comprehension, while the general dimension might be denoted the text comprehension dimension (θg ). Note, however, that the listening and reading comprehension dimensions are not identical to those in the correlated factors model but represent residual variance components from the two tests (see following). In general, the subtest dimensions in this model reflect common variance among item clusters with similar content (e.g., common topics) or similar item format, additional to the general dimension. Again, this model contains nine parameters: six item difficulty parameters and three variances. The assumption of orthogonality is crucial to the bifactor model. By assuming orthogonal dimensions, the item response variance is split into a common variance in the general dimension and a subtest-specific variance in the subtest dimensions. Thus, the general dimension in the bifactor model might also be interpreted as a “pure” measure of what had been intended to be measured by a unidimensional model not attenuated by subtest-specific variance “artifacts.” The subtest dimensions might then be interpreted as method effects. A valuable feature of the bifactor model, and a significant reason for applying this model, is that it includes information on the degree of multidimensionality and the relevance of the subtests as a matter of subtest variance and reliability, after controlling for the general dimension variance (Reise, 2012). The explained common variance is computed as the ratio of the variance explained by the general dimension divided by the variance of the general and the subtest dimensions. Given that the common variance is high (dominant general dimension) and the variances of a subtest dimension are low, the reliabilities of the subtest dimension will also be low, indicating that the subtest dimensions may not be viable. In this case, application of a unidimensional measurement model to what appeared to be multidimensional data is reasonable. Thus, the bifactor model is a statistical model for the validation of test scoring in a multidimensional test (Reise, 2012).

Compensatory and non-compensatory models There are two basic ways to integrate different abilities in within-item dimensionality models: namely, compensatory and non-compensatory functions (this terminology is further used in cognitive diagnostic assessment, which is discussed in Chapters 3 and 4, Volume II. In a compensatory model, low abilities in one dimension can be compensated by high abilities in another dimension, while in a non-compensatory model, high abilities are required in all dimensions related to the items to approach a high probability of correctly responding to the item (e.g., Hartig & Höhler, 2009). These principles are illustrated in few

40  Spoden and Fleischer words by assuming an item of medium item difficulty (δi = 0) loading on two dimensions, θ1 and θ2. In a compensatory model such as a dichotomous two-­ dimensional Rasch model with within-item dimensionality, computation of the probability of a correct item response, given in Equation 2.2, boils down to:

P ( x i = 1| θu 1, θu 2 ) =

exp ( θu 1 + θu 2 ) (2.2) 1 + exp ( θu 1 + θu 2 )

This implies that the same items can be answered on the basis of different abilities or different test-taking strategies related to the different abilities. In a non-compensatory model, the additive combination of both abilities is replaced by a multiplicative function, in the current example:

P ( x i = 1| θu 1, θu 2 ) =

exp ( θu 1 ) exp ( θu 2 ) × (2.3) 1 + exp ( θu 1 ) 1 + exp ( θu 2 )

This shows that a high ability level in one dimension does not help to answer an item if the student is low on the other dimension. Although non-compensatory models provide interesting applications, the rest of the chapter refers only to compensatory models with within-item dimensionality (see Chapters 3 and 4, Volume II for more information concerning (non-) compensatory multidimensional models).

Model fit Making comparisons between a unidimensional and a multidimensional Rasch model, and among different variants of multidimensional Rasch models, is possible on the basis of model fit statistics, including the deviance statistic and information criteria. The deviance is based on the likelihood of the model, a measure of the discrepancy between the fitted model and the true model. The deviance is the negative log-likelihood with smaller values indicating more parsimonious models. Given that two models are nested, as is the case, for example, with a unidimensional and a correlated factors model, these models can be compared by their deviance. To test differences between two models in terms of significance, differences are approximatively χ2-distributed, with the degrees of freedom equal to the difference in the number of estimated parameters between the two models (Briggs & Wilson, 2003). Non-nested models, such as the correlated factors and the bifactor Rasch model (see Baghaei, 2016), can be compared on the basis of information criteria such as the Akaike information criterion (Akaike, 1987), which is defined as −2 log-likelihood + 2p, where p is the number of parameters of the model. Smaller values of the information criteria indicate more parsimonious models. Additionally, the (co-)variances and latent correlations, as well as the reliabilities, should be investigated. Low variances and reliabilities of the abilities, and strong correlations among them, may indicate that the specified dimensional structure defined by the item loadings does not fit the assessment data.

Multidimensional Rasch models 41

Software package for multidimensional Rasch modeling An elaborate and widely applied software package for Rasch modeling is ConQuest (Adams, Wu, & Wilson, 2015) developed by the Australian Council for Educational Research (ACER). ConQuest applies a very general measurement model, referred to as the multidimensional random coefficients multinomial logit model (MRCMLM), which was formally described in Adams, Wilson, and Wang (1997). Due to its flexibility, the MRCMLM facilitates the estimation of several variants of Rasch models, including the partial credit model (Masters, 1982), the linear logistic test model (Fischer, 1973), the many-facets Rasch model (Linacre, 1989), and others, alongside the dichotomous and polytomous uni- and multi-dimensional Rasch models. The ConQuest software produces marginal maximum likelihood estimates, making use of adaptations of the quadrature method of Bock and Aitken (1981) and the Gauss-Hermite quadrature and Monte Carlo methods described by Volodin and Adams (1995) for the estimation of the MRCMLM and its submodels (Wu, Adams, & Wilson, 2007). Marginal maximum likelihood estimates incorporate information on the distribution of latent ability levels in the population of interest. If ability estimates are requested, the MML estimation of item parameters will be completed by a subsequent unconditional step of estimating the ability parameters θud . In the multidimensional Rasch model, ability parameter estimation relies on Bayesian methods, such as expected a posteriori estimates (EAP; Bock & Mislevy, 1982) or plausible value estimates (PV; Wu, 2006). The ConQuest software also provides statistical information on (co-)variances, reliabilities, likelihood, deviance, and AIC, as well as item fit statistics (not considered in this chapter) to evaluate the quality of the estimated parameters.

Modeling receptive L1 and L2 competencies by multidimensional Rasch models A common distinction made in language assessment differentiates productive (conversation and writing skills) and receptive forms of language comprehension in auditory and written language (e.g., Sawaki, Stricker & Oranje, 2009). Receptive skills in language comprehension separate listening comprehension (spoken language) from reading comprehension (printed language). Listening comprehension has been given less attention compared to reading comprehension, which has been a part of the assessment frameworks of international large-scale assessments like the Programme for International Student Assessment (PISA; OECD, 2016) or the Progress in International Reading Literacy Study (PIRLS; Mullis & Martin, 2015). However, this by no means indicates that listening comprehension is of less importance than reading comprehension. Listening comprehension is not only one of the earliest developed skills in both first and second language learning (e.g., Tincoff & Jusczyk, 1999; Werker, 1989), but it is also a major competence in most acts of communication (Feyten, 1991).

42  Spoden and Fleischer The importance of listening is reflected, for example, in the German educational standards, which refer to the CEFR (Council of Europe, 2001) and define listening comprehension, as well as reading comprehension, writing comprehension, and a general linguistic usage, as four strands of communication skills in first language (L1) and second language (L2) learning (KMK, 2003, 2004). Regardless of the differences in the type of stimuli being processed, listening and reading comprehension rely on similar (if not the same) processes of decoding (acoustic or printed input), recoding or conversion to linguistic information, and reconstruction of the meaning against the background or existing knowledge (e.g., Lund, 1991). Previous research on these receptive skills in the German language (Marx & Jungmann, 2000; Rost & Hartmann, 1992; Wolfgramm et al., 2016) and other languages (e.g., Bae & Bachman, 1998; Hagtvet, 2003; He & Min, 2017; Levine & Revers, 1988; Song, 2008) has indicated partial overlap between listening and reading comprehension skills, at substantial levels of correlation. A large part of this overlap is apparently explained by vocabulary (Wolfgramm et al., 2016). Note that, although listening comprehension is referred to as a unitary ability dimension in the following sections, relevant subskills and attributes underlying the performance in listening comprehension tests were previously identified by some language assessment researchers (e.g., Buck & Tatsuoka, 1998; Goh & Aryadoust, 2015, 2016; Levine & Revers, 1988; Song, 2008). Another common differentiation is made between L1 and L2 comprehension. Several authors emphasize the similarity of listening and reading comprehension in L1 and L2 learning (e.g., Koda, 2004). Empirical research also suggests strong correlations in the same receptive skills across L1 and L2, as well as similar ability profiles (e.g., students who show substantial differences in oral and written L1 comprehension show a similar pattern in L2; Sparks et al., 1998). However, text comprehension skills are usually first acquired in L1 and afterward transferred to L2, meaning that language-related knowledge structures have not been fully acquired in L2 learning (Birch, 2007; Meschyan & Hernandez, 2002). Therefore, basic L2 processing is less effective, especially in listening comprehension at the outset of L2 learning (Buck, 2001).

Sample study: Investigating the psychometric structure of a listening test As previously mentioned, listening comprehension is an essential part of the German educational standards of L1 and L2 learning in primary, middle, and high schools. The German educational standards were operationalized as largescale, statewide standardized assessments of learning (Lernstandserhebungen), with diagnostic profile feedback reported to the teachers. From an empirical perspective, however, previous research on L1 listening comprehension has given indications of different structural models in relation to L1 reading comprehension or L2 listening comprehension. Thus, the major goal of the present analysis is to investigate the most appropriate psychometric structure for L1 text

Multidimensional Rasch models 43 comprehension skills and listening comprehension across L1 and L2 learning in these assessments, and therefore validate the diagnostic profiles reported to the teachers. Focusing on L1 listening comprehension in particular, the study is based on two stages of analysis. The goal of the first analysis is to investigate the underlying dimensional structure of L1 receptive skills, including listening and reading comprehension items. The goal of the second analysis is to investigate the dimensional structure of L1 and L2 listening comprehension. In both analyses, the Rasch models with between-item and within-item multidimensionality were estimated using the ConQuest software package. The models were compared in terms of fit as well as the variance and reliability of the ability estimates and were interpreted in consideration of the objectives of the assessment.

Method Sample Test data from a statewide standardized assessment of learning for eighth grade students in the German federal state of North Rhine-Westphalia (Lernstandserhebungen; Leutner, Fleischer, Spoden, & Wirth, 2007) were analyzed. Although the assessment is low-stakes, participation is mandatory for each student in this state and grade. Thus, the overall number of participants was large, with N ≈ 185,000 students from about 2,000 schools belonging to five different school tracks. For the following analyses, a 2% random sample of medium- and high-ability students (Realschule and Gymnasium respectively; n = 2,335 students) was drawn from the two most highly represented school tracks to ensure feasible statistical analyses.

Instruments The students took three standardized tests assessing German reading and listening comprehension and English listening comprehension. These three instruments were developed as separate, unidimensional diagnostic measures to provide teachers with competence profiles of their study group. The competence profiles were to be used to initiate instructional activities, to (self-) evaluate their teaching by means of criterion-referenced feedback referring to the German educational standards and by a norm-referenced comparison to results in study groups that are comparable in terms of relevant student characteristics. The program operationalized the German educational standards and assessed general competencies in the subjects of German language comprehension, English language comprehension, and mathematics. The program was implemented as a part of an educational reform moving the educational system toward evidence-­ based policy and practice, where evidence implies the empirical assessment of competencies in terms of “learnable, contextualized, cognitive dispositions” (Klieme, Hartig, & Rauch, 2008, p. 8; see also Leutner, Fleischer, Grünkorn, & Klieme, 2017), as a central building block for educational standards.

44  Spoden and Fleischer The German language listening comprehension test was based on an audio recording of a radio broadcast as stimulus, and it contained 38 items: 8 in the true  ⁄false format and 30 items comprising completion exercises or short answer items (two items having three polytomously scored response categories). The time limit for this test was 25 minutes. The German language reading comprehension test was based on the stimuli of three short written texts with differing contents. The first and second texts were fictional short stories; the third text was a short nonfictional newspaper report including a large graphic (noncontinuous text). The test comprised 41 items overall (15 related to the first text and 13 related to each of the second and third texts). The item formats included true ⁄false, multiple choice, completion, and short answer formats; four items were polytomously scored, with three ordinal response categories. The English language listening comprehension test was based on three short auditory stimuli. The first stimulus was an audio recording of British people in everyday conversations; the second stimulus was an audio recording of English pupils talking about their experiences in English and German schools; and the third stimulus was an audio recording of an English radio program. The test comprised 27 items overall (8 related to the first stimulus, 9 related to the second stimulus, and 10 related to the third stimulus), with the majority of the items being administered as multiple-choice items (24 items); the remaining 3 items were completion tasks. The time limit for the entire test was 40 minutes. Given that this was a sufficient time limit in which to complete each of the three testing instruments, omitted responses to all items were recoded as incorrect responses. Considering the low-stakes character of this (very) large-scale assessment, and the main objective of self-evaluation, the item responses were coded by the teachers themselves; a secondary coding by external raters was additionally carried out for a sample of item responses, and the agreement between these ratings was computed using the coefficient γ (Gamma) for ordinal measures (Woods, 2007), which is a statistical coefficient for a weakly monotonic association between ordinal variables (here: item responses). The results indicated sufficiently objective coding, with γ ≥ .90 for all items, indicating that teachers and external raters agree on the item response to some fair extent.

Statistical approach and Rasch models applied Two rounds of Rasch measurement were conducted (see Table 2.1 for a summary of the analyzed skill sets, languages, and models). In the first round of analysis, a unidimensional Rasch model, a correlated factors Rasch model, and a bifactor Rasch model were fitted to the data. In the unidimensional Rasch model, all items from the L1 listening and reading comprehension test loaded on one general dimension, possibly best termed “text comprehension.” In the two-dimensional model, items from the listening comprehension test loaded on one dimension and items from the reading comprehension test were loading on a second dimension. As intended by the test administrators, this model indicates separate dimensions of both receptive skills, representing different strands of

Multidimensional Rasch models 45 Table 2.1  Summary of skill sets, languages, and models investigated in Analysis 1 and Analysis 2

Skill Sets Languages Models

Analysis 1

Analysis 2

Listening Comprehension, Reading Comprehension First Language (German)

Listening Comprehension

1. Unidimensional Model 2. Correlated Factors Model 3. Bifactor Model

First Language (German), Second Language (English) 1. Unidimensional Model 2. Correlated Factors Model 3. Nested Factor Model

language acquisition in German educational standards. Additionally, a bifactor model similar to the two-dimensional correlated factors model, but with an additional general dimension and orthogonal dimensions, was estimated. The bifactor model makes it possible to estimate the variance components of listening and reading comprehension, controlled for common text comprehension skills. It represents different assumptions on potential nested factor models for receptive skills. For example, Hartig and Höhler (2008) applied a nested factor model with a text comprehension and an additional auditory processing factor to assessment data from L2 listening and reading comprehension tests, while previous research indicates that reading comprehension requires additional reading-specific skills compared to listening comprehension, which is predominant in earlier stages of development (e.g., Rost & Hartmann, 1992). In the second round of analysis investigating the dimensional structure of L1 and L2 listening comprehension, a unidimensional Rasch model, a correlated factors Rasch model, and a nested factor Rasch model were fitted to the data. In the unidimensional model, all items from the L1 and the L2 listening comprehension test loaded on a general dimension, denoted as a general listening comprehension dimension. In the two-dimensional correlated factors model, items from the L1 listening comprehension test loaded on one dimension and items from the L2 listening comprehension test loaded on a second dimension; this model is in line with the German educational standards in differentiating native and foreign language skills. The nested factor model was specified in such a way that all items loaded on the first listening comprehension factor and only items from the L2 listening comprehension test loaded on a second orthogonal dimension, representing the additional skills relevant to L2 listening comprehension. This model reflects assumptions from language research that there is substantial overlap between the abilities, but L2 listening comprehension requires that students build additional knowledge structures in foreign language education.

Estimation of the multidimensional Rasch models The ConQuest software package (Adams, Wu, & Wilson, 2015) was utilized to estimate these models, making use of the Gauss-Hermite quadrature, which is

46  Spoden and Fleischer the suggested method for multidimensional models with up to three dimensions (Wu, Adams, & Wilson, 2007, p. 189). Model fit was compared across the three models on the basis of the deviance statistic and the AIC information criteria computed by the software. Note that a comparison of multidimensional Rasch models with between-item and within-item dimensionality, solely on the basis of model fit indices, is difficult, especially as the orthogonality assumption of the within-item dimensionality models discredits the fit of these models to some extent (Baghaei, 2016). Thus, variances, reliabilities, and latent correlations were investigated in addition to the model fit to identify the optimal model representing the data.

Results Analysis 1 results Table 2.2 shows the results of parameter estimates and global fit statistics from the unidimensional, correlated factors, and bifactor Rasch models in the first round of analysis. The correlated factors and the bifactor model have noticeably smaller deviance (and AIC) than the unidimensional model. In addition, the difference between the unidimensional and the correlated factors model, and that between the unidimensional and the bifactor model, is significant (χ2 = 512.683, df = 2, p < .001; χ2 = 512.556, df = 2, p < .001). Comparison of the correlated factors and the nested factor Rasch models by means of the AIC criterion, reveals the correlated factors model to be slightly more parsimonious (AICcorrelated factors = 190466.476; Table 2.2  Parameter estimates and global fit statistics for three Rasch measurement models applied to L1 listening and reading comprehension test data Models Estimates Varg Var1 Var2 Relg Rel1 Rel2 r1,2 No. Items (Dim.) No. PC-Items No. Par. Deviance AIC

Unidimensional

Correlated Factors

Bifactor

0.387 ----0.843 ------79 6 86 190803.159 190975.159

--0.490 0.435 --0.796 0.788 0.737 38 ⁄ 41 6 88 190290.476 190466.476

0.341 0.148 0.097 0.714 0.341 0.240 --79 ⁄ 38 ⁄ 41 6 88 190290.603 190466.603

Notes: V  ar = variance; Rel = EAP⁄ PV reliability; r = correlation; No. Items = number of items; No. PC-Items = number of items with partial-credit scoring; No. Par. = number of parameters; AIC = Akaike information criterion

Multidimensional Rasch models 47 AICbifactor =190466.603), although the difference is very small. Additional results of the correlated factors model reveal that the variances and reliabilities of both dimensions are satisfactory, especially compared to a reliability of Relg = 0.843 in the unidimensional model estimated from roughly twice the item number (39 and 41 items versus 79 items). Both receptive abilities are highly correlated (r1,2 = 0.737). In accordance with this high level of correlation, results from the bifactor model indicate large and reliable (Relg = 0.714) proportions of common variance in the general dimension, while the listening (Var1 = 0.148; Rel1 = 0.341) and (in particular) the reading comprehension dimensions (Var2 = 0.097; Rel2 = 0.240) show limited variance and reliability. In summary, these results give some indication that measures of L1 listening and reading comprehension entail a high level of overlap and a large proportion of common variance, as well as a nonignorable degree of tolerably reliable unique variance. The latter quantifies specific cognitive demands of the items from each of the two subtests (i.e., demands arising from cognitive processes specific for listening or specific for reading). Despite the marginal advantage of the correlated factors model in terms of model fit, a bifactor model with an additional text comprehension dimension might therefore be an adequate representation of receptive skill sets in the assessment.

Analysis 2 results Table 2.3 shows the results of parameter estimates and global fit statistics for the unidimensional, correlated factors, and nested factor Rasch models in the second round of analysis. Again, the correlated factors and the nested factor Table 2.3  Parameter estimates and global fit statistics for three Rasch measurement models applied to L1 and L2 listening comprehension test data Models Estimates Varg Var1 Var2 Relg Rel1 Rel2 r1,2 No. Items No. PC-Items No. Par. Deviance AIC

Unidimensional

Correlated Factors

Nested Factor

0.429 ----0.829 ------65 2 68 157852.074 157988.074

--0.490 0.706 --0.783 0.772 0.588 38 ⁄ 27 2 70 156790.779 156930.779

0.432 --0.446 0.777 --0.557 --65 ⁄ 27 2 69 156872.057 157010.057

Notes: V  ar = variance; Rel = EAP⁄  PV reliability; r = correlation; No. Items = number of items; No. PC-Items = number of items with partial-credit scoring; No. Par. = number of parameters; AIC = Akaike information criterion

48  Spoden and Fleischer Rasch model display a clearly smaller deviance than the unidimensional model; the differences between the unidimensional and the correlated factors model, and that between the unidimensional and the bifactor model, are significant (χ2 = 1061.295, df = 2, p < .001; χ2 = 980.017, df = 1, p < .001). Comparison of the correlated factors and the nested factor Rasch model by means of the AIC criterion reveals the correlated factors model to be the slightly more parsimonious model (AICcorrelated factors = 156930.779; AICbifactor = 157010.057). The reliabilities of both dimensions in the correlated factors model are satisfactory (Rel > 0.770), while the two listening comprehension dimensions are moderately correlated (r1,2 = 0.588). Additional results from the nested factor model indicated that the variance of the L2 listening comprehension dimension was larger than the variance of the general listening comprehension dimension (Var2 = 0.446, Varg = 0.432), and that the reliability of the L2 listening comprehension dimension was substantially large, controlling for the common variance proportion (Rel2 = 0.557). In summary, these results generally support the use of separate ability estimates obtained from the correlated factors model as descriptive measures of L1 and L2 listening comprehension since the L2 listening comprehension dimension had a large amount of unique variance. But given that the difference between the correlated factors and nested factor models was not large in this analysis, the nested factor model might offer an attractive alternative specification of listening comprehension test across L1 and L2 for researchers interested in specific characteristics of L2 learning.

Discussion and conclusion This chapter offers an introduction to multidimensional Rasch models and emphasizes the usefulness of this data analysis approach in language assessment research. Multidimensional Rasch models offer advantages such as a comprehensible criterion-referenced interpretation of test scores, robust parameter estimates under suboptimal testing conditions (medium sample size, complex booklet design), a confirmatory statistical approach allowing for the rejection of nonfitting dimensional structures, a direct and attenuation-corrected estimation of correlations among measurement dimensions, and the usage of collateral information from these correlations to increase the precision and reliability of each single dimension. While the first advantage from this list stems from the Rasch model itself, the latter are due to incorporation of multidimensionality into ability estimation (e.g., Wang, Chen, & Cheng, 2004; Yao & Boughton, 2007; for a recent example from adaptive testing, see also Frey, Kröhne, Seitz, & Born, 2017). Even though multidimensional Rasch models with correlated factors are more common in language testing, and Rasch models incorporating a within-item dimensionality structure are rarely applied in practice (Baghaei, 2012), the latter offer useful methods for the validation of subtest scoring of multidimensional testing instruments. In the empirical application described in the

Multidimensional Rasch models 49 present chapter, the nested factor model and the bifactor Rasch model were used alongside the unidimensional and the correlated factors Rasch models to analyze the psychometric structure of first language listening comprehension in relation to L1 reading comprehension (as a second receptive skill set) and L2 listening comprehension in a large-scale assessment data set operationalizing the German educational standards. The results from these analyses offer promise for deriving the intended descriptive competence profiles but also illustrate the advantages of multidimensional models with more complex loading structures in evaluating the viability of (sub) test scoring as results from these models provide language assessment researchers with the relevant information on subtest variance and reliability. Note that different models may be used to test different theoretical assumptions regarding the overlap and the empirical separability of different language skills. For example, while results from the first analysis essentially confirmed assumptions about unique variance components of L1 listening and reading comprehension (despite strong overlap between the skill sets), results from the second analysis revealed a very large amount of unique variance of L2 listening comprehension compared to a common L1 and L2 listening comprehension ability dimension and, thus, indicated a very separated, additional skill set. In more general terms, multidimensional models with complex loading structures, such as the bifactor Rasch model, can control for unexplained variance (or simply for method effects) in bundles of items with common characteristics, while simultaneously calibrating the general dimension that was originally meant to be measured. Unexplained variance components in language assessment data may arise for different reasons, like common contexts or topics covered by some items. Other examples of unexplained variance components in language testing are testlets, clusters of items sharing a common stimulus. If the dependencies among the testlet items are ignored, this yields biased item difficulties and reliability estimates (Wang & Wilson, 2005). Wang and Wilson (2005) have proposed a Rasch testlet model, formally equivalent to the bifactor Rasch model, to account for testlet variance. Harsch and Hartig (2010) have given an example of the application of this model to data from the C-test, a cloze test for English as a foreign language consisting of six different texts. The authors found substantial text-specific variances up to 44% of the variance of the general dimension: this underlines the necessity of controlling for testlet effects. Beyond others, DeMars (2006), Eckes (2014), and Min and He (2014) have given additional examples for the application of bifactor and testlet models to account for testlet structures in language assessment. Baghaei and Aryadoust (2015) argued that the testlet model is also useful to model local item dependencies due to common item formats and applied the Rasch testlet model to an English language comprehension test. They found that in the presence of an estimated variance of the target listening comprehension dimension of 1.73, the item format introduced specific variance between 0.08 (multiple-choice format) and 1.36 (map labeling format), possibly as a function

50  Spoden and Fleischer of item format familiarity, thereby confounding the test scores. Although item difficulty and ability measures from the standard Rasch model and the Rasch testlet model were highly correlated, estimates shrank to the mean and the reliability was overestimated when item dependencies due to common item format were ignored in the standard Rasch model. Up to this point, each of these examples of the application of multidimensional Rasch models presented has focused on adequate models for the assessment of individual language skills. It should be noted, however, that researchers interested in the quality of instruction, and educational activities to foster instruction, might meet with differing results on the structure of language comprehension at the aggregated classroom level. For example, Jude et al. (2008) found that a differentiated structure of L1 and L2 language comprehension at the individual level collapsed into a few broadly defined dimensions at the aggregated classroom level in a sample of students with German native language. Höhler, Hartig, and Goldhammer (2010) have given a concrete example of how to model the structure of students’ L2 comprehension within and between instructional groups by extending the multidimensional measurement approach derived from multilevel analysis (e.g., Hox, 2010; Snijders & Bosker, 2011). They also found a more differentiated correlation structure within these groups compared to the between-group level. Despite the high level of flexibility of the MRCMLM framework implemented in the ConQuest software, these results point to the necessity of even more elaborate statistical frameworks than the multidimensional Rasch model in some fields of language research.

Notes

1. A bifactor model is a model where “…each item is allowed to load on a general factor, and only one group factor” (Reise, 2012, p. 677). Even though Model 2 and Model 3 include nested factors, only Model 3 is a bifactor model according to the definition given by Reise (2012). 2. It should be noted that Reise (2012, p. 692) advocated more than two subtest dimensions in a bifactor model, pointing out that “[for] measures that were not originally developed with a clear blueprint to include at least three items from at least three contain domains, bifactor modeling is likely to be a challenge.”

References Adams, R. J., Wilson, M., & Wang, W.C. (1997). The multidimensional random coefficient multinomial logit model. Applied Psychological Measurement, 21(1), 1–23. doi: 101177⁄ 0146621697211001 Adams, R. J., Wu, M. L., & Wilson, M. R. (2015). ACER ConQuest: Generalised Item Response Modelling Software [Computer software]. Version 4. Camberwell, Victoria: Australian Council for Educational Research. Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52(3), 317–332. doi: 10.1007⁄ BF02294359

Multidimensional Rasch models 51 Allen, D. D., & Wilson, M. (2006). Introducing multidimensional item response modeling in health behavior and health education research. Health Education Research, 21(Supplement 1), 73–84. doi: 10.1093 ⁄ her ⁄cyl086 Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford University Press. Bae, J., & Bachman, L. F. (1998). A latent variable approach to listening and reading: Testing factorial invariance across two groups of children in the Korean ⁄ English two-way immersion program. Language Testing, 15(3), 380–414. doi: 10.1177⁄ 026553229801500304 Baghaei, P. (2012). The application of multidimensional Rasch models in large scale assessment and validation: An empirical example. Electronic Journal of Research in Educational Psychology, 10(1), 233–252. Baghaei, P. (2016). Modeling dimensionality in foreign language comprehension: An Iranian example. In V. Aryadoust and J. Fox (Eds.), Trends in language assessment research and practice: The view from the Middle East and the Pacific Rim (pp. 47–66). Newcastle, UK: Cambridge Scholars Publishing. Baghaei, P., & Aryadoust, V. (2015). Modeling local item independence due to common test format with a multidimensional Rasch model. International Journal of Testing, 15, 71–87. doi: 10.1080 ⁄ 15305058.2014.941108 Baghaei, P., & Grotjahn, R. (2014a). Establishing the construct validity of conversational C-tests using a multidimensional Rasch model. Psychological Test and Assessment Modeling, 56(1), 60–82. Baghaei, P., & Grotjahn, R. (2014b). The validity of C-tests as measures of academic and everyday language proficiency: A multidimensional item response modeling study. In R. Grotjahn (Ed.), Der C-test: aktuelle tendenzen ⁄  The C-test: current trends. Frankfurt am Main, Germany: Peter Lang. Birch, B. M. (2007). English L2 reading: Getting to the bottom. Mahwah, NJ: Lawrence Erlbaum Associates. Bock, R. D., & Aitken, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459. doi: 10.1007⁄ BF02293801 Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431–444. doi: 10.1177⁄ 014662168200600405 Briggs, D. C., & Wilson, M. (2003). An introduction to multidimensional measurement using Rasch models. Journal of Applied Measurement, 4, 87–100. Buck, G. (2001). Assessing listening. Cambridge, UK: Cambridge University Press. Buck, G., & Tatsuoka, K. (1998). Application of the rule-space procedure to language testing: examining attributes of a free response listening test. Language Testing, 15(2), 119–157. doi: 10.1177⁄ 026553229801500201 Chalhoub-Deville, M. (1997). Theoretical models, assessment frameworks and test construction. Language Testing, 14(1), 3–22. doi: 10.1177⁄ 026553229701400102 Council of Europe (2001). The common European framework of reference for languages: learning, teaching, assessment. Cambridge, UK: Cambridge University Press. De Ayala, R. J. (2008). The theory and practice of item response theory. New York, NY: Guilford Press. DeMars, C. E. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43(2), 145–168. doi: 10.1111⁄ j.1745-3984.2006.00010.x

52  Spoden and Fleischer Eckes, T. (2014). Examining testlet effects in the TestDaF listening section: A testlet response theory modeling approach. Language Testing, 31(1), 39–61. doi: 10.1177⁄ 0265532213492969 Feyten, C. (1991). The power of listening ability: An overlooked dimension in language acquisition. Modern Language Journal, 75(2), 173–180. doi: 10.2307⁄ 328825 Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37(6), 359–374. doi: 10.1016 ⁄ 0001-6918(73)90003-6 Fleischer, J., Wirth, J., & Leutner, D. (2007). Testmethodische Grundlagen der Lernstandserhebungen NRW: Erfassung von Schülerkompetenzen für Vergleiche mit kriterialen und sozialen Bezugsnormen [Basic of test theory for statewide standardized assessments of learning in Northrhine-Westfalia: Assessing student competencies with regard to criterion and social references]. In Landesinstitut für Schule ⁄ Qualitätsagentur (Ed.), Lernstandserhebungen Mathematik in Nordrhein-Westfalen – Impulse zum Umgang mit zentralen Tests (pp. 91–113). Stuttgart, Germany: Klett. Frey, A., Kröhne, U., Seitz, N. N., & Born, S. (2017). Multidimensional adaptive measurement of competences. In D. Leutner, J. Fleischer, J. Grünkorn, and E. Klieme (Eds.), Competence assessment in education: Research, models, and instruments (pp. 369–387). Berlin, Germany: Springer. Goh, C. C. M., & Aryadoust, V. (2015). Examining the notion of listening subskill divisibility and its implications for second language listening. International Journal of Listening, 29(3), 109–133. doi: 10.1080 ⁄ 10904018.2014.936119 Goh, C. C. M., & Aryadoust, V. (2016). Learner listening: New insights and directions from empirical studies. International Journal of Listening, 30(1–2), 1–7. doi: 10.1080 ⁄ 10904018.2016.1138689 Hagtvet, B. E. (2003). Listening comprehension and reading comprehension in poor decoders: Evidence for the importance of syntactic and semantic skills as well as phonological skills. Reading and Writing: An Interdisciplinary Journal, 16(6), 505–539. doi: 10.1023 ⁄ A:1025521722900 Harsch, C., & Hartig, J. (2010). Empirische und inhaltliche Analyse lokaler Abhängigkeiten im C-Test [Empirical and content-related analysis of local dependencies in the C-test]. In R. Grotjahn (Ed.), Der C-Test: Beiträge aus der aktuellen Forschung  ⁄ The C-test: Contributions from current research (pp. 193–204). Frankfurt am Main, Germany: Peter Lang. Hartig, J. (2008). Psychometric models for the assessment of competencies. In J. Hartig, E. Klieme, and D. Leutner (Eds.), Assessment of competencies in educational contexts: State of the art and future prospects (pp. 69–90). Göttingen, Germany: Hogrefe and Huber. Hartig, J., & Höhler, J. (2008). Representation of competencies in multidimensional IRT models with within-item and between-item multidimensionality. Journal of Psychology, 216(2), 89–101. doi: 10.1027⁄ 0044-3409.216.2.89 Hartig, J., & Höhler, J. (2009). Multidimensional IRT models for the assessment of competencies. Studies in Educational Evaluation, 35(2–3), 57–63. doi: 10.1016 ⁄ j. stueduc.2009.10.002 He, L., & Min, S. (2017). Development and validation of a computer adaptive EFL test. Language Assessment Quarterly, 14(2), 160–176. doi: 10.1080 ⁄  15434303.2016.1162793 Höhler, J., Hartig, J., & Goldhammer, F. (2010). Modeling the multidimensional structure of students’ foreign language competence within and between classrooms. Psychological Test and Assessment Modeling, 52(3), 323–340.

Multidimensional Rasch models 53 Hox, J. J. (2010). Multilevel analysis: Techniques and applications (2nd ed.). New York, NY: Routledge. Jude, N., & Klieme, E. (2007). Sprachliche Kompetenz aus Sicht der pädagogisch-­ psychologischen Diagnostik [Language competence from the perspective of educational-­ psychological diagnostics]. In E. Klieme and B. Beck (Eds.), Sprachliche Kompetenzen. Konzepte und Messung. DESI-Studie (Deutsch Englisch Schülerleistungen International) (pp. 9–22). Weinheim, Germany: Beltz. Jude, N., Klieme, E., Eichler, W., Lehmann, R. H., Nold, G., Schröder, K., Thomé, G., … Willenberg, H. (2008). Strukturen sprachlicher Kompetenzen [Structures of language competencies]. In DESI-Konsortium (Eds.), Unterricht und Kompetenzerwerb in Deutsch und Englisch: Ergebnisse der DESI-Studie Band 2 (S. 191–201). Weinheim, Germany: Beltz. Kelderman, H. (1996). Multidimensional Rasch models for partial-credit scoring. Applied Psychological Measurement, 20(2), 155–168. doi: 10.1177⁄ 014662169602000205 Klieme, E., Hartig, J., & Rauch, D. (2008). The concept of competence in educational contexts. In J. Hartig, E. Klieme, and D. Leutner (Eds.). Assessment of competencies in educational contexts (pp. 3–22). Göttingen, Germany: Hogrefe. KMK. (2003). Bildungsstandards für die erste Fremdsprache (Englisch  ⁄ Französisch) für den Mittleren Schulabschluss [Educational standards in first foreign language (English, French) for middle school graduation]. München, Germany: Luchterhand. KMK. (2004). Bildungsstandards im Fach Deutsch für den Mittleren Schulabschluss [Educational standards for middle school graduation in the German language subject]. München, Germany: Luchterhand. Koda, K. (2004). Insights into second language reading: A cross-linguistic approach. Cambridge, UK: Cambridge University Press. Leutner, D., Fleischer, J., Grünkorn, J., & Klieme, E. (Eds.). (2017). Competence assessment in education: Research, models and instruments. Berlin: Springer. Leutner, D., Fleischer, J., Spoden, C., & Wirth, J. (2007). Landesweite Lernstandserhebungen zwischen Bildungsmonitoring und Individualdiagnostik [Statewide standardized assessments of learning between educational monitoring and individual diagnostics]. Zeitschrift für Erziehungswissenschaft, Sonderheft 10, 149–167. doi: 10.1007⁄ 978-3-531-90865-6_9 Levine, A. & Revers, T. (1988). The FL receptive skills: Same or different? System, 16(3), 326–336. doi: 10.1016 ⁄ 0346-251X (88)90075-9 Linacre, J. M. (1989). Many facet Rasch measurement. Chicago, IL: MESA Press. Lund, R. J. (1991). A comparison of second language listening and reading comprehension. Modern Language Journal, 75(2), 196–204. doi: 10.2307⁄ 328827 Marx, H. & Jungmann, T. (2000). Abhängigkeit der Entwicklung des Leseverstehens von Hörverstehen und grundlegenden Lesefertigkeiten im Grundschulalter: Eine Prüfung des Simple View of Reading-Ansatzes [Dependency of reading comprehension development on listening and basic reading skills: An examination of the Simple View of Reading]. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 32(2), 81–93. doi: 10.1026 ⁄ ⁄ 0049-8637.32.2.81 Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. doi: 10.1007⁄ BF02296272 Meschyan, G., & Hernandez, A. (2002). Is native-language decoding skill related to second-language learning? Journal of Educational Psychology, 94 (1), 14–22. doi: 10.1037⁄ 0022-0663.94.1.14

54  Spoden and Fleischer Min, S., & He, L. (2014). Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment. Language Testing, 31(4), 453–477. doi: 10.1177⁄ 0265532214527277 Mullis, I. V. S., & Martin, M. O. (Eds.). (2015). PIRLS 2016 assessment framework (2nd ed.). Chestnut Hill, MA: Boston College. OECD (2016). PISA 2015 assessment and analytical framework: Science, reading, mathematic and financial literacy. Paris, France: OECD Publishing. doi: 10.1787⁄ 9789264255425-en Rasch, G. (1960). Probabilistic models for some intelligence or attainment tests (2nd ed.). Chicago, IL: The University of Chicago Press. Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667–696. doi: 10.1080 ⁄ 00273171.2012.715555 Robitzsch, A. (2008). Methodische Herausforderungen bei der Kalibrierung von Leistungstests [Challenges in achievement test calibration]. In D. Granzer, O. Köller, A. Bremerich-Voss, A. van den Heuvel-Panhuizen, K. Reiss, and G. Walther (Eds.), Bildungsstandards Deutsch und Mathematik (pp. 42–106). Weinheim, Germany: Beltz. Rost, D. H., & Hartmann, A. (1992). Lesen, Hören, Verstehen [Reading, listening, comprehension]. Zeitschrift für Psychologie, 200(4), 345–361. Sawaki, Y., Stricker, L., & Oranje, A. (2009). Factor structure of the TOEFL Internetbased test. Language Testing, 26(1), 5–30. doi:10.1177⁄ 0265532208097335 Sijtsma, K., & Hemker, B. T. (2000). A taxonomy of IRT models for ordering persons and items using simple sum scores. Journal of Educational and Behavioral Statistics, 25(4), 391–415. doi: 10.3102 ⁄ 10769986025004391 Snijders, T. A. B., & Bosker, R. J. (2011). Multilevel analysis: An introduction to basic and advanced multilevel modeling (2nd ed.). London, UK: Sage. Song, M.-Y. (2008). Do divisible subskills exist in second language (L2) comprehension? A structural equation modeling approach. Language Testing, 25(4), 435–464. doi:10.1177⁄ 0265532208094272 Sparks, R. L., Artzer, M., Ganschow, L., Siebenhar, D., Plageman, M., & Patton, J. (1998). Differences in native-language skills, foreign-language aptitude, and foreign-language grades among high-, average-, and low-proficiency foreign-language learners: Two studies. Language Testing, 15(2), 181–216. doi: 10.1177⁄ 026553229801500203 Tincoff, R., & Jusczyk, P. W. (1999). Some beginnings of word comprehension in 6-month-olds. Psychological Science, 10(2), 172–175. doi: 10.1111 ⁄ 1467-9280.00127 Volodin, N. A., & Adams, R. J. (1995, April). Identifying and estimating a D-dimensional item response model. Paper presented at the International Objective Measurement Workshop, Berkeley, CA. Wang, W. C., Chen, P. H., & Cheng, Y. Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9(1), 116–136. doi: 10.1037⁄ 1082-989X.9.1.116 Wang, W. C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29(2), 126–149. doi: 10.1177⁄ 0146621604271053 Wang, W. C., Wilson, M., & Adams, R. J. (1997). Rasch models for multidimensionality between items and within items. In G. Engelhard and M. Wilson (Eds.), Objective measurement (Vol. 4) (pp. 139–155). Greenwich, CN: Ablex. Werker, J. (1989). Becoming a native listener. American Scientist, 77, 54–59. Wolfgramm, C., Suter, N., & Göksel, E. (2016). Examining the role of concentration, vocabulary and self-concept in listening and reading comprehension. International Journal of Listening, 30(1–2), 25–46. doi: 10.1080 ⁄ 10904018.2015.1065746

Multidimensional Rasch models 55 Woods, C. M. (2007). Confidence intervals for gamma-family measures of ordinal association. Psychological Methods, 12(2), 185–204. doi: 10.1037⁄ 1082-989X.12.2.185 Wu, M. (2006). The role of plausible values in large scale surveys. Studies in Educational Evaluation, 31(2–3), 114–128. doi: 10.1016 ⁄ j.stueduc.2005.05.005 Wu, M., Adams, R. J., & Wilson, M. (2007). ACER ConQuest version 2.0: Generalised item response modelling software [software manual]. Camberwell, Victoria: ACER Press. Yao, L., & Boughton, K. A. (2007). A multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31(2), 83–105. doi: 10.1177⁄ 0146621606291559

3

The log-linear cognitive diagnosis modeling (LCDM) in second language listening assessment Tug˘ ba Elif Toprak, Vahid Aryadoust, and Christine Goh

Introduction This chapter demonstrates a didactically oriented application of the log-linear cognitive diagnosis modeling (LCDM) (Henson, Templin, & Willse, 2009) in the context of cognitive diagnostic assessment (CDA). CDA is a cognitively grounded alternative approach to educational assessment, which is motivated by the current developments in psychometrics (Jang, 2008). The aim of CDA is to furnish fine-grained and diagnostic feedback concerning individual examinees’ mastery of language skills in a specific domain by classifying examinees into distinct skill mastery classes. CDA merges the theories of cognition and ⁄or learning in the area of interest with statistically rigorous measurement models called diagnostic classification models (DCMs) and determines examinees’ cognitive strengths, weaknesses, and misconceptions in prespecified (language) attributes (Rupp & Templin, 2008). Rich and pedagogically meaningful feedback yielded by CDA models could inform pedagogical practices, which could consequently foster individuals’ learning. CDA also helps evaluate the diagnostic capacity of test items and the effectiveness of the estimation process, thereby enabling researchers to obtain empirical evidence to validate their theory-based conjectures about the construct of interest. Specifically, in proficiency testing where language performance is reduced to a total score, the attributes that are measured by the test remain masked. Recent CDA or similar techniques, however, have been able to reliably identify some of the subskills measured by various tests (e.g., Lee & Sawaki, 2009). This approach, which is further known as retrofitting CDA, helps test developers determine what listening mechanisms underlie the test and affect test takers’ performance (Aryadoust & Goh, 2014). To fulfill the aforementioned functions, the CDA approach utilizes DCMs that are probabilistic, confirmatory, and multidimensional measurement models. DCMs are called probabilistic as they can predict the probability of individual examinee’s falling into a specific latent class. DCMs are confirmatory since they employ a Q-matrix (Tatsuoka, 1983), which is an incidence matrix demonstrating the relationships between attributes of interest and test items measuring the attributes. The Q-matrix specifies the loading structure of DCMs. DCMs are multidimensional, making it possible to break down a general ability into several

Log-linear cognitive diagnosis modeling 57 dimensions or components. DCMs’ primary function is to define and estimate examinees’ ability in a given domain or language skill based on the attributes that s ⁄ he has mastered and to generate examinees’ attribute mastery profiles. This estimated attribute mastery profile allows for determining the probability of a correct response for each item across different attribute mastery classes and predicting the probability of each examinee becoming a member of a specific attribute mastery class (Henson, Templin, & Willse, 2009). DCMs are rooted in other measurement models including classical test theory (CTT; see Chapter 1, Volume I), item response theory (IRT), latent class analysis (LCA), and factor analysis (FA; see Chapter 11, Volume I). The major distinction between DCMs and the other measurement models is that DCMs are well-suited to model categorical latent variables rather than continuous variables (Rupp & Templin, 2008). The categorical latent variables, which are referred to as attributes (or subskills), are usually binary or dichotomous. It is assumed that attributes drive examinee responses to test items (Jurich & Bradshaw, 2014). In other words, possessing or lacking an attribute would affect examinees’ performance on a test. Unlike unidimensional IRT models that rank-order examinees along a continuum of a single overall ability, DCMs divide the underlying ability into a set of attributes and place the examinees into their attribute mastery classes based on their test performance. Accordingly, some examinees would be grouped as masters of prespecified subskills, whereas others would be nonmasters. Attribute mastery classifications are made based on the Q-matrix (Tatsuoka, 1983), which is an incidence matrix demonstrating the relationships between attributes of interest and test items measuring the attributes. For example, while an item might tap into attributes A and B, another might engage attributes B, D, and E. There has been a growing interest in the application of the DCMs in language assessment (e.g., Aryadoust, 2012; Jang, 2009; Kim, 2015; Lee & Sawaki, 2009; Ravand, 2016). Although these applications have produced promising results, they have remained fairly limited to the application of, for example, reduced reparameterized unified model (rRUM) (Hartz, 2002), rule-space methodology (Tatsuoka, 1983) and different versions of the deterministic input, noisy, and gate (DINA) model (Junker & Sijtsma, 2001). As an effective alternative, LCDM provides a general framework subsuming other core DCMs, without requiring CDA researchers to opt for either a non-compensatory or a compensatory model from the onset (see the next section for a definition of these terms). The LCDM allows for modeling most core DCMs in a flexible way and examines item-attribute relationships and item parameters for theoretical and statistical significance. This flexibility would prove useful in the context of language assessment where researchers and practitioners deal with measuring the language skills that are highly interactive and likely compensatory in nature (Bernhardt, 2005). While the term interactive means that lower- and higher-level language skills work together as a part of language processing, compensatory means that deficiencies in any skill can be overcome by relying on other skill ⁄s. Despite this advantage, the application of the LCDM has been limited to only a few studies in

58  Toprak, Aryadoust, and Goh language assessment: Templin and Hoffman’s (2013) study, in which the LCDM was retrofitted to the grammar section of the Examination for the Certificate of Proficiency in English (ECPE), and Toprak’s (2015) study, in which the LCDM was used to develop a cognitive diagnostic L2 reading comprehension test (CDSLRC) in higher education. Thus, this chapter introduces the LCDM framework and applies it to the listening test of the Michigan English Test (MET) to investigate test takers’ mastery of the attributes. We first provide basic information concerning the theoretical and statistical foundations of the LCDM and present the rationale for using the LCDM. Next, we describe L2 listening comprehension, elaborating on the cognitive processes and skills underlying task performances. We then describe the MET data and the Q-matrix construction process alongside the LCDM estimation. After presenting and interpreting the findings, we discuss the results in the light of L2 listening comprehension theories. Throughout the chapter, we will refer to the test takers as examinees, the test questions as items, the latent variables as attributes, and the measurement model used as a diagnostic classification model (DCM).

Literature review The log-linear cognitive diagnosis modeling (LCDM) DCMs traditionally have been divided into two groups: non-compensatory and compensatory models (Henson, Templin, & Willse, 2009). In a non-compensatory model, the conditional relationship between the attributes and the item responses relies on the mastery of all required attributes. On the other hand, in the compensatory models such as the (reduced) reparameterized unified model (rRUM, formerly known as the fusion model) (Hartz, 2002), the mastery of a skill can compensate for the lack of other skills. The LCDM is a general DCM family and an overarching methodology enjoying several advantages over other core DCMs. First, the LCDM allows for modeling most compensatory and non-compensatory DCMs by putting constraints on item parameters (Madison & Bradshaw, 2014). Second, the LCDM does not impose any constraints from the onset of model development but allows researchers to examine item-attribute alignments at the item level. The LCDM parameters and interactions can be modeled and tested empirically at the item level, and the parameters can be removed if values do not reflect statistical or theoretical significance (Jurich & Bradshaw, 2014). The LCDM uses a generalized linear model framework to link examinee responses to latent attributes of interest. Thus, it functions very similarly to an analysis of variance (ANOVA) model for binary data (It contains the ANOVA style main effects for each item; see Chapter 8, Volume I). Items could have a simple or a complex structure. Simple structure items tap only one attribute, whereas complex structure items measure more than one attribute. The LCDM also estimates additional interaction parameters capturing the relationships

Log-linear cognitive diagnosis modeling 59 among the attributes. The interaction parameters display the degree to which having a combination of attributes would increase the probability of a correct response. The LCDM defines the probability of a correct response (X ei = 1) for examinee e on item i measuring attribute a as (Equation 3.1):

P ( X ei = 1| α e ) =

e λi,0  +λi,1( a )αea (3.1) 1 + e λi,0  +   λi,1( a )αea

where the correct response for item i depends on αe or the mastery profile for examinee e and the item parameters, or λs. The mastery profile (αe) is defined as a vector of length K (K refers to the total number of attributes) and shows the attributes mastered by the examinee. The attribute status notated as αea becomes αea = 1 if attribute a is mastered by examinee e, and αea = 0 if it is not mastered. This attribute mastery profile (0 and 1) is used to define the item parameters, which help to estimate the probability of an examinee responding to an item correctly. For simple structure items measuring only one attribute, there would be only one intercept parameter notated as λi,0 and a main effect parameter notated as λi,1(a). The intercept refers to the predicted log-odds (chance or probability) of a correct response for examinees who have not mastered the attribute, and the main effect parameter represents the increase in the log-odds of a correct response to an item. For complex items, additional parameters are included. For instance, if an item taps into a second attribute, an additional main effect parameter showing the log-odds increase for mastering the second attribute, and an interaction parameter representing the additional change in the log-odds when the examinee possesses both attributes are over the probability of the intercept. In a fictitious listening scenario, if Item 1 measures two listening attributes that are, e.g., paraphrasing (P, αe1) and inferencing (I, αe2), the LCDM models the probability of the correct response as follows: e λ1,0+λ1,1(1)αe1+λ1,1( 2)αe2+λ1,2(1*2)αe1αe2  P  ( X ei = 1|α e ) = (3.2) 1 + e λ1,0+λ1,1(1)αe1+λ1,1( 2)αe2+λ1,2(1*2)αe1αe2  The intercept (λ1,0) in Equation 3.2 is the predicted log-odds of a correct response for the examinees who possess neither paraphrasing nor inferencing (i.e., nonmasters of the two attributes). This item has two main effect parameters: the first one is for paraphrasing, notated as λ1,1(1). The main effect for the mastery of paraphrasing represents an increase in log-odds of a correct response for examinees mastering paraphrasing but not inferencing. The second main effect parameter for inferencing, notated as λ1,1(2), refers to the increase in logodds of a correct response for examinees mastering inferencing but not paraphrasing. Finally, there is an interaction term, λ1,2(1*2), representing the change in log-odds when an examinee masters both paraphrasing and inferencing. Previously, we called DCMs special cases of constrained latent class models where the class-specific item difficulty parameter (πic) is replaced with a

60  Toprak, Aryadoust, and Goh class-specific item threshold parameter (τic). As in IRT models, the threshold (τic) is linked to the probability (πic) using an inverse log odds or logit function (Templin & Hoffman, 2013 see Chapter 6, Volume I). Thus, a (τic) threshold of 1 corresponding to, for example, a (πic) probability of .70, means a 70% chance of a correct response on an item. As each item and latent class requires a separate threshold (τic), this results in a large number of possible thresholds. To further illustrate how an item response is predicted using the LCDM, let us examine an item in the MET L2 listening comprehension test measuring inferencing (Attribute 2) and connecting details (Attribute 4). For this item, there are four threshold values (τic) as predicted by the LCDM parameters and the attribute mastery status of an examinee: 1 The first threshold is τic = λ i,0, for examinees mastering neither inferencing nor connecting details (αe2 = 0 and αe4 = 0). 2 The second threshold is τic = λi,0 + λi,1,(2), for examinees who have mastered only inferencing (αe2 = 1 and αe4 = 0). 3 The third threshold is τic = λi,0 + λi,1,(4), for examinees who have mastered only connecting details (αe2 = 0 and αe4 = 1). 4 The fourth threshold is τic = λi,0 + λi,1,(2) + λi,1,(4) + λi,2,(2,4) for examinees who have mastered both inferencing and connecting details (αe2 = 1 and αe4 = 1). The four item parameter values estimated for this particular item on the MET were: 0.242 for the intercept (λ i,0), 1.825 for the first main effect for inferencing (λ i,1,(2)), 1.164 for the second main effect for connecting details (λ i,1,(4)), and −0.798 for the interaction effect for mastering both attributes (λ i,2,(2,4)). When we convert the log-odds of a correct response across these four threshold levels into probabilities, the probability of responding to this item correctly would be .56 for nonmasters of both skills, .88 for the masters of inferencing, .80 for the masters of connecting details, and .91 for the masters of both attributes. Along with item parameters and class-specific item thresholds, the LCDM includes a set of structural parameters (υc) that describe the proportion of examinees belonging to each attribute mastery class. The structural component of the LCDM describes how attributes are related to each other. The base rate probabilities of mastery are used to define the relationships and interactions among the attributes, expressed in the form of tetrachoric correlations among pairs of attributes. Once item threshold (τ ic) and structural parameters (υc) are estimated, examinees can be classified into their most likely attribute mastery classes. Finally, the LCDM allows for examining interaction parameters. As a case in point, if the main effects for inferencing and paraphrasing are zero for an item and the interaction is positive, this would indicate that the attributes are fully non-compensatory. That is, the positive interaction effects on an item suggest that possessing one attribute does not fully make up for the lack of the other attribute measured by that item. We, then, could label this item as a fully non-compensatory DINA item.

Log-linear cognitive diagnosis modeling 61

Second language (L2) listening comprehension Listening is a crucial part of human communication. While listening may appear to be effortless to many, it is in fact a complex cognitive and social process and many things can go wrong if the listener is unable to engage in these processes effectively. The listener has to engage in a number of interactive low- and highlevel cognitive processes, namely perceptual processing, parsing, and utilization (Anderson, 1995). Perceptual processing comprises bottom-up processes that involve decoding the sounds of words and segmenting speech streams into recognizable lexical items such as words and phrases. These processes occur while the listener is attending to the spoken input and all but a trace of it may remain once it has been delivered. Listeners also need to engage in high-level top-down processes that draw on their stored knowledge of the context of the interaction, the topic(s), and previous experiences to construct an interpretation of the meaning of the spoken text. Bottom-up and top-down processes are recursive, and they often occur concurrently. Depending on the listener’s needs and circumstances, one direction of processing may take greater prominence over the other at different points of the communication. These processes are supported by the listeners’ implicit grammatical knowledge, which is frequently needed to parse the utterances for literal meanings before elaborations that are linked to prior knowledge can be constructed. These cognitive processes occur in both first language (L1) and second language (L2) listening, but they are considerably more challenging for L2 listeners because of their inadequate mastery of the L2 phonological system and vocabulary, which could hinder low-level bottom-up processes of decoding and lexical segmentation. A lack of grammatical proficiency could also hinder many L2 listeners in parsing utterances. In L1 listening, processes such as perception and parsing are highly automatized; that is to say little or no effort is needed to “hear” what is said. The listener can focus on utilizing the processed information to construct a reasonably complete interpretation of specific utterances or the overall discourse. That said, miscomprehension can still occur in L1 listening if the wrong prior knowledge of context and topic is applied, giving rise to miscommunication and misunderstanding. For L2 listeners, problems occur at every level of cognitive processing but particularly during perception and parsing (Goh, 2000). Cognitive processing of listening input depends greatly on L2 listeners’ control over various sources of knowledge. These primarily include vocabulary knowledge, phonological knowledge, grammatical knowledge, discourse knowledge (i.e., organization of various discourse structures and discourse cues), pragmatic knowledge (i.e., knowledge of language use appropriate to contexts and culture), and topical or background knowledge (Goh, 2000). When L2 listeners lack these kinds of knowledge, which support bottom-up and top-down processing, they will have to exert a deliberate effort to focus on as much of the input as possible to process it (Aryadoust, 2017). Such controlled efforts would make listening less efficient because effective listening usually occurs when the listener

62  Toprak, Aryadoust, and Goh is able to understand the message without spending an undue amount of time processing low-level linguistic cues such as words and phrases. Under such circumstances, L2 learners may focus on understanding single words and become distracted or even lost as they would not be able to keep up with the unfolding input (Goh, 2000). The aforementioned survey of the literature shows that listening is a complex language communication skill that consists of a number of smaller enabling subskills or attributes. The L2 listening literature has provided many lists and taxonomies of these attributes that are tied to the various sources of knowledge listed previously and the purpose for which listening is done. Field (2008) identified a number of decoding, lexical segmentation, and meaning-building skills that work together to enable learners to achieve comprehension, while Buck (2001) emphasized broad sets of skills that are needed for understanding literal and explicit information and implied meanings. Vandergrift and Goh (2012) identified six core listening skills from the extant literature on listening skills that learners need to develop for effective and purposeful listening comprehension: listen for details, listen selectively, listen for global understanding, listen for main ideas, listen and infer, and listen and predict. These six skills are supported by the cognitive processes previously explained. In addition, research on listening assessment using advanced quantitative methods such as CDA (Buck & Tatsuoka, 1998; Lee & Sawaki, 2009) and factor analysis (Goh & Aryadoust, 2015) identified a number of subskills assessed in several major listening tests: i ii iii iv

Vocabulary knowledge Paraphrasing the spoken passage Using prior knowledge of specific facts to comprehend the auditory message Knowing “between-the-lines” or unspoken information to connect different pieces of explicit information v Knowing the speaker’s intention, context, and perception vi Acquiring and internalizing new knowledge and information from the auditory message

An assessment that tests these subskills would require test takers to exercise both top-bottom and bottom-up cognitive processes to arrive at the answers. In reality, however, listening tests might not engage all known attributes of listening, depending on the theoretical underpinnings of the tests and their specifications. The aforementioned listening attributes reflect some of the main sources of knowledge for comprehension and skills that the L2 listening literature has identified. Some of these attributes can be further broken down into more finegrained attributes. For example, knowing “between-the-lines” and knowing speakers’ intentions are skills for inferring implicit meaning from explicit information and could rely on several granular attributes such as listening for details, main ideas, and global understanding. The skill of paraphrasing information

Log-linear cognitive diagnosis modeling 63 from listening texts is not a listening-specific attribute per se, but it is essentially a technique for eliciting the extent to which an examinee has understood information given in the prompts. Nevertheless, in order to do this, one would expect the L2 examinee to engage in one or more of the other listening “subskills” stated. In sum, the aforementioned characterization of the complexity of listening comprehension provides a list of the attributes that one needs to develop to be an L2 effective listener. Specifically, L2 listeners would need to understand what is underneath the spoken words or literal information (i.e., the function of the utterance rather than just its form and literal meaning) in order to grasp the intention of the speaker, which is a real-life social skill that is crucial beyond the classroom walls. This ability demonstrates that the important high-level process of utilization is taking place. It also represents one of the most critical attributes of listening, which is the ability to infer meaning from the utterances that have been processed literally. L2 listeners would benefit from understanding what kind of cognitive processes may be hindering their comprehension. There are at least two ways in which they can find out about this. One is through immediate self-reports conducted after they have completed a listening task, which can yield introspective data that are contextualized to specific listening tasks, as the study by Goh (2000) has shown. The findings can assist teachers and learners in identifying areas of weakness that require further instruction and support. Another way, as this chapter will demonstrate, is through a quantitative analysis of their listening test performance. This information, particularly when it is elicited from a diagnostic test, will inform teachers and learners of specific listening strengths and weaknesses. The next section illustrates the use of LCDM in a sample study on second language listening comprehension in which the LCDM was retrofitted to the results of a proficiency exam.

Sample study: Application of LCDM to a listening test Background This study applied the LCDM framework to the results of the CambridgeMichigan Language Assessment (CaMLA) listening test. The LCDM was used to determine examinee profiles on each second language listening comprehension attribute that formed the basis of the CAMLA listening test and to classify 963 students into their second language listening mastery profiles.

Data source The data and the test materials required for this study were provided by CaMLA and include the item-level data of 963 students’ performance in seven independent Michigan English Test (MET) listening papers. The test consists of

64  Toprak, Aryadoust, and Goh 46 multiple choice test items. CaMLA also provided the test materials including the test items and the relevant audio materials. The test comprises three sections: i Seventeen short conversations between a man and a woman ii Four extended conversations between a man and a woman iii Three mini-talks, each followed by three to four test items with four options

Software The LCDM was estimated using the software package MPlus (Muthén & Muthén, 1998–2017), which can handle restricted latent class analysis. Moreover, the syntax needed for estimating the LCDM was created by using the SAS computer program (version 9.4) and the macro program provided by Templin and Hoffman (2013).

Q-matrix construction A Q-matrix is an item-to-attribute indicator that specifies which item measures which attribute. If an item taps into one attribute, it is called a simple structure item, while it is referred to as a complex structure item if it taps into more than one attribute. In the present study, two of the authors having experience applying CDA to language assessment worked independently on test items to identify the attributes that underlie the MET L2 listening comprehension section and constructed the Q-matrix. Initially, a list of potential listening attributes was prepared by reviewing the L2 listening comprehension literature. The coding of items was conducted in an iterative manner by taking this list into account. Authors agreed on the codes of the majority of items, and five discrepancies regarding the coding of the items were resolved through e-mail discussions. After this iterative coding process, four attributes that formed the Q-matrix were identified: paraphrasing, inferencing, understanding important information, and connecting details.

Analysis and results A conjecture-based approach (Bradshaw et al., 2014) was used to estimate the LCDM. In this approach, the LCDM was used as a tool for collecting statistical evidence to validate the Q-matrix. First, the LCDM was defined based on the initial Q-matrix specifications. Then, the LCDM general parameterization was utilized to identify all possible main effects and interactions within both the measurement and structural models. While the measurement model depicts the relationships among the items and the attributes they measure, the structural model features the relationships among the attributes of interest. Since other specifications featuring three- and four-way interaction terms did not converge, the final structural model specification was set to be a two-way structural model, in which the highest order of interaction among the attributes was two. The

Log-linear cognitive diagnosis modeling 65 two-way structural model constrained three- and four-way interactions among the attributes to zero. Finally, the significance of the parameters was evaluated empirically, and nonsignificant parameters (p > .05) were removed to obtain a parsimonious model with a better model-data fit.

Model specifications The LCDM is also used to empirically verify researchers’ hypotheses regarding attribute-item relationships, a process that is similar to confirmatory factor analysis (CFA). After the first application of the LCDM to the MET data where the initial Q-matrix (Model 1) was used, we made several changes to the initial Q-matrix. The changes that were made in the second analysis (Model 2) were as follows: The main effects for Attribute 3 (understanding important information) on Item 5, Attribute 1 (paraphrasing) on Item 12, and Attribute 1 (paraphrasing) on Item 17 were removed from the model. Moreover, Attribute 2 on Item 7 was replaced with Attribute 1 after this item was reexamined in the light of the substantive theory and statistical evidence obtained by using the LCDM results. These items produced small main effect parameters, indicating that nonmasters of these attributes also had a high probability of a correct response. In the third analysis (Model 3), further modifications were applied to further fine-tune the Q-matrix: Attribute 4 (connecting details) on Item 6, with a main effect of 0, and Item 7, which had a poor fit index, were removed from the model. Initially, Item 7 was hypothesized to measure Attribute 1 (paraphrasing). However, the intercept and main effect parameters for Attribute 1 were 0.144 and 0.557, respectively, meaning that Item 7 did not actually discriminate between masters and nonmasters of paraphrasing. In the fourth analysis (Model 4), Item 36 was also removed from analysis since it produced an intercept value of 1.075. This meant that nonmasters of understanding important information had .75 probability of a correct response. The final analysis was conducted using 44 items. In the final model specification, Attribute 1 (paraphrasing), Attribute 2 (inferencing), Attribute 3 (understanding important ⁄specific information), and Attribute 4 (connecting details) were measured by 18, 16, 7, and 7 items, respectively. There were four complex items and 40 simple items, while the highest degree of interaction between the attributes was two (as earlier noted, since the structural models featuring more than two-way interactions did not converge, a two-way log-linear structural model was estimated).

Model fit To identify the optimal model, model comparisons were made using three information criteria measures: the Akaike information criteria (AIC; Akaike, 1987), Bayesian information criteria (BIC; Schwarz, 1978), and a sample size adjusted BIC (ssBIC; Sclove, 1987). These global information-based fit indices are commonly used to assess the model fit and select the model that best represents the

66  Toprak, Aryadoust, and Goh data. When comparing several models by using these information criteria measures, smaller values are desirable. The three information criteria suggested that Model 4 was the best-fitting model, as indicated by lower values of the indices demonstrated in Table 3.1, which elaborates on the model modifications and fit indices for the MET listening test.

Measurement model: Item-attributes relationships The measurement component of the LCDM features the relationships between the items and the attributes they measure. Items with relatively large intercepts indicate that examinees lacking the attribute of interest have a high probability of getting the item right, which could suggest a Q-matrix misspecification (Templin and Hoffman, 2013). In such a case, there would be three options for researchers to choose: (i) researchers could change the Q-matrix entry for that item; (ii) they may define a new attribute based on their substantive theory; or (iii) they may remove the item from the test. In our case, except Item 2, all the items had large main effects but small intercept values, indicating high discrimination power and no Q-matrix misspecification. Table 3.2 presents intercept, main effect, and interaction parameter estimates of all items with their standard error measurement estimates based on Model 4.

Table 3.1  Model modifications and fit indices for the MET listening test Number of Specifications Log-likelihood parameters Model 1 46 items Two-way log-linear structural model Model 2 46 items Two-waylog-linear structural model Model 3 45 items Two-waylog-linear structural model Model 4 44 items Two-waylog-linear structural model

AIC

BIC

ssBIC

−24262.340

123

48770.681 49369.698 48979.052

−24283.339

112

48790.679 49336.125 48980.415

−23643.952

108

47503.904 48029.869 47686.864

−23308.021

106

46828.042 47344.268 47007.614

Note: AIC = Akaike information criteria; BIC = Bayesian information criteria; ssBIC = sample size adjusted BIC

Table 3.2  The MET listening test Q-matrix and LCDM item parameter estimates Main effects

Item

Intercept

Paraphrasing

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

−0.037 (0.091) 0.688 (0.102) 0.090 (0.099) −0.258 (0.094) 0.216 (0.094) −0.329 (0.097) −0.537 (0.094) −0.304 (0.109) −0.809 (0.102) −0.718 (0.096) −0.825 (0.108) −0.691 (0.096) −1.832 (0.152) −0.898 (0.107) −1.448 (0.127) −1.608 (0.124) 0.242 (0.101) 0.222 (0.114) −0.129 (0.089) −0.192 (0.091) −1.494 (0.144) −1.450 (0.116)

2.385 (0.197) 2.703 (0.205) 1.754 (0.150) 3.421 (0.221) 2.456 (0.171) 2.707 (0.181) 1.320 (0.666)

1.107 (0.143) 3.105 (0.258) 2.515 (0.162)

Inferencing 2.531 (0.287) 1.545 (0.170)

Understanding important information

Connecting details

Interaction

1.447 (0.171) 1.612 (0.171)

3.579 (0.293) 2.518 (0.172) 1.305 (0.149) 1.732 (0.654) 1.123 (0.642) 2.526 (0.167) 2.222 (0.171) 1.825 (0.665)

−1.732 (0.654) −0.361 (0.964)

2.580 (0.253)

1.164 (1.052)

−0.798 (1.264)

2.322 (0.179) (Continued)

Table 3.2  The MET listening test Q-matrix and LCDM item parameter estimates (Continued) Main effects

Item

Intercept

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

−0.445 (0.099) −0.873 (0.099) 0.264 (0.103) −1.284 (0.110) −0.231 (0.095) 0.246 (0.099) 0.009 (0.092) −0.861 (0.101) −0.500 (0.094) −1.118 (0.112) −0.976 (0.108) −1.329 (0.121) −1.701 (0.127) −1.074 (0.107) −0.541 (0.100) 0.098 (0.100) −0.345 (0.095) −0.566 (0.095) −0.710 (0.098) −0.443 (0.098) −0.401 (0.098) −0.681 (0.096)

Paraphrasing 2.172 (0.157) 2.826 (0.213) 3.595 (0.338) 2.224 (0.157)

Inferencing

2.621 (0.191) 2.362 (0.170)

1.444 (0.142)

Connecting details

Interaction

2.382 (0.184) 2.754 (0.323) 0.886 (0.424)

2.219 (0.163) 3.672 (0.234)

Understanding important information

2.933 (0.297)

1.822 (0.169)

4.115 (0.443)

2.401 (0.209)

Note: Standard errors for parameters are given in parenthesis

2.575 (0.210) 3.164 (0.197)

2.224 (0.171)

2.619 (0.200)

1.760 (0.158)

1.072 (0.148)

1.758 (0.158)

−0.886 (0.424)

Log-linear cognitive diagnosis modeling 69 Since log odds might be difficult to discern, their corresponding probabilities are given in Table 3.3. When the main effect is larger, it becomes more difficult for nonmasters to answer the item accurately by guessing. Overall, the MET items had relatively small intercepts and large main effects, except Items 2, 17, 25, and 28. These items had larger intercepts, indicating that in these items nonmasters had a high probability of correct responses. For the majority of items, the probability of a correct response without possessing the attribute of interest was lower for nonmasters. To illustrate, for Item 25, which measures understanding important information, nonmasters of the attribute had a probability of .56 to answer the item correctly. On the other hand, the probability of a correct response on that item was .95 for the masters. When the items with larger intercepts were examined, it was seen that these items were relatively easier than the others. This was not surprising since the LCDM was retrofitted to a test that was developed for norm-referenced testing, and the test contained items that conformed to the psychometric rule on creating a bell-shaped score distribution that features items at different item difficulty levels. The majority of the items had an acceptable to a fair degree of item estimate values. Ideally, an item should have a small intercept and a large main effect value to be considered a good item. For instance, the intercept and main effect parameters for Item 36 were −1.074 and 3.672, respectively. These values would yield a probability of .25 for the nonmasters and .93 for the masters of paraphrasing skill for a correct response on Item 36. Hence, Item 36 could be considered a good item as it helps distinguish between masters and nonmasters of paraphrasing. These probabilities could be calculated by using the probability function from Equation 3.1. The equation helps convert the log-odds of a correct response to the probability of a correct response for each item. Table 3.3 demonstrates the probabilities of a correct response across all the MET items. Since the impact of attributes on item responses may not be observed at the aggregate level, we designed the item characteristic bar charts (ICBCs), which display the response probabilities on the vertical axis as a function of attribute mastery on the horizontal axis. Figure 3.1 shows the ICBC for Item 34, a simple structure item, which measured only connecting details. The estimated probability of answering Item 34 correctly was .86 for masters of connecting details and .20 for nonmasters. Figure 3.2 presents the ICBC for a complex item, Item 14, which measured two attributes, paraphrasing and inferencing. The estimated probability of answering Item 14 correctly was .28, .60, .55, and .76 for examinees who mastered neither attribute, only paraphrasing, only inferencing, and both attributes, respectively. Examining how attributes interact may help us understand the nature of the construct of interest and cognitive processes that are involved in responding to an item. When examining item parameters, the second issue of interest would be the examination of the interaction terms. In our case, out of 44 items, four items had a complex structure. Items 13, 14, 17, and 28 measured two attributes, and all these items behaved like DINO (deterministic inputs noisy or gate, Templin & Henson, 2006) items, which produced negative interaction effects.

70  Toprak, Aryadoust, and Goh Table 3.3  Probabilities of a correct response for nonmasters and masters on all the MET items

Item

Attribute measured

Probability for the nonmasters of the attribute

1 2 3 4 5 6 7 8 9 10 11 12 13

Paraphrasing Inferencing Understanding important information Paraphrasing Inferencing Understanding important information Paraphrasing Inferencing Paraphrasing Paraphrasing Inferencing Inferencing Paraphrasing Inferencing

.49 .66 .52 .43 .55 .41 .36 .42 .30 .32 .30 .33 .13

14

Paraphrasing Inferencing

.28

15 16 17

Inferencing Inferencing Inferencing Connecting details

.19 .16 .56

18 19 20 21 22 23 24 25 26 27 28

Understanding important information Paraphrasing Paraphrasing Understanding important information Paraphrasing Inferencing Paraphrasing Understanding important information Connecting details Paraphrasing Inferencing Connecting details

.55 .46 .45 .18 .19 .39 .29 .56 .21 .44 .56

29 30 31 32 33 34 35 36 37 38

Paraphrasing Paraphrasing Connecting details Understanding important information Inferencing Connecting details Inferencing Paraphrasing Connecting details Inferencing

.50 .29 .37 .42 .27 .20 .15 .25 .36 .52

Probability for the masters of the attribute .91 .96 .82 .92 .85 .78 .77 .96 .93 .85 .84 .64 .70 .47 .70* .60 .55 .76* .74 .64 .88 .80 .91* .94 .72 .94 .69 .74 .87 .78 .95 .61 .93 .75 .96 .96* .97 .79 .88 .66 .77 .86 .62 .93 .97 .92 (Continued)

Log-linear cognitive diagnosis modeling 71 Table 3.3  Probabilities of a correct response for nonmasters and masters on all the MET items (Continued)

Item

Attribute measured

39 40 41 42 43 44

Paraphrasing Paraphrasing Connecting details Understanding important information Inferencing Paraphrasing

Note: *Asterisks indicate interaction effect parameters

Figure 3.1  ICBC for Item 34.

Figure 3.2  ICBC for Item 14.

Probability for the nonmasters of the attribute

Probability for the masters of the attribute

.41 .36 .32 .39 .40 .33

.90 .85 .74 .65 .90 .68

72  Toprak, Aryadoust, and Goh

Figure 3.3  ICBC for Item 17.

For instance, Item 17 measured two attributes, which were inferencing and connecting details (see Figure 3.3). The log-odds of a correct response for examinees possessing neither inferencing nor connecting details was an intercept of 0.242, corresponding to a probability of .56. In Item 17, the masters of inferencing had a .88 probability of answering the item correctly, while this probability was .80 for the masters of connecting details. Mastering both attributes increased the chances of responding to the item correctly, a probability of .91. Item 17 functioned like a DINO item, since under DINO, once a subset of the required attributes has been mastered, the mastery of additional attributes does not increase the probability of getting the item right. However, if the main effects for inferencing and connecting details on Item 17 were both zero and the interaction was positive, or if the item produced nonsignificant main effects, this would suggest that the attributes measured by this item were fully non-compensatory. Then, we could deduce that the item is DINA-like, in which the probability of a correct response is high when all the attributes required for the item are possessed. Under DINA, positive interaction effects mean that possessing one attribute does not fully make up for the lack of the other attribute⁄s measured by that item.

Structural model: Attributes’ relationships The final model with 44 items measuring four attributes was reestimated after removing all nonsignificant parameters. Using the base-rate probabilities of attribute mastery, the tetrachoric correlations among the attributes were estimated. These correlations were strong, ranging from .962 to .990, suggesting that masters of one attribute are very likely to be masters of the other attributes (Table 3.4). Therefore, if an examinee has not yet mastered paraphrasing, s ⁄ he is likely to have not mastered the other three attributes as well. This further indicates the close interconnection of the listening attributes measured by the MET test.

Log-linear cognitive diagnosis modeling 73 Table 3.4  Tetrachoric correlations among attributes

Paraphrasing Inferencing Understanding important information Connecting details

Paraphrasing

Inferencing

Understanding important information

– 0.98 0.96

– 0.97



0.98

0.97

0.99

Connecting details



Attribute mastery classifications If a test measures a total of N attributes, the number of latent mastery classes would be N2. In our example, the MET L2 listening comprehension section measured four attributes, producing 16 distinct attribute mastery profiles. The LCDM placed 963 examinees into these 16 distinct attribute mastery classes and estimated the probability that each examinee is a member of each latent class (Table 3.5). For instance, while the first latent class (0000) indicates that its members have not mastered any of the four attributes measured, the sixth mastery profile (0101) indicates that its members have mastered the second and fourth attribute measured but lacked the first and third one. The attribute mastery classes with the highest frequency are the flat attribute mastery profiles (i.e., latent class 1 [0000] and latent class 16 [1111]). Examinees who were classified into these classes either mastered none of the attributes or all of them. This situation may be linked to the fact that the LCDM was applied to

Table 3.5  Attribute mastery classes and their respective attribute mastery profiles Latent Class

Skill mastery profile

Number of examinees

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

468 1 17 5 9 0 15 2 9 3 0 6 27 5 10 386

74  Toprak, Aryadoust, and Goh data obtained from a nondiagnostic assessment. Consequently, while the majority of examinees were classified into the flat mastery profiles, the majority of classes were sparsely populated. However, the LCDM also classified examinees into a wide range of attribute mastery classes. For instance, 27 examinees were placed in the Latent Class 13, who had mastered paraphrasing and inferencing but not understanding important information and connecting details. Fifteen examinees were classified into the latent class 7 [0110], in which members possessed inferencing and understanding important information but lacked paraphrasing and connecting details. Finally, the LCDM application to the MET data showed that 46% of the examinees mastered paraphrasing, 47% of the examinees mastered inferencing, 46% of the examinees mastered understanding important information, and 42% of the examinees mastered connecting details.

Discussion This chapter investigated the merits of the LCDM and practical implications for language assessment. The LCDM provides language testers with more opportunities and increased feasibility when compared to other core DCMs. First, the LCDM could be used as a tool for evaluating the diagnostic and discriminatory capacity of individual test items. Notably, examining the intercept and main effect parameters produced by the test items yields detailed information about the diagnostic capacity of language tests. This information could even be more fine-grained at the item level since the LCDM allows for examining how items behave when more than two subskills interact. Second, one of the significant advantages of the LCDM over other core DCMs is that it offers the flexibility to examine each item behavior without putting any a priori constraints on test items. This flexibility may prove extremely beneficial specifically in language testing and assessment in which language testers need to deal with constructs that are highly interactive and compensatory. Therefore, language testers would not need to opt for either a non-compensatory or a compensatory model from the onset. Since it is possible to express other core DCMs by using the LCDM parameterizations, one would end up having some items that are DINO-like and some items that are rRUM-like. In our particular case, four items tapping into two attributes were DINO-like items, in which once a subset of the required attributes was mastered, the mastery of additional attributes yielded an increase in the probability of a correct response. Moreover, relaxing the constraints on the item parameters may lead to an improved model-to-data fit (Jurich & Bradshaw, 2014). Finally, the LCDM may function as a tool for validating language testers’ theory-based conjectures that are expressed in the form of a Q-matrix. If language testers’ conjectures do not seem to be compatible with the statistical evidence produced by the LCDM, they would need to revisit, refine, or modify these conjectures. In the present study, in line with our conjectures about L2 listening comprehension, we also made several modifications to the Q-matrix, which resulted in an improved model-data fit.

Log-linear cognitive diagnosis modeling 75 If language testers wish to take an inductive approach to CDA and develop a diagnostic language test from the ground up, the LCDM could prove useful for test construction purposes. One relevant example would be the construction of a CDSLRC (cognitive diagnostic second language reading comprehension) test by Toprak (2015), in which the LCDM was used to develop a diagnostic test of L2 reading comprehension. The results revealed that both the CDSLRC test and the LCDM estimation were substantially efficient in making reliable diagnostic classifications. Turning to the assessment overtones of the study, we found that the analysis of the test items in the MET listening test reported here revealed that four attributes were dominantly focused on by the assessors. These did not come as a surprise because they were similar to some of the key attributes that were presented in the earlier review of the literature. The results of the study therefore would suggest that major listening tests focus on more or less the same kinds of listening attributes reported in the L2 listening literature. Further, it is interesting to note that paraphrasing, which this paper has argued to be an assessment technique, has a strong presence in this set of tests as well. Paraphrasing, as explained earlier, is not so much a listening subskill as it is a complex language skill for demonstrating understanding. It is an assessing technique to find out how well examinees have understood what they heard. The L2 listener would have had to use one or more of the listening attributes to first arrive at a reasonable interpretation of the listening text before expressing their understanding in another way. In other words, examinees can exercise the skill of listening for details, listening for global meaning, or listening to get the main idea depending on what the question requires them to paraphrase. In this sense, paraphrasing items does not clearly explicate the individual attributes that are required to answer the question well. To do this, it is inevitable that some qualitative manual coding has to be done. Inferencing is one of the most important listening subskills as it requires listeners to go beyond literal or surface information to understand intended meanings (Goh, 2000). This contrasts with the third attribute, which is to understand important information. This encompasses the ability to recognize important details while listening and does not rule out the need to listen for important main ideas or simply to grasp the theme of what is said by listening for global meaning. The fourth attribute of connecting details indicates that examinees are assessed on their discourse knowledge of spoken texts, recognizing discourse cues and other lexical items that create cohesion and coherence. L2 listeners must recognize key pieces of information in different parts of a spoken text and draw the relationships among them in order to create a reasonable understanding of what is needed. L2 learners can benefit from information about their performance to help them continually practice their listening (Buck, 2001) and engage in metacognitive processes to direct their listening development (Vandergrift & Goh, 2012). If such results were to be used as feedback to teachers and learners who sat for the tests, it would be right to say that all four attributes still required a lot more work because less than half of the examinees have mastered them. It is important as a caveat to state that the test that was used in this study was a

76  Toprak, Aryadoust, and Goh proficiency test. It is used in this paper to demonstrate how the LCDM could be applied to listening assessment data, thereby providing an attractive alternative for providing more fine-grained and potentially more useful feedback to L2 learners. In practice, however, it would be important to create a diagnostic test that can address a greater variety of listening subskills that cover bottom-up and top-down processes comprehensively. The results would also benefit L2 learners more directly and meaningfully when the test is set at a level that is appropriate for the learners’ stage of listening development.

Limitations and conclusion This study has several limitations. First, the model-to-data fit was not checked using the absolute fit indices. This was a limitation of the Mplus program used for DCM analysis. Furthermore, although Mplus can easily estimate the LCDM with up to five attributes, it is not suitable for testing the Q-matrices that include more attributes. A similar limitation also applies to the number of test items that can be estimated using Mplus. The package has difficulty in estimating models that include more than 80 items (Templin & Hoffman, 2013). However, we should also note that LCDM can also be estimated with the CDM package and the GDINA package in R since the G-DINA and the LCDM are reparameterizations of each other (see Chapter 4, Volume II). Second, the Q-matrix construction is one of the mainstays of DCM assessments. In the present study, the Q-matrix construction mainly was informed by the statistical evidence and substantive theory, yet there are several Q-matrix construction and validation methods used in the relevant literature such as think-aloud protocols, retrospective interviews, expert ratings, and more empirical methods of Q-matrix validation such as the discrimination index proposed by de la Torre and Chiu (2016). While acknowledging these limitations, we should also note that the present study not only adds to the growing body of research on CDA and L2 listening assessment but also showcases how the LCDM could be applied to real language assessment data. Hence, it is hoped that the present study will contribute to an increasing familiarity and access to the LCDM among language testers.

Acknowledgement Authors would like to thank Michigan Language Assessment for providing the data and test materials for the study.

References Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317–332. Anderson, J. R. (1995). Cognitive psychology and its implications (4th ed.). New York: Freeman. Aryadoust, V. (2012). Using cognitive diagnostic assessment to model the structure of the lecture comprehension section of the IELTS listening test: A sub-skill-based approach. Asian EFL Journal, 13(4), 81–106.

Log-linear cognitive diagnosis modeling 77 Aryadoust, V. (2017). Communicative testing of listening. In J. I. Liontas & M. DelliCarpini (Eds.), The TESOL encyclopedia of English language teaching. New York, NY: John Wiley & Sons Inc. Aryadoust, V., & Goh, C. C. M. (2014). Exploring the relative merits of cognitive diagnostic models and confirmatory factor analysis for assessing listening comprehension. In E. D. Galaczi & C. J. Weir (Eds.), Studies in language testing volume of proceedings from the ALTE Krakow Conference, 2011. Cambridge, UK: Cambridge University Press. Bernhardt, E. B. (2005). Progress and procrastination in second language reading. Annual Review of Applied Linguistics, 25, 133–150. Bradshaw, L., Izsak, A., Templin, J., & Jacobson, E. (2014). Diagnosing teachers’ understandings of rational numbers: Building a multidimensional test within the diagnostic classification framework. Educational Measurement: Issues and Practice, 33, 2–14. Buck, G. (2001). Assessing listening. Cambridge, UK: Cambridge University Press. Buck, G., & Tatsuoka, K. (1998). Application of the rule-space procedure to language testing: Examining attributes of a free response listening test. Language Testing, 15(2), 119–157. de la Torre, J., & Chiu, C. (2016). General method of empirical Q-matrix validation. Psychometrica, 81(2), 253–273. Field, J. C. (2008). Listening in the language classroom. New York: Cambridge University Press. Goh, C. C. M. (2000). A cognitive perspective on language learners’ listening comprehension problems. System, 28, 55–75. Goh, C. C. M., & Aryadoust, V. (2015). Examining the notion of listening subskill divisibility and its implications for second language listening. International Journal of Listening, 29(3), 109–133. Hartz, S. M. (2002). A Bayesian guide for the unified model for assessing cognitive abilities: Blending theory with practicality (unpublished doctoral dissertation). University of Illinois, Department of Statistics, Urbana-Champaign, IL. Henson, R. A., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74(2), 191. Jang, E. E. (2008). A framework for cognitive diagnostic assessment. In C. A. Chapelle, Y. R. Chung, & J. Xu (Eds.), Towards adaptive CALL: Natural language processing for diagnostic language assessment (pp. 117–131). Ames, IA: Iowa State University. Jang, E. E. (2009). Demystifying a Q-matrix for making diagnostic inferences about L2 reading skills. Language Assessment Quarterly, 6(3), 210–238. Junker, B., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258–272. Jurich, D. P., & Bradshaw, L. P. (2014). An illustration of diagnostic classification modeling in student learning outcomes assessment. International Journal of Testing, 14(1), 49–72. Kim, A. Y. (2015). Exploring ways to provide diagnostic feedback with an ESL placement test: Cognitive diagnostic assessment of L2 reading ability. Language Testing, 32(2), 227–258. Lee, Y. W., & Sawaki, Y. (2009). Application of three cognitive diagnosis models to ESL reading and listening assessments. Language Assessment Quarterly, 6(3), 239–263. Madison, M., & Bradshaw, L. (2015). The effects of Q-matrix design on classification accuracy in the log-linear cognitive diagnosis model. Educational and Psychological Measurement, 75(3), 1–21.

78  Toprak, Aryadoust, and Goh Muthén, L. K., & Muthén, B. O. (1998–2017). Mplus User’s Guide (8th edition). Los Angeles, CA: Muthén & Muthén. Ravand, H. (2016). Application of a cognitive diagnostic model to a high-stakes reading comprehension test. Journal of Psychoeducational Assessment, 34(8), 782–799. Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art. Measurement, 6(4), 219–262. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. Sclove, S. L. (1987). Application of model-selection criteria to some problems in multivariate analysis. Psychometrika, 52(3), 333–343. Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20(4), 345–354. Templin, J. L., & Henson, R. A. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11, 287–305. Templin, J., & Hoffman, L. (2013). Obtaining diagnostic classification model estimates using Mplus. Educational Measurement: Issues and Practice, 32(2), 37–50. Toprak, T. E. (2015). Cognitive diagnostic assessment of second language reading comprehension: Application of the log-linear cognitive diagnosis modeling to language testing (unpublished doctoral dissertation). Gazi University, Ankara, Turkey. Vandergrift, L., & Goh, C. C. M. (2012). Teaching and learning second language listening. New York: Routledge.

4

Application of a hierarchical diagnostic classification model in assessing reading comprehension Hamdollah Ravand

Introduction Traditional assessments such as the classical test theory (CTT) and item response theory (IRT) usually estimate a person’s location along an unidimensional latent trait, which may be useful to inform decisions in selecting students who are most likely to succeed in a given course of instruction. Low scores on traditional assessments may point to the need for remedial instruction, but these assessments fail to provide information as to exactly where that instruction is needed. Diagnostic classification models (DCMs) can compensate for this limitation by providing fine-grained formative feedback on examinees’ strengths and weaknesses. Thus, DCMs and traditional assessments can be distinguished in terms of assessment purpose: IRT and CTT models scale test takers while DCMs classify them into latent classes, the number of which is determined by the number of binary attributes involved in solving the items of any given test. This may promote more individualized instruction through tailoring the instruction to the needs of the learners rather than providing uniform instruction to all the leaners with varying strengths and weaknesses. Despite their utility, the application of DCMs to promote learning has not been widespread mainly due to two reasons (Ravand & Robitzsch, 2015): (i) DCMs are relatively new and their theoretical underpinnings have not been explicated in an accessible way to those interested in their applications; and (ii) there are a lot of controversial issues surrounding their practice, which include sample size requirements, the optimal number of attributes, choice of the DCM, issues of model fit, and difficulty in interpreting the results. Furthermore, the extant applications of DCMs (e.g., Kim, 2015; Lee & Sawaki, 2009) are limited to the models that deal only with conjunctive ⁄­d isjunctive relationships among the attributes required to perform successfully in any given assessment. This chapter deals with the aforementioned limitations by reviewing important theoretical considerations in DCMs and illustrates the application of a relatively recent DCM, the Hierarchical Diagnostic Classification Model (HDCM; Templin & Bradshaw, 2014). Besides conjunctive ⁄disjunctive relationships, HDCMs can test hypotheses as to hierarchical relationships among the attributes.

80  Ravand The chapter focuses on two main issues in DCMs. It provides a brief introduction to theoretical underpinnings of the models and then illustrates their application. The chapter is laid out as follows. First, an introduction to DCMs is presented. Then, a comprehensive categorization of DCMs is introduced. Later, some critical considerations such as model selection, specificity of the attributes, number of items, model fit, and sample size are discussed. Finally, the application of a hierarchical DCM is illustrated, and the interpretations of the results are discussed.

Literature review Types of diagnostic classification models (DCMs) Like in other latent trait models, in DCMs, a set of discrete latent variables that could be examinee attributes, language subskills, or processes, predict performance on each item on any given test. DCMs make varying assumptions as to how the predictor latent attributes condense (i.e., combine) to generate a response to the item. Traditionally, DCMs have been classified into the dichotomies of conjunctive versus disjunctive or compensatory versus non-compensatory models. In the compensatory or disjunctive models, mastery of one or a subset of the required attributes can compensate for nonmastery of the other attributes. In these models, mastery of more attributes does not increase the probability of providing the item with a right answer; that is, under these models, mastery of any subset of attributes is the same as the mastery of all the required attributes. Compensatory models are appropriate when multiple strategies lead synergistically to a correct answer of test items. In non-compensatory DCMs, on the other hand, the predictor attributes combine in an all-or-nothing fashion wherein the presence of all the required attributes results in high probability of a correct answer. More recently, additive DCMs have been presented as another category of DCMs. Unlike compensatory DCMs, which do not credit test takers for the number of attributes mastered, in additive DCMs, the presence of any one of the attributes affects the probability of a correct response independent of the presence or absence of other attributes. In the present study, the two dichotomies—conjunctive ⁄disjunctive and compensatory ⁄non-compensatory—are distinguished, and the conjunctive ⁄­disjunctive dichotomy is preferred since the compensatory  ⁄non-compensatory dichotomy is more appropriate for continuous latent variables in multidimensional IRT where scores on each one of the latent variables can go to infinity and low scores on one dimension can be offset by high scores on the other(s). In DCMs, the presence of one attribute can never fully compensate for the absence of another required attribute unless the nonmastered attribute is actually not required. Lately, a new categorization of DCMs has been proposed: specific versus general. Specific DCMs are models in which only one type of relationship is possible within any assessment: disjunctive, conjunctive, or additive. In contrast, in

Hierarchical diagnostic classification 81 general DCMs (GDCMs) such as the generalized deterministic, noisy, and gate (G-DINA) model (de la Torre, 2011), multiple DCMs are possible within the same assessment. The GDCMs do not assume any prespecified relationships among the attributes underlying any assessment. Rather, each item can select its own model a posteriori. De la Torre (2011) showed that many of the specific DCMs, regardless of whether they are conjunctive, disjunctive, or additive, can be derived from the general models by introducing constraints in the parameterization of the GDCMs. A more recent extension of the DCMs such as the hierarchical log-linear cognitive diagnosis modeling (HLCDM; Templin & Bradshaw, 2014) has led to a new category of DCMs: hierarchical versus nonhierarchical. In the hierarchical DCMs (HDCMs), structural relationships among the required attributes are modeled. In instructional syllabi, some teaching materials are presented prior to others. The sequential presentation of skills maybe reflected in test takers’ responses to items that require those skills. HDCMs are able to capture the effect of sequential teaching of materials where learning of new skills builds upon prerequisite skills. They also have the potential to explore the inherent hierarchical relationships among constructs. With the previous discussion in mind, the following categorization of the DCMs is suggested. At a macro level, DCMs are divided into general and specific, and at a micro level, specific DCMs are divided into disjunctive, conjunctive, and additive. Furthermore, HDCMs form a new category in both the general and specific DCMs. A further point that needs to be made before wrapping up the discussion of how DCMs can be categorized is that many DCMs are reparameterizations of each other. Changing the link function in a DCM would result in the parameterization of another one. For instance, the additive cognitive diagnostic model (ACDM; de la Torre, 2011) with the identity link function turns into the linear logistic model (LLM; Maris, 1999) and into the compensatory reparametrized unified model (C-RUM; Hartz, 2002) when the link function is changed into logit and log, respectively. It is noteworthy that the C-RUM traditionally has been categorized among the disjunctive models. As another example, the G-DINA turns into LCDM by the change of the identity link function to logit. Therefore, it seems that the traditional distinctions among the DCMs are getting blurred; however, for the ease of reference and continuity with the DCM literature, the categorization in Table 4.1 is suggested.

DCM application in language assessment DCM research and application originated in educational measurement—­ specifically mathematics assessment (e.g., de la Torre & Douglas, 2004; Henson, Templin, & Willse, 2009; Tatsuoka, 1990). As displayed in Table 4.2, most of the DCM application in language assessment has involved reading comprehension skill (e.g., Jang, 2005; Kim, 2015; Li, 2011; Ravand, 2016) and more sparsely listening (e.g., Buck & Tatsuoka, 1998).

82  Ravand Table 4.1  DCM categorization

Specific

DCM Type

Examples

Author(s)

Disjunctive

1) Deterministic input, noisy, or gate model (DINO) 2) Noisy input, deterministic, or gate (NIDO) model 1) Deterministic input, noisy, and gate model (DINA) 2) Noisy input, deterministic, and gate (NIDA) 1) Additive CDM (ACDM) 2) Compensatory reparameterized unified model (C-RUM) 3) Non-compensatory reparameterized unified model (NC-RUM) 4) Linear logistic model (LLM) 1) Hierarchical DINA (HO-DINA) model 5) General diagnostic model (GDM) 6) Log-linear CDM (LCDM) 7) Generalized DINA (G-DINA) 1) Hierarchical diagnostic classification model (HDCM)

Templin and Henson (2006) Templin (2006)

Conjunctive

Additive

Hierarchical General

Disjunctive, Conjunctive, and Additive Hierarchical

Junker and Sijtsma (2001) DiBello et al. (1995); Hartz (2002) de la Torre (2011) DiBello et al. (1995); Hartz (2002) Hartz (2002) Maris (1999)

de la Torre (2008) Von Davier (2005) Henson, Templin and Willse (2009) de la Torre (2011) Templin and Bradshaw (2014)

Note: Originally, the NC-RUM was parameterized as a non-compensatory model. However, Ma, Iaconangelo, and de la Torre (2016) showed that it is an additive DCM with log link function

However, few studies have applied DCMs to the writing skill (e.g., Kim, 2011; Xie, 2017). To the best knowledge of the author, there is no application of DCMs to the speaking skill. As Table 4.2 shows, DINA and NC-RUM are the two models most frequently used in language assessment.

Key considerations in DCM application Model selection Choice of the right model is of critical importance in DCMs because model selection affects classification of test takers (Lee & Sawaki, 2009), which is the primary purpose of DCMs. However, model selection has been taken for granted either because there is no explicit theory as to how response processes combine to generate observed responses or because of the availability of the software programs or familiarity of researchers with these programs (Ravand & Robitzsch, 2015). In principle, model selection should be informed by the degree of match between the assumptions of the models and how the attributes underlying the test are assumed to interact. There have been few studies (e.g., Lee & Sawaki, 2009;

Hierarchical diagnostic classification 83 Table 4.2  DCM studies of language assessment Study

Model

Skill

Buck et al. (1997) Buck and Tatsuoka (1998) Buck et al. (2004) von Davier (2005) Jang (2009) Lee and Sawaki (2009) Aryadoust (2011) Li (2011) Svetina, Gorin, and Tatsuoka (2011) Wang and Gierl (2011) Y.H. Kim (2011) Ravand, Barati, and Widhiarso (2012) Jang et al. (2013) Zhang (2013) A.-Y. Kim (2015) Ravand (2016) Liu et al. (2017)

RSM RSM RSM GDM NC-RUM NC-RUM,GDM RRUM NC-RUM RSM

Reading Listening Reading Reading Reading Reading Listening Reading Reading

AHM NC-RUM DINA

Reading Writing Reading

NC-RUM NC-RUM NC-RUM G-DINA G-DINA, A-CDM, DINA, DINO, HO-DINA NC-RUM DINA, DINO, ACDM, LCDM, NC-RUM DINA, DINO, ACDM, C-RUM, NC-RUM, G-DINA

Reading Reading Reading Reading Listening

Xie (2017) Yi (2017) Ravand and Robitzsch (2018)

Writing Grammar Reading

Note: RSM = Rule Space Method (Tatsuoka, 1983); AHM = Attribute Hierarchy Method (Gierl, Leighton, and Hunka, 2007)

Li, Hunter, & Lei, 2016; Ravand & Robitzsch, 2018; Yi, 2017) investigating the issue of the appropriate DCM that could optimally model relationships among attributes underlying language tests. All these studies, explicitly or implicitly, have advised against imposing a single DCM on all the items in any language assessment. As a way around the challenge of model selection, one can apply a GDCM first and then let each item select its own model. The suggestion works well in situations where there is no explicit theory of response processes underlying performance on any given test and the sample size is big enough.

Attributes, items, and sample size Another important issue that researchers need to consider in DCM application is the optimal number of attributes1 to be measured, which is also known as the granularity of the subskills. A review of the available literature on the granularity of the attributes measured by a language assessment shows that as the number of attributes increases, the number of items and the sample size should increase as well (DiBello, Roussos, & Stout, 2007). As to the specificity or grain size of

84  Ravand the attributes, Jang (2009) proposed three criteria in determining the number of attributes that are measured by a language assessment: (1) theoretical aspects of the attributes or construct-representativeness, (2) technical aspects or the number of test items associated with each attribute, and (3) practical aspects or the usefulness of the attributes for diagnostic feedback. Choice of the DCM can also impact the number of attributes and sample size. When the selected DCM is a general model, more attributes for each item means more item parameters, which in turn requires larger sample sizes and more items. In this situation, we would encounter both computational and interpretation challenges. For example, with an item that requires four attributes, 14 item parameters could be estimated in the G-DINA: four main effects (of the attributes) and six two-way, three three-way, and one four-way interaction effects (among the attributes). Most DCM applications have used a sample size above 2,000 and have included about five attributes per assessment and at most three attributes preitem. Hartz (2002) recommended at least three items for any given attribute.

Model fit Model fit refers to the degree of correspondence between the selected DCM’s assumptions regarding how attributes combine to generate observed response and the assumptions of the theory underlying test performance. It should be noted that model fit and the appropriateness of the Q-matrix in DCMs are intertwined. Therefore, model fit can equally refer to how accurately the item-by-­ attribute relationships have been specified. There are two groups of fit indices in DCMs. Relative fit indices compare fit of different DCMs and are appropriate for model selection purposes, whereas absolute fit indices are used to evaluate the fit of any given model to the data. To compare DCMs, log-likelihood, Akaike information criterion (AIC; Akaike, 1974), and Bayesian information criterion (BIC; Stone, 1979), values of the models are compared. Both AIC and BIC introduce a penalty for model complexity; however sample size is included in the penalty for BIC. Thus, the penalty is larger in BIC. Models with lower AIC and BICs are preferred. The likelihood ratio (LR) test can be used to compare log-likelihood values of nested 2 models against those of a model with more parameters. However, AIC and BIC can be used to compare both nested and nonnested models. Absolute fit indices are based on the residuals obtained from the difference between the observed and model-predicted values. Unfortunately, for the majority of the absolute fit indices in DCMs, cutoffs or significance tests do not exist. There are two absolute fit indices, MX2 and abs(fcor), for which significance tests are reported. MX2 is the averaged difference between the model-­ predicted and observed response frequencies, and abs(fcor) is the difference between the observed and predicted Fisher-transformed correlations between pairs of items. For a more detailed account of the absolute fit indices in DCMs see Ravand and Robitzsch (2018).

Hierarchical diagnostic classification 85

Illustrating application of HDCM The second aim of this chapter was to demonstrate how DCMs can be applied to language assessment data. For the purpose of the illustration, the HDCM was applied. Besides having all the capabilities of the conventional DCMs, HDCM also provides a statistical means for testing hypotheses as to the hierarchical relationships among constructs. Since the HDCM is an extension of the G-DINA3, the G-DINA was also estimated. Choice of the DCMs in the present study was driven by theoretical and empirical considerations. General DCMs, such as the G-DINA, are flexible enough to allow more than one DCM within the same assessment (Ravand, 2016). Some studies (e.g., Lee & Sawaki, 2009; Li, Hunter, & Lei, 2016; Ravand & Robitzsch, 2018) have shown that attribute relationships in reading comprehension are a combination of conjunctive, disjunctive, and additive. Therefore, a general DCM (i.e., G-DINA), which allows all three kinds of relationships within the items of a language assessment, was chosen. As to the selection of the HDCM, a brief explanation is warranted. It is a commonly held belief among language teachers and testers that subskills of reading comprehension (RC) are arranged hierarchically (Alderson, 1990a, 1990b). It seems intuitively plausible to consider that, for example, understanding vocabulary and syntax precedes understanding sentences, which in turn precedes paragraph or text meaning. However, empirical studies have come up with mixed results as to the existence of skill hierarchies. Alderson and Lukmani (1989) found evidence for inseparability of reading skills into hierarchies. Weir, Hughes, and Porter (1990) found flaws with Alderson and Lukmani’s (1989) study and argued that skill hierarchies can be viewed from either an acquisition order view or a use-involvement view. The former view assumes that lower-order skills should be acquired before the higher-order ones, which Alderson and Lukmani (1989) criticize. However, the latter view, held by Weir, Hughes, and Porter (1990), assumes that application of higher-level skills entails some lower-level skills, regardless of the order of acquisition. Overall, it can be said that language teachers and testers distinguish among different levels of difficulty associated with the comprehension of reading passages, a belief that plays out in their teaching and testing practices. However, research results are inconclusive, and further research on the topic is required. To this end, some expert judges were asked to formulate hypotheses as to the attribute dependencies, and then the HDCM was used to test the hypotheses.

Formal representation of the models G-DINA The approach adopted by the G-DINA model is the same as analysis of variance (ANOVA). In this model, a set of main and interaction effects are used. The

86  Ravand probability in a G-DINA model that student i gets item j correct, which requires two attributes, α1 and α 2 , is defined in Equation 4.1 as follows:

P ( X ij = 1| α1, …, α K ) = δ j 0 + δ j 1α1 + δ j 2α 2 + δ j 12α1α 2 (4.1)

The parameter δ j 0 is denoted as the item intercept, which is the probability of a correct answer to the test item when none of the required attributes for the item has been mastered. For two attributes, there are two main effects, δ j 1 and δ j 2 , and one interaction effect, δ j 12. The parameters δ j 1 and δ j 2 are the changes in the probability of a correct response when only Attributes 1 and 2 have been mastered, respectively; and the parameter δ j 12 represents change in the probability of a correct response due to the mastery of both Attributes 1 and 2 over and above their main effects.

HDCM Conventional DCMs make assumptions about how attributes interact to generate responses to test items, but they take potential dependencies among the attributes for granted. Teaching materials usually are presented in a sequence, so that learning of new materials is dependent on the previously presented instructional points. The hierarchical nature of the syllabi may lead to dependencies among the knowledge components formed in the mind of the examinees. The HDCM (Bradshaw & Templin, 2014) can model attribute dependencies, referred to as attribute hierarchy. HDCM provides inferential statistics to falsify hypotheses about attribute hierarchies and does so within the flexible framework of the G-DINA. Compared to conventional DCMs, HDCM is more parsimonious in two ways: (i) the number of attribute profiles4 is much smaller, and (ii) fewer parameters are estimated for each item. Due to the dependencies among the required attributes, some attribute profiles or latent classes may not exist in the population of examinees. Take a three-attribute test wherein, according to the conventional DCMs, the test takers should be assigned to 23 = 8 latent classes or attribute profiles. However, when there are linear dependencies among the attributes so that, for example, mastery of Attribute 3 requires mastery of Attribute 2, which in turn requires mastery of Attribute 1, the number of attribute profiles reduces to A + 1 (i.e., 3 + 1 = 4) since the following profiles do not exist: [0,1,1], [0,1,0], [1,0,1], [0,0,1]. The parsimony advantage becomes more pronounced when there are a lot of attributes underlying a given test. For a 10-attribute test, for example, in conventional DCMs, examinees are classified into 1,024 latent classes, whereas in the HDCM when the hierarchy is linear, the test takers are assigned to only 10 + 1 = 11 latent classes. In terms of item parameters, some main effects that are not supported by the hierarchical structure of the attributes, are set to zero. For a two-attribute scenario where mastery of Attribute 2 (α 2 ) is dependent on mastery of Attribute 1 (α1), the G-DINA can be rewritten as

P ( X ij = 1| α1, …, α K ) = δ j 0 + δ j 1α1 + δ j 2(1)α1α 2 (4.2)

Hierarchical diagnostic classification 87 where δ j 2(1)α1α 2 represents an interaction for Attribute 2 nested within Attribute 1. As one might note, there is no main effect for Attribute 2 in Equation 4.2 because the attribute structure does not allow such a main effect. The main effect for each attribute refers to the increase in the probability of getting any given item correct when the other attributes have not been mastered. If mastery of Attribute 2 is dependent on mastery of Attribute 1, the main effect of Attribute 2 (i.e., increase in the probability of getting the item correct when Attribute 1 has not been mastered) is set to zero.

Sample study: Application of the HDCM to reading comprehension DCMs have mostly been employed to analyze tests already developed and calibrated through a nondiagnostic framework such as IRT or CTT, a practice that has been referred to as retrofitting (e.g., Aryadoust, 2011; Chen & Chen, 2016; Kim, 2015; Li, 2011; Ravand, 2016; Ravand, Barati, & Widhiarso, 2012; Templin & Bradshaw, 2014). Almost all the DCM studies on language assessment have ignored the hierarchical relationships among the attributes (Ravand & Robitzsch, 2018). The “relationship” has come to mean conjunctive ⁄­disjunctive. Taking into account hierarchical relationships in DCMs is as important (Ravand & Robitzsch, 2018). One possible reason for the ignorance of the attribute dependencies in DCM studies may have been the unavailability of proper models to test hypotheses as to attribute hierarchies. Using HDCM, researchers can discover hierarchies empirically and test hypotheses as to the hierarchical structure of the attributes. Thus, the present chapter intends to walk the readers through the procedures of the HDCM application. To this end, the RC data of 3,000 Iranian test takers who took the University Entrance Examination (UEE) to seek admission into Master’s in English programs in 2009 were used. The UEE was composed of two main sections: the general English (GE) and content knowledge sections. The GE section contained 60 multiple-choice items of grammar, vocabulary, and reading comprehension. This study focuses on the 20 RC items of the GE section that were used. The 20 RC items were referenced to three passages, and the test takers had about 20 minutes to answer the items.

Research questions The present study intends to answer the following research questions: i What are the attributes underlying the reading comprehension section of the University Entrance Examination? ii Can an HDCM model the relationships among the attributes test takers engage in when responding to the reading comprehension section of the University Entrance Examination?

88  Ravand

Q-matrix The core component of every DCM is a Q-matrix, which represents what attributes are measured by each item. A Q-matrix (see Table 4.3) is a matrix where rows represent items and columns represent attributes required to perform successfully on the items. There are either 0s or 1s at the intersection of each row and each column. The 0s indicate that the item does not require the attributes, and 1s indicate that the item requires the attributes. Reviewing the reading comprehension literature, a list of 10 attributes were drawn by the author and presented to five expert judges. The judges were Iranian university instructors who held PhDs in applied linguistics with at least 10 years of teaching and researching reading comprehension. Prior to coding the test items, they were trained in a 30-minute session on how to code the items for attributes. Then, the 20 reading comprehension items and the list of 10 attributes were given to each coder. They were asked to read the passages and answer the items and code them independently for each attribute. They were asked to rate how sure they were each attribute was necessary for each item on a scale of one to five. Attributes that were rated at least four by at least three of the raters were included in the initial Q-matrix. In the next step, the Q-matrix was subjected to statistical analysis through the procedure proposed by de la Torre and Chiu (2016). The procedure is based on a discrimination index that measures the degree to which an item discriminates among different reduced q-vectors5 and can be used in conjuction with the G-DINA and all the constrained models subsumed under it. This procedure by Table 4.3  The final Q-matrix Lexical Cohesive Paragraph Item Meaning Meaning Meaning Summarizing Inferencing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1

1 1 0 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0

1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0

1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0

0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1

Hierarchical diagnostic classification 89 de la Torre and Chiu (2016) identifies potential misspecifications and provides suggestions for modification of the Q-matrix. The suggested modifictions are either turning 0 entries into 1s or vice versa. Overall, 12 revisions were suggested. In eight cases, the suggestion was to turn 0s into 1s (i.e., adding the attribute to the q-vector of the respective item), and in four cases, it was suggested to turn 1s into 0s (i.e., deleting the attribute from the q-vector of the respective item). If the suggested changes were theoretically supported, the Q-matrix was modified. The five expert judges met to discuss the suggested changes and concluded that all the deletion suggestions were unnecessary. Therefore, it was decided to delete none of the attributes from the q-vector of any of the items. Addition of attributes to the Q-matrix was considered theoretically sensible if at least one expert judge had rated the necessity of the respective attribute at least four. All the suggested cases passed the criterion for inclusion. The final Q-matrix is displayed in Table 4.3. There are five attributes in the Q-matrix: lexical knowledge, cohesive knowledge, paragraph knowledge, summarizing, and inferencing. Lexical knowledge refers to semantic meaning of the words, as opposed to the implied meaning. Cohesive knowledge refers to the knowledge of various coheisve devices such as conjunction, reference, substitution, and ellipsis. Paragraph knowledge refers to the knowledge of the overall paragraph meaning. Summarizing refers to understanding overall meaning of a text, and inferencing refers to understanding text meaning derived from inferring from text or prior knowledge. In a nutshell, one item measured one attribute, 10 items two attributes, and nine items three attributes. Consequently, the average attribute per item was 48 ⁄20 = 2.4.

Specifying attribute dependencies In an HDCM, besides identifying the attributes underlying successful performance on any given test, one needs to specify how the required attributes depend on each other. The five expert judges were asked to think of the possible dependency patterns among the attributes. Leighton, Gierl, and Hunka (2004) identified four different types of attribute hierarchies (Figure 4.1). In a linear hierarchy, the attributes are ordered in just one branch. In a convergent hierarchy, an attribute in a

Figure 4.1  Some possible attribute hierarchies.

90  Ravand

Figure 4.2  Hypothesized hierarchies.

branch has multiple prerequisites, whereas in a divergent hierarchy, several branches are derived from the same attribute. In an unstructured hierarchy, many different branches are linked to the same attributes but are not linked to each other. Two of the judges identified a divergent pattern such that the attributes were sequenced as in HDCM in Figure 4.1. Two other judges suggested a linear hierarchy (HDCM1). Finally, one of the judges identified another divergent hierarchy (HDCM2) as shown in Figure 4.2.

Analysis procedures To analyze the data, the GDINA package (de la Torre & Chiu, 2016) and CDM package (Robitzsch, Kiefer, George, & Uenlue, 2017) in R were used. Q-matrix validation was carried out using the GDINA package. For the rest of the analyses, the CDM package was employed. Both packages employ marginalized maximum likelihood estimation (MMLE) with the expectation-maximization (EM) algorithm to estimate model parameters. To answer the second research question, the G-DINA and three different HDCMs (Figure 4.2), were compared. First, the absolute fit indices were examined and, for the models that fit the data, the relative fit indices were also investigated. To test whether the observed differences in the relative fit indices were statistically significant, the log-likelihoods of the well-fitting models were compared using the likelihood ratio (LR) test. The prevalence of the attribute profiles was also examined as evidence as to the hierarchical nature of the attributes. Comparison of the latent classes with near-zero membership with those that are relatively highly populated provides evidence as to the patterns of attribute dependencies.

Hierarchical diagnostic classification 91 Table 4.4  Absolute fit indices Model G-DINA DINA HDCM HDCM 1 HDCM 2

MX2 (p)

abs(fcor) (p)

11.54 (.13) 65.14 (< .001) 8.37 (.15) 65.24 (< .001) 67.55 (< .001)

0.06 (.06) 0.15 (< .001) 0.07 (.07) 0.15 (< .001) 0.15 (< .001)

Note: p = p-value for statistical significance

Results The first step in interpreting DCM results is to check the fit of the models. The nonsignificant MX2 and abs(fcor) values (Table 4.4) suggest that the G-DINA and HDCM fit the data. Thus, for the rest of the analyses, the results of the HDCM 1 and HDCM 2 are not considered. Since the HDCM is nested in the G-DINA, the likelihood ratio (LR) test can be used to compare its log-likelihood against that of the G-DINA. The HDCM is a constrained case of the G-DINA model, and thus it is expected to have a lower log-likelihood value. A significant difference in the log-likelihood values indicates that the nested model, which is more parsimonious, does not result in significant loss of fit. As Table 4.5 shows, the HDCM fits the data almost as well as the G-DINA. Also, the AIC and BIC values for both the HDCM and G-DINA are almost the same. The results indicate that the HDCM fits equally as well as the G-DINA. However, from two equally well-fitting models, the more parsimonious one is preferred. Prevalence of the attribute profiles can also be suggestive of dependencies among the attributes. Table 4.6 shows the prevalence of the attribute profiles according to the G-DINA. The first column of the table shows the 25 = 32 attribute profiles; the second column represents probability of the attribute profiles. When most of the attribute profiles are possessed by an insignificant number of respondents, the implication is that some of the attribute patterns are not possible for the sample of the study; hence, attribute dependencies might be inferred (Templin & Bradshaw, 2014). Attribute profiles with probabilities < .05, i.e., possessed by less than 5% of the respondents, may be considered not sizably populated. Consequently, nine of the latent classes generated by the G-DINA contained a sizable number of examinees. It should be noted that the order of the attributes in the attribute profiles is the same as their order in the Q-matrix Table 4.5  Relative fit indices Model G-DINA HDCM

LL

Npars

AIC

BIC

χ2

Df

P

−37292 −37293

131 130

74846 74846

75043 75041

−2 —

1 —

.81 —

Note: LL = log likelihood value; Npar = number of parameters

92  Ravand Table 4.6  Attribute profile prevalence Profile

Probability

00000 10000 01000 11000 00100 10100 01100 11100 00010 10010 01010 11010 00110 10110 01110 11110 00001 10001 01001 11001 00101 10101 01101 11101 00011 10011 01011 11011 00111 10111 01111 11111

.07 .11 0 .05 0 0 .01 .11 0 .01 .01 0 .01 .01 0 .15 0 .05 0 .1 0 .01 0 .05 0 0 .02 0 0 .01 .01 .21

Note: Skill profiles where all the prerequisite attributes have been mastered are written in bold

in Table 4.3 (i.e., lexical meaning, cohesive meaning, paragraph meaning, summarizing, and inferencing). As Figure 4.2 shows, in the HDCM diagram, there are two nodes ⁄ branches in the hierarchy of attributes, both diverging from the lexical knowledge. One branch ends in inference, but the other one leads to cohesive knowledge, then to paragraph knowledge, and finally to the summarizing attribute. Thus, attribute profiles where the lexical knowledge, for instance, has not been mastered (e.g., [01111], [00111], [01000], [00100], [01011]) should not exist in the sample of examinees because lexical knowledge is the prerequisite of all the attributes in both nodes. As Table 4.6 shows, almost all the latent classes where lexical knowledge has not been mastered (where the first attribute in the attribute profile has been represented by a 0) are meagerly populated. Furthermore, since cohesive knowledge is a prerequisite for paragraph

Hierarchical diagnostic classification 93 knowledge, which is in turn a prerequisite for the summarizing attribute, attribute profiles such as [10101], [10111], [11010], and [00010] are not possible in the sample of the respondents. Nonexistence of the previously mentioned four attribute profile suggests that, for example, cohesive knowledge (the second attribute) should be mastered before paragraph knowledge (the third attribute). Finally, nonexistence of the attribute profiles [11010] and [00010] suggests that paragraph knowledge (the third attribute) should be mastered before summarizing (the fourth attribute). The attribute profiles where prerequisite attributes have been mastered are written in bold in Table 4.6.

Discussion The present chapter aimed at providing an easy-to-follow introduction to DCMs and illustrates the application of a recently developed, rarely applied DCM, i.e., the hierarchical diagnostic classification model (HDCM). The chapter started with some important theoretical issues and key considerations in the application of DCMs. The readers were walked through both the qualitative and empirical Q-matrix development and validation. The present chapter is among the first attempts to apply the HDCM; therefore, it served to illustrate how to formulate different hypotheses as to attribute dependencies using expert judgment and how to test the hypotheses using the HDCM. To this end, three DCMs, including the G-DINA, and three different HDCMs, each representing a different hypothesis as to attribute dependencies, were fitted to the RC data. Absolute fit indices showed that both the G-DINA and the HDCM fit the data well, whereas the HDCM1and HDCM2 did not fit. In terms of the relative fit indices, the LR test showed no significant difference between the G-DINA and HDCM; however, the AIC and BIC values for the HDCM were almost the same as those of the G-DINA, indicating comparable fit of the two models. Model fit results were corroborated by the prevalence of attribute profiles. The results showed that under the G-DINA, only nine of the patterns of attribute mastery profiles had sizable memberships. The fact that few examinees were estimated to have, for example, the attribute profiles [01111], [00111], [00011], and [00001] would reinforce the hypothesis that the lexical knowledge is a prerequisite to all the attributes. Besides, very low membership for the attribute profiles [10101], [10111], [11010], and [00010] suggests that, for example, cohesive knowledge (the second attribute) should be mastered before paragraph knowledge (the third attribute). Finally, nonexistence of the attribute profiles [11010] and [00010] suggests that paragraph knowledge (the third attribute) should be mastered before summarizing (the fourth attribute). Overall, the results showed that there are dependencies among the attributes of reading comprehension. It can be argued that it may not be possible to separate reading comprehension subskills into higher-order and lower-­order, as Alderson and Lukmani (1989a, 1989b) found, but it may be possible to think of attributes as being dependent upon each other. This dependency may not

94  Ravand reflect the order of acquisition but the sequence of presentation of or the amount of emphasis put on the reading subskills in school and university syllabi. Most probably, vocabulary is one of the first subskills of reading presented to students. Understanding cohesive relationships among sentences is emphasized and practiced prior to understanding paragraph meaning, and understanding explicitly stated information is most probably taught before inferencing. Perfetti and Stafura (2014) argued that word knowledge is central to reading comprehension. There is a common belief among reading comprehension researchers (e.g., Perfetti, Yang, & Schmalhofer, 2008; Yang, Perfetti, & Schmalhofer, 2005, 2007) that lexical knowledge is an elemental prerequisite in reading comprehension, which distinguishes the more skilled reading comprehenders from the less skilled ones. According to Grabe (2009), inferencing is dependent on prior knowledge and vocabulary knowledge, among other things. The results of the present study also showed that summarizing depended on paragraph meaning, which in turn depended on cohesive meaning, which finally depended on lexical meaning. Summarizing requires understanding the overall text and extracting its gist. Understanding the gist of a passage requires a lot of knowledge sources such as vocabulary, grammar, and discourse structure (Pressley, 2002). Concurring with claims made by Grabe and Stoller (2002) and Lumley (1993), Kim (2015) also found a hierarchy of reading comprehension attributes. In this hierarchy, inferencing and summarizing stood higher than the paragraph knowledge, cohesive meaning, and lexical meaning. The results of the present study support the use-involvement view (Weir, Hughes, & Porter, 1990) regarding attribute hierarchy in RC. However, researchers are cautioned against inferring hierarchical relationships among attributes from a one-shot study such as the present one. It should be noted that, to provide stronger evidence for the presence of a hierarchy of relationships among attributes, longitudinal studies should be carried out. One might argue that both the absolute and relative indices showed that the G-DINA and HDCM almost equally fit and that they should yield very similar estimates. Thus, there is no need to bother to test hierarchical relationships. However, it can be argued that for the following reasons, the HDCM should be preferred: (1) the prevalence of the attribute profiles pointed to a hierarchy, and (2) even in case of comparable fit and performance of any two models, the more parsimonious one should be chosen.

Conclusion The present chapter discussed some theoretical issues regarding DCMs and illustrated the application of an HDCM. Hopefully, pedagogical studies such as the present one make DCMs more accessible to researchers and practitioners alike. The present study showed that DCMs can be used not only to discover conjunctive ⁄d isjunctive relationships among the attributes underlying performance on language tests but also to discover how these attributes might depend upon each other. Substantively speaking, the inspection of both attribute profiles and

Hierarchical diagnostic classification 95 model fit results showed that subskills of reading comprehension were dependent upon each other. Some attributes are mastered only when the prerequisite subskills have been mastered.

Notes 1. Attributes, skills, and subskills are strategies taken by test takers to answer test items. In the present chapter, the terms are used interchangeably. 2. When parameters of Model A, for example, are a subset of the parameters in Model B, Model A is nested in Model B. 3. Originally, Templin and Bradshaw (2014) advanced HLCDM as an extension of the LCDM. However, for the following reasons, it was decided to do the analysis with an extension of the G-DINA: (1) The LCDM and G-DINA are reparameterizations of each other (i.e., they are different only in their link functions); in the present study, the application of the HDCM, which is the hierarchical extension of the G-DINA, is illustrated. (2) Doing LCDM entails using two different software programs (i.e., Mplus and SAS), which are not user friendly and time efficient (Ravand & Robitzsch, 2018). However, HDCM can be carried out with the userfriendly CDM package in R. 4. An attribute profile is composed of a set of 0s and 1s, representing nonmastery and mastery of the required attributes, respectively, for each respondent. For example, the attribute profile [110] for a respondent on a test requiring three attributes indicates the respondent has mastered the first two attributes but not the third one. 5. Each row of a Q-matrix is a q-vector.

References Akaike, H. (1974). A new look at the statistical identification model. IEEE Transactions on Automated Control, 19, 716–723. doi:10.1109 ⁄  TAC.1974.1100705 Alderson, J. C. (1990a). Testing reading comprehension skills (Part Two). Reading in a Foreign Language, 7(1), 465–503. Alderson, J. C. (1990b). Testing reading comprehension skills (Part One). Reading in a Foreign Language. 6(2), 425–438. Alderson, J. C., & Lukmani, Y. (1989). Cognition and reading: Cognitive levels as embodied in test questions. Reading in a Foreign Language, 5(2), 253–270. Aryadoust, V. (2011). Application of the fusion model to while-listening performance tests. SHIKEN: JALT Testing and Evaluation SIG Newsletter, 15(2), 2–9. Retrieved from http: ⁄  ⁄  jalt.org ⁄ test ⁄ ary_2.htm Buck, G., Tatsuoka, K., & Kostin, I. (1997). The subskills of reading: Rule-space analysis of a multiplechoice test of second language reading comprehension. Language Learning, 47(3), 423–466. Buck, G., & Tatsuoka, K. (1998). Application of the rule-space procedure to language testing: Examining attributes of a free response listening test. Language Testing, 15, 119−157. doi: 10.1177 ⁄ 026553229801500201 Buck, G., VanEssen, T., Tatsuoka, K., Kostin, I., Lutz, D., & Phelps, M. (2004). Development, selection, and validation of a set of cognitive and linguistic attributes for the SAT I verbal: Critical reading section. Princeton, NJ: Educational Testing Services. Chen, H., & Chen, J. (2016). Retrofitting non-cognitive-diagnostic reading assessment under the generalized DINA model framework. Language Assessment Quarterly, 13(3), 218–230.

96  Ravand von Davier, M. (2005). A general diagnostic model applied to language testing data (RR-05-16). Princeton, NJ: Educational Testing Service. de la Torre, J., & Douglas, J. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69(3), 333–353. doi: 10.1007 ⁄  bf02295640 de la Torre, J. (2008). An empirically based method of Q-matrix validation for the DINA model: Development and applications. Journal of Educational Measurement, 45(4), 343–362. de la Torre, J. (2011). The generalized DINA model framework. Psychometrika, 76(2), 179–199. doi: 10.1007 ⁄ s11336-011-9207-7 de la Torre, J., & Chiu, C.-Y. (2016). A general method of empirical Q-matrix validation. Psychometrika, 81(2), 253–273. DiBello, L. V., Roussos, L. A., & Stout, W. F. (2007). Review of cognitively diagnostic assessment and a summary of psychometric models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, vol. 26: psychometrics (pp. 979–1030). Amsterdam, The Netherlands: Elsevier. DiBello, L. V., Stout, W. F., & Roussos, L. (1995). Unified cognitive psychometric assessment likelihoodbased classification techniques. In P. D. Nichols, S. F. Chipman, & R. L. Brennan (Eds.), Cognitively diagnostic assessment (pp. 361–390). Hillsdale, NJ: Lawrence Erlbaum. Grabe, W. (2009). Reading in a second language: Moving from theory to practice. New York, NY: Cambridge University Press. Grabe, W., & Stoller, F. L. (2002). Teaching and researching reading. Harlow, UK: Longman. Gierl, M. J., Leighton, J. P., & Hunka, S. M. (2007). Using the attribute hierarchy method to make diagnostic inferences about examinees’ cognitive skills. In J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 242–274). New York, NY: Cambridge University Press. Hartz, S. M. (2002). A Bayesian framework for the unified model for assessing cognitive abilities: Blending theory with practicality. Unpublished doctoral dissertation, University of Illinois at Urbana-Champaign, Illinois. Henson, R. A., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74(2), 191–210. Jang, E. E. (2005). A validity narrative: Effects of reading skills diagnosis on teaching and learning in the context of NG TOEFL. Unpublished doctoral dissertation, University of Illinois at Urbana-Champaign, Illinois. Jang, E. E. (2009). Cognitive diagnostic assessment of L2 reading comprehension ability: Validity arguments for fusion model application to LanguEdge assessment. Language Testing, 26(1), 031–073. Jang, E. E., Dunlop, M., Wagner, M., Kim, Y. H., & Gu, Z. (2013). Elementary school ELLs’ reading skill profiles using cognitive diagnosis modeling: Roles of length of residence and home language environment. Language Learning, 63(3), 400–436. Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258–272. Kim, A.-Y. (2015). Exploring ways to provide diagnostic feedback with an ESL placement test: Cognitive diagnostic assessment of L2 reading ability. Language Testing, 32(2), 227–258.

Hierarchical diagnostic classification 97 Kim, Y. H. (2011). Diagnosing EAP writing ability using the reduced reparameterized unified model. Language Testing, 28(4), 509–541. doi:10.1177 ⁄ 0265532211400860 Lee, Y.-W., & Sawaki, Y. (2009). Application of three cognitive diagnosis models to ESL reading and listening assessments. Language Assessment Quarterly, 6(3), 239–263. doi: 10.1080 ⁄ 15434300903079562 Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy method for cognitive assessment: A variation on Tatsuoka’s rule-space approach. Journal of Educational Measurement, 41(3), 205–237. Li, H. (2011). Evaluating language group differences in the subskills of reading using a cognitive diagnostic modeling and differential skill functioning approach. Unpublished doctoral dissertation, Penn State University, State College, PA. Li, H., Hunter, C. V., & Lei, P.-W. (2016). The selection of cognitive diagnostic models for a reading comprehension test. Language Testing, 33(3), 391–409. Liu, R., Huggins-Manley, A. C., & Bulut, O. (2018). Retrofitting diagnostic classification models to responses from IRT-based assessment forms. Educational and Psychological Measurement, 78(3), 357–383. doi: 10.1177 ⁄ 0013164416685599 Lumley, T. (1993). The notion of subskills in reading comprehension test: An EAP example. Language Testing, 10(3), 211–234. Ma, W., Iaconangelo, C., & de la Torre, J. (2016). Model similarity, model selection, and attribute classification. Applied Psychological Measurement, 40(3), 200–217. Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64(2), 187–212. Perfetti, C., & Stafura, J. (2014). Word knowledge in a theory of reading comprehension. Scientific Studies of Reading, 18(1), 22–37. Perfetti, C. A., Yang, C-L., & Schmalhofer, F. (2008). Comprehension skill and word-­ to-text processes. Applied Cognitive Psychology, 22(3), 303–318. Pressley, M. (2002). Comprehension strategy instruction: A turn-of-the-century status report. In C. Block & M. Pressley (Eds.), Comprehension instruction: Research-based best practices (pp. 11–27). New York: Guilford Press. Ravand, H. (2016). Application of a cognitive diagnostic model to a high-stakes reading comprehension test. Journal of Psychoeducational Assessment, 34(8), 782–799. Ravand, H., Barati, H., & Widhiarso, W. (2012). Exploring diagnostic capacity of a highstakes reading comprehension test: A pedagogical demonstration. Iranian Journal of Language Testing, 3(1), 12–37. Ravand, H., & Robitzsch, A. (2015). Cognitive diagnostic modeling using R. Practical Assessment, Research and Evaluation, 20(11), 1–12. Ravand, H., & Robitzsch, A. (2018). Cognitive diagnostic model of best choice: A study of reading comprehension. Educational Psychology, 38(10), 1255–1277. doi: 10.1080 ⁄  01443410.2018.1489524. Robitzsch, A., Kiefer, T., George, A., & Uenlue, A. (2017). CDM: Cognitive diagnosis modeling. R package version 3.1-14: retrieved from the Comprehensive R Archive Network [CRAN] at http: ⁄ ⁄ CRAN. R-project. org ⁄ package=CDM. Stone, M. (1979). Comments on Model Selection Criteria of Akaike and Schwarz. Journal of the Royal Statistical Society: Series B (Methodological), 41(2), 276–278. doi: 10.1111 ⁄  j.2517-6161.1979.tb01084.x Svetina, D., Gorin, J. S., & Tatsuoka, K. K. (2011). Defining and comparing the reading comprehension construct: A cognitive-psychometric modeling approach. International Journal of Testing, 11(1), 1–23.

98  Ravand Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20(4), 345–354. Templin, J. (2006). CDM user’s guide. Unpublished manuscript. Templin, J., & Bradshaw, L. (2014). Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika, 79, 317–339. Tatsuoka, K. K. (1990). Toward an integration of item-response theory and cognitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M. G. Shafto (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. 453–488). Hillsdale, NJ, US: Lawrence Erlbaum Associates, Inc. Templin, J. L., & Henson, R. A. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11(3), 287–305. Wang, C. & Gierl, M. J. (2011). Using the attribute hierarchy method to make diagnostic inferences about examinees’ cognitive skills in critical reading. Journal of Educational Measurement, 48(2), 165–187. Weir, C., Hughes, A., & Porter, D. (1990). Reading skills: Hierarchies, implicational relationships and identifiability. Reading in a Foreign Language, 7(1), 505. Xie, Q. (2017). Diagnosing university students’ academic writing in English: Is cognitive diagnostic modelling the way forward? Educational Psychology, 37(1), 26–47. doi: 10.1080 ⁄ 01443410.2016.1202900 Yang, C-L., Perfetti, C. A., & Schmalhofer, F. (2005). Less skilled comprehenders’ ERPs show sluggish word-to-text integration processes. Written Language & Literacy, 8(2), 233–257. Yang, C-L., Perfetti, C. A., & Schmalhofer, F. (2007). Event-related potential indicators of text integration across sentence boundaries. Journal of Experimental Psychology: Learning, Memory, and Cognition, 33(1), 55–89. Yi, Y.-S. (2017). In search of optimal cognitive diagnostic model(s) for ESL grammar test data. Applied Measurement in Education, 30(2), 82–101. doi: 10.1080 ⁄ 08957347.2017.1283314 Zhang, J. (2013). Relationships between missing responses and skill mastery profiles of cognitive diagnostic assessment. (Unpublished doctoral dissertation). University of Toronto, Toronto, ON. Canada.

Section II

Advanced statistical methods in language assessment

5

Structural equation modeling to predict performance in English proficiency tests Xuelian Zhu, Michelle Raquel, and Vahid Aryadoust

Introduction Structural equation modeling (SEM) is an advanced statistical technique to examine the relationships among groups of observed and latent variables by estimating covariances and means in experimental or nonexperimental designs (Kline, 2015; Ockey & Choi, 2015). SEM, also called analysis of covariance structures, covariance structure modeling, or covariance structure analysis, has been applied in a wide range of areas such as psychology (e.g., Breckler, 1990; MacCallum & Austin, 2000), management (e.g., Shah & Goldstein, 2006), marketing (e.g., Williams, Edwards, & Vandenberg, 2003), and language assessment (Purpura, 1998; Wilson, Roscoe, & Ahmed, 2017). SEM offers an advantage over other statistical methods such as regression analysis (see Chapter 10, Volume I), as it analyzes both latent and observed variables as well as provides measurement error estimates. Latent variables are factors that cannot be directly observed but can be inferred from other variables that can be directly observed or measured. In SEM, latent variables can represent psychological and cognitive traits, such as intelligence, language ability, and happiness, to name a few. On the other hand, observed variables, also known as manifest variables, are those variables that can be directly observed or measured. For example, happiness could be a latent variable that cannot be measured directly, but it can be inferred by the times a person smiles or the socialization intention, and so on. In SEM, the observed variables can be categorical or continuous, while latent variables are always continuous (Schumacker & Lomax, 2010). Among the advantages SEM has over other statistical methods are the following: i SEM evaluates the statistical validity of multiple complex theoretical ⁄­conceptual models, ensuring the accuracy and reliability of the conclusions. ii SEM estimates and incorporates measurement error for endogenous variables, whereas in, for example, regression analysis, only common variance is used and measurement error is estimated as part of the observed variance.

102  Zhu, Raquel, and Aryadoust iii SEM analyzes direct and indirect relationships, as well as assesses claimed “causal” relationships among variables in one analysis (Byrne, 2001; In’nami & Koizumi, 2011b). iv SEM provides graphical illustrations representing the relationships among hypothesized factors using software packages such as LISREL (Jöreskog & Sörbom, 2006), AMOS (Arbuckle, 2003), EQS (Bentler, 2008), or Mplus (Muthén & Muthén, 1998–2012). The purpose of SEM is to build up models to test whether the empirical data are in accordance with a hypothetical model. The models usually are specified based on a substantive theory, and SEM parameters are estimated under some statistical assumptions. A fundamental step in SEM is the specification of exogenous variables (independent variables) and endogenous variables (dependent variables), although exogenous variables can change into endogenous variables. Exogenous variables are not caused by or attributed to other variables in the model, whereas endogenous variables are caused by or attributed to one or more variables (Kline, 2015). The following sections provide an overview of SEM analysis and spell out guidelines to generate and evaluate SEM models in language assessment. We will then review the literature on SEM in language assessment research.

An overview of SEM We start this section with a description of model postulation and estimation alongside a brief overview of basic terminologies in SEM analysis. Figure 5.1 presents an SEM model with two latent variables (oval shapes), each of which is measured by three observable or observed variables or indicators (squares). For example, in language assessments, the latent variable would be the language ability under assessment and the indicators would be the relevant test items or tasks. Each observed variable has an error of measurement, represented by small circles (e1,…, e6) that measure the amount of variance in the observed variables that is not attributed to the latent variable. Latent Variable 1, which is an endogenous variable, also has an error term (also called disturbance). This is because Latent Variable 1 is an endogenous variable that is “caused” or predicted by Latent Variable 2, which is an exogenous variable (Schumacker & Lomax, 2010). In some models, a latent variable causes both exogenous and endogenous variables but the latent variable is not included in the mode; this is referred to as spuriousness. (Spuriousness can lead to model identification problems; see Kenny, Kashy, & Bolger, 1998, for a discussion.) The model presented in Figure 5.1 has two main components: two discrete measurement models (also called confirmatory factor analysis [CFA] models) that comprise separate latent variables and their observable models and one structural model that consists of the two latent variables and their relationship, which is illustrated by a one-headed arrow, also known as a path (Kline, 2015). This arrow indicates that the variance in Latent Variable 1 is predicted or caused by the variance in Latent Variable 2.

Structural equation modeling 103

Figure 5.1   A hypothesized SEM model including two latent variables and multiple observed variables.

Other components of the model in Figure 5.1 include Predictor Variables 1 and 2, which are observed and exogenous. Predictor variables make Latent Variable 1 a formative variable because it is caused or predicted by observed variables, as opposed to Latent Variable 2, which is a reflective variable, causing (variance in) a set of observed variables (Byrne, 2001). The correlations between the predictors usually are freed and estimated. The model in Figure 5.1 can also be called a multiple indicators multiple causes (MIMIC) model wherein the latent variables are predicted by one or more observed variables.

Five stages in SEM analysis Our survey of the literature has identified several stages in SEM analysis in language assessment and other fields. We propose a useful framework for language assessment research that comprises five steps: (1) model specification, (2) model identification, (3) data preparation, (4) parameter estimation, and (5) estimating fit and interpretation. The five steps have been derived from previous applications of SEM such as Hoyle and Isherwood (2013), In’nami and Koizumi (2011a), Kline (2015), McDonald and Ho (2002), and Schumacker and Lomax (2010). We adapt and extend these guidelines and provide a step-by-step guide

104  Zhu, Raquel, and Aryadoust that is accompanied by a tutorial (see the Companion website) to illustrate basic and advanced techniques in model generation, estimation, and validation. To facilitate the description of the five steps, we use the IBM software AMOS, Version 25 (Analysis of Moment Structures) (Arbuckle, 2003). We showcase the application of SEM by presenting a study of the ability of Hong Kong students’ scores on the Diagnostic English Language Tracking Assessment (DELTA) to predict their language proficiency scores on the International English Language Testing System (IELTS). In addition, we investigate whether the relationship between the students’ DELTA and IELTS scores is influenced by their academic background.

Stage 1: Model specification Model specification refers to the identification of the theoretical relationships among the variables as well as the specification of the relationships that do not exist (Kline, 2015). Model misspecifications can occur when we do not specify an otherwise important relationship between two or more variables or we specify a link between two or more variable that are not relevant (McDonald & Ho, 2002). There are three types of relations among variables in an SEM analysis: links that are fixed to zero and therefore not estimated (e.g., there is no observable link between Latent Variable 1 and Observable Variable 4); links that are fixed to a nonzero number and not estimated (e.g., the link between e1 and Observable Variable 1 in Figure 5.1); and links that are not fixed and therefore are estimated (e.g., the links between Latent Variable 1 and Observable Variables 2 and 3; McDonald & Ho, 2002).

Stage 2: Model identification Model identification refers to the process of evaluating whether the parameters in a specified model can be uniquely estimated and actual results can be produced (Crockett, 2012). Traditional model identification requires mathematical knowledge and rigorous computation. Nevertheless, thanks to the advances in computational and statistical software, now a model can be identified by current SEM computer packages. Investigating whether the model would be identified is performed for both measurement and structural models (Kline, 2015). To achieve model identification in measurement models, there must be at least two observed variables, and preferably three or more with uncorrelated errors, per latent variable. In addition, there must be a justified relationship between each pair of latent variables. This relationship can be correlational (indicated by bidirectional arrows) or cause-effect (indicated by unidirectional arrows) (Schumacker & Lomax, 2010). Next, we must fix one of the factor loadings to one (i.e., free the parameter), thereby creating a marker variable (Kenny, Kashy, & Bolger, 1998). In models with formative latent variables, we must fix one of the parameters leading to the latent variable to one and its disturbance to zero. Models can be over-, just-, or under-identified. To determine whether a model meets the conditions of any of these three identification types, we must calculate

Structural equation modeling 105 the degrees of freedom by subtracting the number of unknown parameters (those that we need to estimate) from the number of nonredundant known elements or correlations in the correlation matrix. The number of correlations can be calculated by using the formula shown in Equation 5.1 (Weston & Gore, 2006): ( Number of observed variables [ Number of observed variables + 1]) ⁄ 2 (5.1) For example, in the lower part of Figure 5.1, there are three observed variables (4, 5, and 6), thus making the numerator of the formula 3(3 + 1) = 12, which is then divided by the denominator 2. Thus, the final number of correlations is 6. If we consider the full model, note that there are eight observed variables and 36 known elements: ((8[8 + 1]) ⁄  2). Furthermore, we need to estimate six error variances, one disturbance, two correlations, and nine loading coefficients, thereby 18 unknown parameters. Since 36 − 18 = 18 (degrees of freedom = 18), the number of known parameters is more than the unknown. Thus, the model is over-identified. In other words, the model can be estimated and the unknown parameters can be computed. However, if the number of known parameters is equal to the unknown, the mode is just-identified (no degrees of freedom) and will have a perfect fit to the data. If the number of known parameters is less than those unknown, the model is under-identified and not estimable (Kline, 2015).

Stage 3: Data preparation Data preparation includes adequate sample size, checking the univariate and multivariate normality, and multicollinearity (Weston & Gore, 2006). While having a large enough sample size is an important requirement of SEM analysis, there is no universally accepted rule. Despite this, there are some rough guidelines in the literature. Kline (2015) recommended a participant-to-indicator ratio of 10 or 20, whereas others have suggested a ratio of 5 persons to 1 indicator (Bentler & Chou, 1987) or 20 to 1 (Tanaka, 1987) for stable parameter estimation (see Muthén & Muthén, 2002). Another approach to determine the sample size is determination according to the statistical power required. According to MacCallum, Browne, and Sugawara (1996), larger sample sizes return more precise parameter estimates. One recent development is Soper’s (2018) calculator, which can be used to estimate a priori sample sizes in SEM. (The calculator is accessible from https: ⁄ ⁄www.danielsoper.com ⁄statcalc ⁄calculator.aspx?id=89.) The calculator provides an estimation of the required sample based on the desired size effect, statistical power level, the number of latent and observed variables, and the desired p value (see Cohen, 1988; Westland, 2010). Finally, some researchers (e.g., Bond & Fox, 2015; Wang & Chyi-In, 2004) have proposed that raw data should be transformed to interval-level data to meet the requirements of parametric analysis and to make the analysis more robust. Byrne (2010) also claims that interval level data are preferred when building empirical models with SEM. One method of converting ordinal data into interval level data is through the Rasch model (see Chapter 4, Volume I).

106  Zhu, Raquel, and Aryadoust The second data preparation requirement is checking the univariate and multivariate normality. This is particularly important in the Maximum Likelihood (ML) method of parameter estimation, which assumes data normality. To evaluate univariate normality, researchers usually check the skewness (the degree of symmetry of the distribution) and kurtosis (the sharpness or peakedness of the distribution). Values falling between −3 and +3 indicate little or no significant deviation from normal distributions (Kline, 2015). Multivariate normality is evaluated using the normalized Mardia’s (1970) coefficient. In normal distributions, Mardia’s coefficient is < 3 (Arbuckle & Wothke, 1999). To treat non-normal data, Weston and Gore (2006, p. 737) proposed identifying outliers and removing them or transforming data into square roots “when data are moderately positively skewed,” inversing “for severe positive skew, such as an L-shaped distribution,” and logarithm “for more than moderate positive skew.” Another method is to estimate the Bollen-Stine p value, which is based on bootstrapping instead of the conventional p value estimated by the ML method (see Mooney & Duval, 1993, for a review of bootstrapping). For example, researchers can choose to generate 2,000 bootstrap samples drawn from the main sample and estimate the parameters. If the model does not fit the bootstrap samples to reach stable parameter estimations (e.g., when singular covariance matrices are created, affecting model fitting), some software packages like AMOS select a replacement sample to make sure that the final parameter and fit estimates are made based on the useable samples that the analyst initially requested. The third essential requirement for data preparation is to test for multicollinearity. This is a situation where independent variables are highly correlated with each other (correlation > .90), which could lead to the inflation of the standard errors of the coefficients. To determine whether multicollinearity is present in the data, indices such as the variance inflation factor (VIF) can be checked. If VIF is near or above five, multicollinearity may cause problems in parameter estimation (Gujarati & Porter, 2003). According to Kline (2015), when the correlation of the observed variables measuring one latent variable is > 0.90, one of them should be removed from the analysis. However, in language assessments, most of the tasks and items are expected to have high correlations and removing items might not be feasible. A solution to the multicollinearity problem was proposed by Goh and Aryadoust (2014). They suggested constructing parcel items or aggregate-level indicators that include the summation of test takers’ performance on locally dependent related test items (items that have significantly high correlations). The caveat of item aggregation is that while some researchers advocate the application of the technique (Sterba & Rights, 2016), others such as Marsh, Ludtke, Nagengast, Morin, and Von Davier (2013) have cautioned that parceling is “ill-advised” specifically when the model does not fit the data. Nevertheless, using parcel items is a common approach. For example, Plummer (2000) and Williams and O’Boyle (2008) showed that 50% of the SEM studies that they surveyed had used the technique for a variety of reasons including fit optimization.

Structural equation modeling 107

Stage 4: Parameter estimation Two SEM approaches to parameter estimation have emerged from the literature: the covariance-based (CB) SEM or maximum likelihood (ML) (e.g., used in the AMOS) and partial least square (PLS) SEM (e.g., used in the PLS software) (Hair, Sarstedt, Ringle, & Gudergan, 2018; Lohmöller, 1989; Wold, 1982). While there is a great deal of published research on comparing these two methods and presenting them as “rivals” (Bentler & Huang, 2014; Dijkstra, 2014; Dijkstra & Henseler, 2015), Jöreskog and Wold (1982) stressed that these methods are simply complementary and are used for different purposes. We find Hair, Ringle, and Sarstedt’s (2011, p. 144) guidelines on using these methods useful: If the goal is predicting key target constructs or identifying key “driver” constructs, select PLS-SEM. If the goal is theory testing, theory confirmation, or comparison of alternative theories, select CB-SEM. If the research is exploratory or an extension of an existing structural theory, select PLS-SEM. The next step is to determine the type of parameters. There are three types of parameters in SEM analysis: “directional effects, variances, and covariances” (Weston & Gore, 2006, p. 730). Directional effects capture the relationship between latent variables and observed variables as well as the relationships among latent variables; for example, in Figure 5.1, directional effects include those represented as one-headed arrows running from Latent Variable 1 to Observed Variables 1, 2, and 3 or running from Latent Variable 2 to Latent Variable 1. The magnitude of effect that is captured by directional effects is called factor loading or regression weights  ⁄coefficients. For parameter estimation, it is necessary to set one of the directional effects in each measurement model to one (e.g., the factor loading between Latent Variable 1 and Observed Variable 1). This way, we scale the latent variable (Brown, 2006). Another way of scaling the latent variable is to set the mean of the latent variable to zero or to set its variance to one (Brown, 2006; Kline, 2015). Since SEM accommodates measurement error (unlike regression, which does not), it is possible to estimate variance or error by fixing the error terms (small circles) to one and then estimating the variance per each measurement error, as shown in Figure 5.1 (Weston & Gore, 2006). Finally, covariances that are represented by bidirectional arrows (e.g., between e3 and e4 in Figure 5.1) indicate that two latent variables or error terms are associates and therefore covary—that is, any change in one is associated with change in the other one.

Stage 5: Model fit and interpretation The final stage for SEM analysis is to assess the accuracy of the model through checking the model fit indices. If the model fit indices are within a reasonable range, the model specification is proven to be appropriate for the data. Otherwise, the model needs to be respecified. There are multiple fit indices that can be categorized into four types: absolute fit, incremental or comparative fit, residual-based fit, and predictive fit (Schumacker & Lomax, 2010).

108  Zhu, Raquel, and Aryadoust Absolute fit indices estimate the amount of variance that can be explained by the proposed model in the data matrix. The commonly seen absolute fit indices are the χ2 (chi-square), the goodness-of-fit index (GFI), and the adjusted goodness-­of-fit index (AGFI) that is adjusted for parsimony. The interpretation of χ2 should be done with caution, mainly because it is sensitive to the sample size, and it is better to be reported with the degrees of freedom (df ) and p value. As recommended by Kline (1994), the ratio of χ2:df less than 3:1 indicates that the proposed model is a good fit for the covariance matrix. The GFI indices, proposed by Jöreskog and Sörbom (1986), measure the ratio of the sum of the squared differences to the observed variance. AGFI is the adjusted GFI that takes degrees of freedom into account. Both GFI and AGFI values range from 0 to 1, indicating a good fit to the data when values are higher than .90 (Jöreskog & Sörbom, 1986; Mulaik et al., 1989). However, the mean value of GFI is substantially affected by sample size, which means that it decreases as the sample size decreases and vice versa (Sharma, Mukherjee, Kumar, & Dillon, 2005), so the use of this index is suggested to be used together with other indices to avoid any biased prediction. Incremental or comparative fit indices, which refer to R 2 value, compare the hypothesized model with a baseline model. The normed fit index (NFI), nonnormed fit index (NNFI, or the Tucker-Lewis index or TLI), the comparative fit indices (CFI), and the relative fit index (RFI) are commonly used incremental or comparative fit indices. Traditionally, values greater than .90 or .95 are viewed as indicators of a good model-data fit. Among these fit indices, the NFI might underestimate the model fit in small samples (Bollen, 1989), so the NFFI or the CFI, which takes the sample size into account, has been used as a more reliable measure of fit (Bentler, 1990). The desired values for NNFI or CFI should preferably be greater than 0.90 (Byrne, 2001). The RFI is similar to CFI, ranging from 0 to 1, with values greater than .90 or .95 indicating a good fit. Residual-based fit indices are based on the residuals that are the differences between the observed and predicted covariances. Two widely used residual-based fit indices are the standardized root mean square residual (SRMR; Bentler, 1995) and the root mean square error of approximation (RMSEA; Browne & Cudeck, 1992). There is no agreement whether SRMR should be reported in the literature (In’nami & Koizumi, 2011), but we suggest SRMR can provide a fresh perspective for the model fit since it, unlike χ2, reflects the standardized difference between the observed correlation and the predicted correlation. RMSEA is currently a popular measure of fit in SEM papers, but it is suggested that RMSEA is not appropriate in models with low df because it may be inflated in such cases (Kenny, Kaniskan, & McCoach, 2015). For both SRMR and RMSEA, values less than 0.05 are desired for a well-fitting model, while values less than .08 still suggest a reasonable model fit (Brown, 2006). Predictive fit indices are used to select the best model from competing models that are developed for the same data (Kline, 2015). These include the Akaike information criterion (AIC), the constant AIC (CAIC), the Bayesian information criterion (BIC), the expected cross-validation index (ECVI), etc.

Structural equation modeling 109 These indices are used when the models are not nested. AIC is founded on the information theory, which helps to choose the optimal model that minimizes the information loss, so AIC does not test whether the observed model fits the null hypothesis (Akaike, 1974). Instead, it only provides a measure for the quality of the model compared to other (candidate) models. CAIC is AIC after adjusting for the sample size. BIC, although similar to the formula of AIC, is not an optimal choice for selecting the model when we apply the least mean squared error method because a “true model” (the process that generated the data) is not from the set of candidate models (Yang, 2005). Similarly, ECVI is used to predict how reliably a model would predict future sample covariances (Browne & Cudeck, 1993). With the predictive fit indices, lower values indicate better fit. It should be noted that recent research shows that the cutoff criteria for the fit statistics in SEM are not strictly applicable to all types of models (Fan & Sivo, 2007), so the introduction to the fit indices here serves as the starting reference to model fit assessment. Researchers should take into account the sample size, factor loading structures, or the number of latent and manifest variables when it comes to a specific model (see Heene, Hilbert, Draxler, Ziegler, & Bühner, 2011). Finally, as will be discussed in the next section, further guidelines in the reporting of SEM studies in language assessment have been provided by In’nami and Koizumi (2011a). They identified the best practices used by doing a meta-analysis of language testing studies in top language assessment journals up until 2008. Their study revealed that the ML estimation method was commonly used for parameter estimation and that model fit was determined by the following indices: chi-squares (with p values and degrees of freedom), CFI, RMSEA, and TLI. However, they also indicated that the sample size of some studies was not adequate, but those that were adequate were sufficient if they included the guidelines used to determine appropriate sample size. These guidelines were further refined by Ockey and Choi (2015). Table 5.1 is a summary of the guidelines suggested by both articles, which have also been reflected earlier in Bollen and Long (1993). Note that while we presented model fit and interpretation as a single stage, In’nami and Koizumi (2011a) separated model fit from model interpretation.

SEM in language assessment research SEM has been a popular method in language testing. In’nami and Koizumi (2011a) conducted a review of studies that used SEM in the field of language testing and learning up until 2008. The aim of the study was to determine how SEM had been used in the field of language testing and learning and, as mentioned in the previous section, to suggest guidelines as to the proper use and reporting of SEM. Their study identified 50 articles published in 20 language testing and learning international online journals. Their research showed that the top research areas were investigation of the relationship of test taker characteristics

110  Zhu, Raquel, and Aryadoust Table 5.1  Language assessment SEM research reporting guidelines (based on Ockey and Choi, 2015, and In’nami and Kiozumi, 2011a) Steps in SEM application

Information to be included in SEM research reports

Model specification

• Path models of proposed hypothesized models that clearly indicate the hypothesized relationships of latent and observed variables • Description of theory to support hypothesized model

Model identification

• Path models that clearly indicate the number of unique data points, fixed parameters, and free parameters to verify the model degrees of freedom

Data preparation

• Description of test taker characteristics (e.g., gender, language background) • Observed variables’ sample size, descriptive statistics, reliability estimates, correlation, and covariance matrix • Presence of multicollinearity and how it was dealt with (if any) • Methods used to check for normality of data (e.g., Q-Q plots or histograms, skewness, and kurtosis values to determine univariate or multivariate data) • Methods used to account for missing data (if any) (e.g., list-wise deletion, pair-wise deletion) • Guidelines used to determine appropriateness of sample size (e.g., Kline [2015] guidelines, Raykov and Marcoulides [2006] guidelines)

Parameter estimation

• Parameter estimation method used (e.g., maximum likelihood method, generalized least squares method) • Software packages used (AMOS, LISREL, EQS, Mplus)

Model fit

Absolute fit indices • Chi-squares with p values and degrees of freedom • Goodness-of-fit index (GFI) • Standardized root mean square residual (SRMR) Adjusted for parsimony indices with confidence intervals • Root mean square error of approximation (RMSEA) • Adjusted goodness-of-fit index (AGFI) Relative ⁄ incremental ⁄ comparative fit indices • Comparative fit index (CFI) • Tucker-Lewis index (TLI) • Incremental fit index (IFI) • Normed-fit index (NFI) Predictive fit indices • Akaike information criterion (AIC) • Consistent Akaike information criterion (CAIC)

Model interpretation

• Standard errors of estimates • Standardized and unstandardized estimates in the path diagram • Residual variances of endogenous variables (appendix or supplementary material) • Factor loadings and structural coefficients • Alternative models (if any) • Theory and ⁄ or rationale for model respecification (if any)

Structural equation modeling 111 and test performance and test trait ⁄structure. Their study also showed the significant growth of SEM studies since 1998; and indeed, it has been used to address a wider range of research queries such as the investigation of the predictors of reading ability (Phakiti, 2008a), the multidivisibility of listening subskills (Aryadoust & Goh, 2013; Eom, 2008; Goh & Aryadoust, 2014), the multidivisibility of writing ability (Aryadoust & Liu, 2015; Wilson, Roscoe, & Ahmed, 2017), and effects of strategy use in language tests (Purpura, 1998; Yang, 2014). In addition to the meta-analysis conducted by In’nami and Koizumi (2011a), listed here are other SEM studies we found in the top language assessment journals that investigated the following: i Relationship of test taker characteristics ⁄attributes and overall or specific skill test performance (Alavi & Ghaemi, 2011; Cai, 2012; Gu, 2013; Phakiti, 2008a, 2008b, 2016; Song, 2011; Yang, 2014; Zhang, Goh, & Kunnan, 2014) ii Relationship of individual test tasks and overall or specific test performance (e.g., Harsch & Hartig, 2016; Leong, Ho, Chang, & Hau, 2013; Trace, Brown, Janssen, & Kozhevnikova, 2017) iii Relationship of test design and use on test preparation (e.g., Xie & Andrews, 2012) iv Relationship of test preparation and test scores (e.g., Xie, 2013) v Trait ⁄test structure (e.g., Bae, Bentler, & Lee, 2016; Cai & Kunnan, 2018; Sawaki, Stricker, & Oranje, 2009; Song, 2008; van Steensel, Oostdam, & van Gelderen, 2013) vi Test construct validity (e.g., Alavi, Kaivanpanah, & Masjedlou, 2018; Farnsworth, 2013; Liao, 2009) vii Relationship between test results and another measure (e.g., Aryadoust & Liu, 2015) viii Validity ⁄reliability of test-score interpretations through multisample ⁄group analysis (e.g., Bae & Bachman, 1998, 2010; In’nami & Koizumi, 2011a, 2011b; Shin, 2005; Trace et al., 2017) The next section illustrates how SEM can be applied in language assessment research. We present a study of how a hypothesized model of the relationship between a diagnostic language test and a language proficiency test is analyzed through SEM.

Sample study: Identifying the predictive ability of a diagnostic test toward a proficiency test Context A university in Hong Kong requires students to take the Diagnostic English Language Tracking Assessment (DELTA) at least three times during their academic degree program to support their English language development throughout their studies. Students are also asked to take it so that they can self-monitor

112  Zhu, Raquel, and Aryadoust their English language proficiency level in preparation for the official exit language test of the university, the International English Language Testing System (IELTS) assessment. Although both are tests of language proficiency, the DELTA is a diagnostic and tracking assessment that assesses receptive skills of listening, reading, grammar, and vocabulary in discrete-item (multiple-choice) format, while IELTS assesses listening, reading, speaking, and writing ability through multiple integrative item types. Thus, to provide evidence of the usefulness of the DELTA to stakeholders, it is imperative to determine to what extent DELTA scores can predict IELTS scores. Furthermore, since studies have determined that students’ English proficiency and ⁄or academic performance differ across academic disciplines (Aina, Ogundele, & Olanipekun, 2013; Celestine & Ming, 1999; Pae, 2004), there is also a question as to whether the academic discipline of the student impacts the relationship of DELTA and IELTS scores and contributes to construct-irrelevant variance. Predictive validity is an aspect of criterion-related validity where a test can claim that its test results can consistently and accurately determine a test taker’s future performance (Hughes, 2002). This usually is determined by comparing test scores against an established external measure such as test results of another test taken at another point in time or grade point average. In fact, predictive validity research is quite common in high-stakes language proficiency tests such as IELTS and the Test of English as a Foreign Language (TOEFL) as these tests usually are utilized as university entrance requirements, which thus require evidence that the test can predict academic performance. For example, several studies were conducted to determine if IELTS is an indicator of academic success (e.g., Breeze & Miller, 2011; Morton & Moore, 2005; Woodrow, 2006) and similarly for TOEFL (e.g., Ginther & Yan, 2017; Sawaki & Nissan, 2009; Takagi, 2011). At the time of the writing of this chapter, however, there have been no predictive validity studies known do the reverse, i.e., another test to predict performance on either the IELTS or the TOEFL test. This study thus attempts to test a relationship between DELTA and IELTS and to determine if this relationship differs depending on the academic discipline of the student. The hypothesized conceptual framework of this study is shown in Figure 5.2. The study aims to answer the following research questions: i Do the DELTA scores explain (predict) IELTS scores? ii Does faculty (academic discipline) influence the predictive relationships of the DELTA and IELTS scores?

Methodology Instruments The DELTA is a computer-based post-entry diagnostic language assessment developed for tertiary institutions in Hong Kong (Urmston, Raquel, & Tsang, 2013). As mentioned, it has four components: listening,

Structural equation modeling 113

Figure 5.2  A hypothesized model of factors predicting IELTS scores.

reading, vocabulary, and grammar. The listening, reading, and grammar components are text-based tests where candidates are given a text to read or listen to and are asked to answer questions related to the text, while the vocabulary component is a discrete-item test. Each reading and listening question is tagged to address one subskill, while grammar items are tagged for specific grammatical items, and vocabulary items are tagged for the appropriate academic vocabulary that completes each sentence (the construct of each test component is discussed in detail in Urmston et al., 2013). Each student completes 100 items selected from the item bank: 20–30 items across four listening tests, 20–30 items across four reading tests, 25–30 items across two grammar tests, and 20–25 vocabulary items. The DELTA diagnostic report provides students with a holistic score of performance across all subcomponents (DELTA, 2018), a graph that describes their relative performance across test components, and a graph that tracks their scores every time they take the DELTA (see Figure 5.3). The DELTA scale is a 200-point interval scale transformed from Rasch logit measures (11 DELTA units = 1 logit) and represents the score of all test components (Urmston et al., 2013). IELTS is a language proficiency test of listening, reading, speaking, and writing. It has two formats, general and academic, with the latter the type preferred by universities. The academic reading and listening tests are text-based, similar to the DELTA, but utilize multiple item types (multiple-choice, gap-fill, matching, etc.). The speaking and writing tests are performance-based tests; the speaking test requires candidates to go through a 15-minute three-stage interview with an examiner, and the writing test requires a candidate to complete two writing tasks—interpretation of a graphic and an essay. IELTS scores are generated for each test component and presented as band scores out of a nine-point interval scale (IELTS, 2018).1 The listening and reading band scores are direct conversions from raw scores, while writing and speaking scores are awarded based on grades on the assessment criteria. An overall score is also provided, which is the average of the component scores (IDP Education, 2018). Evidence of the validity of each test component scale structure is provided in studies such as

114  Zhu, Raquel, and Aryadoust

Figure 5.3  DELTA track.

Brown (2006a, 2006b) for the speaking test, Shaw and Falvey (2006) and Shaw (2007) for the writing test, Aryadoust (2013) for the listening test, and Taylor and Weir (2012) for the reading and listening tests.

Sample The data were taken from a cohort of business and humanities students who took the DELTA and IELTS from 2012 to 2017. In academic year 2016–2017, students were asked to volunteer to submit their IELTS scores for the purpose

Structural equation modeling 115 of the project. A total of 257 students who took the DELTA between 2012 and 2016 and the IELTS test between 2014 and 2017 agreed to participate in the study. Of these, 156 students were humanities students and 101 were business students. There were no missing data in this sample.

Data analysis In this section, we review the five SEM stages that were discussed earlier. In model specification (stage 1), we specified the relationships among the variables in three models: a measurement ⁄CFA model and two SEM models. For the CFA model, we developed a model comprising a latent variable representing academic English proficiency measured by IELTS that was indicated (measured) by the participants’ IELTS listening, speaking, reading, and writing scores. This relationship is presented in a graphical model with one-headed arrows running from IELTS to the four indicators (see Figure 5.4). Then, the first SEM model included the CFA model regressed on the DELTA overall scores linearized by using the Rasch model, as discussed earlier (see Figure 5.5). The regression (cause-effect) relationship was specified by a one-headed arrow running from the DELTA observed variable to the IELTS latent variable, which has a disturbance term. DELTA is the exogenous variable, whereas IELTS is the endogenous or predicted variable. The second SEM model is nested in the first SEM but also included faculty (representing academic disciplines of business and humanities) as an exogenous variable predicting or causing the DELTA observed variable and the IELTS latent variable (see Figure 5.6). In this model, both the DELTA and the IELTS variables are endogenous variables.

Figure 5.4  A CFA model with the IELTS latent factor causing the variance in the observable variables (IELTSL: IELTS listening score, IELTSR: IELTS reading score, IELTSW: IELTS writing score, and IELTSS: IELTS speaking score) (n = 257).

116  Zhu, Raquel, and Aryadoust

Figure 5.5  A n SEM model with observable DELTA data predicting the latent variable, IELTS (n = 257).

As mentioned in the previous section, software packages such as AMOS facilitate the evaluation of model identification (stage 2). In this analysis, AMOS was used to do this. As noted earlier, DELTA and IELTS scores are interval level data. These were examined for adherence to univariate normality by examining their kurotsis and skewness coefficients. Multivariate normality was determined by Mardia’s coefficient (stage 3).

Figure 5.6  A n SEM model with faculty predicting the latent variable, IELTS, and the observable DELTA variable (n = 257).

Structural equation modeling 117 Next, in parameter estimation (stage 4), we used the ML method as the data used was normally distributed. Finally, in model fit and model interpretation (stage 5), we estimated multiple fit statistics to evaluate the fit of the model to the data. The next section reports the results of the five stages of SEM analysis.

Results: CFA model After checking the univariate and multivariate normality of the data in the way discussed in the section on data preparation, we used the ML method of parameter estimation to test the fit of the CFA model. The model fit indices determine how well the model fits the data. As demonstrated in Figure 5.4, the standardized regression coefficients for the indicators are located on top of the arrows running from the latent variable to the indicators. The coefficients are around medium size (p < .05), which indicates that the observed variance in the test scores is attributed moderately to the latent variable IELTS. In addition, we investigated the fit statistics of the model (using the most commonly used fit indices mentioned in Table 5.1) that showed an excellent model-to-data fit (χ2 = .857, CFI = 1.000, NFI = .967, TLI = 1.178, RMSEA = .000). It should be noted that TLI and CFI values can be greater than 1.

Results: SEM full models To address the two research questions, we regressed the IELTS latent variable on the DELTA variable (Figure 5.5) and the DELTA overall score and IELTS on the faculty (academic discipline) variable (Figure 5.6). The fit of the hypothesized model and the data is shown in Table 5.2. The chi-square statistic for the model in Figure 5.5 was nonsignificant (χ2 = 5.261, df = 5, p > .05), indicating that there is no statistically significant difference between the model-implied and the observed covariance matrix. Moreover, the χ2:df value was 1.052, which is less than 3 and further supports good model fit, and the comparative fit statistics also confirm this (CFI = .987, NFI = .939, TLI = .993, RMSEA = .014). Similarly, the model in Figure 5.6 had a good fit into the data (e.g., CFI = .988, NFI = .903, TLI = .969, RMSEA = .02). In the first SEM model (Figure 5.5), the amount of variance in IELTS scores as explained by DELTA scores is 54.76%. In the second SEM model (Figure 5.6), the amount of variance in IELTS scores as explained by DELTA scores taking faculty into account is 55%. The reason for this slight Table 5.2  Fit indices of the hypothesized models and the data df χ2 ⁄ df

Model

χ2

SEM Model 1 ⁄ CFA (Figure 5.4) SEM Model 2 (Figure 5.5) SEM Model 3 (Figure 5.6)

.851 .654

2

5.261 .385

5

1.052

.997 .939

.993 35.261

.014

8.818 .358

8

1.102

.988 .903

.969 46.818

.020

p

CFI

NFI

TLI

AIC

.425 1.000 .967 1.178 24.851

RMSEA .000

118  Zhu, Raquel, and Aryadoust difference in values of variance explained in the first and second SEM model is that faculty, which is added to the model, had negligible impact on both IELTS and DELTA scores (standardized regression weights = 0.08 and 0.07 respectively). Standardized estimates are comparatively easier to interpret because they enforce the same variance to all parameter estimates; but, due to the standard errors that they may cause, unstandardized estimates are preferred for making meaningful statistical inferences (Ockey & Choi, 2015). Standardized estimates are measured by units of standard deviations, meaning the variables are all converted to z-scores before running the analysis. However, unstandardized estimates are based on the “raw” data, representing how much a dependent variable changes while keeping the independent variable constant. Standardized estimates already appear in the path model shown in Figure 5.5. To help further examine the direct or indirect path estimates of DELTA and IELTS, the SEM model provides both standardized and unstandardized estimates in Table 5.3. The prediction relationships from Faculty to DELTAOverall1 (p > .05) and from faculty to IELTS (p > .05) are not statistically significant, indicating that the students’ academic discipline does not influence their DELTA overall performance. However, the predicted relationship from DELTA Overall1 to IELTS is statistically significant (p < .05), indicating that the overall DELTA score, as the diagnostic test, could positively predict test takers’ language ability as measured by IELTS.

Discussion This chapter presented a study of the relationship between the DELTA and IELTS tests. The results of the study confirm the hypothesized relationship between the DELTA and the IELTS test scores with no effect of the academic discipline as a mediating factor. This seems to confirm the homogenous characteristic of the population where students within and across the universities have similar English language proficiency. It also suggests that in this testing context, the test scores were not affected by the background of the test takers, which contradicts previous studies (e.g., Aina et al., 2013; Celestine & Ming, 1999; Pae, 2004) that have claimed that students from different academic disciplines had different levels of Table 5.3  Standardized and unstandardized estimates of SEM model 3 (see Figure 5.6) Direction

Standardized Estimate

Unstandardized Estimate

S.E.

C.R.

p

DELTA Overall1 IELTS



Faculty

.066

.701

.663 1.058 .290



.734

.020

.006 3.173 .002

IELTS IELTSL IELTSS IELTSR IELTSW

← ← ← ← ←

DELTA Overall1 Faculty IELTS IELTS IELTS IELTS

.075 .389 .249 .446 .319

.021 1.832 1.000 2.219 .927

.026 0.807 .420 .662 2.769 .006 .000 .772 2.873 .004 .358 2.586 .100

Structural equation modeling 119 language proficiency. This finding thus provides evidence of the test fairness quality of the DELTA and the IELTS tests. As Roever and McNamara (2006) argued, for a test to be fair, there must be strong evidence that the test is not affected by sources of construct-irrelevant variance in a way that some subgroups are advantaged unfairly. We have, therefore, shown that SEM is able to investigate potential sources of fairness violation in language assessment. Another SEM-based method to investigate fairness is multigroup SEM analysis, which has been discussed by Yoo and Manna (2015). Another approach to fairness is differential item functioning (DIF), which has been discussed in Chapter 4 and 5, Volume I. The SEM analysis also showed that, despite it being only a test of receptive skills, the DELTA is able to strongly predict students’ scores in the IELTS exam. Identifying the strength of the relationship of the DELTA and IELTS test scores supports the predictive validity claim of the DELTA, especially since IELTS is considered a strong external benchmark to compare test scores in this testing context. A limitation of the study is that test scores were not taken at the same point in time (i.e., there was variation in the interval when the students took the two tests), and this could be a reason for having 54.76% of the variance explained. Despite this limitation, this study provides evidence that the DELTA is an appropriate test for students to self-monitor their English language proficiency level in preparation for the exit language test of the university. Thus, this study contributes to validity studies on diagnostic tests and shows that aggregate test scores can be used to determine success in other proficiency tests. The study also extends predictive validity studies as it shows the potential predictive relationship of diagnostic language tests and proficiency tests.

Conclusion This chapter introduced the application of SEM in the field of language assessment with an example of its application in identifying the predictive relationship between a diagnostic and a proficiency test. It is important to note that constructing a strong hypothesized model is of paramount importance to have substantive theoretical grounding underlying the models (Mueller, 1997). The chapter also proposes a five-stage analysis procedure (model specification, model identification, data preparation, parameter estimation, estimating fit and interpretation) to serve as a guideline for researchers to conduct their research and analysis. Language testing and assessment research benefits from the use of SEM since SEM is able to draw a clear map of the relationships among commonly appearing latent and observable variables in language testing, such as the test performance, test taker characteristics, skills or techniques involved in test, etc. SEM also enjoys the advantage of identifying both direct and indirect relationships among variables, thus providing opportunities to make predictions in language testing such as the sample case in this chapter. Meanwhile, the consideration of error of measurement in the model makes the results more accurate. SEM also has its own limitations. A possibility is that the model may produce possible unexplained results (i.e., model interpretation might not be possible)

120  Zhu, Raquel, and Aryadoust due to a variety of reasons (for example, small sample size). Kline (2015) proposed a sample size smaller than 100 as small, between 100 and 200 as medium, and exceeding 200 as large, but Raykov and Marcoulides (2006) stated that the sample size should be considered with the complexity of the model, arguing that the desirable sample size would be 10 times the number of free model parameters. However, researchers never came to a definite agreed “global fit” sample size. Another possible reason for unexplained results could be the missing data. Missing data result from different reasons, like hardware failure, nonresponses from items, test takers skipping test items, or other nonanticipated reasons, especially in longitudinal studies. The default ML estimation method in many SEM computer packages is unable to handle raw data files with missing data, so it is necessary to deal with missing data in SEM (Allison, 2003; Peters & Enders, 2002). One way is to delete or impute (replace missing data with, e.g., mean) the incomplete data, and the other way, which is used in AMOS, is to estimate mean and intercept as additional parameters. ML estimates with missing data can be carried out only if mean and intercept are also estimated. In summary, this chapter has presented SEM analysis and provided guidelines to carry out analysis. With increasingly more demands for the development and validation of new language assessments, SEM becomes one of the most useful instruments available to researchers for a variety of purposes such as confirming the underlying structure of tests and exploring the predictive power of language assessments.

Note 1. For more details on the band descriptors, please see https:  ⁄  ⁄www.ielts. org  ⁄  ielts-for-organisations ⁄  ielts-scoring-in-detail

References Aina, J. K., Ogundele, A. G., & Olanipekun, S. S. (2013). Students’ proficiency in English language relationship with academic performance in science and technical education. American Journal of Educational Research, 1(9), 355–358. doi:10.12691 ⁄education-1-9-2 Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. Alavi, M. & Ghaemi, H. (2011). Application of structural equation modeling in EFL testing: A report of two Iranian studies. Language Testing in Asia, 1(3), 22. doi:10.1186 ⁄  2229-0443-1-3-22 Alavi, S. M., Kaivanpanah, S., & Masjedlou, A. P. (2018). Validity of the listening module of international English language testing system: multiple sources of evidence. Language Testing in Asia, 8(1), 8. doi:10.1186 ⁄s40468-018-0057-4 Allison, P. D. (2003). Missing data techniques for structural equation modeling. Journal of Abnormal Psychology, 112, 545–557. Arbuckle, J. L. (2003). AMOS 5.0.1. Chicago, IL: Smallwaters Corp. Arbuckle, J. L., & Wothke, W. (1999). AMOS 4.0 User’s Guide. Chicago, IL: Smallwaters Corp.

Structural equation modeling 121 Aryadoust, V. (2013). Building a validity argument for a listening test of academic proficiency. Newcastle upon Tyne, UK: Cambridge Scholars Publishing. Aryadoust, V., & Goh, C. (2013). Exploring the relative merits of cognitive diagnostic models and confirmatory factor analysis for assessing listening comprehension. In E. D. Galaczi & C. J. Weir (Eds.), Studies in Language Testing Volume of Proceedings from the ALTE Krakow Conference, 2011 (pp. 405–426). Cambridge: University of Cambridge ESOL Examinations and Cambridge University Press. Aryadoust, V., & Liu, S. (2015). Predicting EFL writing ability from levels of mental representation measured by Coh-Metrix: A structural equation modeling study. Assessing Writing, 24, 35–58. doi: https: ⁄ ⁄doi.org ⁄ 10.1016 ⁄  j.asw.2015.03.001 Bae, J., & Bachman, L. F. (1998). A latent variable approach to listening and reading: Testing factorial invariance across two groups of children in the Korean ⁄ English two-way immersion program. Language Testing, 15(3), 380–414. doi:10.1177 ⁄ 026553229801500304 Bae, J., & Bachman, L. F. (2010). An investigation of four writing traits and two tasks across two languages. Language Testing, 27(2), 213–234. doi:10.1177 ⁄ 0265532209349470 Bae, J., Bentler, P. M., & Lee, Y.-S. (2016). On the role of content in writing assessment. Language Assessment Quarterly, 13(4), 302–328. doi:10.1080 ⁄ 15434303.2016.1246552 Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246. Bentler, P. M. (2008). EQS program manual. Encino, CA: Multivariate Software, Inc. Bentler, P. M., & Chou, C. P. (1987). Practical issues in structural modeling. Sociological Methods and Research, 16, 78–117. Bentler, P. M., & Huang, W. (2014). On components, latent variables, PLS and simple methods: Reactions to Rigdon’s rethinking of PLS. Long Range Planning, 47(3), 138–145. Bollen, K. A. (1989). A new incremental fiIt index for general structural equation models. Sociological Methods and Research, 17(3), 303–316. Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences (3rd ed.). Mahwah, NJ: L. Erlbaum. Breckler, S. J. (1990). Applications of covariance structure modeling in psychology: Cause for concern? Psychological Bulletin, 107, 260–273. Breeze, R., & Miller, P. (2011). Predictive validity of the IELTS listening test as an indicator of student coping ability in Spain. In J. Osborne (Ed.), IELTS Research Reports (Vol. 12, pp. 1–34). Melbourne, Australia: IDP: IELTS Australia and British Council. Brown, A. (2006a). Candidate discourse in the revised IELTS speaking test. In P. McGovern & S. Walsh (Eds.), IELTS Research Report (Vol. 6, pp. 71–90). Canberra, Australia: IELTS Australia, Pty Ltd and British Council. Brown, A. (2006b). An examination of the rating process in the revised IELTS speaking test. In P. McGovern & S. Walsh (Eds.), IELTS Research Report (Vol. 6, pp. 41–70). Canberra, Australia: IELTS Australia, Pty Ltd and British Council. Brown, T. A. (2006). Confimatory factor analysis for applied research. New York: Guilford Press. Browne, M. W., & Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological Methods and Research, 21(2), 230–258. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Thousand Oaks, CA: Sage. Bollen, K. A. (1989). Structural equations with latent variables. New York: John Wiley. Bollen, K. A., & Long, J. S. (1993). Introduction. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 1–9). Newbury Park, CA: Sage.

122  Zhu, Raquel, and Aryadoust Byrne, B. M. (2001). Structural equation modeling with AMOS, EQS, and LISREL: Comparative approaches to testing for the factorial validity of a measuring instrument. International Journal of Testing, 1(1), 55–86. Byrne, B. M. (2010). Structural equation modeling with Amos: Basic concepts, applications, and programming (2nd ed.). New York: Routledge. Cai, H. (2012). Partial dictation as a measure of EFL listening proficiency: Evidence from confirmatory factor analysis. Language Testing, 30(2), 177–199. doi:10.1177 ⁄ 0265532212456833 Cai, Y., & Kunnan, A. J. (2018). Examining the inseparability of content knowledge from LSP reading ability: An approach combining bifactor-multidimensional item response theory and structural equation modeling. Language Assessment Quarterly, 15(2), 109–129. doi:10.1080 ⁄ 15434303.2018.1451532 Celestine, C., & Ming, C. S. (1999). The effect of background discipline on IELTS scores. International English Language Testing System (IELTS) Research Reports, (2). IELTS Australia, Australia. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Earlbaum. Crockett, S. A. (2012). A five-step guide to conducting SEM analysis in counseling research. Counseling Outcome Research and Evaluation, 3(1), 30–47, doi: 10.1177 ⁄ 2150137811434142 DELTA. (2018). DELTA results and reports. Retrieved from http: ⁄  ⁄ gslpa.polyu.edu. hk ⁄ eng ⁄ deltatesting ⁄ students ⁄ Results.html Dijkstra, T. K. (2014). PLS’ Janus face—response to Professor Rigdon’s “Rethinking partial least squares modeling: in praise of simple methods.” Long Range Planning, 47(3), 146–153. doi: 10.1016 ⁄  j.lrp.2014.02.004 Dijkstra, T. K., & Henseler, J. (2015). Consistent and asymptotically normal PLS estimators for linear structural equations. Computational Statistics and Data Analysis, 81(1), 10–23. doi: 10.1016 ⁄  j.csda.2014.07.008 Eom, M. (2008). Underlying factors of MELAB listening construct. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 6(1), 77–94. Fan, X. and Sivo, S. A. (2007). Sensitivity of fit indices to model misspecification and model types. Multivariate Behavioral Research, 42(3), 509–529. Farnsworth, T. L. (2013). An investigation into the validity of the TOEFL iBT speaking test for international teaching assistant certification. Language Assessment Quarterly, 10(3), 274–291. doi:10.1080 ⁄ 15434303.2013.769548 Ginther, A., & Yan, X. (2017). Interpreting the relationships between TOEFL iBT scores and GPA: Language proficiency, policy, and profiles. Language Testing, 35(2), 271–295. doi:10.1177 ⁄ 0265532217704010 Goh, C. C., & Aryadoust, V. (2014). Examining the notion of listening sub-skill divisibility and its implications for second language listening. International Journal of Listening, 29(3), 109–133. doi: 10.1080 ⁄  10904018.2014.936119 Gu, L. (2013). At the interface between language testing and second language acquisition: Language ability and context of learning. Language Testing, 31(1), 111–133. doi:10.1177 ⁄ 0265532212469177 Gujarati, D., & Porter, D. (2003). Multicollinearity: What happens if the regressors are correlated. Basic Econometrics, 363. Retrieved from https: ⁄  ⁄ www.studocu. com ⁄ en ⁄ book ⁄ basic-econometrics ⁄ gujarati-damodar-n-porter-dawn-c ⁄ 2932 Hair, J. F., Ringle, C. M., & Sarstedt, M. (2011). PLS-SEM: Indeed a silver bullet. Journal of Marketing Theory and Practice, 19(2), 139–151.

Structural equation modeling 123 Hair, J. F., Sarstedt, M., Ringle, C. M., & Gudergan, S. P. (2018). Advanced issues in partial least squares structural equation modeling (PLS-SEM). Thousand Oaks, CA: Sage. Harsch, C., & Hartig, J. (2016). Comparing C-tests and yes ⁄  no vocabulary size tests as predictors of receptive language skills. Language Testing, 33(4), 555–575. doi:10.1177 ⁄ 0265532215594642 Heene, M., Hilbert, S., Draxler, C., Ziegler, M., & Bühner, M. (2011). Masking misfit in confirmatory factor analysis by increasing unique variances: a cautionary note on the usefulness of cutoff values of fit indices. Psychological Methods, 16(3), 319. doi: 10.1037 ⁄ a0024917 Hoyle, R. C., & Isherwood, J. C. (2013). Reporting results from structural equation modeling analyses in Archives of Scientific Psychology. Archives of Scientific Psychology, 1(1), 14–22. doi: doi.org ⁄ 10.1037 ⁄ arc0000004 Hughes, A. (2002). Testing for language teachers. Cambridge, UK: Cambridge University Press. IDP Education. (2018). IELTS scores. Retrieved from https: ⁄  ⁄ www.idp.com  ⁄  hongkong  ⁄ ­­ ielts-hk ⁄ results ⁄ ielts-scores?sc_lang=en ⁄   ⁄  www.ielts. IELTS. (2018). IELTS scoring in detail. Retrieved from https:  org ⁄ ielts-for-organisations ⁄ ielts-scoring-in-detail In’nami, Y., & Koizumi, R. (2011a). Structural equation modeling in language testing and learning research: A review. Language Assessment Quarterly, 8(3), 250–276. doi:10.10 80 ⁄ 15434303.2011.582203 In’nami, Y., & Koizumi, R. (2011b). Factor structure of the revised TOEIC® test: A multiple-sample analysis. Language Testing, 29(1), 131–152. doi:10.1177 ⁄ ­0265532211413444 Jöreskog, K. G., & Sörbom, D. (1986). LISREL VI: Analysis of linear structural relationships by maximum likelihood, instrumental variables, and least squares methods. Skokie, IL: Scientific Software International, Inc. Jöreskog, K. G., & Sörbom, D. (2006). Listrel 8.8 for windows [computer software]. Skokie, IL: Scientific Software International, Inc. Jöreskog, K. G., & Wold, H. O. A. (1982). The ML and PLS techniques for modeling with latent variables: Historical and comparative aspects. In H. O. A. Wold & K. G. Jöreskog (Eds.), Systems under indirect observation, Part I (pp. 263–270). Amsterdam: North-Holland. Kenny, D. A., Kaniskan, B., & McCoach, D. B. (2015). The performance of RMSEA in models with small degrees of freedom. Sociological Methods and Research, 44(3), 486–507. Kenny, D. A., Kashy, D. A., & Bolger, N. (1998). Data analysis in social psychology. In D. Gilbert, S. Fiske, & G. Lindzey (Eds.), The handbook of social psychology (Vol. 1, 4th ed., pp. 233–265). Boston: McGraw–Hill. Kline, P. (1994). An easy guide to factor analysis. East Sussex, UK: Routledge. Kline, R. B. (2015). Principles and practice of structural equation modeling (4th ed.). New York, London: Guilford Press Leong, C. K., Ho, M. K., Chang, J., & Hau, K. T. (2013). Differential importance of language components in determining secondary school students’ Chinese reading literacy performance. Language Testing, 30(4), 419–439. doi:10.1177 ⁄ 0265532212469178 Liao, Y.-F. (2009). A construct validation study of the GEPT reading and listening sections: Re-examining the models of L2 reading and listening abilities and their relations to lexico-grammatical knowledge. (Ed.D. 3348356), Teachers College, Columbia University. Retrieved from the digital dissertation consortium database: http: ⁄  ⁄ pqdd.sinica.edu. tw.eproxy1.lib.hku.hk ⁄ ddc_open_link_eng.htm?type=ddcandapp=13anddoi=3348356

124  Zhu, Raquel, and Aryadoust Lohmöller, J. B. (1989). Predicti predictive vs. structural modeling: PLS vs. ML. In: Latent variable path modeling with partial least squares. Heidelberg, Germany: Physica. MacCallum, R. C., & Austin, J. T. (2000). Application of structural equation modeling in psychological research. Annual Review of Psychology, 51, 201–226. MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological Methods, 1(2), 130. Mardia, K. W. (1970). A bivariate non-parametric c-sample test. Journal of the Royal Statistical Society, 32(1), 74–87. Marsh, H. W., Ludtke, O., Nagengast, B., Morin, A., & Von Davier, M. (2013). Why item parcels are (almost) never appropriate: two wrongs do not make a right—­ camouflaging misspecification with item parcels in CFA models. Psychological Methods, 18(3), 257–284. McDonald, R. P., & Ho, M. H. R. (2002). Principles and practice in reporting structural equation analyses. Psychological Methods, 7(1), 64. Mooney, C. Z., & Duval, R. D. (1993). Bootstrapping: A nonparametric approach to statistical inference. Newbury Park, CA: Sage. Morton, J., & Moore, T. (2005). Dimensions of difference: A comparison of university writing and IELTS writing. Journal of English for Academic Purposes, 4(1), 43–66. Mueller, R. O. (1997). Structural equation modeling: back to basics. Structural Equation Modeling, A Multidisciplinary Journal, 4(4), 353–369. Mulaik, S. A., James, L. R., Van Alstine, J., Bennett, N., Lind, S., & Stilwell, C. D. (1989). Evaluation of goodness-of-fit indices for structural equation models. Psychological Bulletin, 105(3), 430–445. doi:10.1037 ⁄ 0033-2909.105.3.430 Muthén, L. K., & Muthén, B. O. (1998–2012). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén and Muthén. Ockey, G., & Choi, I. (2015). Structual equation model reporting practices for language assessment. Language Assessment Quarterly, 12(3), 305–319, doi: 10.1080 ⁄ 15434303.2015.1050101 Pae, T.-I. (2004). DIF for examinees with different academic backgrounds. Language Testing, 21(1), 53–73. doi:10.1191 ⁄ 0265532204lt274oa Peters, C. L. O., & Enders, C. (2002). A primer for the estimation of structural equation models in the presence of missing data. Journal of Targeting, Measurement and Analysis for Marketing, 11(1), 81–95. Phakiti, A. (2008a). Construct validation of Bachman and Palmer’s (1996) strategic competence model over time in EFL reading tests. Language Testing, 25(2), 237–272. doi:10.1177 ⁄ 0265532207086783 Phakiti, A. (2008b). Strategic competence as a fourth-order factor model: A structural equation modeling approach. Language Assessment Quarterly, 5(1), 20–42. doi:10.1080 ⁄ 15434300701533596 Phakiti, A. (2016). Test takers’ performance appraisals, appraisal calibration, and cognitive and metacognitive strategy use. Language Assessment Quarterly, 13(2), 75–108. doi:10.1080 ⁄ 15434303.2016.1154555 Plummer, B. (2000). To parcel or not to parcel: the effects of item parceling in confirmatory factor analysis (Unpublished dissertation). The University of Rhode Island, Providence, RI. Purpura, J. E. (1998). Investigating the effects of strategy use and second language test performance with high- and low-ability test takers: A structural equation modelling approach. Language Testing, 15(3), 333–379.

Structural equation modeling 125 Raykov, T., & Marcoulides, G. (2006). A first course in structural equation modeling (2nd ed.). Mawah, NJ: Lawrence Erlbaum. Roever, C., & McNamara, T. (2006). Language testing: The social dimension. International Journal of Applied Linguistics, 16(2), 242–258. doi:10.1111 ⁄  j.1473-4192.2006.00117.x Sawaki, Y., & Nissan, S. (2009). Criterion-related validity of the TOEFL iBT listening section. TOEFL iBT Research Report No. TOEFLiBT-08. Princeton, NJ: Educational Testing Service. Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL Internetbased test. Language Testing, 26(1), 005–030. doi:10.1177 ⁄ 0265532208097335 Schumacker, R., & Lomax, R. (2010). A beginner’s guide to structural equation modeling (3rd ed.). Mawah, NJ: Lawrence Erlbaum. Shah, R., & Goldstein, S. M. (2006). Use of structural equation modeling in operations management research: Looking back and forward. Journal of Operations Management, 24, 148–169. Sharma, S., Mukherjee, S., Kumar, A., & Dillon, W. R. (2005). A simulation study to investigate the use of cutoff values for assessing model fit in covariance structure models. Journal of Business Research, 58(7), 935–943. Shaw, S. D. (2007). Examining writing: Research and practice in assessing second language writing. New York: Cambridge University Press. Shaw, S. D., & Falvey, P. (2006). The IELTS writing assessment revision project: Towards a revised rating scale. Cambridge ESOL: Research Notes, 23, 7–12. Shin, S.-K. (2005). Did they take the same test? Examinee language proficiency and the structure of language tests. Language Testing, 22(1), 31–57. doi:10.1191 ⁄  0265532205lt296oa Song, M.-Y. (2008). Do divisible subskills exist in second language (L2) comprehension? A structural equation modeling approach. Language Testing, 25(4), 435–464. doi:10.1177 ⁄ 0265532208094272 Song, M.-Y. (2011). Note-taking quality and performance on an L2 academic listening test. Language Testing, 29(1), 67–89. doi:10.1177 ⁄ 0265532211415379 Soper, D. S. (2018). A-priori sample size calculator for structural equation models [Software]. Retrieved from http: ⁄  ⁄ www.danielsoper.com ⁄ statcalc Sterba, S. K., & Rights, J. D. (2016). Accounting for parcel-allocation variability in practice: Combining sources of uncertainty and choosing the number of allocations. Multivariate Behavioral Research, 51(2–3), 296–313. doi: 10.1080 ⁄ 00273171.2016.1144502 Takagi, K. (2011). Predicting academic success in a Japanese international university [PhD Thesis]. Temple University, Philadelphia, PA. Tanaka, J. S. (1987). How big is big enough? Sample size and goodness of fit in structural equation models with latent variables. Child Development, 58(1), 134–146. Taylor, L., & Weir, C. J. (2012). IELTS collected papers 2: Research in reading and listening assessment (Vol. 2012). Cambridge, UK: Cambridge University Press. Trace, J., Brown, J. D., Janssen, G., & Kozhevnikova, L. (2017). Determining cloze item difficulty from item and passage characteristics across different learner backgrounds. Language Testing, 34(2), 151–174. doi:10.1177 ⁄ 0265532215623581 Urmston, A., Raquel, M. R., & Tsang, C. (2013). Diagnostic testing of Hong Kong tertiary students’ English language proficiency: The development and validation of DELTA. Hong Kong Journal of Applied Linguistics, 14(2), 60–82. van Steensel, R., Oostdam, R., & van Gelderen, A. (2013). Assessing reading comprehension in adolescent low achievers: Subskills identification and task specificity. Language Testing, 30(1), 3–21. doi:10.1177 ⁄ 0265532212440950

126  Zhu, Raquel, and Aryadoust Wang, W.-C., & Chyi-In, W. (2004). Gain score in item response theory as an effect size measure. Educational and Psychological Measurement, 64(5), 758–780. doi: 10.1177 ⁄ 0013164404264118 Weston, R., & Gore, P. A., Jr. (2006). A brief guide to structural equation modeling. The Counseling Psychologist, 34(5), 719–751. doi.org ⁄ 10.1177 ⁄ 0011000006286345 Westland, J. C. (2010). Lower bounds on sample size in structural equation modeling. Electronic Commerce Research and Applications, 9(6), 476–487. doi.org ⁄ 10.1016 ⁄  j.elerap.2010.07.003 Williams, L. J., Edwards, J., & Vandenberg, R. (2003). Recent advances in causal modeling methods for organizational and management research. Journal of Management, 29, 903–936. Williams, L. J., & O’Boyle, E. H. (2008). Measurement models for linking latent variables and indicators: A review of human resource management research using parcels. Human Resource Management Review, 18(4), 233–242. doi.org ⁄ 10.1016 ⁄ j.hrmr.2008.07.002 Wilson, J., Roscoe, R., & Ahmed, Y. (2017). Automated formative writing assessment using a levels-of-language framework. Assessing Writing, 34, 16–36. doi: org ⁄ 10.1016 ⁄ ­j .asw.2017.08.002 Wold, H. (1982). Soft modeling: The basic design and some extensions. In K. G. Jöreskog & H. Wold (Eds.), Systems under indirect observation: Causality, structure, prediction (Vol. 2, pp. 1–54). Amsterdam: North-Holland. Woodrow, L. (2006). Academic success of international postgraduate education students and the role of English proficiency. University of Sydney Papers in TESOL (1), 51–70. Xie, Q. (2013). Does test preparation work? Implications for score validity. Language Assessment Quarterly, 10(2), 196–218. doi:10.1080 ⁄ 15434303.2012.721423 Xie, Q., & Andrews, S. (2012). Do test design and uses influence test preparation? Testing a model of washback with structural equation modeling. Language Testing, 30(1), 49–70. doi:10.1177 ⁄ 0265532212442634 Yang, H.-C. (2014). Toward a model of strategies and summary writing performance. Language Assessment Quarterly, 11(4), 403–431. doi:10.1080 ⁄ 15434303.2014.957381 Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation. Biometrika, 92(4), 937–950. Yoo, H., & Manna, V. F. (2015). Measuring English language workplace proficiency across subgroups: Using CFA models to validate test score interpretation. Language Testing, 34(1), 101–126. doi:10.1177 ⁄ 0265532215618987 Zhang, L., Goh, C. C. M., & Kunnan, A. J. (2014). Analysis of test takers’ metacognitive and cognitive strategy use and EFL reading test performance: A multi-sample SEM approach. Language Assessment Quarterly, 11(1), 76–102. doi:10.1080 ⁄ 15434303.­2013.853770

6

Student growth percentiles in the formative assessment of English language proficiency Husein Taherbhai and Daeryong Seo

Introduction In recent years, the paradigm shift in language education emphasizes the importance of assessment that provides information for formative purposes where relevant feedback can be used to minimize the existing gap between the actual and desired levels of performance (Perie, Marion, & Gong, 2009; Poehner & Lantolf, 2005; Seo & Taherbhai, 2012). While simple achievement scores do provide a diagnostic aspect to the differential performance of student achievement, they do not provide information that can be integrated into effective learning and teaching strategies (Roberts & Gierl, 2010). In order to understand how students perform from one year to another, their scores measuring the same latent trait (e.g., math) have to be placed on the same scale across years. When such a scale is extended to span a series of grades, it is known as a vertical or a developmental scale. This type of scale is common in assessing growth for students (Young & Tong, 2016) including English language learners (ELL’) as shown by authors such as LaFlair, Isbell, May, Gutierrez Arvizu, and Jamieson (2017) and Saida (2017). Observing the difference between two scores of a student on a vertical scale provides an evaluation of “how much” students have grown in the underlying construct. It should be noted that scores on a vertical scale are “status” scores (i.e., achievement at a point in time). These scores provide a summative understanding within a criterion-referenced assessment system that flags students who have not been able to achieve a dichotomously separated “acceptable” score (Seo & Taherbhai, 2012). For example, if adequate growth is defined as 60 points for all or for students within a certain range of scores (e.g., between 350 and 360 score points) on a vertical scale, then a growth that is below 60 points is flagged as “inadequate” growth. More importantly, however, under these criteria, each student is compared with students who have somewhat similar current scores but different learning curves and different pre-test scores. Ironically, it is on this expected level of progress that lesson plans sometimes are developed for all students within the score range, irrespective of their differential propensity to learn. However, to meaningfully interpret students’ growth, Lissitz and Doran (2009) point out that additional information is required. In the past, the primary

128  Taherbhai and Seo effort in the analysis of growth has been to use prior student achievement to disentangle effectiveness (e.g., of teachers, of schools) from aggregate level of achievement (Betebenner, 2007; Braun, 2005). The use of one or more prior score ⁄s depict a trend line that indicates students’ growth based on empirical evidence substantiated by the relationships among students’ history of scores. Generally speaking, two performance tasks (including one prior score) are necessary, although a larger number of prior scores can provide more accurate results (Sanders, 2006). The reason for using prior scores as a proxy for all variables that affect students’ learning is because these variables can be seen as those that are nested within each student’s performance. While some could argue that comparing students with the same current scores would confirm the argument made previously, it does not account for what would happen in the future because a single score point in time does not produce a trend line for future expectations. But, while pre- and post-tests do individualize students’ progress based on their past performance, they do not allow for a comparison of student scores with individuals who have the same learning curve. It is in this context that student growth percentiles (SGPs) are very attractive (Betebenner, 2007). SGPs can be obtained using statistical methods such as quantile regression (QR) or ordinary least square (OLS) approaches. In this paper, following Betebenner’s (2007), we have used the QR method in our analysis because QR methods directly model conditional percentiles of the target variable.

Literature review Student growth percentiles The conceptualization of using SGPs is exemplified by using growth within the structure of others who have an identical pre-tests scores, referred to as “peers” by Betebenner (2007). Peers, in this paper, will also follow Betebenner’s (2007) definition, identified as those students who have a history of identical scores, that is, those who have the same learning curves at the particular time of measurement. With the use of SGPs, students are evaluated based on their propensity to achieve, which would be different for most students except for the students’ peers. Thus, the use of SGPs can help teachers individualize each student’s requirements based on students who are identical in their history of performance rather than on an approximation developed by comparing “similarly” performing students at a point in time who have different learning curves. SGPs, thus provide a truer picture of each student’s propensity for progress. In other words, students are not given tasks that may be out of their reach or too easily attainable, which might be detrimental to the students’ learning experiences. For example, a student with a score of 350 who has already achieved the target score of say, 325, but who is performing only in the 10th percentile when measured against his peers, would in all likelihood fall under the radar for

Assessment of English language proficiency 129 intervention because he ⁄she has achieved the target score for passing. However, this student is underperforming based on his ⁄  her propensity to achieve, as evaluated through a comparison with his ⁄  her peers. The important thing to note here is that this student has a history of high achievement but because of lack of motivation or for some other reasons, he ⁄she is not performing to his ⁄  her full capacity. Merely making her ⁄  h im work “harder” might do the trick in the classroom, but it may be that motivating the student through counseling may also be necessary to make the student progress on the same trajectory created by his ⁄  her prior scores. Similarly, if the same student was already performing at a very high level with respect to his ⁄  her peers (say, in the 80th percentile) but still has not achieved proficiency, this student would likely not need the same classroom intervention techniques because the student is doing her ⁄  h is best as defined by his ⁄  her history. What likely should be done is to change the future trajectory of this student through some serious intervention that may not be entirely in the hands of the classroom teacher. Perhaps it could be that the student is not happy with his ⁄  her home life and that has an impact on her ⁄  h is performance. Such questions are best left to the school psychologist rather than improving learning through common classroom intervention techniques. This type of assessment, as discussed in this paper, is best served with the use of SGPs because it allows an appropriate lesson plan to be drawn for each student based on how he ⁄she is doing relative to his ⁄  her peers, instead of depending on value-added scores that generally have a comparison with “similar” students instead of “identical” performing students. However, the search for a reasonable number of “peers” to match each student’s history of achievement scores for norm-referenced analysis is next to impossible in a classroom, a district, or even in the state. In this context, student growth percentiles (SGPs) can be very useful because SGP analysis estimates student percentile scores based on a statistically formulated number of peers for each student. To understand SGPs, we first need to understand what a percentile means. The percentile rank of a score is the percentage of scores in an assessment that are equal to or lower than it. For example, if a student scored 350 and is in the 75th percentile, it means that 75% of the students who took the test scored 350 or below, and 25% of the students scored above 350. Thus, percentiles can be calculated for any test at a particular point in time. However, it is the conditional aspect based on prior tests that indicates the probability of growth in percentiles for the model. If prior score ⁄s are removed from the discussion, the SGP would simply be the percentile ranks for students based on current administration. In this context, because SGPs provide a method to measure students’ achievement with others who have an identical history of achievement, there is a great likelihood for SGPs to be different even for students with identical current scores (Taherbhai, Seo, & O’Malley, 2014). It should be noted that students’ prior scores are used solely for conditioning purposes in the calculation of their growth percentiles, which provides a major advantage in measuring change because

130  Taherbhai and Seo these scores do not require ranking on a vertical scale. From a practical point of view, SGPs have a shorter turnaround period (no vertical equating required) and need only a longitudinal scale that may not necessarily be a vertical scale. Furthermore, as will be discussed later, each student’s individualized SGP score can also be used to determine the propensity of students for achieving the target score. This information would be especially useful for formative assessment.

Formative assessment Tognolini and Stanley (2011) state that all aspects of collecting information, whether it is unstructured (e.g., conversing with the student on a chance meeting), slightly structured (e.g., observation of student in the classroom), more structured (e.g., classroom tests), or “most structured” (e.g., standardized tests), can be viewed as an assessment. It should be noted that the difference between formative and summative information lies in the inferences made from the information obtained (Seo, McGrane, & Taherbhai, 2015). In the summative phase, the information is used to describe the status of the student at a particular point in time. In the formative phase, the result from the test is used as an indicator for future action in allocating resources to remedy or further students’ progress across different levels of achievement (Black & Wiliam, 2009). The scores obtained from the current test can be seen as an outcome based on formative actions undertaken using prior year’s summative assessment (Gronlund, 1998). In both phases, however, student behavior on the assessment can be used as evidence with regards to the progress in achievement from the prior phase (Nichols, Meyers, & Burling, 2009). In other words, summative assessment can be used for formative purposes (Black & Wiliam, 2009). While the summative aspect of an assessment is important in assigning scores to students and for the purpose of accountability, the formative use of an assessment is needed for the evaluation of progress in students’ learning (Black & Wiliam, 2009; Dodge, 2009; Seo et al., 2015). All types of assessments used for formative purposes (formal and informal) are mainly based on an evaluation of what is taught and what is understood. It is a process where future learning strategies are formed based on presently available information (Black & Wiliam, 2009). It should be noted, however, that such formative uses of the assessment system are based on an ongoing process where prediction of progress in students’ learning can constantly be assessed, revised, and effectively used for current instructional activities. The current process in using assessment for formative purposes is utilized merely as an evaluation of the distance between what has been achieved and how far that achievement is from a predetermined target ⁄cut score. However, because each student comes with a unique history of achievement, simply assigning similar or the same learning strategies to students with the same current score is hard to accept. It is in this context that the depth of information

Assessment of English language proficiency 131 provided by SGPs is very useful, particularly in the formative evaluation of subjects such as language achievement.

English language proficiency Assessing students’ language skills is very important in academia because it could affect students’ learning in other language-laden subjects such as social studies and language arts. Proficiency in language acquisition needs to be evaluated not only in terms of total efficiency in English learning but also in each modality of language acquisition (i.e., listening, speaking, reading, and writing), all of which are undeniable for evaluating each student’s individualized learning program. According to Abedi (2008), however, most states in the United States use a compensatory model for assessing students’ language acquisition. Unlike conjunctive models where students must achieve “targets” in each of the modalities of English language proficiency (ELP) to be considered proficient, English language learners (ELLs) assessed by the compensatory model may not be proficient in an aspect of language acquisition when they are mainstreamed into a non-ELL classroom. For example, a student may not be proficient in reading and yet be considered proficient in the overall language proficiency (Abedi, 2008; Seo & Taherbhai, 2012; Van Roekel, 2008). Furthermore, in the past, ELLs’ language progress has been based upon the difference in a student’s current score and his ⁄  her prior score measuring the same construct (Wang, Chen, & Welch, 2012; WIDA Consortium, 2013). In this context, ELLs who show improvement as measured on a vertical scale may be vastly underperforming based on their propensity to achieve. For example, a student, based on his ⁄  her current performance, could be seen as low achieving in, for example, listening, when in fact the student is performing above average in the subject when compared to his ⁄  her peers. The same student, however, could be performing far below average in, for example, reading, when compared to his ⁄  her peers. Recognizing that students achieve differently and knowing what these differences mean for learning and teaching in classrooms may instill realistic goals for ELLs’ language proficiency.

Student growth percentile (SGP) for formative assessment Betebenner’s (2007) SGP model, which uses QR (Koenker, 2005), can be used as a formative assessment tool to compare students with identical prior histories (i.e., their peers). SGP analysis outcome for formative assessment can be highlighted as follows: i It provides a norm-referenced understanding of students’ growth based on those students with an identical history of prior scores (i.e., their peers). ii Students’ current score can be used as a check to indicate which percentile that score lies.

132  Taherbhai and Seo iii Based on the estimation of the score required to attain different percentile rankings, it is possible to observe what score would be required for students to attain a criterion-referenced predetermined passing score, that is, their propensity to achieve or to examine their performance relative to the predetermined percentiles in comparison to the performance of their peers. Figure 6.1 provides a hypothetical visual depiction of the SGPs for a student. To understand the workings of the SGP without getting into the complex mathematical formulation, refer to the figure, which is based on the following assumptions: i The student’s current score is 363. ii The target score for proficiency is set at 390, which implies that the student has not reached proficiency. iii For ease of depiction, only one prior score is shown. To start SGP analysis, one needs to first specify what percentiles are of interest to the educator. In most such cases, we have found that 10th to 90th percentiles in increments of 10 percentile points work well in assessing where a student’s percentile ranking is and what it would need to be to achieve other percentile

Figure 6.1  Student’s score estimates at predetermined percentiles conditioned on a prior score. (This figure is adapted from Seo, McGrane, & Taherbhai, 2015. While this student is functioning in the 30th percentile [dashed line], he ⁄ she needs to perform in the 80th percentile to meet proficiency [dash-dot-dot line].

Assessment of English language proficiency 133 rankings based on identical scores of his ⁄  her peers. In Figure 6.1, the right side depicts the estimated scores needed for membership in the predetermined percentiles (i.e., the10th to the 90th percentile). The history of scores for all students is used to obtain regression coefficients in estimating the scores that fall within each of the predetermined percentiles. As an example, the coefficients in the 50th percentile are presented later in this chapter. Using these coefficients for each of the prior scores, each student’s trajectory for attaining a predetermined percentile is calculated. For example, in Figure 6.1, let’s assume that based on the performance of all students, the coefficient for attaining the 60th percentile for one pre-test shown in the figure is 1.0496 (rounded), and the derived intercept is 20. For the particular student depicted in Figure 6.1, the entry level score for this student at the 60th percentile would be: Y = 20 (intercept) + 343 × 1.0496 (slope). From the aforementioned equation, Y is approximately = 380, which is the student’s estimated entry score for the 60th percentile (see Figure 6.1). It should be noted that, generally speaking, a greater number of prior scores (minimum of two is recommended) provides better estimates for entry scores at different percentiles. The student’s current score (i.e., 363) is matched to the scores needed for membership in the different percentiles. Since the student’s current score falls between 361 (the entry score at the 30th percentile) and is less than the 367 (the entry score for the 40th percentile), the student is said to perform in the 30th percentile with respect to his ⁄  her peers. This is the student’s growth percentile measure in comparison to his ⁄  her peers. Furthermore, because the estimates to be in other percentiles are also provided, one can identify student score estimates needed to be in the other predetermined percentiles. For example, to achieve the target score to proficiency set at 390, this student would have to perform in the 80th percentile in comparison to his ⁄  her peers. The SGP model makes no distribution assumptions. Furthermore, the model is robust in handling extreme value points and outliers for the target variable. It is also insensitive (equivariant) to any monotonic transformations of the independent variable (Betebenner, 2009). Fit, using the QR approach, can be examined by the goodness-of-fit (GOF) statistic together with the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) indices. However, for low-stakes use of data, such as in formative assessment, fit indices are not as important as they would be for, say, accountability or research purposes. Because this paper is specifically targeted on the use of SGPs in formative assessment, we have not provided information on these indices or included them in the example provided in the Method section. However, for those who would like to explore fit indices used for the QR model in producing SGPs, please refer to Fu and Wu (2017) for the statistical analytical system (SAS) codes to obtain these indices. While the mathematical model for SGP is complex, the SAS codes provided in Appendix A are quite simple to implement. (See Choi, Suh, & Ankenmann

134  Taherbhai and Seo [2017] for further information on the SAS codes.) As stated earlier, SGPs are obtained by first determining what percentile categories one’s interest lies. For example, some researchers may only want to know what score a student needs to obtain to be in the 50th percentile when compared to his peers. In such a case, only the 50th percentile can be mentioned in the SAS coding. The predetermined criterion for setting adequate progress, however, does not answer an important question as to how “adequate” performance is defined. In most large-scale criterion referenced tests, such classifications of students’ achievement are based on standard setting efforts. Cizek and Bunch (2006) detail various standard setting types and processes. In the same manner as achievement tests, the classification of SGPs as being “adequate,” “good,” or “enough” is similar to a standard setting procedure (Betebenner, 2007) that could differ from one assessment to another. In the United States, in Colorado, a student’s growth is evaluated by comparing his ⁄  her growth to the students’ peers using SGPs (McGrane, 2009). For a school or district in Colorado, SGPs are summarized using their median scores to create a median growth percentile for the school or district (McGrane, 2009). In a typical, low-stakes, formative ELL assessment, reliance on teachers’ knowledge of their students could be used to classify adequate performance for the district, keeping in mind that expensive standards setting procedures are also burdened with a certain amount of subjectivity. It should be noted that while the SGP technique has been used in other industries and research fields, such as health care and financial economics, it has only recently been used in the educational field. For example, the Colorado Department of Education started using SGPs in 2014 as a teacher and school accountability model and only very recently has used it for a students’ growth model. In the next section, we provide an example of how to carry out SGP analysis using actual data from an ELP examination.

Sample study: SGP in the formative assessment of English language proficiency Data collection and analyses requirements The first step in carrying out an SGP analysis in a language assessment context is to collect total ELP and modality scores over a predetermined number of years. It should be noted that the larger the number of data collection points, the more accurate are the results. Once the data are collected, the bands of quantiles need to be specified. These quantile levels, as stated earlier, are to assess what entry point scores are required for students to achieve a certain percentile score. In this example, total and modality ELP score estimations are set at the 10th percentile all the way to the 90th percentile in increments of 10 percentile points, that is, 0.10, 0.20, 0.30, and so on. These percentiles would likely provide enough discrimination in analyzing each student’s unique percentile trajectory for formative purposes (Taherbhai et al., 2014).

Assessment of English language proficiency 135 In order to facilitate our discussions, we present an example using the SAS 9.4 software program (see Appendix A). The purpose of the example is to evaluate ELLs’ language progress based on: i Norm-referenced inferences—the examination of each student’s SGP for the total and modality scores of the ELP examination to see how well they linguistically grew vis-à-vis their peers ii Criterion-referenced inferences—the evaluation of the SGPs needed for nonproficient ELLs to achieve higher proficiency and for proficient students to maintain their SGP ranking for the total and modality ELP scores

Participants The sample of data analyzed for the study consists of 1,000 ELLs who had been administered a large-scale ELP assessment in the United States since 2005. For our analyses, first graders who had been in the ELP program for five years since 2005 were selected. These students’ fifth year of administration, in 2009, was considered to be their most current administration, while the other four years from 2005 to 2008 were the four prior tests used as having a conditional influence on their current year score.

Instrument The ELP test used in the study consisted of four modalities, speaking, listening, reading, and writing. These tests were created to measure English language progress of students who had a primary home language other than English. Five ELP test scores (i.e., one total and four modality scores) of each student for five consecutive years (i.e., 2005 to 2009) were used in this study. While scores on a vertical scale are not a requirement for SGP analyses, the scores of the ELP test in this example were on a vertical scale.

Setting the proficiency cut scores Generally speaking, proficiency cut scores are set as a criterion-based target in classifying student achievement on the underlying construct that is being measured. We have used the proficiency cut score for the total test as the score that would not require ELL students to attend ELL classes because they have gained sufficient language abilities to function effectively in academic contexts. For the four language-learning modalities, each target cut score was used to indicate proficiency in the particular modality. The proficiency scale score cuts for the ELP examination used in the present study are 390 for total, 391 for reading, 411 for writing, 382 for speaking, and 402 for listening. It is important to note that while adequate growth is also a standard setting activity, SGP = 0.50 will be considered as a subjective indication of adequacy among peers in this chapter because the 50th percentile rank is midway between scores above and below the percentile rank.

136  Taherbhai and Seo

Figure 6.2  An example of smoothing the data set.

Smoothing the quantile function When there are large fluctuations in students’ scores on any of the five tests included in the model, it can often hinder the accuracy of the SGP analysis. To overcome such uneven distribution of scores, the B-spline parameterization for smoothing the trend line, according to Harrell (2001), provides excellent fit and seldom leads to “estimation problems” (p. 20). This smoothing technique is incorporated within the SAS program outlined in Appendix A. Figure 6.2 provides an extreme case of fluctuations to which a nonlinear line smooths out the result of the actual scores of eight students taking a hypothetical test.

Student growth percentile (SGP) analyses As stated earlier, bands of quantiles were prespecified for total ELP scores and modality (reading, listening, writing, and speaking) from the 10th percentile all the way to the 90th percentile in increments of 10 percentile points, that is, 0.10, 0.20, 0.30, and so on. These bands were incorporated in the SAS program. Summary statistics that include the mean, median, and the standard deviation are provided in Table 6.1. Students’ current year (i.e., Year 2009—see section on “Participants”) scores were examined vis-à-vis the predicted total ELP and modality scores needed for entry into each percentile. The percentiles that harbored each fifth-year modality score and the total ELP score became the students’ growth percentile ranks for the modalities and the total score, respectively. The established proficiency cuts for the total ELP examination and each of the four modalities were then examined for their location, that is, the percentile in which they resided, to determine the SGP required for students to maintain proficiency or to achieve proficiency with the assumption of the same growth pattern over the years (i.e., holding the prior performance constant).

Assessment of English language proficiency 137 This kind of analysis could be a very useful tool for teachers in the context of formative assessment because it would give them an indication of what growth is expected for each student to maintain or attain proficiency, which in turn would help them formulate their teaching strategies in terms of how much differential effort would be required for each of these students.

Results Summary statistics In Table 6.1, student mean and median scores for the total and the modality scores keep increasing over time. This is true because all the tests, as stated earlier, are on the vertical scale. The standard deviations for total and each modality score are close to one another except for writing 2006 (27.8353) and speaking 2005 (64.0931).

Regression coefficients for the predetermined percentiles The SAS program provides regression coefficients for each of the predetermined percentiles. Table 6.2 depicts regression coefficients based on the 50th percentile. Table 6.1  Summary statistics of five years’ ELP assessment: Total and modality score Variable (Year) Reading (2005) Reading (2006) Reading (2007) Reading (2008) Reading (2009) Writing (2005) Writing (2006) Writing (2007) Writing (2008) Writing (2009) Speaking (2005) Speaking (2006) Speaking (2007) Speaking (2008) Speaking (2009) Listening (2005) Listening (2006) Listening (2007) Listening (2008) Listening (2009) Total Test (2005) Total Test (2006) Total Test (2007) Total Test (2008) Total Test (2009)

Median

Mean

SD

282.0 329.5 350.8 376.8 403.0 304.0 331.7 365.8 396.8 420.0 302.7 361.7 413.2 424.0 420.0 310.7 350.5 372.5 400.5 410.0 294.5 316.4 346.7 371.4 392.0

278.6 331.0 349.5 375.9 402.6 290.6 334.2 359.1 385.5 415.9 313.2 353.6 404.4 407.6 426.6 326.1 356.0 374.7 398.1 422.1 279.8 317.8 346.7 364.9 389.1

42.1803 43.0389 39.5536 47.4503 47.0733 39.6533 27.8353 36.3533 49.8195 53.0396 64.0931 48.3558 53.2045 53.9050 46.0495 48.0251 42.3425 36.6014 37.9449 41.6613 36.2909 24.5508 26.2381 32.3474 33.3038

Note: SD = standard deviation

138  Taherbhai and Seo Table 6.2  Regression coefficient estimates of SGP model at the 50th percentile

Parameter Intercept for Reading Reading (2005) Reading (2006) Reading (2007) Reading (2008) Intercept for Writing Writing (2005) Writing (2006) Writing (2007) Writing (2008) Intercept for Speaking Speaking (2005) Speaking (2006) Speaking (2007) Speaking (2008) Intercept for Listening Listening (2005) Listening (2006) Listening (2007) Listening (2008) Intercept for the Total Total (2005) Total (2006) Total (2007) Total (2008)

DF 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Regression Intercept and Coefficient Estimates 131.713 −0.0445 0.1074 0.2462 0.4259 102.045 −0.0676 0.2264 0.146 0.5162 181.26 0.0259 0.0394 0.1005 0.4664 132.871 0.0745 0.0353 0.2141 0.424 60.968 −0.0251 0.1386 0.2453 0.5617

Standard Error 16.6789 0.0446 0.0351 0.0397 0.0325 20.8958 0.0365 0.0706 0.0432 0.0227 17.5235 0.0121 0.0341 0.0322 0.0235 16.1998 0.0215 0.0264 0.0391 0.0371 12.4649 0.0231 0.0395 0.0439 0.0337

As shown in Table 6.2, the Year 4 score (2008) had the highest regression coefficient with respect to the total as well as the modality scores, relative to the Year 1 (2005), Year 2 (2006), and Year 3 (2007) scores. This indicates that this predictor was the most influential out of the four predictors across the total and the modality scores. This is intuitively understandable as students’ current behavior can be best estimated by their most immediate prior behavior. Furthermore, as can be seen in Table 6.2, the effect of the prior tests decreases with each preceding year, that is, the coefficients are much lower for the first few years compared to the most recent year, that is, Year 4 (2008). Some of the coefficient estimates, especially in Year 1 (2005), are close to zero, which implies they have very little influence on the prediction of Year 5 (2009) scores.

Examination of students’ scores and their propensity to achieve proficiency for the modality and total scores Table 6.3 depicts examples of students’ current scores and their estimates for performing in each of the predetermined percentiles. In these examples, groups

Table 6.3  Examples of students’ predicted scale scores across the specified percentiles Passing Cut Subject 391

Reading

411

Writing

382

Speaking

402

Listening

390

Total

390

Total

Student ID

Year 2009

39 40 58 59 127 133 247 248 34 35 731* 734*

382 382 390 390 335 335 395 395 364 364 414 414

10th 20th 30th 40th 50th 60th 70th 80th 90th Percentile Percentile Percentile Percentile Percentile Percentile Percentile Percentile Percentile 372 340 349 377 343 379 405 367 350 384 408 378

390 349 359 386 353 390 414 381 358 390 412 387

399 358 365 393 366 404 425 389 365 398 418 394

408 368 371 400 377 429 435 395 369 404 424 399

416 377 379 408 388 443 444 401 374 410 431 404

424 383 385 417 401 450 451 407 380 415 437 410

435 389 395 427 470 470 462 416 386 422 441 418

445 402 410 439 470 470 475 423 395 429 449 427

470 412 430 461 470 470 503 442 402 440 461 438

Note: *Students 731 and 734 have the same scale score and have passed the performance level cut (i.e., 390), but their estimates for entry into different percentiles are not the same

140  Taherbhai and Seo of two students with identical current scores, which are lower than the proficiency cut scores, are compared with each other with respect to their entry into the predetermined percentiles. The only exceptions are Student ID #731 and ID #734 who have the same current scores, but they have both achieved proficiency. In the table, the passing cut for reading is 391. Student ID #39 has a current score of 382 (Year 2009), which places him ⁄  her in the 10th percentile. Assuming that the expectations are to have a student score at the median percentile, that is, the 50th percentile when compared to his ⁄  her peers, this student is underperforming. In this context, even if Student #39 achieved the required proficiency (≥ 391) he ⁄ she would still be underachieving because that would place this student only in the 20th percentile. In other words, this student has the propensity to achieve higher (i.e., at the 50th percentile) based on history, which indicates that she ⁄  he is only 30 percentile points away from performing in the 50th percentile when compared to that student’s peers. On the other hand, Student ID #40, who has the same current score as Student ID #39, has already achieved the 50th percentile but has not met the proficiency score set at 391. This student has a very flat trend line dictated by his ⁄  her historical performance and would need to perform in the 80th percentile to achieve proficiency. This student has an uphill task in achieving proficiency that may need counseling and some other intervention techniques besides tutorial help to change the slope of his ⁄  her trend line. In writing, Student ID #58 and Student ID #59 both had an identical current score (390), which was below the established proficiency cut (411). Student ID #58, who is performing in the 60th percentile, needs to perform in the 90th percentile when compared to that student’s peers, to achieve the required proficiency. Scoring in the 90th percentile is rather a steep requirement in absolute terms but could be possible because the student is already performing in the 60th percentile. On the other hand, Student ID #59 needs to perform in the 60th percentile to hold his ⁄  her prior performance trend constant, even though he ⁄ she is performing below the median in the 30th percentile, in comparison to his ⁄  her peers. In speaking, neither of Students ID #127 nor ID #133 achieved the required proficiency with their scores (335) in comparison to the passing score (382). However, Student ID #127 is only performing in the 10th percentile. Based on this person’s history, the target of 382 is achievable for his ⁄  her at the 50th percentiles, where he ⁄ she would not only achieve proficiency but also be working as an “average” student when compared to his ⁄  her peers. By the same token, Student ID #133 is also scoring in the 10th percentile but would require just a little extra effort to perform at the 20th percentile in order to pass the test. In doing so, however, the student would still be performing below the median performance of his peers. In other words, this student has the propensity to do much better, perhaps with extra effort or an intervention technique that would account for his ⁄  her poor performance vis-à-vis the average score of his peers. In listening, Student ID #247 and ID #248 also have not achieved the required proficiency. While Student ID #248 is already performing in the 40th

Assessment of English language proficiency 141 percentile, he ⁄ she needs the 60th percentile to achieve the required proficiency. On the other hand, Student ID #247, who needs to perform in the 10th percentile to achieve the passing score (402), can do much better because he ⁄ she is currently performing much below the median in the 10th percentile when compared to his ⁄  her peers. With respect to total scores, the same type of comparisons can be made between Student ID #34 and Student ID #35. Student ID #34, who has scored in the 20th percentile, is performing below the median with respect to his ⁄  her peers, but his ⁄  her history indicates that this student will have to do even better in comparison to his ⁄  her peers to achieve the required proficiency, which is in the 80th percentile. On the other hand, Student ID #35 has scored below the 10th percentile but needs to perform in the 20th percentile to reach proficiency. As in some of the earlier examples, this student could do much better with respect to his ⁄  her propensity to achieve since he ⁄ she is achieving far below the median level with respect to his ⁄  her peers at the present. As a final example, we look at Students ID #731 and ID #734, Student ID #731 is not using his ⁄  her full propensity to achieve because the student is only performing in the 20th percentile in comparison to his ⁄  her peers. This student might fall below the teacher’s radar for intervention since he ⁄ she has already achieved the required proficiency. On the other hand, Student ID #734 has reached proficiency and is performing in the 60th percentile when compared to his ⁄  her peers. Table 6.3 also indicates that for the speaking modality, there is no increase in percentiles at entry points extending from the 70th percentile to the 90th percentile for student ID #127 and ID #133. For these students, 470 is the score at the 70th to the 90th percentiles. One of the reasons for such an occurrence is because there are not enough history-based scores beyond the 70th percentile for evaluating the scores needed for the 80th and the 90th percentiles. In other words, the distribution of historical scores is badly skewed in covering the range of scores, especially at the upper end of the scoring spectrum. Hence, a score of 470 carries the percentile ranks from 70 to 90. These types of evaluations for precise ranking are necessary to evaluate students’ performances on high-stakes examinations. In terms of formative assessment, however, while it is good to have greater discrimination in assigning an SGP rank, it is not crucial because the intensity of intervention techniques can quickly change based on student performance in ongoing classroom assessments, especially when student performances are measured by the differences in a few consecutive percentile ranks.

Student performance analyses based on total and modality scores Table 6.4 provides examples of each student’s total score vis-à-vis the modality scores in deciphering areas of weaknesses and strength. SGP analyses can identify students who are proficient or nearly proficient with respect to their total score but who do not perform well in mainstream classrooms because they lack language proficiency in one or more of the modalities. Deciphering who

Table 6.4  Comparing each student’s predicted growth percentiles across total and modality scores Readings

Student Year ID 2009 85

40

484

SGP

Estimate of Growth for Prof. at Score = Year 391 2009

Writing

SGP

Estimate of Growth for Prof. at Score = Year 411 2009

Speaking

SGP

Estimate of Growth for Prof. at Score = Year 382 2009

Listening

SGP

Estimate of Growth for Prof. at Score = Year 402 2009

Total

SGP

Estimate of Growth for Prof. at Score = 390

418 50th Achieved Below 30th Over Achieved 387 80th Achieved 518 30th Achieved 524 Percentile Percentile target 10th target 50th Percentile target Percentile target score for Percentile Percentile score for score for score for Prof. Prof. Prof. Prof. 80th 90th 382 50th 80th 390 40th 80th 420 60th Achieved 511 Beyond Achieved 388 Percentile Percentile target 90th Percentile Percentile Percentile Percentile Percentile target Percentile score for score for Prof. Prof. 30th 80th 364 60th Beyond 470 70th Achieved 371 382 70th 80th 369 70th Beyond Percentile Percentile Percentile 90th Percentile target Percentile Percentile Percentile 90th Percentile score for Percentile Prof.

426

Note: 1. Prof. = proficiency; 2. Std. = student

Assessment of English language proficiency 143 is proficient can be easily accomplished by examining the actual scores against the target score. However, as stated earlier, this type of information would not include how much effort is needed to achieve a required level of proficiency. Additional scrutiny of the entry score that is needed for proficiency at each modality percentile can inform us of the percentile growth needed in the modalities not only to achieve overall English language proficiency but also to perform adequately in the respective modalities that are likely important in functioning adequately in regular classrooms. In Table 6.4, while Student ID #85 has reached total language proficiency (Year 2009 score = 418), he ⁄ she has not reached a required level of proficiency in listening, which is one of the key components of academic success according to Seo et al. (2016). Overall, the student seems to be underperforming with respect to his ⁄  her peers on the total test as well as across all modalities except for writing. However, he ⁄ she should not have much trouble in achieving proficiency in this modality with extra help because of his ⁄  her steep trajectory in listening (30th percentile). By the same token, the student should be encouraged to do much better in reading, where he ⁄ she has achieved proficiency but is only performing in the 30th percentile when compared to his ⁄  her peers. On the other hand, Student ID #40 has not reached the required total proficiency, even though he ⁄ she is proficient in speaking and listening. However, this student has not missed the total proficiency by much (388, which is very close to the proficient cut score of 390). Nevertheless, the fact that this student grew only to a level below the 40th percentile in comparison to his ⁄  her academic peers in writing is troubling because it shows that the student has not improved his ⁄  her writing much in recent years. This student needs intervention with respect to writing for the uphill path he ⁄ she needs take to achieve the required writing proficiency in the 80th percentile from his ⁄  her current position in the 40th percentile. Among the three examples given in Table 6.4, Student ID #484 seems to be in the most trouble in achieving his ⁄  her academic proficiency in language acquisition. This student is not only performing very poorly with respect to his ⁄  her peers, but the student’s trajectory of past scores is so flat that he ⁄ she has to perform in very high percentiles to reach the required proficiency in total and all modalities except speaking. Based on academic history, the student may not be able to overcome the barriers of scoring in the 90th percentile even with his ⁄  her best effort at the present. For this student, intervention such as counseling, besides “extra” tutorials, may be necessary to increase the slope of his ⁄  her trajectory (Taherbhai et al., 2014).

Discussion and conclusion In this chapter, we have shown the advantages of using SGPs as a tool for assessing growth in English language learning as well as providing a norm-referenced reflection that indicates “how much” the student should grow based on a predetermined expectation of acceptable growth. This type of assessment is very

144  Taherbhai and Seo useful, especially when the purpose is to guide students through the language learning experiences, because it creates realistic expectations from the student based on his ⁄  her potential to achieve. Personalized learning helps the teacher to not have unrealistic expectations for the student while, at the same time, to help those who can achieve much higher than is presented by his ⁄  her classroom performance. While ELLs’ overall progress in ELP can be assessed simply by observing the relative target score employed for passing the ELP examination, an evaluation of ELLs’ performance on the four modalities becomes paramount when assessing areas of weakness and strength within the ELP construct for each student (Taherbhai et al., 2014). Because of the use of compensatory scoring by many ELP assessments in the United States, students’ proficiency in certain modalities could be so low that it could result in poor academic performance in regular English language learning classes even though the student might have achieved proficiency when measured by the total score on an ELP examination (Menken & Kleyn, 2009; Seo & Taherbhai, 2012; Taherbhai et al., 2014). As Betebenner (2009) points out, SGPs focus on quantifying changes through a norm-referenced identification in achievement instead of focusing on the magnitude of learning. In analyzing the information provided in Tables 6.3 and 6.4, it becomes evident from the use of SGPs that achieving a given level of proficiency can require differential efforts from two students with the same current score (Taherbhai et al., 2014). When the knowledge of comparative growth is lacking, instructional efforts are applied without a fair reckoning of students’ growth potential. As such, students are often instructed based on an average performance criterion that lumps them together in an undifferentiated manner, instead of receiving remedial help that varies for each student based on different levels of achievement across the years (Seo et al., 2015; Taherbhai et al., 2014). As stated earlier, when the SGP model is applied as a formative assessment tool, students can benefit from individualized, tailor-made remedial activities. Such understanding of student requirements would likely allow for the allocation of scarce instructional resources in the most productive manner and avoid setting some students up for failure. Furthermore, providing teachers with their students’ differential propensity for achieving proficiency may assist in mollifying those teachers who claim that it is unfair to be held accountable based on a single measure of progress for all students (Seo et al., 2015; Taherbhai et al., 2014). Much as in other growth models, students with missing prior scores would have to be eliminated from the SGP analyses. Other SGP models could be created for students with different consecutive numbers of prior scores, provided enough responses covering the range of scores are available for such analyses (Grady, Lewis, & Gao, 2010). It should be kept in mind that while more than one prior score is desirable for more accurate SGP estimates, the net effect on formative information is that it serves as a guideline for teachers’ remedial actions. Any help in that direction is always useful as a starting point for intervention techniques. While including estimates for missing scores is a possibility

Assessment of English language proficiency 145 (Sanders, 2006), imputing missing values carries the risk of larger sampling errors and is technically quite challenging to be incorporated in formative assessments. Projecting how much language learners need to grow, using SGP analyses, is based on holding the growth pattern constant. But like any dynamic situation, students’ growth trajectories are likely to change; therefore, it behooves educators to monitor student progress continuously across the years or earlier if possible. Using the SGP method for comparison seems like a daunting task for some educators, particularly for those teachers whose ELLs’ growth requirements for achieving proficiency may seem unattainable. Nevertheless, this concern does not undercut the useful information the method provides, namely, the desirable property of knowing how much growth is required. On the contrary, the SGP method for comparison can facilitate better allocation of finite resources for those students who are having problems, allowing educators to revisit these problems with creative teaching methods, one-on-one tutorials, motivational strategies, and parental involvement that provides the right type of intervention on a differential basis for each student.

References Abedi, J. (2008). Classification system for English language learners: Issues and recommendations. Educational Measurement: Issues and Practice, 27(3), 17–31. Betebenner, D. W. (2007). Estimation of student growth percentiles for the Colorado student assessment program. Retrieved from the Colorado Department of Education website: https: ⁄ ⁄ www.researchgate.net ⁄ profile ⁄ Damian_Betebenner ⁄ publication ⁄ 228822935_ Estimation_of_student_­growth_percentiles_for_the_Colorado_Student_Assessment_ Program ⁄ links ⁄ 582d6ae908ae138f1bffa692 ⁄ Estimation-of-student-growth-percentiles-forthe-Colorado-Student-Assessment-Program.pdf Betebenner, D. W. (2009). Norm-and criterion-referenced student growth. Educational Measurement: Issues and Practice, 28(4), 42–51. Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and Accountability, 21(1), 5–31. Braun, H. I. (2005). Using student progress to evaluate teachers: a primer on value-added models. Retrieved from Educational Testing Service website: http: ⁄  ⁄ www.ets.org ⁄ Media  ⁄ Research ⁄ pdf ⁄ PICVAM.pdf Choi, J., Suh, H., & Ankenmann, R. (2017). Estimation of student growth percentiles using SAS procedures. Retrieved from SAS support website: http: ⁄  ⁄ support.sas. com ⁄ resources ⁄ papers ⁄ proceedings17 ⁄ 0986-2017-poster.pdf Cizek, G. J., & Bunch, M. B. (2006). Standard setting: A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage Publications. Dodge, J. (2009). What are formative assessments and why should we use them? Retrieved from http: ⁄  ⁄ www.scholastic.com ⁄ teachers ⁄ article ⁄ what-are-formative-­assessments-andwhy-should-we-use-them Fu, L., & Wu, C.-s. P., (2017). SAS macro calculating goodness-of-fit statistics for quantile regression. Retrieved from https: ⁄   ⁄ statcompute.wordpress.com ⁄ 2017 ⁄ 04 ⁄ 15 ⁄ sas-macro-calculating-goodness-of-fit-statistics-for-quantile-regression ⁄ 

146  Taherbhai and Seo Grady, M., Lewis, D., & Gao, F. (2010). The effect of sample size on student growth percentiles. Paper presented at the 2010 annual meeting of the National Council on Measurement in Education. Denver, CO. Gronlund, N. E. (1998). Assessment of student achievement (6th ed.). Needham Heights, MA: Allyn and Bacon Publishing. Harrell, F. E. (2001). Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis. New York: Springer-Verlag New York. Koenker, R. (2005). Quantile regression: Econometric society monographs. New York: Cambridge University Press. LaFlair, G. T., Isbell, D., May, L. D. N., Gutierrez Arvizu, M. N., & Jamieson, J. (2017). Equating in small-scale language testing programs. Language Testing, 34(1), 127–144. Lissitz, B., & Doran, H. (2009). Modeling growth for accountability and program evaluation: An introduction for Wisconsin educators. Retrieved from University of Maryland, Maryland Assessment Research Center for Education Success website: http: ⁄  ⁄ marces. org ⁄ ­completed ⁄ Lissitz%20(2009)%20Modeling%20Growth%20for%20Accountability.pdf McGrane, M. (2009). Colorado’s growth model. Retrieved from the Colorado ⁄   ⁄  www.ksde.org ⁄ Portals ⁄ 0 ⁄ Research%20and%20 Department of Education website: http:  Evaluation  ⁄ ­Colorados_Growth_Model.pdf Menken, K., & Kleyn, T. (2009). The difficult road for long-term English learners. Educational Leadership, 66(7), 26–29. Nichols, P. D., Meyers, J. L., & Burling, K. S. (2009). A framework for evaluating and planning assessments intended to improve student achievement. Educational Measurement: Issues and Practice, 28(3), 14–23. Perie, M., Marion, S., & Gong, B. (2009). Moving toward a comprehensive assessment system: A framework for considering interim assessments. Educational Measurement: Issues and Practice, 28(3), 5–13. Poehner, M. E., & Lantolf, J. P. (2005). Dynamic assessment in the language classroom. Language Teaching Research, 9(3), 233–265. Roberts, M. R., & Gierl, M. J. (2010). Developing score reports for cognitive diagnostic assessments. Educational Measurement: Issues and Practice, 29(3), 25–38. Saida, C. (2017). Creating a common scale by post-hoc IRT equating to investigate the effects of the new national educational policy in Japan. Language Assessment Quarterly, 14(3), 257–273. Sanders, W. L. (2006). Comparisons among various educational assessment value-added models. Retrieved from SAS website: http: ⁄  ⁄ www.sas.com ⁄ govedu ⁄ edu ⁄ services ⁄ vaconferencepaper.pdf Seo, D., McGrane, J., & Taherbhai, H. (2015). The role of student growth percentiles in monitoring learning and predicting learning outcomes. Educational Assessment, 20(2), 151–163. Seo, D., & Taherbhai, H. (2012). Student growth percentiles for formative purposes and their application in the examination of English language learners (ELL). Paper presented at the 91st annual meeting of the California Educational Research Association, Monterey, CA. Seo, D., Taherbhai, H., & Frantz, R. (2016). Psychometric evaluation and discussions of English language learners’ learning strategies in the listening domain. International Journal of Listening, 30(1–2), 47–66. Taherbhai, H., Seo, D., & O’Malley, K. (2014). Formative information using student growth percentiles for the quantification of English language learners’ progress in language acquisition. Applied Measurement in Education, 27(3), 196–213.

Assessment of English language proficiency 147 Tognolini, J., & Stanley, G. (2011). A standards perspective on the relationship between formative and summative assessment. In P. Powell-Davies (Ed.), New directions: Assessment and evaluation (pp. 25–31). London, UK: British Council. Van Roekel, D. (2008). English language learners face unique challenges. Retrieved from NEA Education Policy and Practice Department website: http: ⁄  ⁄ educationvotes.nea. org  ⁄ wp-content ⁄ uploads ⁄ 2010 ⁄ 05 ⁄ ELL.pdf Wang, M., Chen, K., & Welch, C. (2012). Evaluating college readiness for English language learners and Hispanic and Asian students. Retrieved from http: ⁄  ⁄ itsnt962.iowa.uiowa. edu ⁄ ia ⁄ documents ⁄ Evaluating-College-Readiness-for-English-Language-Learners-andHispanic-and-Asian-Students.pdf WIDA Consortium. (2013). ACCESS for ELLs: Interpretative guide for score reports, Spring 2013. Retrieved from http: ⁄  ⁄ www.wrsdcurriculum.net ⁄ ACCESSInterpretiveGui de2013_1_.pdf Young, M. J., & Tong, Y. (2016). Vertical scales. In S. Lane, M. R. Raymond, & T. M. Haladyna (Eds.), Handbook of test development (2nd ed.) (pp. 450–466). New York: Taylor and Francis.

148  Taherbhai and Seo

Appendix A SAS program codes for estimating SGPs ****************************************************************** For total scores of language learners. The same codes can be modified for language learning modality scores of the students. *****************************************************************; Data Shan_T; Set select_1_select; run; ****************************************************************** The use of spline for the smoothening of the curve obtained by the conditioning variables *****************************************************************; proc transreg data=shan_T; model identity(TT_SS_09) = spline(Year_05) spline(Year_06) spline(Year_7) spline(Year_08); output out = a2 predicted; run; ****************************************************************** Creating the new data set to account for the inclusion of the spline function ****************************************************************** *********; data File2_T; merge Shan_T a2; run; ****************************************************************** Using quantile regression with predetermined percentiles to obtain coefficients and entry points *****************************************************************; proc quantreg data=file2_T; model TTT_SS_09 = TTT_SS_05 TTT_SS_06 TTT_SS_07 TTT_ SS_08 ⁄ quantile=.10 .20 .30 .40 .50 .60 .70 .80 .90; id id; output out=outp1_T pred=p1; run;

Assessment of English language proficiency 149 ****************************************************************** Change the variable names using the following SAS commands (optional): *****************************************************************; Data shano_T; set outp1_T; P10_T= round(P11,1); P20_T= round(P12,1); P30_T= round(P13,1); P40_T= round(P14,1); P50_T= round(P15,1); P60_T= round(P16,1); P70_T= round(P17,1); P80_T= round(P18,1); P90_T= round(P19,1); run;

7

Multilevel modeling to examine sources of variability in second language test scores Yo In’nami and Khaled Barkaoui

Introduction A major question in language assessment research concerns test validation, that is, what variables contribute to variability in test scores, and the extent of that variability. Examining these issues helps to draw more accurate (or more valid) inferences from test scores about examinees’ abilities and to construct a validity argument (see Kane, 2006). Ideally, construct-relevant variables are the major contributors to such variability, while the contribution of construct-irrelevant variables to test score variability should be negligible. Identifying the contributing variables and estimating the size of their impact on test scores have traditionally been examined using data analysis techniques such as analysis of variance (ANOVA), multiple regression analysis, factor analysis, item response theory, and generalizability theory (see relevant chapters in this volume). These methods are useful in separating score variance into components and estimating the extent to which test score variance is explained by construct-relevant and construct-irrelevant factors. An equally important question in language assessment research is how performance on one assessment tool (i.e., tests) is related to performance on another test of the same or different constructs. If performance levels on Tests A and B are closely related, it would be possible to predict performance on Test A from the results of Test B. For example, one could predict writing proficiency from reading test scores, vocabulary depth from vocabulary size, and university grade point average (GPA) from high school GPA. These predictions typically have been examined using ANOVA, multiple regression analysis, and factor analysis (for more information on these methods, see Tabachnick & Fidell, 2014). One issue that is not explicitly addressed in the statistical methods listed above is the hierarchical or nested structure of test data where each observation or response is not independent of the others (e.g., Hox, 2010). Examples of different forms of hierarchical structure are shown in Figure 7.1a–e. Nesting can occur with individuals (Figure 7.1a), with test items (Figure 7.1b), with scores or ratings (Figure 7.1c and d), with repeated test times (Figure 7.1e). Each of these is defined as a “Level 1” variable. Classes (Figure 7.1a),

Variability in second language test scores 151 a) Level 2 (Class) Level 1 (Score per student) b) Level 2 (Section) Level 1 (Difficulty per item)

c) Level 2 (Rater) Level 1 (Essay rating per student)

d) Level 2 (Prompt) Level 1 (Essay rating per student)

e) Level 2 (Student) Level 1 (Response per time point)

Figure 7.1a–e  Examples of nested data structures.

sections (Figure 7.1b), raters (Figure 7.1c and d), and students (Figure 7.1e) are defined as “Level 2” variables that are situated above the Level 1 variables. For example, when studies are conducted at school, students are nested within classes ⁄teachers as shown in Figure 7.1a. This easily extends to cases where classes ⁄teachers are situated within schools; schools are situated within school districts; school districts are situated within states ⁄ regions; and states ⁄regions are situated within countries. Another example is reading (and listening) tests, where items are nested within sections as shown in Figure 7.1b. Another typical example is essay scores nested within raters as shown in Figure 7.1c and d. Finally, if students are repeatedly tested (time 1, time 2, time 3…) as in longitudinal studies, the same students produce a series of responses ⁄scores, and it can be said that responses ⁄scores are nested within students, as shown in Figure 7.1e.

152  In’nami and Barkaoui In all these examples, observations and responses at the lower levels (i.e., Level 1) are not independent. If the dependencies are not modeled, the standard error (i.e., the inaccuracy) of a parameter estimate tends to be smaller, which increases the probability of finding statistically significant results by chance when they are not present (i.e., a Type I error) (e.g., Hox, 2010). These dependencies among observations ⁄responses are common in language testing studies, but they cannot be modeled using traditional statistical methods. Multilevel modeling (MLM), also called hierarchical linear modeling, mixed effects modeling, or random effects modeling, is a powerful tool for identifying and estimating the sources of variability in nested test score data (e.g., Barkaoui, 2013; Linck & Cunnings, 2015; Luke, 2004; McCoach, 2010). Although MLM is a popular technique in educational research, it has rarely been used in language testing and validation research, except by Barkaoui (2010, 2013, 2014, 2015, 2016), Cho, Rijmen, and Novák (2013), Feast (2002), and Koizumi, In’nami, Asano, and Agawa (2016). Recent applications of MLM in applied linguistics studies include Murakami (2016) and Sonbul and Schmitt (2013). This chapter discusses how MLM can be used to examine sources of variability in second language (L2) test scores in cross-sectional designs. It begins with a review of MLM in L2 testing research, discusses the logic behind MLM, demonstrates the use of MLM by analyzing a set of L2 vocabulary test scores while describing the rationale and procedures of conducting MLM, and discusses further considerations when using MLM. Readers are also encouraged to consult Barkaoui (2010, 2013), which offer an accessible and comprehensive introduction to MLM in language testing. For MLM in longitudinal, language-testing research, see Barkaoui and In’nami (Chapter 8, Volume II). For MLM in L2 acquisition research using the lme4 package in the R statistics program, see Linck and Cunnings (2015). For more recent and advanced MLM topics, see Harring, Stapleton, and Beretvas (2016).

MLM in language testing As described earlier, MLM has rarely been used in language testing and validation research, except by Barkaoui (2010, 2013, 2014, 2015, 2016), Cho et al. (2013), Feast (2002), and Koizumi et al. (2016). Of these, Barkaoui (2010) and Cho et al. (2013) use cross-sectional designs as in the current chapter and are reviewed here. Barkaoui (2010) demonstrated how MLM can be applied to examine variability in essay holistic scores due to rating criteria (e.g., organization, argumentation, linguistic accuracy [Level 1]), and rater experience (i.e., novice or experienced [Level 2]). The data structure is the same as in Figure 7.1c, with rating criteria as Level 1 and rater experience as Level 2 variables. In other words, the predictor variables were rating criteria (Level 1) and rater experience (Level 2) and the outcome variable was holistic score (Level 1). The results showed that the experienced raters were harsher in their holistic

Variability in second language test scores 153 ratings and emphasized linguistic accuracy; the novice raters, on the other hand, gave more weight to argumentation and showed more variability in their holistic ratings. These results suggest that both rating criteria and rater experience need to be considered in writing research and that MLM is useful in estimating the contribution of these variables to variability in essay holistic scores. Depending on the purpose of the study, other variables such as gender, age, and L1 background of the raters can also be added as Level 2 predictors. Cho et al. (2013) examined how TOEFL iBT integrated writing scores (Level 1) were related to prompt characteristics (Level 2), as shown in Figure 7.1d. Prompt characteristics refer to native-speakers’ perceptions of task difficulty measured using a questionnaire. Examples of prompt characteristics included (a) distinctiveness of the ideas presented in a lecture, (b) difficulty summarizing the ideas in a lecture, and (c) difficulty understanding the ideas in a reading passage. It should be noted that the questionnaire items included those in lectures ([a] and [b]) and those in a passage ([c]). The results showed that both (a) and (c) were significant predictors explaining variation in the writing scores—an unexpected finding since the researchers assumed that prompt difficulty concerned the features of the lecture, not those of the passage. In sum, these applications of MLM in Barkaoui (2010) and Cho et al. (2013) suggest how MLM is useful with these types of design.

The logic behind MLM As introduced earlier, MLM is a statistical method for modeling data with hierarchical or nested structures. Observations and responses in such structures are not independent. If they are not properly modeled, the results could lead to Type I errors. Regression is a special case of MLM (Raudenbush & Bryk, 2002; see Chapter 10, Volume I, for an introduction to regression). Regression is modeled as (Equation 7.1): Yi = β0 + β1 * ( X i ) + ri , (7.1) where β0 is an intercept, β1 is a slope, ri is an error term (or a residual error), and the subscript i indicates a student. The intercept and slope are the same across students, but the test scores Y and X and error vary across students. An intercept is the expected score of Y when X is zero, and a slope is the expected change in Y in relation to a unit change in X. Suppose we administer two vocabulary tests to students from 100 schools: one test of vocabulary size (how many words one knows) and one of depth of vocabulary knowledge (how much one knows about the words one knows). Suppose further that we are interested in predicting scores on the vocabulary depth test from the vocabulary size test scores. The relationship between these two tests is linear, and the vocabulary depth score is predicted by the

154  In’nami and Barkaoui following equation: −0.584 + 0.3 * (SIZE). Because the tests were conducted at 100 schools, a regression equation can be developed for each school as follows: For School 1 students, Y i = 5.02 + 2.21 * (SIZEi) + 4.25; For School 2 students, Y i = 4.25 + 1.24 * (SIZEi) + 3.76; For School 3 students, Y i = 6.74 + 2.18 * (SIZEi) + 2.10; and : : For School 100 students, Y i = 4.22 + 1.11 * (SIZEi) + 6.60. Developing an equation for each school is not tractable. An MLM approach to this issue is to model all schools by introducing another subscript, j, as follows: Yij = β0j + β1j * ( SIZE ij ) + rij , (7.2) which replaces the 100 equations with a single equation and expresses interschool variability by modeling intercepts and slopes as fixed or random coefficients. These two key concepts, fixed coefficients (or fixed effects) and random coefficients (or random effects), need to be introduced to understand how MLM works (e.g., Luke, 2004; McCoach, 2010; Raudenbush & Bryk, 2002). Fixed coefficients are regression coefficients at higher levels that are fixed (i.e., they take the same value) and do not vary across data units at that particular level. This indicates, for example, that effects of schools on student vocabulary scores could hold true across schools. In contrast, random coefficients refer to coefficients at higher levels that are not fixed (i.e., they take different values) and vary across units at that particular level. For example, effects of schools on student scores could be stronger for School 1 than for Schools 2 and 3. As an another example, effects of raters on essay ratings, as shown in Figure 7.1c, could hold true across raters and would therefore be fixed effects. On the other hand, effects of raters on essay ratings could be stronger for Rater 1 than Raters 2 and 3 because Rater 1 mistakenly assigned similar scores to all essays. In this case, the effect would be random. Recall that the individual students’ scores are at Level 1 that are nested within schools at Level 2. By introducing fixed and random coefficients, we can allow the intercept (β0j) and slope (β1j) in Equation 7.2 to vary across schools, as follows: Level 1 Model :  Yij = β0j + β1j * ( SIZE ij ) + rij (7.2) Level 2 Model :  β0j = γ 00 + µ 0j (7.3) β1j = γ 10 + µ1j (7.4) Combined :  Yij = γ 00 + γ 10 * ( SIZE ij ) + µ 0j + µ1j * ( SIZE ij ) + rij (7.5)

Variability in second language test scores 155 In Equation 7.2, the vocabulary depth test score of student i (the letter i as in individual) in school j (the letter j, which comes after the letter i), Yij, is predicted by the mean score for vocabulary depth for school j (the intercept β0j), the expected change in the vocabulary depth test scores in relation to a unit change in the vocabulary size test scores (the slope β1j), and the residual error (rij). In Equations 7.3 and 7.4, the intercept and slope are denoted by γ00 and γ10, respectively, to differentiate them from β0j and β1j; γ00 and γ10 represent the average value of the Level 1 intercept and the average value of the Level 1 slope, respectively; γ00 and γ10 are accompanied by the residual errors (μ0j and μ1j, respectively), which indicate the variation in the intercept and slope across schools (e.g., Hox, 2010; Raudenbush & Bryk, 2002). This removes the need to develop a separate regression equation for each school. That is, by collapsing the equation over schools (rather than having one equation per school), the slopes and intercepts can be replaced with the mean slope and mean intercept across schools, but one also needs to account for the spread about these means (i.e., the residual errors). If the spread is minimal (e.g., the equations per school are very similar), then the slope and intercept coefficients can be modeled as fixed. If not, then they can be modeled as random. Equations 7.2 through 7.4 can be combined into a single equation. This results in Equation 7.5. As noted previously, the coefficients can be either fixed or random, which holds for both the intercepts and the slopes. Therefore, intercepts and slopes can be modeled to be fixed by removing residual errors (i.e., β0j = γ00 instead of β0j = γ00 + μ0j, and β1j = γ10 instead of β1j = γ10 + μ1j), suggesting no variation across schools in vocabulary size, depth, or both. Thus, there are four possible combinations of fixed and random coefficients with intercepts and slopes, and these are shown in Figure 7.2a–d. Figure 7.2a shows a random intercept and a fixed slope, suggesting that the average scores of the vocabulary depth test vary across schools while the predictive power of depth from size is similar across schools. Figure 7.2b shows a fixed intercept and a random slope, suggesting that the average scores of the vocabulary depth test do not vary across schools, but the predictive power of depth from size differs across schools. Differing interschool predictive power suggests that vocabulary depth is predicted well by vocabulary size in some schools but less so in other schools. Figure 7.2c shows a random intercept and a random slope, suggesting that the average scores of the vocabulary depth test vary across schools and so does the predictive power of

a) Random intercept and fixed slope

b) Fixed intercept and random slope

c) Random intercept and random slope

d) Fixed intercept and fixed slope

Figure 7.2a–d  Fixed and random coefficients with intercepts and slopes at Level 2.

156  In’nami and Barkaoui depth from size across schools. Finally, Figure 7.2d shows a fixed intercept and a fixed slope, suggesting that both the average scores of the vocabulary depth test and the predictive power of depth from size are similar across schools. This conceptually equals ordinary regression. In sum, MLM allows for considering the hierarchical structure of data by introducing a higher level where intercepts and slopes are modeled to be fixed or random.

Sample study: MLM to examine relationship of vocabulary to speaking ability To illustrate how MLM can be applied to language testing data, half of the data from Koizumi and In’nami (2013) were randomly selected and analyzed using MLM. This study originally examined to what extent L2 speaking ability could be explained by vocabulary breadth and depth. Among the data, we focused on the relationship between vocabulary size and depth for the sake of simplicity and illustrative purposes, while excluding speaking. The vocabulary test was designed to measure vocabulary size and depth in a decontextualized format. The vocabulary size test had 78 items that were randomly selected from the 3,000 most frequent lemmas in the JACET8000 (Japan Association of College English Teachers Basic Word Revision Committee, 2003), a corpus-based 8,000-word list aimed at Japanese learners of English. The vocabulary depth test consisted of 20 items to measure derivation, 17 items to measure antonyms, and 18 items to measure collocation. Of these, only derivation was included in the current analyses. Sample items are shown in Table 7.1. Both tests were dichotomously scored based on predetermined, author-made criteria. The tests had high reliability (Cronbach’s alpha = .96 for the size test and .93 for the derivation test). The size test had a larger number of items (78 items) than the depth, derivation test (20 items). Although the number of items differed across the measures, the high reliability indicates that they represented vocabulary size and depth very well in the current study. The participants were 482 Japanese students at 15 junior and senior high schools who had been studying English as a foreign language for approximately three to six years. Of the 15 schools, 8 were public (4 junior high schools [n = 109 students] and 4 senior high schools [n = 197]) and 7 were private (3 junior high schools [n = 64] and 4 senior high schools [n = 112]). The participants’ Table 7.1  Sample items from the vocabulary size and depth tests Size test (78 items): Write the English word that best corresponds to the Japanese word 1. 犬 (d             ) [Answer: dog(s)] Depth test—Derivations (20 items): Change the form of each English word below according to the part of speech provided in [ ]. Write only one word. Do not write words with -ing or -ed. 12. emphasis [Verb: to do the action of ∼] (             ) [Answer: emphasize (emphasise)]

Variability in second language test scores 157 proficiency ranged from “below A1” to B1 levels of the Common European Framework of Reference for Languages (CEFR), as judged by their self-reported grades on the EIKEN test (Eiken Foundation of Japan, 2017). We reported on the participants’ proficiency range on the CEFR—a widely used standard for describing language ability—as doing so would help the readers understand the proficiency level of the current participants. Proficiency was just demographic information about the participants and was not used in the analyses. We acknowledge that the CEFR was developed in Europe and not developed for Asian countries. For the purpose of this study, two school types were compared (public, private). Three research questions were examined: i Does vocabulary size relate to vocabulary depth (knowledge about derivations)? ii Does derivation knowledge vary across school type (i.e., public or private)? iii Does the relationship between vocabulary size and derivations vary across school type? To address the research questions for this study, the vocabulary size and depth ⁄derivation scores per student are considered as Level 1 variables and the type of school is considered as a Level 2 variable. These research questions correspond to the three types of relationships that can be examined using MLM (e.g., Mathieu, Aguinis, Culpepper, & Chen, 2012). The first type is the relationship between predictors and outcome variables at lower levels. In the current study, vocabulary size and derivations are both lower-­level variables (called Level 1 predictors in MLM), and their relationship is analyzed in research question 1. Second, one may be interested in a cross-level direct effect: how higher-level predictors relate to a lower-level outcome variable. For example, the degree of knowledge about derivations students possess may differ between public and private schools. This is addressed in research question 2. We compare public and private schools because we noticed that public schools in Japan (or in our data) tend (or tended) to put more emphasis and allocate more time to English instruction than do the private schools. Consequently, it was expected that students in public schools would achieve higher scores on the derivations test. Finally, MLM allows for investigations into a cross-level interaction effect between lower-level variables (i.e., vocabulary size and derivation) and a higher-level variable (i.e., school). In other words, question 3 asks whether the relationship between lower-level variables varies according to (or is moderated by) a higher-level variable. Such an effect is often a topic of great interest because it suggests the presence of moderator variables that could explain differing relationships among lower-level variables. In sum, MLM considers the relationship between predictors and outcome variables at lower levels, between higher-level predictors and a lower-level outcome variable (cross-level direct effect), and between predictors and outcome variables at lower levels in relation to higher-level predictors (cross-level interaction effect).

158  In’nami and Barkaoui

Data analysis Recall that 482 students at 15 high schools took the vocabulary test. This indicates that students are nested within schools. More specifically, students are Level 1, while schools are Level 2. The Level 1 model included vocabulary size scores and derivation scores. The Level 2 model included schools coded according to school type (i.e., public [0] or private [1]). These relationships are presented in Figure 7.3. MLM models usually are developed and tested step by step, starting with the simplest model, with later models becoming more and more complex as more predictors are added (e.g., Barkaoui, 2013; Hox, 2002; Luke, 2004). Five models were examined. The simplest model (Model 1) includes no predictors at any level. The model divides the variation in the outcome (i.e., derivation scores) into Level 1 and Level 2 to examine whether enough variation exists to warrant further investigation. As will be shown here, variation in the outcome was found to be significant at both Levels 1 and 2. Accordingly, Model 2 added vocabulary size as a Level 1 predictor to examine the impact of vocabulary size on derivation scores. Similarly, Model 3 included vocabulary size as a Level 1 predictor. The only difference between the two models is that the impact of vocabulary size scores on derivation scores was assumed not to vary significantly across schools in Model 2, whereas it was assumed to vary significantly across schools in Model 3. Model 4 adds a Level 2, school-level predictor (school type) to examine a cross-level direct effect, namely, whether a higher-level predictor (i.e., school type) relates to a lower-level outcome variable (i.e., derivations). Finally, the fifth model (Model 5) uses the same Level 2, predictor, school type, to examine a cross-level interaction effect, namely, whether the relationship between lower-level variables (i.e., vocabulary size and derivation) varies according to or is moderated by the higher-level variable, school type. It should be noted that Models 1 through 5 have a two-level structure, although the variable(s) therein is specified differently as explained as follows. Models were estimated using full information maximum likelihood (FML) and were compared using chi-square difference tests, the Akaike information criterion (AIC),

Level 2

Level 1

Predictor variables: derivation test scores (Level 1); school (public or private; Level 2) Outcome variable: vocabulary size test scores (Level 1)

Figure 7.3  Schematic graph of the nested structure of the current data.

Variability in second language test scores 159 Table 7.2  Descriptive statistics N

M

SD

Min

Max

Level 1 Vocabulary size Knowledge about derivation

482 482

29.70 8.36

13.86 4.98

1 0

71 20

Level 2: School Public schools Vocabulary size Knowledge about derivation Private schools Vocabulary size Knowledge about derivation

15 8 306 306 7 176 176









31.77 8.79

15.28 5.48

1 0

71 20

26.10 7.60

10.01 3.89

4 0

54 19

Note: N = number; M = mean; SD = standard deviation

and the proportion of variance explained (Hox, 2010; Singer & Willett, 2003). Chi-square difference tests are used to compare two models that are nested. Models are nested when one model can be specified by constraining parameters in the other model. We used the free, student version of the software HLM7 (Raudenbush, Bryk, Cheong, Congdon, & du Toit, 2016) to analyze the data.

Results Table 7.2 shows descriptive statistics of the scores used for the analysis. Note that the sample size differs between levels. Level 1 had a sample size of 482, which is the number of students. Level 2 had a sample size of 15, which is the number of schools. The number of students in the public schools ranged between 7 and 72 (M = 38.13, SD = 28.09) and that for the private schools between 14 and 44 (M = 25.14, SD = 10.65). The statistics at Level 2 are based on the means of schools (i.e., aggregate) at Level 2.

Step 1: Assess the need for MLM The data were first analyzed to assess whether MLM was needed. MLM is required when data have a hierarchical structure and when responses vary across higher levels. If data are hierarchal but responses vary little across higher levels, there should be little bias in the results from ordinary regression and therefore no need to employ MLM. This variation was tested by constructing a “null model” (also called the “intercept-only model” [Hox, 2010], or the “unconditional model” [Raudenbush & Bryk, 2002]) and is defined as follows: Model 1: Level 1 Model :  Yij = β0j + rij (7.6)

160  In’nami and Barkaoui Level 2 Model :  β0j =  γ 00 + µ 0j (7.3) Combined :  Yij = γ 00 +

µ 0j

+

rij (7.7)

Equation 7.6 means that the derivation score of student i in school j, Y ij, is predicted by the mean score for derivation for school j (the intercept β0j) and the residual error (rij). Equation 7.3 means that the mean score for derivation for school j (the intercept β0j) is predicted by the grand-mean score for the derivation test score (γ00) and the residual error (μ0j). The grand-mean here refers to the average of derivation scores across all 482 participants. The residual error (μ0j) indicates variations in the derivation test score among schools. Equations 7.3 and 7.6 can be combined into a single equation by substituting Equation 7.3 into Equation 7.6. The result is Equation 7.7. Note that no predictors (i.e., size at Level 1 and school type at Level 2) are added to Model 1. Based on the results for Model 1 in Table 7.3, the need for MLM can be examined by calculating variation in responses across higher levels in four ways. First, between-school variability (11.13) was larger than within-school (i.e., between-student) variability (9.82). These variability estimates are used to calculate an intraclass correlation (ICC). ICC is the proportion of the variance explained by the school level, which was 0.53 (11.13 ⁄ [11.13 + 9.82]), or 53%. This proportion is large in social studies, where ICC values usually range from .05 to .20 (Peugh, 2010) and indicates the need to model the school level by MLM. Second, although not shown in Table 7.3, another way to examine the need to use MLM is to calculate the design effect: 1 + ICC*([the average sample size within each cluster] − 1). The design effect for Model 1 was 1 + 0.53*([482 ⁄15] − 1) = 17.50. Values over 1 indicate the violation of the assumption of independence of observations and suggest the need to use multilevel models (McCoach & Adelson, 2010). Third, reliability relates to the proportion of between-school variance in the intercept compared with total variance (i.e., reliability = between-group variance ⁄ [between-group variance + error variance] [Raudenbush & Bryk, 2002]). The high reliability (.96) indicates that 96% of the variance in the intercept varied across Level 2 units (i.e., schools) and that the school level needs to be modeled. Fourth, the statistically significant variance of the intercept (11.13) indicates that the mean score varied significantly across schools, χ2 = 832.37, df = 14, p < .001. In sum, the results for the ICC, the design effect, the reliability, and the significance test of the variance of the intercept all indicate the need to model the school level using MLM. To the best of our knowledge, it is not clear as to what decision to follow if all four ways do not yield the same result. It would be sensible to use both regression and MLM and compare the findings.

Table 7.3  Multilevel model results Research question 2

Research question 1

Entered variables Fixed effects (fixed regression coefficients) Level 1 (n = 482) Size (γ00)—intercept Size (γ10)—slope Level 2 (n = 15) School type (γ01)—intercept School type (γ11)—slope Random effects (variance components) Level 1 (n = 482) Within-school (between-student) variance (r) Level 2 (n = 15) Between-school variance (μ0) Between-school variance (μ1) Chi-square (μ0; df)

Research question 3

Model 1

Model 2

Model 3

Model 4

Model 5

Null

Random intercept and fixed slope

Random intercept and random slope

Cross-level direct effect

Cross-level interaction effect

Derivation

Derivation and size

Derivation and size

Derivation, size, and school type

Derivation, size, and school type

Coefficient (SE) 7.06*** (0.88) –

Coefficient (SE) 7.04*** (0.87) 0.27*** (0.01)

Coefficient (SE) 7.04*** (0.87) 0.28*** (0.01)

Coefficient (SE) 7.02*** (1.41) 0.27** (0.01)

Coefficient (SE) 7.04*** (0.87) 0.28*** (0.02)

– –

– –

– –

0.04 (1.70) –

– 0.00 (0.02)

9.82

5.15

5.09

5.15

5.15

11.13 – 832.37*** (14)

11.22 – 1592.75*** (14)

11.22 0.00 1610.57*** (14)

11.22 – 1595.61*** (13)

11.22 – 1592.77*** (14) (Continued)

Table 7.3  Multilevel model results (Continued) Research question 2

Research question 1

Entered variables Chi-square (μ1; df) Intraclass correlation Reliability Intercept (β0) Size (β1) Model fit Deviance (# of estimated parameters) Model comparison test: Chi-square (df) AICa R2y,ŷ R2r (level 1) R2μ0 (level 2)

Research question 3

Model 1

Model 2

Model 3

Model 4

Model 5

Null

Random intercept and fixed slope

Random intercept and random slope

Cross-level direct effect

Cross-level interaction effect

Derivation

Derivation and size

Derivation and size

Derivation, size, and school type

Derivation, size, and school type









.53

.69

13.77 (14) .69

.69

.69

.96 –

.98 –

.98 .18

.98 –

.98 –

2520.34 (3)

2218.72 (4)

2213.75 (6)

2218.72 (5)

2218.72 (5)



301.62*** (1)b

2526.34 – – –

2226.72 .18 .48 .00

4.97 (2)c 2225.75 .18 .01 .00

0 (1)d 2228.72 .18 .00 .00

0 (1)e 2228.72 .18 .00 .00

Note: SE = standard error. aAkaike information criterion (deviance + 2*number of estimated parameters). bComparison between Models 1 and 2. cComparison between Models 2 and 3. dComparison between Models 2 and 4. eComparison between Models 2 and 5. *p < .05; **p < .01; ***p < .001

Variability in second language test scores 163

Step 2: Build the level-1 model To examine research question 1, Models 2 and 3 were tested. Model 2 was tested where vocabulary size test scores were added at Level 1 to examine the impact of vocabulary size on derivation scores. Model 2 is defined as follows: Model 2: Level 1 Model :  Yij = β0j + β1j * ( SIZE ij ) + rij (7.2) Level 2 Model :  β0j = γ 00 + µ 0j (7.3) β1j = γ 10 (7.8) Combined :  Yij = γ 00 + γ 10 * ( SIZE ij ) +

µ 0j

+

rij (7.9)

In Equation 7.2, the derivation score of student i in school j, Y ij, is predicted by the mean score for derivation for school j (the intercept β0j), the vocabulary size score of student i in school j (the intercept β1j), and the residual error (r ij). At Level 2, the average score for derivation for school j (the intercept β0j) is predicted by the grand-mean score for the derivation test and the residual error (μ0j). The slope (β1j = γ10) is fixed (i.e., not random) as it lacks the residual error (μ1j), assuming that the effect of vocabulary size scores on derivation scores does not vary significantly across schools. Equations 7.2, 7.3, and 7.8 can be combined into a single equation. This results in Equation 7.9. Table 7.3 shows that the mean derivation score was 7.04, with a slope of 0.27, indicating that a one-point increase in vocabulary size relates to a 0.27-point increase in derivation, on average. The high reliability (.98), again, indicates that 98% of the variance in the intercept varies across Level 2 variables and that the school level needs to be modeled. Finally, the statistically significant variance of the intercept (11.22) indicates that the mean score for the derivation test varied significantly across schools, χ2 = 1592.75, df = 14, p < .001. Models can be compared in MLM using chi-square difference tests, AIC, and the proportion of variance explained. First, comparison between Models 1 and 2 shows a significant improvement in the fit in Model 2, Δ χ2 = 301.62, df = 1 (4 − 3), p < .001, with the difference in the degree of freedoms between the two models equal to the difference in the number of parameters estimated between the two models. The result suggests that vocabulary size is a useful predictor of knowledge about derivations. Second, this was also supported by a smaller AIC (2226.72). Third, for the proportion of variance explained, we followed Singer and Willett (2003) and calculated two types of an index. The first index is the correlation between observed and predicted values, which is then squared (R 2y,ŷ). For Model 2, this was the correlation between (a) observed scores of knowledge about derivations and (b) predicted scores of knowledge

164  In’nami and Barkaoui about derivations, as predicted by Y ij = 7.04 + 0.27*(SIZEij) + 11.22 + 5.15. The correlation was .43 and its squared value was .18. Another index is the proportion of reduction in residual variance (R 2r, R 2μ0), calculated by ([previous model-subsequent model] ⁄previous model). For Model 2 against Model 1, this was .48 ([9.82 − 5.15] ⁄9.82) and approximately .00 ([11.13 − 11.22] ⁄11.13), suggesting that 48% of the within-school variation in the derivation scores is explained by vocabulary size scores and 0% of the within-school variation in the derivation scores is explained by the Level 2 school variable. Taken together, all these results suggest that Model 2 explains the data better than Model 1. For more information on the proportion of variance explained in MLM, see Hox (2010, pp. 69–78). Although the effect of vocabulary size scores on derivation scores was assumed not to vary significantly across schools in Model 2, this may not hold because the effect would be stronger at some schools than others, leading to betweenschool variation in the predictive power of derivation from size. Additionally, knowledge about derivation might be predicted well by vocabulary size in some schools but less so in other schools. This between-school variation is modeled by adding a residual error (μ1j) to the intercept (γ10) in Equation 7.8. This produces Equation 7.4. All the other equations, along with Equation 7.4, are as follows: Model 3: Level 1 Model :  Yij = β0j + β1j * ( SIZE ij ) + rij (7.2) Level 2 Model :  β0j = γ 00 + µ 0j (7.3) β1j = γ 10 + µ1j (7.4) Combined :  Yij = γ 00 +  γ 10 * ( SIZE ij ) + µ 0j + µ1j * ( SIZE ij ) + rij (7.5)

The residual error (μ1j) is called a random effect or a variance component. Vocabulary size scores are now modeled to vary randomly across schools, and such variation is indicated by the size of a variance component. Table 7.3 showed that such variation (μ1j) was nil and nonsignificant, χ2 = 13.77, df = 14, p > .05. This was also supported by the low reliability for vocabulary size (.18), suggesting little between-school variation in vocabulary size. In sum, the effect of vocabulary size on derivation did not differ significantly across schools and would be more sensibly modeled as fixed, not random effects. Thus, Model 2 was preferred over Model 3. This was also supported by a chi-square difference test comparing the relative fit of two nested models, χ2 = 4.97 (2218.72 − 2213.75), df = 2 (6 − 4), p > .05. Although AIC was slightly smaller for Model 3, the difference was negligible (2226.72 for Model 2

Variability in second language test scores 165 and 2225.75 for Model 3). The proportion of variance explained was identical (R 2y,ŷ = .18), while the proportion of reduction in residual variance was negligible (R 2r, = .01, R 2μ0 = .00). The answer to research question 1 is affirmative: vocabulary size relates to vocabulary depth (i.e., knowledge about derivations), and a one-point increase in vocabulary size leads to a 0.27-point increase in derivation, on average. This relationship, however, does not vary significantly across schools.

Step 3: Build the level-2 model Since Model 2 was preferred over Model 3, the analysis could stop here and does not need to add the Level 2 school-level predictor. Nevertheless, for illustration purposes, the Level 2 variable, school level—(public or private), is added to examine research questions 2 and 3. First, for research question 2, this allows us to demonstrate how to examine a cross-level direct effect, namely, whether higher-level predictors (i.e., school type) relate to a lower-level outcome variable (i.e., derivations). This concerns research question 2. Model 4 is as follows: Model 4: Level 1 Model :  Yij = β0j + β1j * ( SIZE ij ) + rij (7.2) Level 2 Model :  β0j = γ 00 + γ 01 * ( SCHOOLTYPE j ) + µ 0j (7.10) β1j = γ 10 (7.8) Combined : Yij = γ 00 + γ 01 * ( SCHOOLTYPE j ) + γ 10 * ( SIZE ij ) + µ 0j + rij (7.11)

The school type is added to the intercept β0j in Model 2 (recall it was preferred over Model 3). This results in Equation 7.10. Equations 7.2, 7.8, and 7.10 can be combined into a single equation. This results in Equation 7.11. Table 7.3 shows the nonsignificant cross-level direct effect of the school type on derivation (γ01 = .04), suggesting that the means of knowledge about derivations are considered the same between public and private schools. Thus, the answer to research question 2 is negative. Second, adding a school-level, school-type predictor also allows one to examine a cross-level interaction effect, namely, whether the relationship between lower-level variables (i.e., vocabulary size and derivation) varies according to or is moderated by a higher-level variable (i.e., school type). Model 5 is as follows: Model 5: Level 1 Model :  Yij = β0j + β1j * ( SIZE ij ) + rij (7.2) Level 2 Model :  β0j = γ 00 + µ 0j (7.3)

166  In’nami and Barkaoui β1j = γ 10 + γ 11 * ( SCHOOLTYPE j ) (7.12) Combined : Yij = γ 00 + γ 10 * ( SIZE ij ) + γ 11 * ( SCHOOLTYPE j ) * ( SIZE ij ) + µ 0j + rij (7.13)

The school type is added to the slope β1j in Model 2 (recall it was preferred over Model 3). Note that the regression coefficient (γ11) in Equation 7.12 does not have the subscript j (like γ11j) to indicate which school it refers to. This is because the regression coefficient (γ11) refers to all schools (i.e., the regression coefficient is the same across schools) and is therefore called a fixed (not random) coefficient. As not all proportions of the variance are usually explained, such unexplained variance can be modeled by adding a residual (μ1j) for each school to Equation 7.12, which now becomes β1j = γ10 + γ11*(SCHOOLTYPEj) + μ1j. However, based on the result for Model 3, where the effect of vocabulary size on derivation did not differ significantly across schools and would be more sensibly modeled as fixed, not random, effects, Equation 7.12 does not have a residual. By modeling the effect of vocabulary size on derivation as fixed across school type, we are assuming that the effect would be considered to be reasonably similar across schools, although it could not be exactly the same. Also note that the cross-level interaction effect is modeled as γ11*(SCHOOLTYPEj)*(SIZEij) in Equation 7.13, where γ11*(SCHOOLTYPEj) is at Level 2 and (SIZEij) is at Level 1. This is less apparent in Equations 7.2, 7.3, and 7.12 but becomes noticeable in Equation 7.13 (Hox, 2010). Variables across different levels are combined; hence, this is called a cross-level interaction effect. As shown in Table 7.3, the school type was nonsignificant (γ11 = .00), suggesting that whether students belong to public or private schools does not affect the strength of the relationship between vocabulary size and derivation. Thus, the answer to research question 3 is negative. This relationship is considered to be the same in public and private schools.

Considerations in MLM Estimation methods and sample size Two methods for estimating parameters are widely used in MLM: full (information) maximum likelihood (FML) and restricted maximum likelihood (RML). FML is probably the most common method because it provides unbiased estimates of parameters (i.e., parameter estimates in the sample are very close to those in a population) when sample size is large. This is true of fixed coefficients but less so of random coefficients: estimated random coefficients are more biased (smaller) in FML than in RML (Kreft & de Leeuw, 1998). This could

Variability in second language test scores 167 influence inferences in application studies, but Luke (2004) cites Snijders and Bosker (1999) and states that the difference usually would be negligible and recommends using FML when Level 2 units exceed 30. (The number of the Level 2 unit in our study was 15, and the guidelines described in the next paragraph should be followed to obtain more accurate results.) Additionally, both fixed and random coefficients are considered when calculating FML, whereas only random coefficients are considered when calculating RML (Hox, 2010). In other words, if two models differ in fixed coefficients, they can be compared only by using FML (not RML). Thus, in many cases with large samples, FML could be more practical in terms of parameter estimation and model comparison. A common question is what sample sizes would be adequate for MLM. Previous studies, for example, suggest 30 and 30 (30 or more groups ⁄units at Level 2 and 30 or more individuals per group ⁄unit at Level 1 [Kreft, 1996, as cited in Hox, 2010]), 50 and 20 to examine cross-level interactions, and 100 and 10 to examine random effects (i.e., variances and covariances and their standard errors [Hox, 2010]). To the best of our knowledge, the most recent, comprehensive, and accessible study regarding adequate sample size is McNeish and Stapleton (2016). By reviewing relevant studies and conducting simulation, they offer the following guidelines for studies with a Level 2 sample size of 30 or below: (a) use RML to estimate random coefficients and use FML if models differ in fixed coefficients; (b) use FML followed by the Kenward-Roger adjustment to address the inflated Type I error caused by smaller fixed-effects standard errors (Kenward-Roger is available in SAS and R [for R, see Halekoh & Højsgaard, 2017]); (c) use FML followed by bootstrapping to estimate sampling variability since point estimates could be biased (bootstrapping is available in MLwiN and R ([for R, see Halekoh & Højsgaard, 2017]); or (d) use Bayesian Markov Chain Monte Carlo (MCMC) estimation methods (available in MLwiN) instead of RML and FML.

Missing data Although Level 2 data must be complete, Level 1 can include missing data. This is one of the great benefits of using MLM because Level 1 data, which are usually observations and responses, often have missing data, for example, due to students’ skipping items. This flexibility of MLM also is of great value to longitudinal studies where Level 1 data are students’ responses at different points in time and Level 2 data are students; and not all students show up for every test date, resulting in their missing data across time. Although MLM can be run with Level 1 missing data, missing data could be imputed using other software if issues such as nonconvergence of parameter estimation arise.

Software HLM7 (Raudenbush et al., 2016) is recommended for those conducting MLM for the first time. HLM7 allows one to specify models with ease and clarity, both of which we believe are particularly important given that users need to decide on

168  In’nami and Barkaoui which variables to enter into the equation, at which level (e.g., Level 1 only, Level 2 only, or both), as random or fixed, as group-mean centered or grand-mean centered (for details please see the tutorial in the Companion website), among others. A student version of the software can be downloaded for free (Scientific Software International, Inc., 2005–2016), with some restrictions such as on the maximum number of observations at each level applied. Textbook examples are available from the UCLA Institute for Digital Research and Evaluation (2017). Advanced users may prefer to use the lme4 package in the R program because the package can be flexibly used together with other packages in R.

Discussion and conclusion This chapter has discussed the application of MLM to the examination of sources of variability in L2 test scores. For this purpose, it has investigated the relationship between scores on two vocabulary tests in cross-sectional designs and whether the strength of such a relationship changes across schools. These investigations using MLM can offer useful insight into how variables of interest are related to each other, while modeling the hierarchical or nested structure of data where an observation or response is not independent of the others. Stated differently, MLM allows for examining whether the relationship among lower-level, individual variables changes according to (or is moderated by) a higher-level, contextual variable. If dependencies among observations or responses are not considered, the results may have an inflated Type I error (e.g., Hox, 2010): we may then wrongly conclude that the statistical significant results were observed when they are not present. With hierarchical data, potential applications of MLM in language testing abound and are useful in better understanding how variables at different hierarchical levels (e.g., students’ test scores at Level 1, students’ school at Level 2) contribute to variability in test scores. MLM was used in Barkaoui (2010, 2013, 2014, 2015, 2016), Cho et al. (2013), Feast (2002), and Koizumi et al. (2016). The design of these studies is commonplace (e.g., Raters 1, 2, and 3 rate essays of Students 1–3, Students 4–6, and Students 7–9, respectively [Barkaoui, 2010]). This suggests that there may be more studies that could have similar designs and that could benefit from the application of MLM. At the same time, previous studies when reanalyzed using MLM could model contextual variables and produce results that may be different from those reported earlier. This line of inquiry using MLM can help researchers examine sources of variability in test scores and refine arguments for test score interpretation and use.

Acknowledgments We thank Rie Koizumi, the editors, and reviewers for their valuable comments on earlier versions of this paper. This work was funded by the Japan Society for the Promotion of Science (JSPS) K AKENHI, Grant-in-Aid for Scientific Research (C), Grant Number 17K03023, which was awarded to the first author.

Variability in second language test scores 169

References Barkaoui, K. (2010). Explaining ESL essay holistic scores: A multilevel modeling approach. Language Testing, 27(4), 515–535. doi:10.1177 ⁄ 0265532210368717 Barkaoui, K. (2013). Using multilevel modeling in language assessment research: A conceptual introduction. Language Assessment Quarterly, 10(3), 241–273. doi:10.1080 ⁄ 1 5434303.2013.769546 Barkaoui, K. (2014). Quantitative approaches for analyzing longitudinal data in second language research. Annual Review of Applied Linguistics, 34, 65–101. doi: 10.1017 ⁄ S0267190514000105 Barkaoui, K. (2015). The characteristics of the Michigan English Test reading texts and items and their relationship to item difficulty. Retrieved from http: ⁄ ⁄ www.cambridgemichigan.­ org ⁄ ­wp-content ⁄ uploads ⁄ 2015 ⁄ 04 ⁄ CWP-2015-02.pdf Barkaoui, K. (2016). What changes and what doesn’t? An examination of changes in the linguistic characteristics of IELTS repeaters’ Writing Task 2 scripts. IELTS Research Report Series, 3. Retrieved from https: ⁄  ⁄ www.ielts.org ⁄ teaching-and-research ⁄ research-reports Cho, Y., Rijmen, F., & Novák, J. (2013). Investigating the effects of prompt characteristics on the comparability of TOEFL iBT™ integrated writing tasks. Language Testing, 30(4), 513–534. doi: 10.1177 ⁄ 0265532213478796 Eiken Foundation of Japan. (2017). Investigating the relationship of the EIKEN tests with the CEFR. Retrieved from http: ⁄  ⁄ stepeiken.org ⁄ research Feast, V. (2002). The impact of IELTS scores on performance at university. International Education Journal, 3, 70–85. Halekoh, U., & Højsgaard, S. (2017). Package ‘pbkrtest’: parametric bootstrap and Kenward Roger based methods for mixed model comparison. Retrieved from http: ⁄  ⁄ cran.r-project.org ⁄ web ⁄ packages ⁄ pbkrtest ⁄ pbkrtest.pdf Harring, J. R., Stapleton, L. M., & Beretvas, S. N. (Eds.). (2016). Advances in multilevel modeling for educational research: Addressing practical issues in real-world applications. Charlotte, NC: Information Age Publishing. Hox, J. J. (2010). Multilevel analysis: techniques and applications (2nd ed.). New York: Routledge. Japan Association of College English Teachers (JACET), Basic Word Revision Committee. (Ed.). (2003). JACET List of 8000 Basic Words. Tokyo: Author. Kane, M. T. (2006). Validation. In R. L. Brennan (ed.), Educational measurement (4th ed.) (pp. 17–64). Westport, CT: American Council on Education ⁄ Praeger Publishers. Koizumi, R., & In’nami, Y. (2013). Vocabulary knowledge and speaking proficiency among second language learners from novice to intermediate levels. Journal of Language Teaching and Research, 4(5), 900–913. doi:10.4304 ⁄ jltr.4.5.900-913 Retrieved from http: ⁄  ⁄ www.academypublication.com ⁄ issues ⁄ past ⁄ jltr ⁄ vol04 ⁄ 05 ⁄ 02.pdf Koizumi, R., In’nami, Y., Asano, K., & Agawa, T. (2016). Validity evidence of Criterion® for assessing L2 writing performance in a Japanese university context. Language Testing in Asia, 6(5), 1–26. doi: 10.1186 ⁄ s40468-016-0027-7 Kreft, I., & de Leeuw, J. (1998). Introducing multilevel modeling. Thousand Oaks, CA: Sage Publications. Linck, J. A., & Cunnings, I. (2015). The utility and application of mixed-effects models in second language research. Language Learning, 65(S1), 185–207. doi:10.1111 ⁄ lang.12117 Luke, D. (2004). Multilevel modeling. Thousand Oaks, CA: Sage Publications.

170  In’nami and Barkaoui Mathieu, J. E., Aguinis, H., Culpepper, S. A., & Chen, G. (2012). Understanding and estimating the power to detect cross-level interaction effects in multilevel modeling. Journal of Applied Psychology, 97(5), 951–966. McCoach, D. B. (2010). Dealing with dependence (part II): A gentle introduction to hierarchical linear modeling. Gifted Child Quarterly, 54(3), 252–256. doi: 10.1177 ⁄ 0016986210373475 McCoach, D. B., & Adelson, J. L. (2010). Dealing with dependence (part I): Understanding the effects of clustered data. Gifted Child Quarterly, 54(2), 152–155. doi: 10.1177 ⁄ 0016986210363076 McNeish, D. M., & Stapleton, L. M. (2016). The effect of small sample size on twolevel model estimates: A review and illustration. Educational Psychology Review, 28(2), 295–314. doi: 10.1007 ⁄ s10648-014-9287-x Murakami, A. (2016). Modeling systematicity and individuality in nonlinear second language development: The case of English grammatical morphemes. Language Learning, 66(4), 834–871. doi:10.1111 ⁄ lang.12166 Peugh, J. L. (2010). A practical guide to multilevel modeling. Journal of School Psychology, 48(1), 85–112. doi:10.1016 ⁄ j.jsp.2009.09.002 Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage Publications. Raudenbush, S. W., Bryk, A. S., Cheong, Y. F., Congdon, R. T., Jr., & du Toit, M. (2016). HLM7: Hierarchical linear and nonlinear modelling. Lincolnwood, IL: Scientific Software International. Scientific Software International, Inc. (2005–2016). HLM 7 student edition. Retrieved from http: ⁄  ⁄ www.ssicentral.com ⁄ index.php ⁄ products ⁄ hml ⁄ free-downloads-hlm Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. Oxford, UK: Oxford University Press. Snijders, T. A. B., & Bosker, R. L. (1999). Multilevel analysis: An introduction to basic and advanced multilevel modeling. Thousand Oaks, CA: Sage Publications. Sonbul, S., & Schmitt, N. (2013). Explicit and implicit lexical knowledge: Acquisition of collocations under different input conditions. Language Learning, 63(1), 121–159. doi:10.1111 ⁄ j.1467-9922.2012.00730.x Tabachnick, B. G., & Fidell, L. S. (2014). Using multivariate statistics (6th ed.). Harlow, Essex, UK: Pearson. UCLA Institute for Digital Research and Evaluation. (2017). Multilevel modeling with HLM. Retrieved from http: ⁄  ⁄ stats.idre.ucla.edu ⁄ other ⁄ hlm ⁄ hlm-mlm ⁄ 

8

Longitudinal multilevel modeling to examine changes in second language test scores Khaled Barkaoui and Yo In’nami

Introduction Two recent phenomena have resulted in a growing body of longitudinal second language (L2) test data. The first phenomenon is the growing number of test takers who repeat or retake English language proficiency (ELP) tests more than once—often because they failed to achieve a particular score on earlier attempts and need to take the test again in order to achieve a higher score that meets a certain cut score (e.g., for university admission). For example, Zhang (2008) reported that in the period from January to August 2007, approximately 250,000 candidates, or 10% of all candidates, repeated the Test of English as a Foreign Language Internet-based test (TOEFL ® iBT) at least once; 12,000 of these candidates repeated the test once within 30 days. As for the International English Language Test System (IELTS), according to Green (2005), in the period January 1998 to June 2001, 15,380 candidates took the IELTS Academic test more than once. It is likely that the number of test takers who repeat (or retake) L2 proficiency tests will continue to increase. The second phenomenon contributing to the rise of longitudinal L2 test data is the growing number of studies examining changes in L2 proficiency test scores after a period of L2 instruction (e.g., Elder & O’Loughlin, 2003; Ling, Powers, & Adler, 2014; O’Loughlin & Arkoudis, 2009). These studies are based on the assumption that if examinees’ performance on the construct of interest (e.g., L2 proficiency) is expected to improve after L2 instruction; if such change can be detected with a L2 proficiency test; and if test scores improve substantially after L2 instruction, that provides support for the claim that the test measures the construct of interest (Xi, 2008). These studies seem also to aim to establish evidence that these tests are sensitive to change in L2 proficiency over time. While most L2 proficiency tests primarily are intended to measure test taker ELP at a particular point in time, these tests sometimes are used to measure L2 development or progress over time and ⁄or in relation to L2 instruction. However, in order to use these tests to make valid claims about L2 development and instruction effectiveness, more studies are needed to examine the sensitivity of these tests to changes in learner L2 proficiency and to instructional effects.

172  Barkaoui and In’nami In both cases (i.e., repeating the test and intervention studies), the data collected are longitudinal or repeated measures. Such data present two challenges: one concerns validity and the second concerns score analysis. First, as Barkaoui (2017) argued, when test takers are permitted to retake an English proficiency test as often as they wish, new issues can arise concerning the meaning of repeaters’ test scores. A key assumption underlying the valid interpretation and use of repeaters’ test scores is that changes in scores over time reflect true changes in the construct of interest (i.e., L2 proficiency) and are not due to construct-irrelevant factors. This means that the major source of variability in repeaters’ scores over time is true change in the construct being measured. If changes in repeaters’ test scores over time are associated with factors other than changes in L2 proficiency, then this would change the meaning of test scores and undermine tests’ validity arguments (Barkaoui, 2017, 2019). Second, studies that have examined repeaters’ L2 test scores (e.g., Elder & O’Loughlin, 2003; Green, 2005; O’Loughlin & Arkoudis, 2009; Zhang, 2008) are methodologically limited. Specifically, these studies tend to use traditional techniques such as analysis of variance (ANOVA) and multiple regression analysis to analyze test scores and compare them across test takers and time (Barkaoui, 2014). These methods have several limitations, however. First, they only allow comparing test performance across two time points (e.g., pre-test versus post-test). This design can reveal changes that might occur across time but does not capture the rate or shape of change (Taris, 2000). Second, these techniques provide information about average (i.e., group-level) changes in test scores over time, overlooking interindividual variability in changes (Newsom, 2012; Weinfurt, 2000). Third, traditional techniques are based on several restrictive assumptions that are often difficult to meet in practice, such as completeness of data, equally spaced measurement occasions, and sphericity, which refers to the assumption that correlations between all pairs of scores across occasions are equal, irrespective of the time interval among occasions (Twisk, 2003; Weinfurt, 2000). These techniques also require that all test takers repeat the test an equal number of times and at the same time point (Bijeleveld et al., 1998; Hox, 2002; Twisk, 2003). Otherwise, data for test takers with an unequal number of tests and ⁄or irregular spacing of test occasions are deleted to create a complete data set before data analysis. Multilevel modeling (MLM) provides a powerful alternative that addresses these limitations.

Multilevel modeling MLM is a family of statistical models for analyzing data with nested or hierarchical structure (Barkaoui, 2013, 2014; Hox, 2002; Kozaki & Ross, 2011; Luke, 2008). Nested data means that observations at lower levels are nested within units at higher levels, such as students nested within classrooms and classrooms nested within schools. Longitudinal or repeated-measures data are also nested, with test scores obtained from the same individual over time being nested within the person (Henry & Slater, 2008; Hox, 2002). Here, the same

Longitudinal multilevel modeling 173 individual produces test scores over time (e.g., score at time 1, score at time 2, score at time 3…), and these scores are considered to be situated within the person. The point is that regardless of whether students are nested within classrooms, classrooms nested within schools, or test scores nested within persons, they all have nested or hierarchical structures and can be analyzed using MLM. MLM distinguishes between two levels of analysis in longitudinal or repeated-­ measures data: Level 1 observations or test scores that are nested in Level 2 units or test takers (see Chapter 7, Volume II). Given an outcome variable such as test scores, the Level 1 equation examines how the outcome changes within each test taker over time. The Level 1 equation includes two main parameters of change for each test taker: initial status or the intercept of the test taker’s trajectory and the rate of change, which is the slope of the test taker’s trajectory over time. Trends in change in test scores can be tested to find out if they are linear or nonlinear, and parallel change processes can be examined as time-varying predictors (Luke, 2008; Preacher, Wichman, MacCallum, & Briggs, 2008; Singer & Willett, 2003). Time-varying or intraindividual predictors are variables whose values change over time and include test takers’ age, L2 proficiency, and other variables. In contrast, time invariant (or interindividual) predictors are variables that are constant across time, such as test takers’ first language (L1) and gender, which are included as Level 2 predictors in MLM (Luke, 2008). The change trajectory within individuals can vary across individuals in terms of initial status (intercept) and ⁄or rate of change (slope). At Level 2, test takers’ initial status and change rates serve as dependent variables, and test taker factors (e.g., gender and L1) or important factors that explain variability (across individuals) in the rate of change in test scores over time. For example, when examining test repeaters’ scores, MLM can estimate the rate and shape (e.g., linear, nonlinear) of change in repeaters’ scores over time, whether the rate of change in scores varies across individuals and what factors (e.g., test taker L1 and age) explain differences across test takers in terms of rate of change in test scores over time (Barkaoui, 2014; Hox, 2002; Luke, 2008). If individuals are nested within higher-level units (e.g., classrooms, programs, schools), these units could be added in order to examine the effects of contextual factors on changes in test scores over time (cf. Kozaki & Ross, 2011). MLM thus allows examining the effects of various individual and contextual factors on interindividual differences in test scores’ change over time. As a result, MLM can address several important questions about change in test scores over time such as: • • •

What is the form and structure of change in test scores over time? Is individual change linear, nonlinear, continuous, or discontinuous over time? What is the relationship between initial L2 proficiency level and rate of change in test scores?

174  Barkaoui and In’nami • •

Do individuals and groups vary significantly in their initial level and rate of change in test scores over time? What individual and contextual factors explain differences in individual initial level and rate of change over time? (Bijeleveld et al., 1998; Luke, 2008; Singer & Willett, 2003).

Practically, MLM can handle both short- and long-time series and unbalanced data sets with varying numbers, timing, and spacing of testing occasions across and within test takers, which allows the use of all available data, including data from individuals with missing data on some occasions, assuming data are missing at random (MAR) (Hox, 2002; Luke, 2004). Finally, MLM can accommodate time-varying and time-invariant predictors of change, can handle multivariate and categorical outcome variables, and can accommodate more than two hierarchical levels including, for example, time nested within test taker, test taker nested within classroom, and classroom nested within school. Although MLM has been widely applied to longitudinal educational data in educational research, there has been little published research that uses MLM to analyze repeatedly measured L2 test scores (e.g., Barkaoui, 2016, 2019; Gu, Lockwood, & Powers, 2015; Koizumi, In’nami, Asano, & Agawa, 2016; Kozaki & Ross, 2011). Kozaki and Ross (2011), for example, used MLM to examine the effects of both individual and contextual factors and their interactions on changes in L2 learners’ TOEIC (Test of English for International Communication) bridge scores over two years. They found significant interindividual differences in changes in test scores, with various individual and contextual factors being significantly related to variation in growth trajectories across individuals. Gu et al. (2015) used MLM to analyze data from 4,606 students who took the TOEFL Junior test more than once between early 2012 and mid2013 in order to examine the relationship between (a) the time interval between test occasions, used as a proxy for changes in underlying L2 proficiency because of learning, and (b) changes in test scores across test occasions. Gu et al. (2015) found that test takers obtained higher scores on the second administration and that there was a positive, statistically significant relationship between interval length and score gains. Generally, students with longer intervals between retesting exhibited greater gains in terms of both section and total test scores than did students with shorter intervals.

Method Data set To illustrate how MLM can be used to analyze L2 test repeaters’ scores, a data set was obtained from Pearson that includes a sample of 1,000 test takers who each took the Pearson Test of English (PTE) Academic three or more times in the period April 2014 to May 2015. The variables used consist of the PTE Academic total scores, test dates, and test locations. The PTE Academic is a three-hour,

Longitudinal multilevel modeling 175 Table 8.1  Descriptive statistics for number of times test was taken (N = 1,000) Number of Times Test was Taken

Number of Test Takers

3 4 5 6 7 8

597 226 86 45 26 6

%

Number of Times Test was Taken

Number of Test Takers

%

59.7 22.6 8.6 4.5 2.6 0.6

9 10 11 12 14 15

6 3 2 1 1 1

0.6 0.3 0.2 0.1 0.1 0.1

computer-based test that assesses test takers’ academic ELP in listening, reading, speaking, and writing. From the 1,000 test takers, more than half (59.7%) took the test three times; more than a fifth (22.6%) repeated the test four times, and less than a tenth (8.6%) repeated the test five times (see Table 8.1). The data set included 3,772 complete test records (that is 3 to 15 complete test records per test taker). The interval between any two successive tests varied between two days (e.g., between tests 3 and 4) and 349 days (between tests 2 and 3). About two-thirds of the test takers (66%) were males; they ranged in age between 16 and 60 years (mean [M] = 26.33 years, standard deviation [SD] = 5.54) and were citizens of 72 different countries. Table 8.2 displays descriptive statistics for the PTE scores for test occasions 1 to 15. As Table 8.2 shows, the score scale for the PTE Academic ranges from 10 to 90, and as a general pattern, the mean PTE scores increase across test occasions 1 to 7, which involved 40 or more test takers. Table 8.2  Descriptive statistics for PTE total scores by test occasion Occasion

N

1

1,000

2

1,000

3

1,000

4

403

5

177

6

91

7

46

8

20

M SD M SD M SD M SD M SD M SD M SD M SD

Total Score

Occasion

N

47.43 10.23 49.34 9.61 50.74 10.23 51.49 9.70 51.68 9.97 51.44 11.10 53.28 10.97 49.90 12.94

9

14

10

8

11

5

12

3

13

2

14

2

15

1

Total Score M SD M SD M SD M SD M SD M SD M

48.64 13.04 53.50 16.22 49.80 18.69 47.00 20.30 53.50 17.68 56.50 16.26 71.00

176  Barkaoui and In’nami Some of the research questions that can be asked in relation to the four variables in the data set (i.e., time, test location, number of previous tests, and initial ELP) include: 1. What is the rate of change in repeaters’ PTE scores over time? 2. What is the shape of change in repeaters’ PTE scores over time? 3. Does the rate of change in PTE scores over time vary across test takers? 4. What factors contribute to (a) variability in PTE scores at Test Occasion 1 as well as (b) variability in PTE scores over time? To address research question 4, the following factors will be examined: experience with the test (i.e., number of previous tests taken), context (test location), and test taker initial overall ELP.

Data analysis To address the aforementioned research questions, the computer program HLM6 (Raudenbush, Bryk, Cheong, & Congdon, 2004) was used. Given that the number of times the test was taken (3 to 15 times) and the length of the interval between tests (2 to 349 days) in this study varied considerably across test takers, MLM was the most appropriate method to analyze the data. The outcome variable in the MLM analyses was the PTE score for each test taker for each test occasion. In addition to time and time2 (see following section), one Level 1 (i.e., time-varying) predictor and two Level 2 (time-invariant) predictors were included as follows: Level 1 predictors: • •



Time: Time since test 1, in months, to examine for linear change in test scores. Time2: To examine for quadratic change in test scores. As Field (2009) explained, a quadratic trend looks like an inverted U. “If the means in ordered conditions are connected with a line then a quadratic trend is shown by one change in the direction of this line (Field, 2009, p. 792).” For instance, a quadratic trend can show an initial increase in scores followed by a decline. Number of previous tests: Number of times PTE Academic was taken before (0 to 14).

Level 2 predictors: •

Initial ELP: Total score at Test Occasion 1 was used to examine whether and how differences in initial ELP relate to variability in changes in PTE scores over time. Total score at test 1 was grand-mean centered1 (M = 47.43).

Longitudinal multilevel modeling 177 •

Context: Test center location at Test Occasion 1 was used as an indicator of the context where the test taker lived. Test center locations were classified into three categories using Kachru’s (1992) concentric circles of English: inner circle (Australia, the UK, and the United States, 54.8%), outer circle (Ghana, Hong Kong, India, Kenya, Malaysia, Nigeria, Singapore, South Africa, Sri Lanka, and Zimbabwe, 31.8%), and expanding circle (all other countries, 13.4%). Two dummy variables2 were included (outer circle and expanding circle) with inner circle as the baseline category (coded 0).

Following Singer and Willett (2003), several MLM models were built and evaluated to address the aforementioned research questions. Each model addressed one or more research questions. Specifically, the Level 1 equation examined how PTE scores changed for each test taker across test occasions and how changes in scores relate to the number of times the test was taken previously. Level 2 (test taker) included two predictors: initial ELP and context in order to examine whether and to what extent the relationships between Level 1 predictors and PTE scores over time varied significantly in relation to these Level 2 factors. For each model, two main indices of model fit were examined: (a) the deviance statistic, which compares the fit of multiple models to the same data set, and (b) significance tests for individual parameters. Based on the results of these different models, a final model was built. The full maximum likelihood (FML) method of parameter estimation was used as it allows comparison of models that differ in terms of fixed and random coefficients (Hox, 2002; Luke, 2004). Because of the small number of test takers who repeated the test five or more times (n = 86), only quadratic change was examined. Examining higher polynomial change trajectories (e.g., cubic change) requires a larger number of test takers who repeated the test five or more times.

Results Tables 8.3 and 8.4 display the results for the various MLM models examined in the study. Tables 8.3 and 8.4 include four sets of statistics: fixed effects, random effects, reliability, and model fit. Fixed effects can be interpreted in the same way as Beta coefficients in multiple regression analysis (see Seo & Taherbhai, this volume) and include the intercept of the outcome and a slope for each predictor; the slope indicates the strength of the association between each predictor and the outcome, controlling for the effects of other predictors in the model. MLM uses t-tests to assess whether, on average, the relationship between a given predictor and the outcome is significantly different from zero. As a rule of thumb, a coefficient reaches significance at p < .05 when its estimate is twice as large as its standard error (SE) (Hox, 2002). Random effects refer to the magnitude of variance in coefficients (i.e., intercept or slope) across test takers. Chi-square (X2) tests are used to test whether a random effect significantly departs from zero. A significant random effect indicates that the coefficient (e.g., the association

178  Barkaoui and In’nami between a predictor and the outcome) varies significantly across test takers (Barkaoui, 2013; Hox, 2002). Reliability refers to the reliability of Level 1 random coefficients and represents the proportion of variance among Level 2 units that is systematic and, thus, can be modeled in the Level 2 equation using Level 2 variables.3 For example, if the estimated reliability of a coefficient is .70, this indicates that 70% of the differences among Level 2 units could be attributed to true variation among Level 2 units rather than measurement or sampling errors. Generally, a larger number of data points, a higher level of heterogeneity of true parameters of Level 2 units, and a lower measurement error are associated with higher reliability of parameters (Deadrick, Bennett, & Russell, 1997). Finally, model fit is assessed using the deviance statistic and the Akaike information criterion (AIC). The deviance for any one model cannot be interpreted directly; but it can be used to compare the overall fit of multiple models to the same data set (Hox, 2002; Luke, 2008). Generally, models with fewer parameters and lower deviance are better. However, the model with additional parameters will always have a lower deviance (Hox, 2002). Because deviance is sample-sensitive and has no penalty for sample size, its magnitude can change across samples. Consequently, researchers are advised to examine the AIC indices, rather than deviance, when comparing models (Luke, 2004). AIC is based on the deviance but incorporates penalties for a greater number of parameters. Generally, models with lower AIC fit better than models with higher AIC. The chi-square (X2) difference test assesses whether more complex models (i.e., models including more parameters) improve model fit significantly compared to less complex ones. The first MLM model examined was a null model in order to obtain the following: (i) a partitioning of the total variation in test scores to its within-person (Level 1: across test occasions) and between-person (Level 2) components; (ii) a measure of dependence within each Level 2 unit by way of the interclass correlation (ICC); and (iii) a benchmark value of the deviance that can be used to compare models. ICC provides a measure of how much variance in test scores is associated with test taker (Level 2) as opposed to test occasion (Level 1). The equations for Model 1 are as follows: Model  1 Level 1 Level 2

Y = π0 + e

(8.1)

π0 = β00 + r0

According to Model 1, a test score (Y) for a given test taker at a given test occasion is a function of an intercept (π0), or overall mean score across all test takers and occasions, and a random component (e), or unmodeled score variability at Level 1. At Level 2, the intercept (π0) is defined as a function of the regression intercept (β00) and a Level 2 random component (r0), or unmodeled variation across test takers. The results for Model 1 are reported in the second column of Table 8.3. The results for Model 1 (see Table 8.3) indicate that there was more interindividual variability (86.59) than intraindividual variability (16.16) in test scores.

Longitudinal multilevel modeling 179 Table 8.3  Models 1–4 for PTE total scores Model 1

Model 2

Model 3

Model 4

Fixed Effects (SE) Intercept (π0)   Intercept (β00) 49.58** (.30) 48.61** (.31) 48.08** (.31) 47.88** (.31) Time (π1)   Intercept (β10) .72** (.05) 1.65** (.11) .56** (.14) Time2 (π2)   Intercept (β20) −.14** (.02) −.06** (.02) Number of previous tests (π3)   Intercept (β30) .92** (.09) Random Effects   Between-Test Taker   Variance (r0)   X2 (df = 999)   Within-Test Taker   Variance (e)

86.59

88.73 24464.97**

21553** 16.16

Reliability   Intercept (π0) Model Fit   Deviance (#parameters)   Model Comparison: X2 (df)   AIC

14.54

88.45 25421.49** 13.89

87.64 26574.06** 13.18

.95

.96

.96

.96

24215.14 (3)

23940.67 (4) 274.47** (1)

23810.09 (5) 130.58** (1)

23652.58 (6) 157.51** (1)

24221.14

23948.67

23820.09

23664.58

Notes: * p < .05; ** p < .01; AIC = deviance + 2*(number of parameters) (Luke, 2004, p. 34)

The ICC, or the proportion of variance at the test taker level, is estimated as .84 (86.59⁄ [86.59 + 16.16]). In other words, 84% of the variance in test scores is between test takers and 16% is variance across test occasions.4 The intercept of 49.58 in Model 1 is simply the average test score across all test takers and test occasions. The intercept variance is significant (X2 = 21553, df = 999, p < .01), indicating that the average test score varied significantly across test takers. Finally, the coefficient reliability estimate represents the proportion of betweentest taker variance in the intercept coefficient that is systematic and thus can be modeled in the Level 2 equation using test taker-level variables (Raudenbush & Bryk, 2002). In this case, 95% of the variation in the intercept is potentially explicable by test taker-level predictors. Model 2 added time as a linear predictor at Level 1, with no random component, to examine the rate of change in test scores over time (i.e., research question 1). The equations for Model 2 are as follows:

Model  2 Level 1

Y = π0 + π1 * ( TIME ) + e

Level 2

π0 = β00 + r0 π1 = β10

(8.2)

180  Barkaoui and In’nami Here, Y is the outcome (i.e., test score), π0 is the intercept, and π1 is the slope coefficient for time. The slope for time is fixed at Level 2 (i.e., does not have r1), while the intercept has a random component (r0). At Level 2, the intercept (β00) is the average test score at Test Occasion 1 (coded 0). The third column in Table 8.3 reports the results for Model 2. As shown in Table 8.3, Model 2 predicts a value of 48.61 (i.e., average test score across all test takers at Test Occasion 1), which increases by .72 points, on average, each month. This increase is statistically significant (i.e., reliably different from zero). Fit statistics indicated that Model 2 fits the data significantly better than Model 1 (X2 = 24215.14 − 23940.67 = 274.47, df = 4 – 3 = 1, p < .01). AIC for Model 2 is smaller than that for Model 1. Model 3 added time2 as a predictor (with no random component) in order to examine whether the rate of change in test scores accelerates or decelerates over time (i.e., research question 2). The equations for Model 3 are as follows: Model  3 Level 1 Level 2

(

)

Y = π0 + π1 * ( TIME ) + π 2 * TIME 2 + e π0 = β00 + r0

(8.3)

π1 = β10 π 2 = β 20 In this equation, π2 is the slope coefficient for time2. The slopes for time and time2 are fixed at Level 2 (i.e., do not have r’s). Column 4 of Table 8.3 reports the results for Model 3. Fit statistics indicated that Model 3 fits the data significantly better than the linear model (i.e., Model 2) (X2 = 130.58, df = 1, p < .01). According to Model 3, the average initial test score (intercept) is 48.08. Because the coefficient for time2 is significantly different from zero, the meaning of the time coefficient changes. Specifically, the parameter associated with time (1.65) no longer represents a constant rate of change as in Model 2; it now represents the instantaneous rate of change at initial status. The significant time2 coefficient indicates that the rate of change in test scores changes over time. Time2, thus, represents the degree of curvature of the growth trajectory (Singer & Willett, 2003). Because the time coefficient is positive, the trajectory initially rises, with an average increase of 1.65 points for the first month. But because time2 is negative (−.14), this increase does not persist and decelerates over time. With each passing month, the magnitude of the increase in test scores diminishes by .14 points. As Singer and Willett (2003) explained, when scores exhibit a quadratic change pattern, the parameters time and time2 compete to determine the value of the outcome. “The quadratic term will eventually win because, for numeric reasons alone, time2 increases more rapidly than time. So, even though the linear term suggests that [test scores] increase over time, the eventual domination of the quadratic term removes more than the linear term adds and causes the trajectory to peak and then decline” (Singer & Willett, 2003, p. 216). Based on the results for Model 3, the moment when the quadratic trajectory flips over is −1.65⁄(2[−.14]) = 5.89 months (or 177 days)

Longitudinal multilevel modeling 181 since Test Occasion 1.5 This means that test scores continue to rise for the first six months and then start to diminish. Model 4 added the time-varying predictor, which is the number of previous tests (with no random component), to Level 1 in order to examine whether this factor has a significant association with test scores. The equations for Model 4 are as follows: Model  4 Level 1

Y = π0 + π1 * ( TIME )

(

)

+ π 2 * TIME 2 + π3 * ( Number of previous tests ) + e

Level 2

(8.4)

π0 = β00 + r0 π1 = β10



π 2 = β 20 π3 = β30



In this equation (see Column 5 of Table 8.3), π3 is the slope coefficient for the number of previous tests, which is fixed at Level 2. The model shows that when the number of previous tests is taken into account, the average test score at Test Occasion 1 is 47.88. The coefficient for linear change is .56 and that for quadratic change is −.06. Number of previous tests had a positive significant association with test scores. Specifically, for each additional test taken, there is an average increase of .92 points in test scores across test occasions. Model fit also improved significantly when number of previous tests was added (X2 = 157.51, df = 1, p < .01). This suggests that more experience with the test is associated with increases in overall test scores over time. Next, we examined whether any of the Level 1 regression slopes (i.e., π1, π2, and π3) varied significantly across Level 2 units, that is, has a significant variance component (i.e., research question 3). This is the random-slopes model. Hox (2002) advised that testing random slope variation should be done one at a time. This means developing and testing as many submodels as there are Level 1 predictors; in each submodel, only one Level 1 predictor is allowed to have a random slope. After deciding which slopes have a significant variance among Level 2 units, all the significant variance components can be added in a final model, which is then compared to Model 4 in terms of model fit. For the current data set, Model 5 assessed whether the association between each Level 1 predictor (i.e., time, time2, and number of previous tests) and test scores (within test taker) varied significantly across test takers. Three submodels (of Model 5) were examined; in each submodel, only one Level 1 predictor was allowed to have a random slope (r). Each of these models was similar to Model 4 except that one Level 1 slope was left to vary randomly across test takers. In the first submodel, a varying random slope for time was specified (π1 = β10 + r1) to examine whether the association between time and test scores varied significantly across test takers. In the second submodel, a varying random slope for time2 was specified (π2 = β 20 + r2), and in the third submodel,

182  Barkaoui and In’nami a varying random slope for number of previous tests was specified (π3 = β30 + r3). Chi-square and deviance statistics were used to assess whether each association varied significantly across test takers. The results indicated that only time had a significant random variance component (i.e., the time slope varied significantly across test takers). Consequently, the final version of Model 5 was specified as follows: Model  5 Level 1

Y = π0 + π1 * ( TIME )

(

)

+ π 2 * TIME 2 + π3 * ( Number of previous tests ) + e

Level 2

π0 = β00 + r0

(8.5)

π1 = β10 + r1 π 2 = β 20 π3 = β30 The Level 1 equation, the Level 2 intercept equation (π0), and the time2 and number of previous tests slopes (π2 and π3) equations are as in Model 4. The time slope (π1) includes a new component (r1) indicating that this slope is assumed to vary significantly across test takers. The results for Model 5 are reported in the second column of Table 8.3. Chi-square tests indicate that the variance component for time (r1) is significantly different from zero (X 2 = 1272.36, df = 999, p < .01). The model comparison test results show that modeling the time slope as being random across test takers significantly improved model fit (X 2 = 13.85, df = 2, p < .01). In other words, different test takers exhibited significantly different rates of linear change in test scores across test occasions. Note, however, that the reliability coefficient is low (.08) suggesting that the variance for the time slope is likely to be close to zero (Raudenbush & Bryk, 2002), which means that the rate of change in PTE scores over time varied across test takers, but the variation was very small (or insignificant). Model 6 added the time-invariant predictor context to the intercept equation at Level 2 in order to assess the association between context and test scores at Test Occasion 1 (i.e., research question 4a). The equations for Model 6 are as follows: Model  6 Level 1

Y = π0 + π1 * ( TIME )

(

)

+ π 2 * TIME 2 + π3 * ( Number of previous tests ) + e

Level 2

π0 = β00 + β01 ( Outer Circle ) + β02 ( Expanding Circle ) + r0 π1 = β10 + r1 π 2 = β 20 π3 = β30

(8.6)

Longitudinal multilevel modeling 183 The only difference between Model 6 and Model 5 concerns the intercept equation (π0). This equation now states that the intercept varies as a result of the Level 2 predictor context, with two dummy variables (outer circle and inner circle). The results for Model 6 are reported in the third column of Table 8.4. It shows that test takers who took the test in outer circle countries obtained lower scores (by .30 points, on average) at Test Occasion 1 than did those who took the test in inner circle countries, but this difference was not statistically significant. However, test takers who took the test in expanding circle countries obtained significantly lower scores (by 5.04 points, on average) at Test Occasion 1 than did those who took the test in inner circle countries. Adding context as a predictor at Level 2 improved model fit significantly (X2 = 31.51, df = 2, p < .01). Although the dummy variable outer circle was not significantly associated with test scores at Test Occasion 1, it is kept in the model because the interpretation

Table 8.4  Models 5–7 for PTE total scores Model 5 Fixed Effects (SE) Intercept (π0)   Intercept (β00)   Outer Circle (β01)   Expanding Circle (β02) Time (π1)   Intercept (β10)   Total score (β11)   Outer Circle (β12)   Expanding Circle (β13) Time2 (π2)   Intercept (β20) Number of previous tests (π3)   Intercept (β30)

Model 6

Model 7

47.85** (.31)

48.62** (.40) −0.30 (.65) −5.04** (1.01)

48.60** (.40) −0.26 (.66) −4.86 (1.01)

.50** (.13)

.50** (.13)

.56** (.16) −.03** (.01) −.04 (.11) −.30* (.14)

−.06** (.02)

−.06** (.01)

−.06** (.02)

.97** (.08)

.97** (.09)

0.96** (.09)

Random Effects   Between-Test Taker Variance (r0) 86.96 84.42 84.64   X2 (df) 11011.86** (999) 10693.00** (997) 10844.28** (997)   Time Slope Variance (r1) 0.16 0.16 0.16   X2 (df) 1272.36** (999) 1272.70** (999) 1258.04**(996) Within-Test Taker Variance (e) 12.73 12.72 12.54 Reliability   Intercept (π0)   Time (π1) Model Fit   Deviance (#parameters)   Model Comparison: X2 (df)  AIC

.90 .08

.90 .08

.90 .08

23638.72 (8) 13.85** (2) 23654.72

23607.21 (10) 31.51** (2) 23627.21

23601.23 (13) 5.98 (3) 23627.23

Notes: * p < .05; ** p < .01; AIC = deviance + 2*(number of parameters) (Luke, 2004, p. 34)

184  Barkaoui and In’nami of the significant dummy variable (i.e., expanding circle) depends on the inclusion of all related dummy variables in the model (Field, 2009). As noted earlier, the time slope varied significantly across test takers, which suggests that the rate of linear change in writing scores over time varied significantly across test takers. In order to explain the variance in the linear change in test scores, Model 7 included cross-level interactions for time (i.e., research question 4b). Specifically, initial test score and context were added in order to estimate the effects of test taker initial ELP and context on the rate of linear change in test scores over time. The equations for Model 7 are as follows: Model  7 Level 1

(

Y = π0 + π1 * ( TIME ) + π 2 * TIME 2

)

+ π3 * ( Number of previous tests ) + e

Level 2

π0 = β00 + β01 ( Outer Circle ) + β02 ( Expanding Circle ) + r0

π1 = β10 +  β11 ( Total Score at Test 1) + β12 ( Outer Circle )

(8.7)

+ β13 ( Expanding Circle ) + r1

π 2 = β 20 π3 = β30 The only difference between Model 7 and Model 6 concerns the time slope equation (π1). This equation now includes three new terms: one (β11) for the effects of initial total score and two (β12 and β13) for the effects of context on the association between time and test scores across test occasions. A significant t-test for any of these terms indicates that the factor significantly moderates the relationship between time and test scores. Results for Model 7 are presented in the last column of Table 8.4. It shows that both initial total score and expanding circle had significant negative coefficients. Model fit, however, did not improve significantly (X2 = 5.98, df = 3, p > .05). This is interpreted as weak evidence that initial ELP and context moderated the rate of linear change in test scores over time. However, given the exploratory nature of the analyses, Model 7 is considered the final model. Model 7 includes time, time2, and number of previous tests at Level 1 and initial ELP (i.e., total score) at Test Occasion 1 and context at Level 2. It specifies changes in test scores across test occasions as a function of time and number of previous tests. Differences in test scores at Test Occasion 1 are specified as a function of differences in context, while differences in the rate of linear change in test scores over time are specified as a function of test taker initial ELP and context. According to Model 7, the average test score at Test Occasion 1 for test takers who took the test in an inner circle country (coded 0) was 48.60. On average, at Test Occasion 1, test takers who took the test in expanding circle countries obtained significantly lower scores by 4.86 points than did those who took

Longitudinal multilevel modeling 185 the test in inner circle countries. For each additional test the test takers took, there was a significant increase in total scores by .96 points, on average, across test occasions. This relationship did not vary significantly across test takers. Furthermore, Model 7 suggests that test scores initially increased (i.e., significant linear change), but later, this trend slowed down (i.e., significant but negative quadratic change). Specifically, the trajectory initially rose, with an average increase of .56 points for the first month, but the rate of change decelerated over time. With each passing month, the magnitude of the increase in test scores diminished by .06 points. The moment when the quadratic trajectory flips over is −.56⁄(2(−.06)) = 4.33 months (or 130 days) since Test Occasion 1. That is, for test takers who took the test in inner circle countries, test scores continued to rise for the first four months and then started to decline. The rate of linear change in test scores, however, varied significantly across test takers. Initial ELP and context seem to explain some of the variance in the rate of linear changes in scores. Specifically, test takers with higher initial test scores tended to have a smaller linear change (by .03 points per month for each one-point increase in initial total score, on average) than did test takers with lower initial total scores. Furthermore, test takers in expanding circle countries tended to have a smaller linear change (by .30 points per month, on average) compared to test takers in inner circle countries. To allow comparison of the coefficients of the predictors in Model 7, which were measured on different scales, the coefficients were standardized following procedures in Hox (2002, p. 21). Hox (2002, p. 21) proposed the following formula for computing the standardized coefficient for a give predictor: (unstandardized coefficient for predictor x standard deviation of predictor)⁄standard deviation of outcome variable. Using this formula, the standardized coefficient for expanding circle effects on scores at Test Occasion 1 was −.16. In terms of change over time, number of previous tests had the largest effect (.15), followed by linear change (.10), and quadratic change (−.08). The standardized effect of initial overall ELP on linear change (−.03) was larger than that for the effects of context (−.01). Context seems to contribute significantly to variability in test scores at Test Occasion 1 as well as variability in the rate of change in scores over time. Context could be associated with other factors and experiences (e.g., opportunities to learn and use English) that were not measured in this study but that could account for the relationship between context and PTE scores. It is also possible that test takers who repeated the test multiple times prepared for the test and ⁄or engaged in other activities to improve their English before each test occasion. This could explain the significant association between number of previous tests taken and test scores over time. The final model accounted for ([[16.16 − 12.54]⁄16.16] × 100) 22% of the within-­ person (across test occasions) variance in test scores, ([[88.73 − 84.64] ⁄88.73] × 100) 4% of the variance between test takers, and none ([[0.16 − 0.16] ⁄0.16] × 100 = 0%) of the variance in the rate of linear change in test scores. As the significant variance coefficients for the intercept (84.64) and the time slope (0.16) indicate, the remaining variance in test scores and in the rate of linear change in

186  Barkaoui and In’nami scores among test takers is not explained by the final model. Other test taker factors and covariates (e.g., test preparation, English instruction received between tests) may explain the remaining variance.

Limitations The previous findings should be interpreted with caution given the exploratory and correlational nature of the analyses and other limitations. First, the study did not consider variation in test forms across test takers and test occasions. Instead, it was assumed that the different forms of the PTE that were administered to the test takers in the data set were equivalent.6 Variability in test forms across test takers and test occasions could explain some of the remaining variance in test scores between and within test takers. Second, while the interval between test occasions varied between 2 and 349 days, no information was available as to whether any of the test takers included in the study engaged in any test preparation and ⁄or activities to improve their ELP before or between tests. These variables (i.e., test preparation and English instruction) could explain the association between the number of previous tests taken and test scores over time as well as some of the variance in test scores across test takers and test occasions. Third, no information external to the test was available about the ELP of the test takers (e.g., scores on another ELP test or English course placement) to examine how differences in L2 proficiency relate to changes and differences in PTE scores. In MLM, the repeatedly measured variable is assumed to be free of measurement error (Hox, 2002; Preacher et al., 2008). Obviously, every test score involves some degree of measurement error, and this error is compounded when scores from repeated measures are examined. Latent growth modeling (LGM) addresses this limitation as it allows modeling measurement error.7 Nevertheless, the previous analyses show how MLM can be used to estimate and explain variance in repeaters’ test scores across test takers and test occasions.

Conclusion This chapter has illustrated how MLM can be used to estimate the contribution of various factors to variability in repeaters’ test scores across test takers and test occasions. Such research can provide important evidence concerning the validity (or lack thereof) of inferences based on repeaters’ L2 test scores. For example, examining sources of variability in repeaters’ test scores over time can help determine whether the main sources of variability in repeaters’ scores over time is related to construct-relevant factors, such as changes in L2 ability (e.g., as measured through other L2 proficiency tests or L2 course placement), or is due to other, construct-irrelevant factors, such as practice with the test (cf. Barkaoui, 2017, 2019). Such analyses can also help estimate how long it takes test takers to see significant and real score gains (Gu et al., 2015). Such research, however, needs to include relevant variables such as amount and nature of English

Longitudinal multilevel modeling 187 language instruction received before and between test occasions as well as relevant characteristics of test takers (e.g., L1, age, motivation) and context. The data presented here illustrate how MLM can be used to examine and model variation in growth patterns within and among groups as well as among individual test takers. Growth pattern at the group level is captured by the time (π1) and time squared (π2) coefficients. Within-group (or interindividual) differences in growth patterns are reflected in the variability of the time (r1) and time squared coefficients across test takers. Some individual test takers might not have followed the average growth pattern of their groups. Another advantage of MLM not illustrated in this paper is that it can be used to model intraindividual variability in growth. As Borsboom (2005) explained, intraindividual growth refers to individual-specific growth patterns such that an individual might exhibit a quadratic growth pattern, while another might exhibit a linear growth pattern. Each individual can thus have a specific growth pattern. At the group level, however, these individual patterns, taken together, might look like and be explained by a U-shaped quadratic function. HLM6 allows the visual inspection and comparison of individual test taker’s growth patterns. MLM can also be used to examine changes in test scores in relation to amount and nature of L2 instruction in studies using a pre-test ⁄post-test design (e.g., Elder & O’Loughlin, 2003; Ling et al., 2014). This can be achieved by adding specific variables related to instruction (e.g., amount of instruction, type of instruction, etc.) as well as other relevant individual and contextual variables (e.g., test taker L1, age, context) at Level 2 to examine their effects on changes in test scores over time. Such lines of research can enhance our understanding of the effects of various individual, contextual, and instructional factors on L2 test performance as well as L2 development. This line of research can also address important questions concerning the shape and rate of change in L2 ability over time, how long it takes to observe a meaningful change (whether gain or decline) in L2 ability, how and why patterns of L2 development over time vary across individuals, the individual and contextual factors that influence the rate and shape of change in L2 ability over time, and whether and how differences in initial L2 proficiency relate to variability in between-person changes in L2 development over time (see Ortega and Iberri-Shea, 2005). Answering these questions can contribute to the validity arguments of L2 tests as well as to theories on L2 performance and development.

Notes

1. Centering facilitates the interpretation of the coefficients in MLM. In grand-mean centering, the grand mean (that is the mean across all Level 2 units) is subtracted from each score on the predictor for all Level 2 units (i.e., test takers) (see Barkaoui, 2014). 2. Dummy variables: “A way of recoding a categorical variable with more than two categories into a series of variables all of which are dichotomous and can take on values of only 0 or 1” (Field, 2009, p. 785). 3. Reliability is the ratio of the true parameter variance to the observed variance (which consists of true and error variances) (Deadrick, Bennett, & Russell, 1997).

188  Barkaoui and In’nami

4. These percentages do not include an estimate of the contribution of error variance. 5. The formula for computing “the moment when the quadratic trajectory curve flips over, at either a peak or a trough is (−π1 ⁄2π2)” (Singer & Willett, 2003, p. 216), which in this study is −1.65⁄(2[−.14]) = 5.89 months. 6. Obviously, evidence concerning the equivalence of test forms needs to be collected and provided as well. 7. However, unlike MLM, LGM cannot handle unequal number and irregular spacing of test occasions within and across test takers (Hox, 2002; Preacher et al., 2008).

References Barkaoui, K. (2013). An introduction to multilevel modeling in language assessment research. Language Assessment Quarterly, 10(3), 241–273. Barkaoui, K. (2014). Quantitative approaches to analyzing longitudinal data in second-­ language research. Annual Review of Applied Linguistics, 34, 65–101. Barkaoui, K. (2016). What changes and what doesn’t? An examination of changes in the linguistic characteristics of IELTS repeaters’ writing task 2 scripts. IELTS Research Report Series, No. 3. Retrieved from https:  ⁄  ⁄  www.ielts.org  ⁄- ⁄  media  ⁄research-reports  ⁄ielts_ online_rr_2016-3.ashx Barkaoui, K. (2017). Examining repeaters’ performance on second language proficiency tests: A review and a call for research. Language Assessment Quarterly, 14(4), 420–431. Barkaoui, K. (2019). Examining sources of variability in repeaters’ L2 writing scores: The case of PTE-academic writing section. Language Testing, 36(1), 3–25. Bijeleveld, G. C. J. H., van der Kamp, L. J. T., Mooijaart, A., van der Kloot, W. A., van der Leeden, R., & ven der Burg, E. (1998). Longitudinal data analysis: Designs, models, and methods. Thousand Oaks, CA: Sage Publications. Borsboom, D. (2005), Measuring the mind: Conceptual issues in contemporary psychometrics. Cambridge, UK: Cambridge University Press. Deadrick, D. L., Bennett, N., & Russell, C. J. (1997). Using hierarchical linear modeling to examine dynamic performance criteria over time. Journal of Management, 23(6), 745–757. Elder, C., & O’Loughlin, K. (2003). Investigating the relationship between intensive EAP training and band score gain on IELTS. In IELTS Research Reports, Vol. 4 (pp. 207–254). IELTS Australia and British Council. Field, A. (2009). Discovering statistics using SPSS (3rd ed.). Thousand Oaks, CA: Sage Publications. Green, A. (2005). EAP study recommendations and score gains on the IELTS academic writing test. Assessing Writing, 10(1), 44–60. Gu, L., Lockwood, J. R., & Powers, D. E. (2015). Evaluating the TOEFL Junior® standard test as a measure of progress for young English language learners (Research Report ETS RR-15-22). Princeton, NJ: Educational Testing Service. Henry, K. L., & Slater, M. D. (2008). Assessing change and intraindividual variation: Longitudinal multilevel and structural equation modeling. In A. F. Hayes, M. D. Slater, & L. B. Snyder (Eds.), The Sage sourcebook of advanced data analysis methods for communication research (pp. 55–87). Los Angeles, CA: Sage Publications. Hox, J. J. (2002). Multilevel analysis: Techniques and applications. Mahwah, NJ: Lawrence Erlbaum Associates. Kachru, B. (1992). The other tongue: English across cultures (2nd ed.). Urbana, IL: University of Illinois Press.

Longitudinal multilevel modeling 189 Koizumi, R., In’nami, Y., Asano, K., & Agawa, T. (2016). Validity evidence of Criterion® for assessing L2 writing performance in a Japanese university context. Language Testing in Asia, 6(5), 1–26. Kozaki, Y., & Ross, S. J. (2011). Contextual dynamics in foreign language learning motivation. Language Learning, 61(4), 1328–1354. Ling, G., Powers, D. E., & Adler, R. M. (2014). Do TOEFL iBT scores reflect improvement in English-language proficiency? Extending the TOEFL iBT validity argument. (Research Report No. RR-14-09). Princeton, NJ: Educational Testing Service. Luke, D. A. (2004). Multilevel modeling. Thousand Oaks, CA: Sage Publications. Luke, D. A. (2008). Multilevel growth curve analysis for quantitative outcomes. In S. Menard (Ed.), Handbook of longitudinal research: Design, measurement, and analysis (pp. 545–564). New York: Academic Press. Newsom, J. T. (2012). Basic longitudinal analysis approaches for continuous and categorical variables. In J. T. Newsom, R. N. Jones, & S. M. Hofer (Eds.), Longitudinal data analysis: A practical guide for researchers in aging, health, and social sciences (pp. 143–179). New York: Routledge. O’Loughlin, K., & Arkoudis, S. (2009). Investigating IELTS exit score gains in higher education. In IELTS Research Reports, Vol. 10 (pp. 1–86). IELTS Australia and British Council. Ortega, L., & Iberri-Shea, G. (2005). Longitudinal research in second language acquisition: Recent trends and future directions. Annual Review of Applied Linguistics, 25, 26–45. Preacher, K. J., Wichman, A. L., MacCallum, R. C., & Briggs, N. E. (2008). Latent growth curve modeling. Los Angeles, CA: Sage Publications. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage Publications. Raudenbush, S. W., Bryk, A. S., Cheong, Y. F., & Congdon, R. (2004). HLM6: Hierarchical linear and nonlinear modeling. Lincolnwood, IL: Scientific Software International. Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. Oxford, UK: Oxford University Press. Taris, T. W. (2000). A primer in longitudinal data analysis. Thousand Oaks, CA: Sage Publications. Twisk, J. W. R. (2003). Applied longitudinal data analysis for epidemiology: A practical guide. Cambridge, UK: Cambridge University Press. Weinfurt, K. P. (2000). Repeated measures analyses: ANOVA, ANCOVA, and HLM. In L. G. Grimm & P. R. Yarnold (Eds.), Reading and understanding more multivariate statistics (pp. 317–361). Washington, DC: American Psychology Association. Xi, X. (2008). Methods of test validation. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of language and education (2nd ed.), Volume 7: Language testing and assessment (pp. 177–196). Boston, MA: Springer Science+Business Media. Zhang, Y. (2008). Repeater analyses for TOEFL iBT (research memorandum 08-05). Princeton, NJ: Educational Testing Service.

Section III

Nature-inspired data-mining methods in language assessment

9

Classification and regression trees in predicting listening item difficulty Vahid Aryadoust and Christine Goh

Introduction While predicting test item difficulty has been an important stream of research in language assessment (e.g., Grant & Ginther, 2000; Sheehan & Ginther, 2001), most of the studies have employed linear models that have limiting assumptions such as linearity (representing the relationship between variables as a straight line) and normality (assuming that the data spreads out like a bell curve) (Perkins, Gupta, and Tammana, 1995). If the assumptions of linearity and normality are met, the linear models can be highly useful (see Chapter 6, Volume II), but because test takers’ cognitive processes leading to item difficulty are not strictly linear or normally distributed (Gao & Rogers, 2010), linear modeling for predicting item difficulty, such as regression modeling, might not be the optimal choice. As suggested by Perkins et al. (1995), most of the regression-based studies have not reported on the investigation of the assumptions of regression models (see Chapter 6, Volume II), and one would suspect that their low prediction power is caused by these restrictions. To address this limitation, we introduce a nonparametric classification data-mining technique called classification and regression trees (CART). Data mining refers to a group of advanced quantitative methods that are used to detect hidden patterns such as linear and nonlinear relationships and trends in data (Winne & Baker, 2013). Major applications of data mining include clustering, classification, prediction, and summarization, which are performed using, for example, decision trees and artificial neural networks (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). CART is one of several decision trees methods that are used in data mining and machine learning to optimize the precision of prediction and classification of outcomes. CART can identify the nonlinear relationships that connect independent variables (input) and dependent variables (output) (Breiman, Friedman, Olshen, & Stone, 1984); it does so by splitting observations (e.g., item difficulty in this chapter) into homogeneous subgroups or subspaces relative to the input variables (Breiman, 2001; Spoon et al., 2016). Compared with linear regression (see Chapter 6, Volume II), nonlinear regression (Chapter 10, Volume II), and hierarchical methods (see Chapter 7, Volume II), which constitute useful

194  Aryadoust and Goh “global” techniques, CART specifies granular IF-THEN rules, each of which captures complexity in one particular segment or subspace of the data. This chapter presents CART as a method to predict relationships between listening test item difficulty and textual features, which represent levels of mental representation. The methodology presented comprises several main steps (Aryadoust & Goh, 2014), such as the delineation of the theoretical framework, determining whether the analysis is a classification or prediction problem, selection of optimal cross-validation methods, estimating sensitivity, specificity, and receiver operating characteristic (ROC) curve of the model, extraction of IF-THEN rules, and interpretation of results.

Classification and regression trees (CART) Morgan and Sonquist’s (1963) pioneering work in Automated Interaction Detection can be viewed as the main source of inspiration for nonparametric classification techniques including CART (Breiman et al., 1984; Quinlan, 1986, 1993). The decision trees technique is an umbrella term used to refer to a group of nature-inspired data-mining methods that are used in different fields including machine learning and statistics. When the dependent variable is continuous, decision trees (and, therefore, CART) are called predictive models, and when the dependent variable is discrete (categorical), the model is called classification trees (Strobl, Malley, & Tutz, 2009). In the most general terms, the purpose of CART modeling is to determine a set of IF-THEN rules that allow for accurate and precise classification or prediction of dependent variables (Breiman, 2001). Consider a group of test designers who wish to determine what accounts for (predicts) the difficulty level of the test items assembled in an item bank. They need to set predictive rules to envision estimated difficulty levels for the items before administering them to the target population. This means that they will have to choose items of the right difficulty for the students by controlling for the factors that determine or influence item difficulty. Here, the predicting factors are notated as X1, X 2, X3,…, and X n and item difficulty, which is the dependent variable, as Y. It is assumed that Y is a function of the predicting factors; that is, if we know the amount of X1, X 2, X3,…, and X n, then we can predict the amount of Y. The test developers in this scenario are searching for a useful method to derive predictions and estimate the amount of Y when Xs change. This prediction can be described either in the form of linear functions (e.g., a regression model as discussed in Chapter 6, Volume II) or a nonlinear approach such as IF-THEN rules generated by CART (other pertinent nonlinear approaches include symbolic regression, which is discussed in Chapter 10, Volume II). In this example, the dependent variable, or Y, is a continuous variable with values ranging between a minimum and maximum index. In other scenarios, the dependent variable can be categorical (such as levels of proficiency) or binary (such as scores of dichotomous test items or 0 and 1); here, the CART algorithm becomes a classification machine predicting the categorical and binary variables.

Classification and regression trees 195 We discuss the underlying assumptions of CART in a hypothetical example in the following section. This example is provided to help readers, especially those who are new to CART modeling, to develop a profound understanding of the capabilities and requirements of CART.

A hypothetical example Figure 9.1 presents a hypothetical scenario wherein the fairness of an assessment is questionable. In this scenario, a group of students take a language test that is used as an instrument to grant entry to university applicants. It has been claimed that the decisions made by test designers are influenced by sources of construct-irrelevant variance, most prominently gender and financial status of the test takers. Here, we aim to investigate whether this claim is supported by exploring the decisions made (pass ⁄fail) as a function of gender and financial status. Rather than establish a linear relationship (e.g., regression) between the decisions (Y) and gender and financial status (X1 and X 2, respectively), we model a hierarchical relationship between the Xs and Y. Our investigation shows that 40% of applicants who did not obtain the cutoff score (6 ⁄10) failed, and this is represented by the first right branch or node on top. However, we have also observed that a number of the applicants who achieved the cutoff score were denied entry. Unable to find a logical explanation, we suspect that the decision as to whether the applicants scoring 6 ⁄10 and above are admitted might have been influenced by other factors, which we earlier speculated to be gender and financial status.

Figure 9.1  A hypothetical classification model for decision-making in a biased language assessment.

196  Aryadoust and Goh Let us look at the other nodes in the graph. As demonstrated in Figure 9.1, if a male applicant achieves the cutoff score, his chances for admission are far lower (35% of failure) than females (♀). In other words, the males (♂) who have achieved the cutoff score may not be admitted. The remainder of the applicants are female, but only those who have a high financial status have a chance of making it to the program (10%). This hypothetical scenario can be notated as IF-THEN rules as follows: IF score < 6 ⁄10, THEN fail. (Rule 1) IF score > 6 ⁄10, THEN determine the gender first. (Rule 2) IF score > 6 ⁄10 and male, THEN fail. (Or “IF score > 6 ⁄10 and female, THEN pass”.) (Rule 3) IF score > 6 ⁄10, female, and wealthy, THEN admit, otherwise fail. (Rule 4) In Figure 9.1, the top ⁄root node “Did the student achieve 6 ⁄10?” is also called the root node with two leaves or daughter nodes, “fail” and “gender,” each representing the data partitions described by a simple IF-THEN rule as previously discussed. (Note that in actual CART modeling, the variables that are chosen to partition the dependent variable would have to minimize the classification error. If gender in Figure 9.1 results in the misclassification of the passing and failing students, then another variable has to be chosen for higher accuracy.) To elaborate on this concept, look at the first two daughter nodes, fail (right side) and gender (left side). All applicants in the gender node actually have scored higher than 6 ⁄10 (which is the cutoff), and there are no applicants in this node with a score less than the cutoff score. Thus, it is an error-free classification. CART algorithms start with a leading question that minimizes error in partitioning (i.e., it maximizes information about the dependent variables), resulting in the generation of a root node and two leaves. The leading question in the top node in Figure 9.1 is “Did the applicant achieve 6 ⁄10?” and in the left leaf is “What is the student’s gender?” If these questions do not maximize information, another question will be tested to achieve the maximum information. Some partitioning techniques that are used in CART modeling include Gini, Symmetric Gini, Entropy, and Class Probability.1 There are a number of underlying concepts in CART analysis that are discussed and demonstrated in the next section.

Description of CART In this section, we discuss in more detail the underlying concepts of CART analysis. The concepts include cross-validation, training and testing sets, pruning, and fit statistics.

Cross-validation One of the important advantages of CART is cross-validation, which is a method to investigate how the results of CART modeling would generalize to

Classification and regression trees 197

Figure 9.2  Train-test cross-validation (top) versus k-fold cross-validation (bottom).

an independent data set. As demonstrated in Figure 9.2, there are two methods for cross-validation: train-test cross-validation and k-fold cross-validation. In the train-test cross-validation method, first an optimal CART solution with minimal error is generated by the learning algorithm exploring the training set. The model comprises a set of IF-THEN rules that are able to predict the amount of Y or dependent variable with the lowest amount of error (see Breiman, 1996, 2001, for technical information). Subsequently, the identified model is fitted into the validation or testing set (also known as the hold-out set), the independent data sample, to determine the amount of classification or prediction error across this sample. If the amount of error generated in the validation stage is also low, the model can be accepted as reliable. In contrast, if validation returns poor fit (i.e., high error and low R 2), the model is not reproducible or generalizable and has to be abandoned in favor of a more reliable solution (Breiman, 1996). For example, for the aforementioned Rules 1, 2, 3, and 4 to be reproducible, they must be able to predict the pass ⁄fail decision in any given validation set with the same parameters (i.e., gender, financial status, and cutoff score); otherwise, the rules yielded would not be reliable. Another method of validation in CART modeling is k-fold cross-validation (Nadeau & Bengio, 2003), which is presented in Figure 9.2. Each of the six sets (aka bins or folds) in Figure 9.2 constitutes an independent set that is used in both training and testing. For training the algorithm, first, one of the sets is randomly chosen and left out, resulting in five remaining sets (this is notated as “k-1,” with k being the number of sets). Second, the five remaining sets are used to train the algorithm and arrive at the optimal model. Third, the optimal model is tested across the left-out set. The three stages are repeated by leaving out another set from the six sets each time, finding the optimal model, and then fitting it into the left-out set. In the current scenario in Figure 9.2, this would result in a six-fold cross-validation procedure because the data have been split into six equally sliced sets (Nadeau & Bengio, 2003). At the end, the amount of error in testing procedures is averaged across the six cross-validation procedures. In k-fold cross-validation, therefore, all data will be used in both training and testing, as opposed to the train-test set cross-validation where training sets are not used in the cross-validation set (Breiman, 1996). Cross-validation is an essential stage in CART modeling. The goal of CART analysis is to generalize the results to unknown situations and the testing or cross-­ validation sets represent those situations. Cross-validation also precludes overfitting, a situation where a model fits the training set perfectly or closely but it would fail

198  Aryadoust and Goh to fit any testing set (Tetko, Livingstone, & Luik, 1995). In an overfitting model, fit statistics and parameter estimates are misleading because the model emulates not only the variance but also the random noise in the data. As such, the captured relationships between independent and dependent variables are unreliable or false. It is crucial to discover a model that approximates the relationship between variables in the population rather than a random sample drawn from it. Both train-test and 10-fold cross-validation methods are performed and compared in the present study.

Caveats of cross-validation There are a few caveats concerning the cross-validation techniques discussed earlier. The train-test method has been critiqued as not being able to consider or account for the observed variance in the training set. Therefore, it is not ideal for comparing CART algorithms (Dietterich, 2000). That is, while there should be a maximum number of data points in both training and testing sets, there is an inevitable trade-off between the size of training and testing sets: every data point that is taken from the training sample and used in the testing set results in the loss of a part of the data for algorithm training. Of course, if the samples are large enough to represent the population, this caveat will not be a serious problem anymore. Another caveat is that the train-test cross-validation is not appropriate for small samples because splitting small samples into train and test sets would result in much smaller samples and affect the accuracy of modeling. The caveat of k-fold cross-validation is that the k repetitions are dependent on one another to various extents, that is, the training data in a sample will overlap or be used for testing at other times, causing bias in modeling the variance. In very large data sets, it might be possible to generate multiple independent k-fold train-test samples and perform training and cross-validation procedures on each of them independently and, subsequently, to average the observed error and R 2 (Bengio & Grandvalet, 2004). While this procedure is useful, it would need a significantly large amount of data, which are often not easy to collect.

Growing rules, stopping rules, and pruning the trees Tree-growing (aka splitting and partitioning) refers to the process of selecting an independent variable that best separates or partitions the dependent variable into two nodes. There are multiple tree-growing rules that are used by the Salford Systems’ CART©, the CART algorithm used in this chapter, such as Gini index (aka Gini impurity), Symmetric Gini, twoing and ordered twoing, and class probability for classification trees. It is suggested that these tree-growing rules be compared to determine the best method for the data set at hand. An in-depth discussion of these techniques falls outside of the scope of this chapter, but since Gini is the default tree-growing technique in the Salford Systems’ CART ©, it will be discussed briefly here. Gini is an index to measure the purity of the generated nodes. That is, it measures the level of “mixedness” of the data assigned to each node. For example, if

Classification and regression trees 199 the aim is to differentiate pass from fail cases in the aforementioned example of the university entrance examination, a 50%–50% of pass–fail distribution in the nodes would return a high Gini index, indicating that pass and fail cases have not been differentiated maximally in the nodes. Where the independent variable is binary, the poorest Gini index is 0.50 (indicating the highest level of noise) and the best is 0.00 (indicating no noise). CART uses stopping rules to determine when the partitioning of the output variable into left and right nodes should stop. Stringent stopping rules can result in premature tree-growing. To preclude from this issue, the stopping rules can be relaxed so that a larger tree (maximal CART tree) can be grown and then pruned back to an optimal tree with the best fit for the training and testing sets (Breiman et al., 1984). Pruning includes reducing the size of the tree (like trimming it) and improving the parsimony and sensibility of CART models. The CART algorithm used in this chapter (the Salford Systems’ CART©) leverages the concept of overgrowing trees that constitute the largest possible trees with no specific stopping rules (Salford Systems, 2018). This is to ensure that the model will not stop partitioning prematurely, unlike other CART algorithms that use stringent stopping rules and can result in prematurely grown trees. The overgrown (maximal) trees are then pruned back into an optimal tree that would fit both the training and testing sets.

Estimation of fit CART analysis yields a number of trees and, to select the optimal one, one has to compute and compare several fit statistics: R 2 (aka coefficient of determination), sensitivity, and specificity (Jensen, Muller, & Schafer, 2000). R 2 shows how closely the data fit the predicted values, with higher R 2 values indicating better fit. Sensitivity and specificity are computed using the Equations 9.1 and 9.2:   No. of true positives (9.1) Sensitivity =   No. of true positives + No. of false negatives    No. of true negatives Specificity =  (9.2)  No. of true negatives + No. of false positives  An optimal classification would achieve 100% sensitivity and specificity, but in reality, there is a trade-off between the two. The trade-off can be shown graphically as a receiver operating characteristic (ROC) curve (Swets, 1996) that plots the percentage of true positives (TP; or sensitivity) against false positives (FP or 1 – specificity). By examining the area under ROC, which ranges between 0 and 1, one will obtain further proof as to whether the model had a good fit to the data (Zhou & Qin, 2005).

Variable importance index CART is a nonparametric technique and, accordingly, does not estimate regression weights. Instead, the variable importance index (VII) is estimated to show

200  Aryadoust and Goh the amount of contribution of each independent variable to the tree (Salford Systems, 2018). VII ranges between 0 and 100, with the rooting node (the node on top) often being the most important variable (VII = 100). VII indicates how effectively a variable contributes to the splitting of the dependent variable.

CART in language assessment Despite its potentials, CART modeling has not been extensively used in the language assessment literature. We will review two studies carried out by Gao and Rogers (2010) and Xi, Higgins, Zechner, and Williamson (2012). Gao and Rogers (2010) used CART to validate a cognitive model predicting the difficulty level of 40 reading test items from two test forms (each comprising 20 items). They generated the independent variables based on expert judgments and test takers’ perceptions of their cognitive processes and then validated the model by using CART and multiple regression. In model training, the researchers reported R 2 values of 0.9793 and 0.9929 for the two test forms, indicating that almost all variability in item difficulty was explained by the independent variables across the two test forms. As ideal as the results might appear, the models, however, were not cross-validated and the high R 2 values are likely due to overfitting, which was discussed previously. In another study, Xi et al. (2012) used CART and linear regression to validate two automated systems for the scoring of oral proficiency in English. For CART modeling, the authors used the Gini index and performed 10-fold cross-validation. They concluded their study by claiming that multiple regression was preferable to CART modeling due to its parsimony and interpretability, although the models yielded comparable precisions. It should be noted that neither of the two studies discussed here presented sensitivity, specificity, VII, or ROC. Another point to consider in CART modeling is that when data have a linear shape (as indicated by the linear correlation between the dependent and independent variables), linear regression models would be a more plausible choice than CART, and this could have been a reason for Xi et al.’s findings. In the next section, we will showcase the application of CART in predicting test item difficulty in language assessment. This illustrative example is based on the construction-integration model of comprehension (CI), a theoretical framework of comprehension (Kintsch, 1998; Kintsch & Van Dijk, 1978) that is reviewed in detail to help readers see how CART can be incorporated into theoretically motivated studies.

Sample study: Using CART to predict item difficulty Theoretical framework of the sample study and Coh-Metrix The aim of the study reported here is to differentiate easy listening test items from difficult test items through using their psycholinguistic features. We use the construction-integration (CI) model of comprehension as the theoretical

Classification and regression trees 201 framework of the study. We operationalize CI by using Coh-Metrix, which is a computational tool used for measuring surface and deep-level features of texts. We will provide a brief introduction to Coh-Metrix. The development of Coh-Metrix is informed by the CI model of comprehension, which represents comprehension as a three-level cognitive activity generating the surface level, the textbase, and the situation model (Kintsch, 1998). The surface level constitutes the linguistic level of comprehension (Kintsch & Mangalath, 2011) and consists of syntactical components that parse sentences and lexical components that retrieve word meaning (Kintsch, 2004). The textbase comprises the propositional representation of the sentences in the text but does not include comprehenders’ implicit knowledge or inferences (Kintsch, 2004). Comprehenders at this level generate micropropositions, which represent local meaning, and macropropositions, which represent global representations of the text. Finally, the situation model is generated to fill in the gaps of the textbase through inference-making and generating links between the propositions (Kintsch, 1998). Coh-Metrix can be used to quantify some dimensions of the three aforementioned levels of comprehension.

Surface level In addition to word count and verb ⁄noun occurrences, two groups of textual features are measured at the surface level: 1 Syntactic density comprises noun phrases, preposition phrase density (incidence of preposition phrases), and left-embeddedness (word before the main verb that can also be considered as an index for memory load). 2 Syntactic complexity comprises indices for adverbials, verbs, nouns, prepositions, passive phrases, negations, and infinitives. Higher syntactic complexity leads to informationally dense texts that could be more difficult to process. Nevertheless, there is no agreement on whether syntactic density and complexity can exert a significant influence on text quality ⁄difficulty. For example, whereas Crossley and McNamara (2010) found that syntactically easy tests were easily parsed, McNamara, Crossley, and McCarthy (2010) found no meaningful relationship between syntactic indices and the level of text difficulty (perceived by readers). In the Coh-Metrix literature, different dimensions of vocabulary are also measured such as CELEX (the Dutch Centre for Lexical Information) and medical research council (MRC) indices such as imageability (a measure of concreteness), which are not relevant to the context of this study. The Flesch-Kincaid grade level index also can be used to measure text difficulty associated with the generation of the surface level representation. Coh-Metrix measures Flesch-Kincaid grade level as well.

The textbase The textbase consists of the representation of textual propositions and their explicit links (Kintsch, 1998). Coh-Metrix operationalizes various dimensions of

202  Aryadoust and Goh the textbase such as coreferences, a linguistic property that facilitates comprehension by using connecting words, phrases, and statements (McNamara & Kintsch, 1996). Some of the coreference indices that estimate lexical diversity include the measure of textual lexical diversity (MTLD) and type-token ratio (or TTR, which is the ratio of unique vocabulary in the text to total word frequency). In addition, the textbase also emerges from the comprehension of the incidence of logical connectives (words connecting two or more clauses in a grammatically correct fashion) as well as temporality (measure of verb tense and aspect).

The situation model The situation model is the deepest level of mental representation that surpasses the explicit textual meaning and the textbase. It refers to the mental representation that the comprehenders produce to describe the text at local and global levels (Tapiero, 2007). Situation models emerge from multiple text dimensions including emotions, time, intentionality, space, and causality that can lead to constructing coherent texts (Tapiero, 2007). Coh-Metrix uses multiple modules such as latent semantic analysis (LSA, Dumais, 2005), syntactic parsers (Charniak, 2000), and WordNet (Fellbaum, 2005) to measure the dimensions of the situation model. One of the measures of the situation model is LSA, which is a statistical approach to describing word knowledge (Kintsch, 1998). Some LSA indices include LSA for adjacent sentences, lexical connections, adjacent paragraphs, verbs, and given ⁄new information. Sentence overlap, computed in Coh-Metrix by using WordNet, is another measureable dimension of the situation model that represents coherence in the text. Other commonly used indices for quantifying lexical dimensions of the situation model are hypernymy or superordination (general word classes that subsume other words) and polysemy (the presence of multiple meanings per word). These, however, have often (if not always) been unable to predict text difficulty in previous studies (Crossley & McNamara, 2010). It should be noted that research into listening using Coh-Metrix is significantly limited in scope and depth, but there are many studies that investigate the application of Coh-Metrix in reading and writing (e.g., Aryadoust & Sha, 2015). As research into listening using Coh-Metrix is limited, this section will discuss its application in reading research. In one study, the interplay of the readability of second language (L2) texts with content word overlap, lexical frequency, and syntactic similarity was explored by Crossley, Greenfield, and McNamara (2008). The researchers reported that Coh-Metrix is a more accurate indicator of reading difficulty than Flesch Reading Ease and Flesch-Kincaid Grade Level in the prediction of reading difficulty (R 2 = .86). Adopting a similar approach, Crossley, Allen, and McNamara (2012) reported that the Coh-Metrix readability index would yield better results in classifying texts into the three levels than the other conventional readability formulas. We recognize that Coh-Metrix variables also appear to be more suited to the comprehension of written text. In contrast, listening comprehension requires

Classification and regression trees 203 processing of spoken texts, which contain some grammatical and lexical features that are different from written texts. For example, spoken texts tend to have more connectives such as “and” that join a number of utterances together. There may also be fewer clausal embeddings compared with the higher degree of presence of grammatical complexity in written language. Nevertheless, with this caveat about the difference between written and spoken texts and some caution being exercised in interpreting the results, Coh-Metrix may be applied to investigate listening difficulty at the various levels of comprehension.

Methodology Data and test materials The data and test materials for this study were from the Michigan English Test (MET) listening, which was provided by Michigan Language Assessment (https: ⁄ ⁄michiganassessment.org ⁄). The data included item-level scores from seven autonomously administered tests taken by different groups of participants (n = 5039) and comprised of test takers from Brazil, Chile, Colombia, Costa Rica, and Peru (see Table 9.1). The seven tests consisted of 46 test items each, producing a total of 322 multiple choices (one of the items was left out in CART modeling due to poor fit—see as follows). The sample size ranged from the smallest in Form 3 (n = 564) to the largest in Form 1 (n = 963). Michigan Language Assessment also provided the test items and audio files, which were transcribed and analyzed through Coh-Metrix. Each test comprises three parts: a Part one: 17 short dialogues, each followed by a corresponding test item with four options b Part two: four long conversations, each followed by three or four comprehension test items with four options c Part three: three academic mini-talks, each followed by three or four comprehension test items with four options Table 9.1  Demographic information of the listening tests and test takers Form 1 Form 2 Form 3 Form 4 Form 5 Form 6 Form 7 No. of items 46 Age mean 22.09 score Test takers’ age 10–84 range Sample size 963 Gender M 457 distribution F 506 Note: a average age

Total

46 27.71

46 26.85

46 22.96

46 23.5

46 20.87

46 322 23.62 23.94a

15–62

13–71

13–68

14–82

11–62

10–68

NA

612 302 310

564 235 329

758 336 422

608 253 355

708 347 361

826 348 487

5039 2278 2770

204  Aryadoust and Goh

Generating variables for CART modeling We initially estimated Rasch item difficulty indices and discretized them by utilizing a median split technique to classify items into low- and high-difficulty items. Second, the independent variables were generated through Coh-Metrix analysis. Third, theoretical correlates of item difficulty indices identified (Table 9.2) and, finally, the discretized item difficulty was statistically regressed on the Coh-Metrix variables through CART, to identify the optimal predictive variables. Details are discussed as follows.

Dependent variable: Preliminary Rasch measurement Each test form was submitted to the Rasch model analysis separately. We examined the Rasch model infit and outfit mean square (MnSq) indices to ascertain the psychometric validity of the data (see Chapter 4, Volume I). Infit MnSq is a statistic sensitive to the inliers or the answers targeted on the test takers, and outfit MnSq is a weighted fit index sensitive to the data patterns far from test takers’ abilities. The most productive fit MnSq range for items is between 0.6 and 1.4, which was fulfilled in the study. One of the test items was left out due to the poor fit statistics, yielding a 321-item pool for the following CART analysis. Table 9.2  Independent variables in the CART model Variable

Cognitive level

Definition

1. Word count

Surface level

2. Temporality

Surface level

3. Content word overlap 4. Given-new sentences’ average 5. Type-token ratio

Situation model

6. Logical connectives

Textbase

Text length measured by number of words Tense and other temporal factors, which can aid comprehenders The degree of meaning overlap in content words in adjoining sentences An index of semantic overlap measured by latent semantic analysis The ratio of unique vocabulary to total word frequency Incidence of logical connectives such as so, therefore, etc. Incidence of verbs and particles that indicate causality Words before main verb mean Incidence or number of prepositions

Situation model Surface level

7. Causal verbs and particles 8. Left embeddedness 9. Preposition phrase density 10. Verb incidence

Situation model

11. Noun and verb hypernymy

Surface level

12. Flesch-Kincaid grade level

Surface level

Surface level Surface level Surface level

Verbs play an important role in helping listeners understand. Incidence of superordinates that subsume a set of subcategories, e.g., mammal An index of text difficulty

Classification and regression trees 205

Dependent variable: Discretization of item difficulty We then converted the continuous item difficulty measures into a categorical variable using Discretize software (https: ⁄ ⁄www.datapreparator.com ⁄d iscretize. html; Lui, Hussain, Chew, and Dash, 2002) with two difficulty levels: easy and difficult. We computed the median of the item and created a two-level item difficulty variable: values less than or equal to the median were placed in the “low” difficulty level and values greater than the median were placed in the “high” level. To test the accuracy of item conversion into low and high difficulty, we correlated them with the Rasch difficulty measures and discovered that the median split variable had a high correlation (0.81).

Independent variables for CART We examined the test items and listening texts to identify the information necessary (IN) to answer each test item (Buck & Tatsuoka, 1998). We further used options and item stems as a part of the IN because the test takers would need to read these and listen to the oral text to find the correct answer per item. Next, based on our review of the previous research, we generated a list of potential independent variables that could predict item difficulty. We subjected the INs to Coh-Metrix analysis, correlated the Coh-Metrix factors with the discretized item difficulty (Crossley & Salsbury, 2010), and chose 12 variables with high correlations with the dependent variable (p < 0.05), which are presented in Table 9.2. The chosen variables measure the three levels of mental representation specified by Coh-Metrix, which was discussed earlier (see McNamara et al., 2014).

Data Analysis CART analysis We performed the CART analysis using the Salford Predictive Modeler ® (SPM) software, Version 8.0. We utilized the 12 independent variables generated by Coh-Metrix alongside the discrete item difficulty as the dependent variable. We chose SPM over other available software due to its unique properties. According to Salford Systems (2018, p. 2): Salford Systems’ CART is the only decision-tree system based on the original CART code developed by world-renowned Stanford University and University of California at Berkeley statisticians Breiman, Friedman, Olshen and Stone. The core CART code has always remained proprietary and less than 20% of its functionality was described in the original CART monograph. Only Salford Systems has access to this code, which now includes enhancements co-developed by Salford Systems and CART’s originators. As discussed earlier, we performed both train-test and 10-fold cross-­validations. In the train-test cross-validation method, we randomly divided the data into

206  Aryadoust and Goh Table 9.3  Distribution of data in test-train cross-validation CART analysis Class 1 (low difficulty)

2 (high difficulty)

Sample

Number

Percentage

Learn Test Total

111 53 164

53.37% 46.90% 51.09%

Learn Test Total

97 60 157

46.63% 53.10% 48.91%

approximately 65% training subsamples (208 items and the 12 independent variables generated by Coh-Metrix) and 35% testing subsamples (113 items and the 12 independent variables generated by Coh-Metrix; see Table 9.3). In the 10-fold cross-validation model, we generated 10 sets or bins of data and let the algorithm validate the model as discussed in the section on cross-­validation. To generate tree splits, we tested the Gini, Symmetric Gini, and Entropy partitioning methods and compared the fit of the models. In the models generated for train-test cross-validation, we opted for the Entropy technique because it better emulated the underlying structure of the data—that is, it minimized errors and maximized information. In the models generated for 10-fold cross-validation, we opted for the Gini method of tree-growing for the same reason. Another consideration was the enhancement of the classification power of the analysis (minimization of errors). To do so, we set the minimum node sample size (aka Atom) and terminal node size (aka MinChild) to 2 and 1, respectively (see the tutorial on the Companion webpage for further information). The default Atom and MinChild indices are 10, but the analyst can determine the most suitable value by investigating the precision of the models. Specificity and sensitivity statistics as well as misclassified cases were computed separately for train-test and 10-fold cross-validation methods samples. Next, we chose the optimal model and generated the IF-THEN rules per node, which are essential in interpreting the CART results and specifying the mechanisms determining the difficulty level of the items. The VII for each independent variable was further estimated.

Results Performance and misclassification Table 9.4 demonstrates the classification outcome for the training and test samples. As demonstrated, all items in the training sample are classified accurately (error = 0.00), with approximately 43% error for easy ⁄d ifficult items in the testing sample. The precision of classification across both samples is 84.74%, which is significantly high. Overall, this analysis points to the relatively high accuracy of

Classification and regression trees 207 Table 9.4  Classification accuracy, specificity, sensitivity, and ROC Class Train Easy items Difficult items Test Easy items Difficult items Overall (training and testing) Easy items Difficult items Train-test cross-validation Specificity Sensitivity Precision ROC

Number of cases

Number of misclassified cases

Pct. Error

111 97

0 0

0.00% 0.00%

53 60

23 26

43.40% 43.33%

164 157

23 26 10-fold cross-validation Specificity Sensitivity Precision ROC

85.98% 83.44%

83.44% 85.98% 84.43% 0.84708

71.97% 87.80% 76.60% 0.703

the models. Other indicators of performance, specificity and sensitivity, are also well above 0.80, with the ROC value of 0.847. On the other hand, the 10-fold cross-validation analysis returned relatively poorer specificity, ROC, and precision, although sensitivity is slightly better in this model. Given the better fit statistics reported in Table 9.4, we opted for the train-test cross-validation model.

Variable importance index (VII) Table 9.5 presents the 12 independent variables and their VII. The four top variables in terms of their VII belonged to surface and textbase level of conversation. Table 9.5  CART-estimated variable importance indices Variable 1. Noun and verb hypernymy 2. Word count 3. Logical connectives incidence 4. Prepositional phrase density 5. Incidence of causal verbs and particles 6. Verb incidence 7. Flesch-Kincaid grade level 8. Temporality (a text easability index) 9. Average givenness of each sentence (a latent semantic analysis index) 10. Type-token ratio (lexical diversity) 11. Content word overlap (adjacent sentences) 12. Words before the main verb (left embeddedness; an index of syntactic complexity)

VII

Cognition level

100.00 97.74 93.58 73.46 65.86 63.53 61.91 56.75 55.40

Surface level Surface level Textbase Textbase Situation model Surface level Surface level Surface level Situation model

53.50 41.40 38.47

Surface level Situation model Surface level

208  Aryadoust and Goh Noun and verb hypernymy had the largest impact as suggested by its VII (100), followed by word count, and logical connectives incidence. In contrast, left embeddedness and content word overlap fall at the bottom of the table, although their VII is close to 50, indicating that they have a relatively large impact on the difficulty level of the variables. In other words, all the 12 variables had moderate to high influence on the difficulty of the listening test.

Nodes and IF-THEN rules The optimal model presented in Figure 9.3 has 40 nodes or splitting points. At each node, specific splitting rules are applied. Sample rules are presented in Table 9.6 (for space constraints, the remainder of the rules are not presented). For example, the topmost node comprises the rule “IF (Type-token ratio) ≤ 0.7835, THEN item is easy” with the high accuracy of 90%. This rule, which has been selected among an extremely large pool of possible rules by the SPM algorithm, splits the entire sample into easy and difficult items, with some impurity. Other rules are then applied to the remaining difficult ⁄easy items and they are split till the sample become indivisible. This suggests that while some items become easier merely due to their type-token ratio ≤ 0.7835, other easy items possess very different patterns such as the rule for node 13 where 11 input variables of certain values would collaboratively render the items at this node easy. This is a significant advantage of CART over linear regression models in which one universal rule is sought for the entire sample. The IF-THEN rules reached an overall accuracy of 84.74% (15.26% of error) which is significantly high.

Discussion In the sample study reported earlier, we found that the train-test cross-validation method outperformed the 10-fold cross-validation method by reasonably high

Figure 9.3  CART model for the data (based on Gini method of tree-growing).

Classification and regression trees 209 Table 9.6  Sample IF-THEN rules generated by the CART model Node 1 5

7

13

17

21

IF

THEN

Classification accuracy %

(Type-token ratio) ≤ 0.7835 (Type-token ratio) > 0.7835 AND (Prepositional phrase density) ≤ 204.16650 AND (Average givenness of each sentence) ≤ 0.07350 AND (Noun and verb hypernymy) ≤ 1.48450 AND (Word count) ≤ 21.50000 AND (Flesch-Kincaid grade level) > 5.08500 AND (Logical connectives incidence) > 69.04800 AND (Logical connectives incidence) ≤ 87.12100 (Type-token ratio) > 0.78350 AND (Prepositional phrase density) ≤ 204.16650 AND (Average givenness of each sentence) ≤ 0.07350 AND (Noun and verb hypernymy) ≤ 1.48450 AND (Logical connectives incidence) > 87.12100 (Type-token ratio) > 0.78350 AND (Average givenness of each sentence) ≤ 0.07350 AND (Logical connectives incidence) ≤ 54.09400 AND (Noun and verb hypernymy) > 1.61350 AND (Flesch-Kincaid grade level) > 4.82450 AND (Word count) > 11.00000 AND (Prepositional phrase density) > 23.80950 AND (Prepositional phrase density) ≤ 204.16650 AND (Verb incidence) ≤ 138.99600 AND (Incidence of causal verbs) ≤ 10.63850 AND (Temporality) > −27.79650 (Type-token ratio) > 0.78350 AND (Prepositional phrase density) ≤ 204.16650 AND (Average givenness of each sentence) ≤ 0.07350 AND (Noun and verb hypernymy) > 1.48450 AND (Logical connectives incidence) > 54.09400 AND (Word count) ≤ 21.00 (Type-token ratio) > 0.78350 AND (Prepositional phrase density) ≤ 204.16650 AND (Average givenness of each sentence) > 0.09250 AND (Verb incidence) ≤ 63.333

Easy Difficult

90 100

Difficult

90.48

Easy

100

Easy

91.67

Easy

87.5

(Continued)

210  Aryadoust and Goh Table 9.6  Sample IF-THEN rules generated by the CART model (Continued) Node 22

25

39

Classification accuracy %

IF

THEN

(Type-token ratio) > 0.78350 AND (Prepositional phrase density) ≤ 204.16650 AND (Average givenness of each sentence) > 0.09250 AND (Verb incidence) > 63.33300 AND (Content word overlap) ≤ 0.17950 AND (Noun and verb hypernymy) ≤ 1.11250 AND (Word count) ≤ 30.00 (Type-token ratio) > 0.78350 AND (Average givenness of each sentence) > 0.09250 AND (Verb incidence) > 63.33300 AND (Content word overlap) ≤ 0.17950 AND (Noun and verb hypernymy) ≤ 1.11250 AND (Word count) > 30.00000 AND (Prepositional phrase density) > 32.69550 AND (Prepositional phrase density) ≤ 204.1665 (Type-token ratio) > 0.78350 AND (Prepositional phrase density) ≤ 204.16650 AND (Average givenness of each sentence) > 0.09250 AND (Content word overlap) > 0.17950 AND (Word count) > 21.50000 AND (Verb incidence) > 63.33300 AND (Verb incidence) ≤ 106.576

Easy

100

Difficult

88

Difficult

100

accuracy. We find it very useful to compare both methods in language assessment research to ascertain the optimal fit of the selected model over potential rival models. We adopted CART modeling to examine the effect of surface-level, textbase, and situation model variables on listening item difficulty. The optimal model, which used the train-test cross-validation method, showed that the top four variables predicting item difficulty were surface- and textbase-level, testifying to the influence of these cognition levels on listening task difficulty. We would expect to find a more pronounced role for the indices that are associated with dimensions of the situation model, but this was not confirmed in the study. This finding can be interpreted in two ways. First, if test developers have designed most of the items to measure the surface and textbase levels with only a smaller number of them suitable for engaging the situation model, then the results lend supporting evidence to the validity argument of the test. On the other hand, the situation model is a crucial component of listening, and it would be essential that the test designers justify its underrepresentation in the test items. In addition, the findings should be confirmed in other forms

Classification and regression trees 211 of MET in future research to investigate potential differences between constructs across different test forms. It should be noted that since the CART modeling reported here uses a particular IF-THEN rule in each node, we cannot assume that the four top variables would influence the difficulty of all items in the same way. The IF-THEN rules can point us to the influence of the independent variables for each particular group. For example, in Table 9.6, the IF-THEN rules for test items in Nodes 5 and 7, which have a high degree of resemblance, indicate that difficult items possess a type-token ration above 0.7835 alongside other attributes. Nevertheless, this is not generalizable to the entire data set, and quite reversely, for items in Node 13, the same attribute, in combination with other attributes, would result in a low degree of item difficulty in 100% of the items in that node. Overall, the analysis showed that the three levels of cognition would affect item difficulty of different groups of items in different ways and that there is no one linear way of defining the relationships between item difficulty and the levels of cognition predicted in CI.

Conclusion This chapter presented application of CART modeling in language assessment. The study reported here is one of the few studies in language assessment wherein the two most commonly used cross-validation methods were adopted and compared. Despite its potential, CART has not been widely adopted in language assessment research as much as linear modeling, such as linear regression, has. One reason for this gap might be that linearity, which can be detectable by, e.g., linear regression, is a more accessible paradigm than is nonlinearity, which CART seeks to detect. By integrating CART in statistical training in language assessment courses, we would be able to promote the application of the nonlinearity paradigm in language assessment research. In addition, while CART has been applied in language assessment, the methods used in the published research have limitations. In CART modeling, researchers should consider cross-validation and the comparison of splitting rule methods (e.g., Gini and twoing) to identify the best method for tree-growing. The CART algorithms that are formulated based on stringent stopping rules would likely result in prematurely grown trees (see Aryadoust & Goh, 2014, for a comparison of the SPSS CART module and the SPM), and, accordingly, it is recommended that an algorithm that relaxes such strict stopping rules should be used. As CART is a nonparametric technique, conventional fit and regression statistics such as beta coefficients are not applicable to determine model fit. In lieu of these, specificity, sensitivity, and ROC indices—which rarely have been reported in previous language assessment research—need to be considered in model fitting. CART can also generate as equally precise and accurate models as other advanced data-mining techniques do, such as artificial neural networks (see Aryadoust & Goh, 2014).

212  Aryadoust and Goh Future research should consider the viability of CART modeling in other contexts such as predictive validity in reading, speaking, and writing assessment.

Acknowledgments This study was funded by the Spaan Fellowship Program of Michigan Language Assessment.

Note

1. A discussion of these techniques falls outside of the scope of this chapter. For more information, see Kezunovic, Meliopoulos, Venkatasubramanian, and Vittal (2014).

References Aryadoust, V., & Goh, C. C. M. (2014). Predicting listening item difficulty with language complexity measures: A comparative data mining study. SPAAN Working Papers, 2014-01. Ann Arbor, MI: Michigan Language Assessment. Available from http: ⁄ ⁄www. cambridgemichigan.org ⁄resources ⁄working-papers ⁄ 2 Aryadoust, V., & Sha, L. (2015). Predicting EFL writing ability from levels of mental representation measured by Coh-Metrix: A structural equation modeling study. Assessing Writing, 24, 35–58. DOI: 10.1016 ⁄ j.asw.2015.03.001 Bengio, Y., & Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold cross-validation. Journal of Machine Learning Research, 5, 1089–1105. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth and Brooks ⁄Cole Advanced Books and Software. Buck, G., & Tatsuoka, K. (1998). Application of the rule-space procedure to language testing: Examining attributes of a free response listening test. Language Testing, 15(2), 119–157. doi: 10.1177 ⁄026553229801500201 Charniak, E. (2000). A maximum-entropy-inspired parser. Proceedings of the NAACL 2000. Seattle, Washington, (pp. 132–139). Crossley, S. A., Allen, D., & McNamara, D. S. (2012). Text simplification and comprehensible input: A case for an intuitive approach. Language Teaching Research, 16(1), 89–108. doi: 10.1177 ⁄1362168811423456 Crossley, S. A., Greenfield, J., & McNamara, D. S. (2008). Assessing text readability using cognitively based indices. TESOL Quarterly, 42, 475–493. Crossley, S. A., & McNamara, D. S. (2010). Cohesion, coherence, and expert evaluations of writing proficiency. In R. Catrambone & S. Ohlsson, (Eds.), Proceedings of the 32nd Annual Conference of the Cognitive Science Society (pp. 984–989). Cognitive Science Society, Austin, TX. Crossley, S. A., & Salsbury, T. (2010). Using lexical indices to predict produced and not produced words in second language learners. The Mental Lexicon, 5(1), 115–147. doi: 10.1075 ⁄ml.5.1.05cro Dietterich, T. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2), 139–158.

Classification and regression trees 213 Dumais, S. T. (2005). Latent semantic analysis. Annual Review of Information Science and Technology, 38, 188. Fayyad, U. M., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery: An overview. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy, Advances in knowledge discovery and data mining (pp. 1–30). Menlo Park, Calif.: AAAI Press. Fellbaum, C. (2005). WordNet and wordnets. In K. Brown, et al. (eds.), Encyclopedia of language and linguistics (2nd ed.) (pp. 665–670). Oxford, UK: Elsevier. Gao, L., & Rogers, W. T. (2010). Use of tree-based regression in the analyses of L2 reading test items. Language Testing, 28(2), 1–28. Grant, L., & Ginther, A. (2000). Using computer-tagged linguistic features to describe L2 writing differences. Journal of Second Language Writing, 9, 123–145. doi:10.1093 ⁄poq ⁄nfj012 Jensen, K., Muller, H. H., & Schafer, H. (2000). Regional confidence bands for ROC curves. Statistics in Medicine, 19, 493–509. doi: 10.1002 ⁄(SICI)1097-0258(20000229)1 Kezunovic, M., Meliopoulos, S., Venkatasubramanian, V., & Vittal, V. (2014). Application of time-synchronized measurements in power system transmission networks. Switzerland: Springer. Kintsch, W. (1998). Comprehension: A paradigm for cognition. New York: Cambridge University Press. Kintsch, W. (2004). The construction–integration model of text comprehension and its implications for instruction. In R. Ruddell, and N. Unrau (Eds.), Theoretical models and processes of reading (5th ed.), (pp. 1270–1328). Newark, DE: International Reading Association. Kintsch, W., & Mangalath, P. (2011). The construction of meaning. Topics in Cognitive Science, 3, 346–370. Kintsch, W., & Van Dijk, T. A. (1978). Toward a model of text comprehension and production. Psychological Review, 85, 363–394. Lui, H., Hussain, F., Chew, L. T., & Dash, M. (2002). Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6, 393–423. McNamara, D. S., Crossley, S. A., & McCarthy, P. M. (2010). Linguistic features of writing quality. Written Communication, 27, 57–86. doi: 10.1177 ⁄ 0741088309351547 McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge, UK: Cambridge University Press. McNamara, D. S., & Kintsch, W. (1996). Learning from text: Effects of prior knowledge and text coherence. Discourse Processes, 22, 247–288. Morgan, J. N., & Sonquist, J. A. (1963). Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association, 58, 415–434. Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52(3), 239–281. Perkins, K., Gupta, L., & Tammana, R. (1995). Predicting item difficulty in a reading comprehension test with an artificial neural network. Language Testing, 12, 34–53. doi: 10.1177 ⁄ 026553229501200103 Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco: Kaufmann. Salford Systems (2018). SPM users’ guide: Introducing CART. Retrieved from http: ⁄ ⁄media. salford-systems.com ⁄pdf ⁄spm8 ⁄IntroCART.pdf Sheehan, K. M., & Ginther, A. (2001). What do passage-based multiple-choice verbal reasoning items really measure? An analysis of the cognitive skills underlying performance on the current TOEFL reading section. Paper presented at the Annual Meeting of the National Council on Measurement in Education.

214  Aryadoust and Goh Spoon, K., Beemer, J., Whitmer, J. C., Fan, J. J., Frazee, J. P., Stronach, J., … Levine, R. A. (2016). Random forests for evaluating pedagogy and informing personalized learning. Journal of Educational Data Mining, 8(2), 20–50. Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: Rationale, application and characteristics of classification and regression trees, bagging and random forests. Psychological Methods, 14(4), 323–348. Swets, J. A. (1996). Signal detection theory and ROC analysis in psychology and diagnostics: Collected papers. Mahwah, NJ: Erlbaum. Tapiero, I. (2007). Situation models and levels of coherence: Toward a definition of comprehension. Mahwah, NJ: Erlbaum. Tetko, I. V., Livingstone, D. J., & Luik, A. I. (1995). Neural network studies. 1. Comparison of overfitting and overtraining. Journal of Chemical Information and Modeling, 35(5), 826–833. Winne, P. H., & Baker, R. S. J. D. (2013). The potentials of educational data mining for researching metacognition, motivation, and self-regulated learning. Journal of Educational Data Mining, 5(1), 1–8. Xi, X., Higgins, D., Zechner, K., & Williamson, D. (2012). A comparison of two scoring methods for an automated speech scoring system. Language Testing, 29(3), 371–394. Zhou X. H., & Qin, G. (2005). Improved confidence intervals for the sensitivity at a fixed level of specificity of a continuous-scale diagnostic test. Statistics in Medicine, 24, 465–477. doi: 10.1002 ⁄sim.1563

10 Evolutionary algorithmbased symbolic regression to determine the relationship of reading and lexicogrammatical knowledge Vahid Aryadoust Introduction A few decades ago, Hofstadter (1986) wrote that physics theories relied heavily on linear modeling.1 He stated that “Nonlinear mathematical phenomena are much less well understood than linear ones, which is why a good mathematical description of turbulence has eluded physicists for a long time, and would be a fundamental breakthrough” (Hofstadter, 1986, p. 365). Nowadays, the field of language assessment seems to be facing the same gap in predicting language proficiency. For prediction, the field has widely applied linear regression models or similar techniques such as linear structural equation modeling (SEM, see Chapter 5, Volume II) which assume linearity. While there is no inherent flaw in any linear model, nonlinear techniques are much less understood in the field and their application needs to be further investigated (Aryadoust, 2015). The development of a group of nonlinear models called optimization methods has been inspired (partly) by nature. Optimization refers to the process of rendering a system, product, or equation as good as possible (optimal) to model the underlying complexities of the data according to some prespecified requirements or rules (Bertsekas, 1995). In quantitative data analysis, optimization concerns maximizing the precision of a function by systematically choosing input values (aka independent variables) in order to emulate or approximate the amount of an output (aka dependent variables) (Chong & Żak, 2013). Put simply, optimization techniques can be viewed as advanced regression models (see Chapter 6, Volume II) that are used to predict the amount of output based on available inputs. According to Fister, Yang, Fister, Brest, and Fister (2013), there are multiple nature-inspired optimization techniques that can be used for prediction and ⁄or classification, such as artificial neural networks (see Aryadoust & Goh, 2014), ant colonies (Dorigo & Stültze, 2004), the artificial bee colony algorithm (Dervis, 2010), artificial immune systems (De Castro & Timmis, 2002), and genetic and evolutionary algorithms (Koza, 2010) to name a few. The approach adopted in this chapter is based on the evolutionary algorithms method wherein a basic model is first formulated and then bred (optimized step by step) over multiple generations (groups of suboptimal models) to arrive at the optimal solution that can predict an outcome (for example, predict test takers’ reading ability

216  Aryadoust based on their lexical and grammatical knowledge). The optimal model has to be parsimonious (has low complexity) and precise (can predict with high accuracy). We propose that the evolutionary algorithms method can be adopted as a useful nonlinear modeling technique in language assessment research (Tenenbaum, De Silva, & Langford, 2000). To illustrate, we provide an example in which reading ability is predicted by its components. Researchers generally have assumed linear relationships between reading and language components such as vocabulary. However, in this chapter, we aim to show that nonlinear models can provide another perspective to comprehension that has more precision and accuracy. The chapter does not intend to represent linear (prediction) modeling in an unfavorable light; rather, it attempts to introduce evolutionary algorithm-based (EA-based) symbolic regression, a rigorous predictive technique that has been applied in previous language assessment research (Aryadoust, 2015, 2016). This approach introduced in the chapter builds predictive models by using “the symbolic functional operators2 selected by the researcher, and then by applying a genetic programming algorithm which results in the selection of input variables and a final set of models” (Aryadoust, 2015, p. 303). Model fit statistics are used to choose the best fitting model among the generated models (Schmidt & Lipson, 2009). Features and utility of EA-based symbolic regression are further discussed in the next section.

Nonlinearity paradigm: EA-based symbolic regression Linearity is an appealing premise that is underlying numerous studies in language assessment and pedagogy, owing to its simplicity and “cultural attractiveness” (Nicolis, 1999, p. 1). Linearity reduces interactions to proportional relations, where a change in cause (predicting or independent variable) would induce measurable changes in effect (predicted or dependent variable). The linearity assumption is prevalent in language assessment and pedagogy literature. For example, several researchers have reported weak to medium linear interactions among reading test performance, cognitive strategies, and metacognitive strategies (e.g., Phakiti, 2008; Purpura, 1997). In addition, it has been assumed that there is a linear relationship between students’ communicative competence and academic performance (Cho & Bridgeman, 2012). An integrative paradigm has recently emerged that synthesizes scientific theories and mathematical modeling. The application of this concept-based mathematical paradigm has resulted in breakthroughs in various fields of study. For example, it has facilitated the representations of laser-plasma interactions (Rastovic, 2010) and the modeling of outbreaks of diseases in human populations over time (Liao, Hu, Wang, & Leeson, 2013) through using nonlinear mathematical methods such as genetic algorithms. This paradigm has also led to important mathematical achievements in data modeling such as artificial neural networks (ANNs), adaptive neuro-fuzzy inference systems (ANFIS), and evolutionary computation methods including genetic algorithms, differential

Algorithm-based symbolic regression 217 evolutions, and ant colony optimization, to name a few (Eiben & Schoenauer, 2002). Despite their successful application in many fields of science, most of these methods have remained unknown to the field of language assessment. EA-based symbolic regression generates and develops rival models (Schmidt & Lipson, 2009) through examining, choosing, and breeding a set of ideal models (Fogel & Corne, 1998). One distinct benefit of using symbolic regression is that the solutions3 are enhanced through inspecting every discrete datum instead of treating the data as a whole. By utilizing the data, symbolic regression creates and constantly tweaks models to arrive at a best fit solution and discards the poorly fitting models that do not survive in the competition with better fit models. This can enable researchers to transfer the time-consuming task of model generation and estimation to the available powerful computer programs and to select the most suitable model by assessing accuracy and appropriacy of the models proposed by the programs (Schmidt & Lipson, 2010).

Fundamental concepts in symbolic regression EA-based symbolic regression identifies significant predictive variables (independent variables that can predict the dependent variable) and sets the constraints on the independent-dependent variable relationships through (i) deciding on the specific operators and functions, (ii) generating solutions, and (iii) choosing the most appropriate (best-fitting) model. These fundamental concepts are discussed as follows.

Operators First, EA-based symbolic regression chooses a list of mathematical functions from the basic data set (McRee, 2010). According to Dictionary.com (https: ⁄⁄www. dictionary.com ⁄), a function is “a relation between two sets in which one element of the second set is assigned to each element of the first set.” There are a large number of mathematical functions, linear and nonlinear, that contain operators. For example, a simple linear regression can be viewed as a function that maps out the relationship between input and output variables through using operators that signify the specific mathematical manipulations that we desire to perform. Well-known groups of functions include basic functions (e.g., subtraction [operator: −] and multiplication [operator: ×]), complicated functions such as circular functions (e.g., sine [operator: sin] and cosine [operator: cos]), and highlevel functions such as exponential functions (e.g., natural logarithm [operator: ln x]) (Schmidt & Lipson, 2008).

Model solving Subsequently, EA-based symbolic regression utilizes evolutionary algorithms (EAs) that simulate mechanisms of Darwinian evolution such as breeding and variety (Fogel, 1999). EA-based symbolic regression starts with an initialization,

218  Aryadoust where mathematical solutions are developed before assessing their accuracy and relevance, which are measured by their fit indices (Gwiazda, 2006). The best-fitting solution is chosen as the parent solution, which is used as a baseline to develop future generations of solutions called offspring (Zelinka, Chen, & Celikovsky, 2010). EA-based symbolic regression preserves variety in the competing solutions to prevent premature convergence whereby all competing solutions start forming up toward the same end solution. This is commonly known as a local optimum, which is an optimal solution within a subset of data rather than the entire data set (Schmidt & Lipson, 2010, 2011). Avoiding premature convergence results in better precision and higher accuracy, especially in nonlinear modeling wherein there might exist multiple local optima (Zelinka et al., 2010). There are several possible steps to bypass premature convergence, three of which have been included in Eureqa, the software used in the current chapter: crossover, mutation, and age-fitness Pareto optimization (AFPO) (Banzhaf, Nordin, Keller, & Francone, 1998). In crossover, two or more original parent solutions are chosen to reproduce offspring or child solutions (Holland, 1975; Koza, 2010). Eureqa applies crossover iteratively: it chooses an arbitrary point on each parent solution and divides into two segments by the chosen data point (Schmidt & Lipson, 2008). Conceptually, this process resembles chromosomal crossover in animal reproduction in which the paternal and maternal chromosomes combine by exchanging genetic materials (instead of genetic materials, EA-based symbolic regression contains operators). Next, the segments pair up or combine into two separate offspring; that is, the beginning of one paternal solution merges with the end of the other paternal solution and vice versa. The newly developed offspring solutions will then replace the parent solutions that did not have a good fit (Zelinka et al., 2010). Another method to circumvent premature convergence is mutation, which is the process of modifying an available solution to optimize its fit to the data (Gwiazda, 2006; Michalski, 2000). Compared with crossover, which is applied to approximately 50% of solutions, mutation is relatively underused in nonlinear modeling (used only in approximately 1% of solutions) (Zelinka et al., 2010). Like crossover, mutation makes the existing solutions more varied and thus helps avoid premature convergence. The operators appropriate for the evolving solutions will have a higher probability of being chosen for crossover and mutation (Michalski, 2000). The chance of each operator being chosen is a mathematical function of their average improvement (Schmidt & Lipson, 2010), which is estimated based on the nature of how the solutions are being developed. All operators initiate with equal chances of being chosen and are continuously revised according to the progress of model generation (Nicoară, 2009). By applying each operator, the EA algorithm updates the odds of selection per each. The third process to avoid premature convergence is AFPO, which selects new solutions based on two criteria: age (how long the solution has existed) and fitness (how appropriate the solution is when applied) (Schmidt & Lipson, 2010). As new solutions are developed and refined continuously,

Algorithm-based symbolic regression 219 AFPO attempts to reduce errors and increase the extrapolative nature and accuracy of the models (Schmidt & Lipson, 2011). Conventional nonevolutionary algorithms only keep the best solutions. This contrasts with EAs, which preserve a whole set of solutions that fit the model to varying degrees. This aspect is a benefit over conventional methods as it prevents local optimum traps (Koza, 2010). A local optimum is the best mathematical solution (e.g., a function) only in comparison with a small number of possible rival solutions—as opposed to the global optimum, which is the best solution among all possible solutions (Koza, 2010). With conventional algorithms, there is a chance of better solutions being given up if algorithms are stuck in local optima.

Selecting the best model To select the ideal model in EA-based symbolic regression, one will need to inspect the current models based on two benchmarks: substantiation of the accuracy of the model across a cross-validation sample and estimating model complexity (Kordon, Pham, Bosnyak, Kotanchek, & Smits, 2002). In order to verify whether the ideal solutions are applicable to other data sets, EA-based symbolic regression splits the data set into two separate subsets: training and cross-validation or testing (Schmidt & Lipson, 2010)—note that this process is the same as training and testing stages in classification and regression trees (CART; see Chapter 9, Volume II). The training subset is used for developing solutions, while the cross-validation subset is used to evaluate the functionality of the selected solution. Cross-validation impedes overfitting, which happens when the solution emulates the measurement error instead of the mathematical nature of the data set.

Estimating fit There are multiple fit statistics to evaluate the effectiveness of the cross-validated solutions that are used in this chapter: i Mean absolute error (MAE) and mean squared error (MSE) estimate the difference between predicted and observed values. Low MAE and MSE indices suggest better fit. ii The correlation coefficient (R) quantifies the correlation between observed and model-estimated statistics. It ranges between zero and one with values above 0.7 suggesting significantly high correlation between model-generated and observed data. iii The R 2 index indicates the proportion of output explained by the independent variables. It ranges between zero and one, with values closer to one suggesting that the independent variables have high ability to predict the output. In addition to cross-validation, model complexity is estimated to help identify the best model. In Eureqa, a weighting is assigned to every mathematical

220  Aryadoust operator in the solutions to measure its complexity. For instance, basic operators, like addition and subtraction, are assigned a weighting of one, while more complex operators, such as logistic and step functions, have a complexity of four. Total model complexity is calculated by adding up the weightings of all the operators in the solution. Usually simpler solution that return lower prediction error are preferred. This could pose a dilemma where the researcher has to choose between model simplicity and theoretical appropriateness of the model (Schmidt & Lipson, 2010). One way to deal with this dilemma is to adopt the solution, revise and optimize the theoretical framework accordingly, and then test the accuracy and validity of the revised framework in other contexts.

Sample study: Using EA-based symbolic regression to predict reading ability Theoretical framework of the sample study Over the past few years, English as a second ⁄foreign language (ESL ⁄ EFL) assessment research has seen a growing interest in investigating the interrelations between reading comprehension processes and lexicogrammatical knowledge (Zhang, Goh, & Kunnan, 2014), with the latter being a significant predictor of the former according to Alderson (2000). Reading comprehension as a multistage process includes word recognition in which readers recognize the orthographic, morphological, and phonological features of written words by matching them against their knowledge base in their long-term memory (Weir, 2005). Lowability readers’ lack of extended vocabulary and slow pace in associating written words with their mental lexicon would negatively affect their performance in reading comprehension (Alderson, 2000). On the other hand, high-ability ESL ⁄ EFL readers can read more effectively due to their ability to perform fast and automatic word recognition. This facilitates higher-level comprehension processes such as inference-making and thus faster and more efficient reading comprehension (Khalifa & Weir, 2009). After word recognition, ESL readers parse the sentence, a process that relies on the grammatical units of local text segments (Jung, 2009, p. 29). According to Jung (2009), poor grammatical knowledge can hinder global comprehension. Similarly, Weir (2005) highlighted that “grammatical resources” play a crucial role in aiding successful cognitive processing in reading. Successful word recognition and syntactic parsing allow readers to develop a propositional meaning for each sentence that does not necessarily include any “interpretive and associative factors which the reader might bring to bear upon it” (Field, 2004, p. 50). Successively, readers also infer various relationships from the text, by extracting and internalizing implicitly stated information (Field, 2004). By extracting information from the text, discriminating between major and minor pieces of information, incorporating their background knowledge, and organizing the information in terms of importance (e.g., macro-propositions or main ideas

Algorithm-based symbolic regression 221 versus micro-propositions or details), readers assemble a mental representation of the text that is called the situation model (Kintsch, 1988). While it has been shown that grammar and vocabulary play essential roles in reading comprehension (Phakiti, 2008; Purpura, 1997), the construct of grammar and vocabulary knowledge has often been operationalized quite narrowly. Purpura (2004) argues that grammatical knowledge has three levels: form, meaning, and pragmatic level. Form is concerned with accuracy at several levels including phonology, word formation, morphosyntactic, cohesion, and discourse levels. Each of these levels are associated with grammatical and pragmatic meaning; for example, morphosyntactic forms have literal functional meanings as well as sociolinguistic applications such as apologies and requests. Similarly, there are three aspects in vocabulary knowledge: form (e.g., spoken and written), meaning (concepts and associations), and use (e.g., register and function) (Nation, 2005). Together, these levels of language constitute learners’ lexicogrammatical knowledge. Accordingly, it is essential that language tests tap into all the dimensions of lexicogrammatical knowledge and avoid underrepresenting the constructs. It is plausible that, in predictive modeling, underrepresented lexicogrammatical constructs would predict reading knowledge with less accuracy than expected because all the aforementioned aspects of lexicogrammatical knowledge would be required to comprehend the reading texts at surface and deep levels (Kintsch, 1998). In sum, ESL ⁄ EFL reading comprehension is likely to be best predicted by readers’ vocabulary and grammatical knowledge, provided that these knowledge repertoires are well represented in assessment tools. However, the extant research on the use of lexicogrammatical knowledge as a predictor of reading comprehension has at least one methodological drawback, which is the application of correlation analysis or linear modeling (see Jung, 2009; Zhang et al., 2014). Data sets vary to the extent in which they deviate from linearity, and correlation coefficients are not resistant to outliers and nonlinear patterns. Accordingly, low correlations do not conclusively preclude a lack of interaction between two variables. Zhang et al. (2014) have alternatively suggested using structural equation modeling (SEM) instead (see Chapter 5, Volume II), which, despite its potential, is also based on the matrices of correlations or covariances, and nonlinearity can somewhat affect the fit and convergence of models (unless an estimation method that is suitable for modeling nonlinearity is used). Thus, some parts of the data typically are destroyed in linear modeling to resolve nonlinearity, which could result in the loss of useful knowledge in the data (Hair, Black, Babin, & Anderson, 2010).

Methodology Data source and instruments Data were obtained from the administration of a university English test battery to 340 EFL university candidates (male = 170; female = 170; aged between

222  Aryadoust 28 and 45) who held master’s degrees in humanities, engineering, and science fields in an Asian country. The English test is a requirement for all applicants who wish to gain entry to PhD programs at the university. The English test used in this study comprises 100 dichotomously scored multiple choice (MC) test items and is partly modeled after the paper-based Test of English as a Foreign Language (TOEFL PBT). It is organized into three primary parts, each assessing a particular language skill or ability, as follows: i The reading comprehension testlet comprising 40 multiple choice (MC) items testing students’ comprehension of six passages, between 144 and 370 words long. The items assess students’ ability to skim and scan, identify main ideas and detailed information, and surmise the meaning of unknown vocabulary. ii The vocabulary testlet consisting of 30 test items. In the first half (15 items), test takers are required to choose a synonym for a prompt word out of four options. In the second half (15 items), test takers are required to fill in gaps with the most fitting word out of four available options. iii The grammar testlet with 30 test items. Fifteen test items comprise sentences with four underscored words, one of which is grammatically inaccurate and requires the test taker to identify it. 10 test items constitute fill-in-the-blank items where test takers must choose the correct word or phrase among four available options.

Data analysis Psychometric validity of the testlets It is important to ascertain the psychometric validity of tests before using the test scores for EA-based symbolic regression modeling. We used the Winsteps computer program, Version 4.2 (Linacre, 2018), to perform Rasch measurement on the data (see Chapter 4 and 7, Volume I). The testlets were calibrated separately because each testlet forms an independent dimension. In each Rasch measurement analysis, multiple psychometric features of the tests were examined including item difficulty and fit, person ability and fit, reliability and separation, and unidimensionality analysis. A full treatment of these concepts falls outside of the scope of this chapter (various aspects and forms of Rasch measurement are discussed in (Chapters 4, 5 and 7, Volume I).

Linear regression model For comparison, the data were initially submitted to linear regression analysis with ENTER method to assess linearity; R 2, adjusted R 2 (R 2 adjusted for the sample size), and regression coefficients for each independent variable were estimated (see Chapter 10, Volume I, for further information).

Algorithm-based symbolic regression 223

EA-based symbolic regression analysis Target expression and building blocks Eureqa4 computer package (Version 0.99.9 beta) (Schmidt & Lipson, 2013) was used for EA-based symbolic regression analysis (see http: ⁄⁄www.nutonian. com ⁄products ⁄eureqa ⁄). The developers no longer offer the free academic version of Eureqa that was used to perform the analysis for this chapter.5 Reading comprehension proficiency was modeled as a function of vocabulary and grammar knowledge in Equation 10.1: θR = f (α, β) (10.1) where θR is reading ability measured by the reading testlet; α is vocabulary knowledge measured by the vocabulary testlet; and β is grammar knowledge measured by the grammar testlet. Subsequently, pertinent mathematical operators such as addition and negation were chosen. Following Schmidt and Lipson (2009, 2010), we initiated the search for the optimal model by using simple or moderately complex mathematical operators and then complex operators: (i) basic operators that comprise constant, addition, subtraction, multiplication, and division; (ii) squashing and history operators that are moderately complex comprise hyperbolic tangent, simple moving average and median, weighted moving average, and modified moving median; and (iii) inverse trigonometry, which is complex and comprises inverse hyperbolic tangent and cosine.6 The analysis continued for 35 hours. To optimize the depth of analysis, Eureqa was integrated with the Amazon EC2 Secure Cloud for approximately five additional hours to perform cloud computation in which 126 CPU (central unit processing) cores were adopted to accelerate the processing power. The data were divided into a training subsample (70%; n = 238) and a validation subsample (30%; n = 102) for the analysis.

Fit, sensitivity percentages, and magnitudes As earlier noted, multiple fit statistics were computed to assess the performance of the solutions: R, R 2, MAE, and MSE. In addition, negative and positive sensitivity7 magnitudes and percentages were computed for the variables in the chosen models. Sensitivity is estimated based on Equation 10.2 as follows: ∂Y std ( x ) . (10.2) ∂ x std (Y ) where: ∂Y = partial derivative of Y with respect to x, ∂x std (x) = standard deviation of x, the independent variable, and std (Y) = standard deviation of Y, the dependent variable.

224  Aryadoust Sensitivity gives the direction and magnitude of the association between independent and dependent variables. For example, the sensitivity value of 0.7 indicates that if variable X increases by one unit, then variable Y will increase by 0.7 units. This index is the sum of the positive and negative values: for example, if X impacts Y positively 75% of the time with a magnitude of 3 and impacts Y negatively 25% of the time with a magnitude of 2, the total effect will be 1.75 (3 × 0.75 – 2 × 0.25 = 2.25 – 0.5).

Examining performance and progress To evaluate the performance of the proposed solutions and their progress across time, two statistics were used: stability and maturity. Stability is an index describing for how many “generations” the optimal solutions have made no further improvements, and maturity describes when the optimal solutions improve. They have a range between zero and one, and when they approximate one, the solutions are unlikely to enhance. At this point, the analysis should be stopped.

Results: Linear regression model A linear regression model analysis with ENTER method was conducted to examine how precisely vocabulary and grammatical knowledge would predict reading comprehension test performance. The regression model was significant (R 2 = 0.368, R = 0.607, F[2,337] = 98.06, p < 0.001), and yielded an intercept of −0.161 (p < 0.001), thus giving the following solution: θR = −0.161 + 0.406 * α + 0.203* β (10.3) Equation 10.3 indicates that a one-unit increase in vocabulary knowledge (α) will induce a linear 0.406-unit change in reading ability, and a one-unit increase in grammar (β) will induce a linear 0.203-unit change in reading ability. The R 2 value of 0.368 indicates that 36.8% of the observed variance in reading is explained by vocabulary and grammar knowledge.

Results: EA-based symbolic regression analysis Model estimation and selection The analysis reached the maturity and stability of 97%, indicating that the solutions would not improve anymore. As presented in Table 10.1, 22 solutions survived and continuously improved. The R coefficients vary between 0.242 (model ID = 1) and 0.721 (model ID = 22), suggesting varying degrees of correlation between model-estimated and observed reading abilities. In addition, R 2 coefficients range between 0.059 and 0.520, with MSE and MAE indices varying between 0.313 and 0.603 as well as between 0.415 and 0.602, respectively.

Table 10.1  Competing models with their R, R 2, MSE, and MAE indices Model θR = 0.3606×β + 0.3495× α + 0.04242×sma(β, 8) + 0.8021×wma(α, 6)×smm(α, 8) + 0.3792× β2 − 0.1398 − 0.7195×β ×sma(β, 8) − 0.1828×α2 θR = 0.3625×α + 0.3343×β + 0.06713×sma(β, 8) + 0.8021×sma(α, 8)×wma(α, 3) + 0.5033× β2 − 0.1395 − 1.139×β×sma(β, 8) − 0.1603×α2 θR = 0.2624×smm(β, 2) + 0.01434×α + 0.8021×wma(α, 8)2 + 0.6742×tanh(1.084×Α − 0.06106) − 0.09854 − 0.1273× α2 θR = 0.2515×smm(β, 2) + 0.02613×α + 0.8021×wma(α, 2)×wma(α, 10) + 0.8021×tanh(1.084×α − 0.06106) − 0.09888 − 0.232×α2 θR = 0.4394×α + 0.3765×smm(β, 2) + 0.8021×wma(α, 2)×wma(α, 10) − 0.122 − 0.2765×α2 θR = 0.5127×β + 0.3279×α + 0.8021×sma(α, 6)×smm(α, 4) − 0.05713 − 0.1931× α2 θR = 0.3955×mma(β, 8) + 0.8021×sma(α, 5)×sma(α, 8) + 0.6542×tanh(1.084×α − 0.06106) − 0.09814 θR = 0.4858×β + 0.3596×α + 0.8021×sma(α, 6)×smm(α, 4) − 0.1466 − 0.1346×α2 θR = 0.3568× β + 0.8021×sma(α, 5)×sma(α, 11) + 0.5565×tanh(1.084×α − 0.06106) − 0.1192 θR = 0.4444×mma(α, 9) + 0.3761× β + 0.8021×sma(α, 7)2 − 0.1203 θR = 0.4699×β + 0.2577×α + 0.8021×sma(α, 5)×sma(α, 11) − 0.1404 θR = (1.689× β + 0.8021×tanh(2.606×α − 0.1468) − 0.09958) ⁄ (4.417 − 1.42×β) − 0.09814 θR = (1.616×sma(β, 5) + 1.139× β + 0.8696×α − 0.1161) ⁄ (4.586 − 1.42×β) − 0.09814 θR = (1.139×β + 0.8021×tanh(2.256×α − 0.1271) − 0.06713) ⁄ (3.928 − 1.42×β) − 0.09814 θR = (1.139×β + 0.8696×α + 0.8021×sma(β, 5) − 0.1161) ⁄ (4.087 − 1.42×β) − 0.09814 θR = (1.139×β + 0.8021×tanh(1.084×α − 0.06106) − 0.06713) ⁄ (3.823 − 1.42×β) − 0.09814 θR = (1.957×β + 0.8696×α − 0.1644) ⁄ (4.622 − 1.42×β) − 0.09814 θR = (1.139×β + 0.8696×α − 0.1161) ⁄ (3.878 − 1.42×β) − 0.09814 θR = (1.139×β − 0.06713) ⁄ (3.558 − 1.42×β) − 0.09814 θR = 0.0301 + 0.5612×β θR = 0.4744×β − 0.1261 θR = 1.139×β − 0.1653

R

R2

MSE

MAE Complexity ID

0.721 0.520 0.313 0.415

38

22

0.708 0.494 0.324 0.423

36

21

0.696 0.478 0.334 0.434

30

20

0.656 0.419 0.372 0.447

28

19

0.673 0.672 0.674 0.668 0.668 0.660 0.657 0.649 0.640 0.651 0.631 0.641 0.626 0.616 0.588 0.580 0.533 0.242

26 25 24 23 21 20 17 16 15 14 13 12 10 8 6 5 3 1

18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0.449 0.445 0.453 0.431 0.444 0.432 0.431 0.421 0.406 0.419 0.399 0.406 0.387 0.377 0.334 0.334 0.285 0.059

0.352 0.355 0.350 0.364 0.355 0.364 0.365 0.371 0.380 0.372 0.385 0.380 0.392 0.399 0.426 0.426 0.457 0.603

0.439 0.444 0.438 0.450 0.450 0.454 0.460 0.466 0.470 0.470 0.474 0.474 0.483 0.488 0.505 0.510 0.528 0.602

Note: α = vocabulary knowledge; β = grammar knowledge; θR = reading comprehension ability; sma = simple moving average; smm = smoothed moving average; wma = weighted moving average; tanh = tangent

226  Aryadoust According to Equation 10.4, the worst fitting model (R = 0.242; R 2 = 0.059; MSE = 0.603; MAE = 0.602) is expressed as follows: θR = 1.139 × β − 0.1653 (10.4) where only summation and subtraction have been used. This model is the simplest model of all, too, as indicated by its complexity index (1). On the other hand, based on Equation 10.5, the fittest model (R = 0.721; R 2 = 0.520; MSE = 0.313; MAE = 0.415) is mathematically expressed as follows: θR = 0.3606 × β + 0.3495 × α + 0.04242 × sma(β, 8) + 0.8021 × wma(α, 6)

× smm(α, 8) + 0.3792 × β 2 − 0.1398 − 0.7195 × β × sma(β, 8) − 0.1828 × α

2

(10.5)

where summation, subtraction, multiplication, power, simple moving average, simple moving median, and weighted moving average operators are used. As expected, the model is the most complex (complexity = 38) among all models; relative to the second-best model (R = 0.708; R 2 = 0.494; MSE = 0.324; MAE = 0.423), it has a larger R and R 2 and lower MSE and MAE, with only a two-unit difference in complexity. The functions in this model belong to the “moving averages” group where average points are calculated based on the average of the previous subset of numbers. Due to their efficiency and ease of calculation, moving averages can be applied in modeling both time series and static data (e.g., see Aryadoust, 2015). Simple moving average (sma) calculates the mean of the previous data; weighted moving average (wma) calculates the weighted average of the previous data; and the smoothed moving average (smm) is the exponential moving average of the previous data. Adding weights or exponents is an attempt to improve the precision of prediction.

Sensitivity analysis Table 10.2 demonstrates the results of sensitivity analysis, which examines the relative impact of vocabulary and grammar in the fittest model on students’ reading comprehension. The far-left column presents the identifying numbers of the estimated models that match the IDs presented in Table 10.2. Other columns include the variables (α and β), sensitivity, percentage and magnitude positive, and percentage and magnitude negative. Sensitivity coefficients of α (vocabulary) and β (grammar) in the fittest model are 1.056 and 0.426, respectively, with positive sensitivity percentages of 86% (magnitudeα = 1.114) and 99% (magnitudeβ = 0.429) and negative sensitivity percentages of 14% (magnitudeα = 0.683) and 1% (magnitudeβ = 0.004), suggesting that vocabulary has a significantly higher impact on reading proficiency than does grammar. This further indicates that an increase in α and β will lead to an increase in reading proficiency 86% and 99% of the time, respectively, and to a decrease 14% and 1% of the time, respectively. The greater impact of vocabulary is further evidenced by its higher sensitivity coefficient across other models, except models 1 through 4.

Algorithm-based symbolic regression 227 Table 10.2  Sensitivity analysis of the variables in the models ID Variable Sensitivity % Positive Positive Magnitude % Negative Negative Magnitude 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

α β α β α β α β α β α β α β α β α β α β α β α β α β α β α β α β α β α β β β β β

1.056 0.426 0.947 0.432 1.245 0.357 1.075 0.284 0.974 0.478 1.103 0.576 1.299 0.523 1.160 0.544 1.137 0.461 1.449 0.577 1.031 0.644 0.769 0.280 1.237 0.438 0.704 0.347 0.975 0.461 0.757 0.328 0.760 0.358 0.661 0.472 0.985 1 1 1

86% 99% 84% 84% 85% 100% 88% 100% 86% 100% 86% 100% 82% 100% 87% 100% 82% 100% 85% 100% 80% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%

1.114 0.429 1.001 0.479 1.324 0.357 1.134 0.284 1.060 0.478 1.176 0.576 1.374 0.523 1.223 0.544 1.201 0.461 1.519 0.577 1.077 0.644 0.769 0.280 1.237 0.438 0.704 0.347 0.975 0.461 0.757 0.328 0.760 0.358 0.661 0.472 0.985 1 1 1

14% 1% 16% 16% 15% 0% 12% 0% 14% 0% 14% 0% 18% 0% 13% 0% 18% 0% 15% 0% 20% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%

0.683 0.004 0.656 0.183 0.823 0 0.633 0 0.459 0 0.656 0 0.950 0 0.729 0 0.850 0 1.0462 0 0.846 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Note: α = vocabulary knowledge; β = grammar knowledge

Discussion This chapter presented the application of EA-based symbolic regression in language assessment. The primary implication of the study is that although reducing the interdependence of cognitive attributes to strictly linear relationships could render interpretation simple, it nevertheless has a trade-off: it could minimize precision.

228  Aryadoust There are several aspects of EA-based symbolic regression that should be considered in interpreting nonlinear relationships among variables. First, one should inspect goodness of fit statistics (R and R 2 alongside MSE, and MEA) and complexity indices. R is the correlation between the actual and the model-predicted values of the dependent variables; R 2 shows how close the data are to the regression function. In addition, in EA-based symbolic regression analysis, there is a large number of mathematical functions and operators from which researchers can choose to predict the amount of variation in dependent variables. The pool of operators would help improve the precision of the regression models significantly, although this is not without certain limitations. One limitation is the choice of operators and functions for the data, which can be addressed by adopting the Occam’s razor principle. Occam’s razor holds that where there are two competing explanations for an occurrence or phenomenon, the simpler one is the better choice. Accordingly, where there are two predictive models in a research study, the simpler model should be chosen, provided that the two models are equally precise and accurate. It is therefore important that the researchers start off by testing less complex models and compare their MSE and MEA with models of higher complexity. A less complex model with low MSE and MEA indices is preferable (Aryadoust, 2015). Of course, this can be time-consuming and would require high computational power, which can be facilitated via cloud computing (see Aryadoust, 2015, for an example), which includes a group of remote computer servers hosted on the Internet to process or store data. Another aspect of EA-based symbolic regression is the interpretation of the models. In simpler models, it is relatively easy to determine the implication of the model for theory postulation and prediction as they typically take a linear form. In complex models, however, such as those containing, for example, trigonometry (sine, cosine, and tangent), one can apply either of two possible approaches. The first approach is informed by data mining research where the accuracy of the model is pursued rather than the type of mathematical functions that are used to arrive at the desirable accuracy level. For example, ANFIS and artificial neural networks are two predictive and classification models where the functions are chosen according to model precision, which is measured by R, R 2, MSE, and MAE (Aryadoust, 2013; Aryadoust & Goh, 2014). The second approach to interpreting EA-based symbolic regression is to interpret the optimal models in terms of both the fit statistics and their content. The optimal model represented in Equation 10.6 includes sma, wma, smm, and square functions. In other words, while variations in some parts of the data are captured by an sma function, other parts can be explained by square values, ssm, and wma. In addition, there is a certain amount of linearity in the data that can be captured by 0.3606 × β and 0.3495 × α segments of the model, but the entire variation in the data cannot be captured by resorting to a linear paradigm. It can be said that despite the simplicity of linear models and the ease of interpreting them, if higher precision is desirable, it is important to investigate whether nonlinear solutions offer better solutions.

Algorithm-based symbolic regression 229

Figure 10.1  A two-plate model depicting the relationship between reading and lexicogrammatical knowledge.

It is a commonly adhered-to concept that the precision of prediction modeling in social sciences is relatively low. However, it was shown that a nonlinear mechanism with higher precision, which is graphically illustrated in Figure 10.1, could underlie the relatively weaker proportional (linear) relationships between reading and the predicting variables. The nonlinear EA-based symbolic formula displayed in the lower plate in Figure 10.1 is interpretable using a rationale similar to linear regression. That is, sensitivity statistics of the EA-based symbolic regression describe the type of relationship between independent and dependent variables. It was shown that the relationship between the predictors and the dependent variables was primarily positive, i.e., increases in predictors (lexical and grammatical knowledge) would result in increments in reading performance in 75% of the cases. Nevertheless, it was also found that this positive nonlinear relationship does not hold for vocabulary in 1% of the cases and for grammar in 14% of the cases. It is suggested that this unexpected relationship might have been a cause for the linear regression model to have a higher error in prediction as linear regression cannot capture such anomalous cases. The presence and nature of the anomalies would have certain implications for the theory of reading, its assessment, and quantitative modeling. First and foremost, while it is strongly associated with readers’ lexicogrammatical knowledge, reading comprehension in a small portion of certain populations does not seem to be strongly associated with grammatical knowledge, as shown by the EA-based symbolic analyses. This finding could result from the nature of the grammar test in the study. Most grammar tests such as the one in this study assess form and meaning but overlook pragmatic meaning, which is integral to grammatical knowledge (Purpura, 2004) as well as reading (Kintsch, 1998). Underrepresented lexicogrammatical knowledge in assessments results in an underrepresented measure of these knowledge bases, which can affect the precision of prediction modeling. Another possible reason for the negative sensitivity of the grammatical knowledge could be that the subsample with negative

230  Aryadoust sensitivity might have compensated for their low grammatical ability through relying on, for example, vocabulary knowledge that has a much stronger nonlinear association with reading; they might have also relied on other influential factors that were not tested in this study, such as cognitive and metacognitive strategies. These hypothetical inferences should be further investigated in future research. A commonality between linear and nonlinear models in this chapter was that both indicated that a part of the observed variation in reading comprehension still remains unexplained by lexicogrammatical variables. The theoretical implication of this finding is that rich lexicogrammatical knowledge would facilitate reading comprehension, as predicted by Kintsch (1998) and Khalifa and Weir (2009). However, although Kintsch (1998), and Khalifa and Weir (2009) and others attach high importance to the role of lexicogrammatical knowledge in reading (e.g., Perfetti & Stafura, 2014), the rather low R 2 value yielded in the linear regression analysis is not fully consistent with the theoretical expectations of reading as the unexplained proportion of variance was fairly high in the linear regression (63.2%) but relatively lower in EA-based symbolic regression (48%). Khalifa and Weir (2009) argue that readers’ mental lexicon comprising orthography, morphology, and phenology is one of the primary knowledge bases that is activated when visual stimuli is transferred to the brain. Subsequently, word meaning is accessed and grammatical knowledge is applied to help the reader parse the message and achieve local and global comprehension. However, there is no indication of the linearity of the relationship between these cognitive attributes. In linear regression models, the data lying closer to the regression trendline follow a proportional orderliness, suggesting that the lexicogrammatical knowledge of some students is rather linearly related to their reading comprehension ability. Nevertheless, the data lying far from the line would melt “smoothly into chaos” (Hofstadter, 1986, p. 364), characterizing the relationship between reading and lexicogrammatical knowledge as being nonlinear. Finally, the effect of vocabulary knowledge in both EA-based and linear regressions was greater than the effect of grammatical knowledge. This resonates with the findings of previous research and models of reading comprehension (e.g., Kintsch, 1998; Perfetti & Stafura, 2014). However, EA-based symbolic regression found a greater vocabulary impact on reading comprehension (as indicated by the sensitivity indices), suggesting that linearity assumptions can underestimate the postulated relationship.

Conclusion It is possible to carry out scientific predictions with high precision. Due to the complexity of the cognitive mechanisms of humans, it is plausible to presume that every reader would follow a rather different pattern and there are many hidden conditions in data such as cases of disproportionately higher grammar knowledge than vocabulary (or vice versa) or well-developed reading and test-­ taking strategies. In contrast, some readers might have an extended vocabulary

Algorithm-based symbolic regression 231 repertoire but a comparatively poorer command of grammar and lower reading speed. These differences might strike observers as being less predictable than desired, but researchers should note the significant implications of such patterns in theory building and testing. This chapter presented a case for EA-based symbolic regression and examined its precision. The model was found to be more precise than linear models, providing a more realistic picture of the cognitively based data. It should be noted that we did not test for the linearity assumptions in the data (see Chapter 10, Volume I), but even in the presence of linear relationships between inputs and outputs, the problem of multicollinearity (high correlation between input variables) often affects the precision of linear regression. Thus, it is recommended that researchers should explore the fit of EA-based symbolic regression and compare it with the more “conventional” linear techniques, to generate a more profound insight into the underlying patterns of the data.

Notes

1. In this chapter, model or mathematical model is used as a general term to describe systems and ⁄or data sets by using mathematical language and concepts. 2. In this chapter, we use symbolic functional operator to refer to a variety of mathematical operations such as addition (+), multiplication (× or *), subtraction (−), etc. 3. In this chapter, mathematical model, model, and solution are used interchangeably despite their (subtle) differences. 4. Nutonian, which was the company that developed Eureqa, has been acquired by DataRobot, which produces predictive software. 5. A list of software packages with similar genetic-based capabilities is available from http: ⁄⁄geneticprogramming.com ⁄software ⁄. 6. Note that a description of these functions is not covered in this chapter because we aim to show their use rather than shape and formal representations. 7. Sensitivity as measured in the EA-based symbolic regression analysis is different from sensitivity in the classification and regression trees (CART; see Chapter 9, Volume II).

References Alderson, J. C. (2000). Assessing reading. Cambridge, UK: Cambridge University Press. Aryadoust, V. (2013). Predicting item difficulty in a language test with an adaptive neuro fuzzy inference system. Proceedings of IEEE Workshop on Hybrid Intelligent Models and Applications (HIMA), 43–55. doi: 10.1109⁄HIMA.2013.6615021 Aryadoust, V. (2015). Evolutionary algorithm-based symbolic regression for language assessment: Towards nonlinear modeling of data-driven theories. Psychological Test and Assessment Modeling, 57(3), 301–337. Aryadoust, V. (2016). Application of genetic algorithm-based symbolic regression in ESL writing research. In V. Aryadoust & J. Fox (Eds.) Trends in language assessment research and practice: The view from the Middle East and the Pacific Rim (pp. 35–46). Newcastle, UK: Cambridge Scholars Publishing. Aryadoust, V., & Goh, C. C. M. (2014). Predicting listening item difficulty with language complexity measures: A comparative data mining study. CaMLA Working Papers, 2014-01. Ann Arbor, MI: CaMLA.

232  Aryadoust Banzhaf, W., Nordin, P., Keller, R., & Francone, F. (1998). Genetic programming: An introduction. San Francisco, CA: Morgan Kaufmann. Bertsekas, D. P. (1995). Nonlinear programming. Belmont, MA: Athena Scientific. Cho, Y., & Bridgeman, B. (2012). Relationship of TOEFL iBT® scores to academic performance: Some evidence from American universities. Language Testing, 29, 421–442. Chong, E. K. P., & Żak, S. H. (2013). An introduction to optimization (4th ed.). Hoboken, NJ: Wiley. De Castro, L. N., & Timmis, J. (2002). Artificial immune systems: A new computational intelligence approach. London: Springer. Dervis, K. (2010). Artificial bee colony algorithm. Scholarpedia, 5(3), 6915. Dorigo, M., & Stültze, T. (2004). Ant colony optimization. Cambridge, MA: MIT Press. Eiben, A. E., & Schoenauer, M. (2002). Evolutionary computing. Information Processing Letters, 82(1), 1–6. Field, J. (2004). An insight into listeners’ problems: too much bottom-up or too much top-down? System, 32, 363–377. Fister, I., Jr, Yang, X.-S., Fister, I., Brest, J., & Fister, D. (2013). A brief review of nature-­ inspired algorithms for optimization. ELEKTROTEHNISKI VESTNIK, 80(3), 1–7. Fogel, D. B., & Corne, D. W. (1998). An introduction to evolutionary computation for biologists. In D. B. Fogel (Ed.), Evolutionary computation: The fossil record (pp. 19–38). Piscataway, NJ: IEEE Press. Fogel, L. J. (1999). Intelligence through simulated evolution: Forty years of evolutionary programming. New York: Wiley. Gwiazda, T. G. (2006). Genetic algorithms reference. Vol.1 crossover for single-objective numerical optimization problems. Poland: Lomianki. Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate data analysis. A global perspective (7th ed.). Upper Saddle River, NJ: Pearson Prentice Hall. Hofstadter, D. R. (1986). Mathemagical themas: Questing for the essence of mind and pattern. New York: Bantam Books. Holland, J. (1975). Adaptation in natural and artificial systems. Ann Arbor, Michigan: University of Michigan Press. Jung, J. (2009). Second language reading and the role of grammar. Teachers College, Columbia University Working Papers in TESOL and Applied Linguistics, 9(2), 29–48. Khalifa, H., & Weir, C. (2009). Examining reading: Research and practice in assessing second language reading. Studies in Language Testing. Vol. 29. Cambridge, UK: UCLES and Cambridge University Press. Kintsch, W. (1988). The role of knowledge in discourse comprehension: A construction-­ integration model. Psychological Review, 92, 163–182. Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge, UK: Cambridge University Press. Kordon, A., Pham, H., Bosnyak, C., Kotanchek, M., & Smits, G. (2002). Accelerating industrial fundamental model building with symbolic regression: A case study with structure—property relationships. In D. Davis & R. Roy (Eds.), Proceedings of the Genetic and Evolutionary Computing Conference (GECCO) (pp. 111–116). New York: Morgan Kaufmann. Koza, J. (2010). Human-competitive results produced by genetic programming. Genetic Programming and Evolvable Machines, 11, 251–284. Liao, J.-Q., Hu, X.-B., Wang, M., & Leeson, M. S. (2013). Epidemic modelling by ripple-spreading network and genetic algorithm. Mathematical Problems in Engineering, 2013. Retrieved from http:  ⁄ ⁄dx.doi.org  ⁄10.1155 ⁄ 2013 ⁄ 506240

Algorithm-based symbolic regression 233 Linacre, J. M. (2018). Winsteps® (Version 4.2) [Computer Software]. Beaverton, Oregon: Winsteps.com. Retrieved from www.winsteps.com McRee, R. K. (2010). Symbolic regression using nearest neighbor indexing. Proceedings of the 12th annual conference companion on genetic and evolutionary computation, GECCO 10, 1983–1990. Michalski, R. S. (2000). Learnable evolution model: evolutionary processes guided by machine learning. Machine Learning, 38, 9–40. Nation, P. (2005). Teaching vocabulary. Asian EFL Journal, 7(3). Retrieved from https:  ⁄ ⁄www.asian-efl-journal.com ⁄ September_2005_EBook_editions.pdf Nicoară, E. S. (2009). Mechanisms to avoid the premature convergence of genetic algorithms. Economic Insights: Trends and Challenges, 61(1), 87–96. Nicolis, G. (1999). Introduction to nonlinear science. Cambridge, UK: Cambridge University Press. Perfetti, C., & Stafura, J. (2014). Word knowledge in a theory of reading comprehension. Scientific Studies of Reading, 18, 22–37. Phakiti, A. (2008). Construct validation of Bachman and Palmer’s (1996) strategic competence model over time in EFL reading tests. Language Testing, 25(2), 237–272. Purpura, J. E. (1997). An analysis of the relationships between test takers’ cognitive and metacognitive strategy use and second language test performance. Language Learning, 47, 289–294. Purpura, J. E. (2004). Assessing grammar. Cambridge, UK: Cambridge University Press. Rastovic, D. (2010). Quasi-self-similarity for laser-plasma interactions modelled with fuzzy scaling and genetic algorithms. In C. Myers, (Ed.), Stochastic control (pp. 493–504). London: Intech. Schmidt, M., & Lipson, H. (2008). Coevolution of fitness predictors. IEEE Transactions on Evolutionary Computation, 12(6), 736–749. Schmidt, M., & Lipson, H. (2009). Distilling free-form natural laws from experimental data. Science, 324(5923), 81–85. Schmidt, M., & Lipson, H. (2010). Comparison of tree and graph encodings as function of problem complexity. In D. Thierens, H. Beyer, J. Bongard, J. Branke, J. A. Clark, D. Cliff et al. (Eds.), GECCO 07: Proceedings of the 9th annual conference on genetic and evolutionary computation, (Vol. 2) (pp. 1674–1679). London: ACM Press. Schmidt, M., & Lipson, H. (2011). Age-fitness Pareto optimization. Genetic Programming Theory and Practice, 8, 129–146. Schmidt, M., & Lipson, H. (2013). Eureqa [Computer package], Version 0.99 beta. Retrieved from www.nutonian.com Tenenbaum, J. B., De Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. Weir, C. J. (2005). Language testing and validation. Basingstoke, UK: Palgrave Macmillan. Zelinka, I., Chen, G., & Celikovsky, S. (2010). Chaos synthesis by evolutionary algorithms. In I. Zelinka, S. Celikovosky, H. Richter, & G. Chen (Eds.), Evolutionary algorithms and chaotic systems (pp. 345–382). Berlin: Springer-Verlag Berlin Heidelberg. Zhang, L. M., Goh, C. M. C., & Kunnan, A. J. (2014). Analysis of test takers’ metacognitive and cognitive strategy use and EFL reading test performance: A multi-sample SEM approach. Language Assessment Quarterly, 11(1), 76–120.

Index

a posteriori estimates 41 abs(fcor) 84, 91 absolute fit indices: HDCM 84, 91, 93; SEM 108 acquisition order view 85 adaptive neuro-fuzzy inference system (ANFIS) 216, 228 ADCM see additive diagnostic classification model (ADCM) additive diagnostic classification model (ADCM) 80, 81, 82 adjusted goodness of fit index (AGFI) 108 advanced item response theory (IRT) models: hierarchical diagnostic classification model (HDCM) 79–98; log-linear cognitive diagnosis modeling (LCDM) 56–78; mixed Rasch model (MRM) 15–32; multidimensional Rasch model 33–55 advanced statistical methods: longitudinal multilevel modeling 171–89; multilevel modeling (MLM) 150–70; structural equation modeling (SEM) 101–26; student growth percentiles (SGPs) 127–49 AFPO see age-fitness Pareto optimization (AFPO) age-fitness Pareto optimization (AFPO) 218–19 AGFI see adjusted goodness of fit index (AGFI) AIC see Akaike information criterion (AIC) Akaike information criterion (AIC): log-linear cognitive diagnosis modeling (LCDM) 65, 66; longitudinal multilevel modeling 178; mixed Rasch model (MRM) 22, 23, 23; multidimensional Rasch model 40, 46; multilevel modeling (MLM) 158; penalty for model

complexity 84; quantile regression (QR) 133; structural equation modeling (SEM) 108, 109 Amazon EC2 Secure Cloud 223 AMOS software 106, 120 analysis of covariance (ANCOVA) 6 analysis of covariance structures 101; see also structural equation modeling (SEM) analysis of variance (ANOVA) 5–6 ANCOVA see analysis of covariance (ANCOVA) ANFIS see adaptive neuro-fuzzy inference system (ANFIS) ANN see artificial neural network (ANN) ANOVA see analysis of variance (ANOVA) ant colonies 215 ant colony optimization 217 artificial bee colony algorithm 215 artificial immune systems 215 artificial neural network (ANN) 215, 216, 224 attribute hierarchies 89, 90 attribute mastery classes 57 attribute mastery profile 57, 59 attribute profile 86, 95n4 automated interaction detection 194 base rate probabilities of mastery 60 Bayesian information criterion (BIC): LCDM and listening 65, 66; MRM and reading comprehension 22, 23, 23; penalty for model complexity 84; quantile regression (QR) 133; SEM 108, 109 Bayesian Markov Chain Monte Carlo (MCMC) 167 Bayesian methods 41 between-item dimensionality 36, 37, 38

Index 235 BIC see Bayesian information criterion (BIC) bifactor model 50n1 bifactor Rasch model 39, 40, 44, 49 Bollen-Stine p value 106 bootstrapping 106, 167 bottom-up cognitive processes 61 C-RUM see compensatory reparameterized unified model (C-RUM) C-test 49 CAIC see consistent Akaike information criterion (CAIC) Cambridge-Michigan Language Assessment (CaMLA) listening test see LCDM and listening CB-SEM see covariance-based (CB) SEM CDA see cognitive diagnostic assessment (CDA) CDM see cognitive diagnostic model (CDM) CDSLRC test see cognitive diagnostic L2 reading comprehension (CDSLRC) test CEFR see Common European Framework of Reference for Languages (CEFR) CELEX 201 centering 187n1 CFA see confirmatory factor analysis (CFA) CFA model see confirmatory factor analysis (CFA) model CFI see comparative fit index (CFI) chi-square (χ2) 108 chi-square difference tests 157, 159, 163 classical test theory (CTT) 1, 57, 79 classification and regression trees (CART) 193–214; cross-validation 196–8; examination of fit 199; Gini index 198–9; goal 197; hypothetic example regarding fairness of assessment 195–6; IF-THEN rules 208, 209–10; nonparametric technique 211; overgrowth 199; overview 11; pruning 199; purpose 194; sample study see predicting item difficulty; tree-growing rules 198; variable importance index (VII) 199–200, 207, 207–8 classification trees 194, 198 cloud computing 228 coefficient of determination 199 cognitive diagnostic assessment (CDA): aim of 56; DCMs 56; defined 56; how it works 56; listening assessment 62; uses 56

cognitive diagnostic L2 reading comprehension (CDSLRC) test 58, 75 cognitive diagnostic model (CDM) 8 Coh-Metrix 200–1, 201–2 cohesive knowledge 89 cohesive relationships among sentences 94 Common European Framework of Reference for Languages (CEFR) 33, 42, 153 comparative fit index (CFI) 108, 110 diagnostic classification model (DCM) 58, 79 compensatory language skills 57 compensatory Rasch model 39–40 compensatory reparameterized unified model (C-RUM) 81, 82 complex structure items 58, 59, 64 complexity indices 228 concept-based mathematical paradigm 216 confirmatory diagnostic classification model (DCM) 56 confirmatory factor analysis (CFA) model 65, 102, 115, 115, 116 confirmatory Rasch bifactor model 39; see also bifactor Rasch model conjecture-based approach 64 conjunctive diagnostic classification model (DCM) 80, 82 Conquest software 41, 45 consistent Akaike information criterion (CAIC) 22, 23, 23 construct-relevant variables 150 construct validation 16 convergent hierarchy 89, 90 coreference 202 correlated factors Rasch model 38, 44, 45 correlation coefficient (R) 219, 224–6 covariance 107 covariance-based (CB) SEM 107 covariance structure modeling 101; see also structural equation modeling (SEM) criterion-referenced assessment 33, 135 cross-validation 196–8 crossover 218 CTT see classical test theory (CTT) data mining 193; see also nature-inspired data mining data modeling 216 DataRobot 231n4 daughter node 196

236  Index DCM see diagnostic classification model (DCM) decision trees technique 194; see also classification and regression trees (CART) DELTA see Diagnostic English Language Tracking Assessment (DELTA) deterministic input, noisy, and gate (DINA) model 57, 82 deterministic input, noisy, or gate (DINO) model 82 developmental scale 127 deviance 40, 177 diagnostic classification model (DCM): additive 80, 82; applications 57; attribute mastery classes 57; attribute mastery profiles 57; attributes, items 83–4; categorization 81, 82; compensatory model 58, 80; confirmatory 56; conjunctive 80, 82; disjunctive 80, 82; general 81, 82; hierarchical see hierarchical diagnostic classification model (HDCM); limitations 79; model fit 84; multidimensional 56; non-compensatory model 58, 80; other measurement models, distinguished 57, 79; primary function 57; probabilistic 56; reparameterizations of each other 81; research studies 83; sample size 83; selecting the correct model 82–4; specific 80, 82 Diagnostic English Language Tracking Assessment (DELTA) 112–13, 114 DIF see differential item functioning (DIF) differences in degree vs. differences in kind 15, 17 differential item functioning (DIF): differences in kind 17; item characteristic curves (ICCs) 19; logistic regression method 19; Mantel-Haenszel procedure 19; mean scores of two groups not comparable 17; multidimensionality 17; qualitative differences 17; Rasch-based DIF analysis method 4; test fairness 119; two-stage analysis 16; unidimensionality principle 16; what it means 16 DINA see deterministic input, noisy, and gate (DINA) model DINO see deterministic input, noisy, or gate (DINO) model DINO-like items 74 directional effects 107 Discretize software 204 disjunctive diagnostic classification model (DCM) 80, 82

divergent hierarchy 89, 90 dummy variables 177, 187n2 EA-based symbolic regression see evolutionary algorithm-based (EA-based) symbolic regression ECPE see Examination for the Certificate of Proficiency in English (ECPE) EFA see exploratory factor analysis (EFA) EIKEN test 157 ELP see English language proficiency (ELP) emotioncy 20 endogenous variable 102 English language proficiency (ELP): formative assessment of English language proficiency 134–43; generally 131; longitudinal multilevel modeling 171; predictive relationship between diagnostic and proficiency test 111–19 Entropy partitioning 206 Eureqa software 218, 219–20, 223, 231n4 evolutionary algorithm-based (EA-based) symbolic regression 215–33; advantages 217; estimating fit 219–20; greater precision than linear models 231; interpretation of models 228; local optimum 218, 219; model solving 217–18; offspring 218; operators 217; overview 11; parent solution 218; premature convergence 218; sample study see predicting reading ability; selecting the best model 219; total model complexity 220; what it does 217 Examination for the Certificate of Proficiency in English (ECPE) 58 examining changes in L2 test scores 174–86; data analysis 176–7; data set 174–5; fixed effects 177, 179, 183; full maximum likelihood (FML) estimation 177; HLM6 software 176; limitations 186; model fit 178, 179, 183; null model 178; quadratic change pattern 180; questions to be answered 176; random effects 177–8, 179, 183; random-slopes model 181; reliability 178, 179, 183; results 177–86; time-invariant predictors 176–7; time-varying predictors 176 examining relationship of vocabulary to speaking ability 156–66; assessing need for MLM 159–60; building level-1 model 163–5; building level-2 model 165–6; data analysis 158–9; HLM7

Index 237 software 159; intraclass correlation (ICC) 160; multilevel model results 161–2; null model 159; questions to be answered 157; vocabulary depth and size tests 156, 156 exogenous variable 102 expected cross-validation index (ECVI) 108, 109 explained common variance 39 exploratory factor analysis (EFA) 7 FA see factor analysis (FA) factor analysis (FA) 57, 62 factor loading 107 fairness of assessment 119, 195–6 feeling for the words (emotioncy) 20 fixed coefficients 154 fixed effects 154, 177, 179, 183 Flesch-Kincaid grade level index 201 FML estimation see full maximum likelihood (FML) estimation formative assessment 130 formative assessment of English language proficiency 134–43; data analysis 136–7; data collection and analyses requirements 134; instrument 135; participants 135; proficiency cut scores 135; results 137–43; smoothing the quantile function 136, 136 formative variable 103 full maximum likelihood (FML) estimation 166–7, 177 fusion model 58 G-DINA see generalized deterministic, noisy, and gate (G-DINA) G theory see generalizability theory (G theory) Gauss-Hermite quadrature 41, 45 GDCM see general diagnostic classification model (GDCM) GDM see general diagnostic model (GDM) general diagnostic classification model (GDCM) 81, 82 general diagnostic model (GDM) 82 generalizability theory (G theory) 1, 3; multivariate G theory 3–4; univariate G theory 3 generalized deterministic, noisy, and gate (G-DINA) 81, 82, 85–6 genetic and evolutionary algorithms 215 Germany see psychometric structure of listening test

GFI see goodness of fit index (GFI) Gini index 198–9 global information-based fit indices 66 global optimum 219 GOF 133; see also goodness of fit index (GFI) goodness of fit index (GFI) 108, 133 goodness of fit statistics 228 grammar tests 229 grammatical knowledge 221 grand-mean centering 187n1 granularity of attributes 83 HDCM and reading comprehension 87–95; absolute fit indices 91, 93; analysis procedures 89–91; attribute dependencies 89, 90; hypothesized hierarchies 90, 91; likelihood ratio (LR) test 91; Q-matrix 88–9; relative fit indices 91, 93; research questions 87; results 91–3 hierarchical diagnostic classification model (HDCM) 79–98; advantages 95; attribute hierarchy 86; extension of G-DINA 85, 95n3; main effects 86–7; overview 9; parsimony advantage 86; sample study see HDCM and reading comprehension hierarchical DINA (HO-DINA) model 82 hierarchical linear modeling 152; see also multilevel modeling (MLM) hierarchical log-linear cognitive diagnosis modeling (HLCDM) 81 hierarchical structure 150–1, 151 HLCDM see hierarchical log-linear cognitive diagnosis modeling (HLCDM) HLM6 software 173, 176 HLM7 software 167–8 HO-DINA see hierarchical DINA (HO-DINA) model hold-out set 197 hypernymy 202 ICBCs see item characteristic bar charts (ICBCs) ICC see intraclass correlation (ICC) ICCs see item characteristic curves (ICCs) IELTS see International English Language Testing System (IELTS) IF-THEN rules 208, 209–10 imageability 201 incremental indices 108 inferencing 75, 89, 94 infit mean square (MnSq) 204 information criteria measures 66–7

238  Index interaction parameters 59, 60 interactive language skills 57 intercept 153, 155, 155, 173 intercept-only model 159 intercept parameter 59 interindividual growth 187 internal consistency reliability 1 International English Language Testing System (IELTS) 113–14, 171 intraclass correlation (ICC) 160 intraindividual growth 187 inversing 106 Iran see HDCM and reading comprehension; MRM and reading comprehension IRT see item response theory (IRT) item aggregation 106 item analysis 2–3 item characteristic bar charts (ICBCs) 69, 71, 72, 73 item characteristic curves (ICCs) 19, 35, 35 item response theory (IRT) 1, 57, 79; see also advanced item response theory (IRT) models Japan see multilevel modeling (MLM) just-identified model 104, 105 k-fold cross-validation 197, 198, 206 Kenward-Roger adjustment 167 kurtosis 106 L-shaped distribution 106 latent class analysis (LCA) 57 latent growth modeling (LGM) 186 latent semantic analysis (LSA) 202 latent variable 101 LCA see latent class analysis (LCA) LCDM see log-linear cognitive diagnosis modeling (LCDM) LCDM and listening 63–76; attribute mastery classifications 73–4; background 63; conjecture-based approach 64; inferencing 75; item characteristic bar charts (ICBCs) 69, 71, 72, 73; measurement model (item-attributes relationship) 66–72; model fit 65–6, 66; model specifications 65; MPlus software 64, 76; paraphrasing 75; parsimonious model 65; probabilities of correct responses 69, 70; Q-matrix 64; Q-matrix and item parameter estimates 68; simple/complex structure item 64;

structural model (attributes’ relationship) 72, 73; study limitations 76; tetrachoric correlations lexical knowledge 89 lexicogrammatical knowledge 220, 229, 229, 230 LGM see latent growth modeling (LGM) likelihood ratio (LR) test 84, 91 linear hierarchy 89, 90 linear logistic model (LLM) 81, 82 linear logistic test model 41 linear regression 6–7, 222, 224, 229 linearity 193, 216 listening: bottom-up and top-down cognitive processes 61; core listening skills 62; implicit grammatical knowledge 61; inferencing 63; L1 vs. L2 listeners 61; paraphrasing 62–3; perceptual processing 61; listening/reading comprehension 34 listening test see psychometric structure of listening test LLM see linear logistic model (LLM) local independence 4 local optimum 218, 219 log-linear cognitive diagnosis modeling (LCDM) 56–78; advantages 58, 74; base rate probabilities of mastery 60; general framework subsuming other core DCMs 57; generalized linear model framework 58; interaction parameters 59, 60; intercept parameter 59; main effect parameter 59; modeling most core DCMs 57; overview 8–9; predicting item responses 60; Q-matrix 64, 74; research studies 57–8; retrofitting 8, 56; sample study see LCDM and listening; simple/complex structure items 58, 59; structural parameters 60 logarithm 106 logistic regression method 19 longitudinal multilevel modeling 171–89; HLM6 software 173; intraindividual/ interindividual growth 187; overview 10–11; repeaters’ test scores 172, 174, 186; time-varying/time-invariant predictors 173; see also multilevel modeling (MLM) LR test see likelihood ratio (LR) test LSA see latent semantic analysis (LSA) macroproposition 201 MAE see mean absolute error (MAE)

Index 239 main effect parameter 59 MANCOVA see multivariate analysis of covariance (MANCOVA) manifest variable 101 Mantel-Haenszel procedure 19 many-facet Rasch measurement (MFRM) 5, 41 Mardia’s coefficient 106, 115 marginal maximum likelihood (MML) estimation 41 marker variable 104 mastery profile 59 maturity 224 maximum likelihood (ML) estimation 106, 107, 117 MCMC see Bayesian Markov Chain Monte Carlo (MCMC) mean absolute error (MAE) 219, 224–6 mean squared error (MSE) 219, 224–6 measure of textual lexical diversity (MTLD) 202 measurement model 1, 102, 115 medical research council (MRC) indices 201 MET listening test see LCDM and listening; predicting item difficulty MFRM see many-facet Rasch measurement (MFRM) Michigan English Test (MET) listening test see LCDM and listening; predicting item difficulty microproposition 201 missing data: multilevel modeling (MLM) 167; structural equation modeling (SEM) 120; student growth percentiles (SGPs) 144–5 mixed effects modeling 152; see also multilevel modeling (MLM) mixed Rasch model (MRM) 15–32; class-specific item profiles 18; drawback/limitation 29; heuristic purposes 18; invariance 18; latent classes 17; manifest variables 19; overview 7–8; previous applications of MRM 19–21; purpose 19; sample size 29; sample study see MRM and reading comprehension; survey research 29; test validation 29; unidimensionality 29; uses/ advantages 28–9; what it does 17–18 ML method see maximum likelihood (ML) estimation MLM see multilevel modeling (MLM) MLwiN 167

MML estimation see marginal maximum likelihood (MML) estimation model fit: absolute fit indices 84; diagnostic classification model (DCM) 84; log-linear cognitive diagnosis modeling (LCDM) 65–6, 66; longitudinal multilevel modeling 178, 179, 183; mixed Rasch model (MRM) 22, 23; multidimensional Rasch model 40; relative fit indices 84; structural equation modeling (SEM) 107–9 modern theoretical models of language comprehension 34 Monte Carlo methods 41 MPlus software 64, 76 MRC indices see medical research council (MRC) indices MRCMLM see multidimensional random coefficients multinomial logit model (MRCMLM) MRM see mixed Rasch model (MRM) MSE see mean squared error (MSE) MTLD see measure of textual lexical diversity (MTLD) multicollinearity 106, 231 multidimensional diagnostic classification model (DCM) 56 multidimensional random coefficients multinomial logit model (MRCMLM) 41 multidimensional Rasch model 33–55; advantages 48; between-item dimensionality 36, 37, 38; cognitive diagnostic model (CDM), compensatory/ non-compensatory models 39–40; complex assessment designs 33; confirmatory Rasch bifactor model 39; Conquest software 41; correlated factors model 38; deviance 40; dichotomous data 34; explained common variance 39; item characteristic curves (ICCs) 35, 35; item response scoring 33; model fit 40; nested factor model 38; overview 8; polytomously scored item responses with multiple outcomes 36; Rasch testlet model 49; receptive L1 and L2 competencies 41–2; sample study see psychometric structure of listening test; unexplained variance components 49; within-item dimensionality 36, 37, 38, 48 multidimensional tests 1 multidimensionality 29

240  Index multigroup SEM analysis 119 multilevel modeling (MLM) 150–70; estimation methods 166–7; fixed/random coefficients 154; HLM7 software 167–8; intercept/slope 153, 155, 155; logic behind MLM 153–6; longitudinal data see longitudinal multilevel modeling; missing data 167; potential uses 10, 157, 168, 187; repeatedly measured variable 186; sample size 167; sample study see examining relationship of vocabulary to speaking ability; underlying premise 10 multivariate analysis of covariance (MANCOVA) 6 multivariate generalizability theory 3–4 multivariate normality 106, 116 mutation 218 MX2 84, 91 nature-inspired data mining: classification and regression trees (CART) 193–214; symbolic regression 215–33 nature-inspired optimization 215 nested data 172 nested data structures 150–1, 151 nested factor Rasch model 38, 45 NFI see normed fit index (NFI) NIDO see noisy input, deterministic, or gate (NIDO) model NNFI see non-normed fit index (NNFI) noisy input, deterministic, or gate (NIDO) model 82 non-compensatory diagnostic classification model (DCM) 58, 80 non-compensatory Rasch model 40 non-compensatory reparameterized unified model (NC-RUM) 82 non-normed fit index (NNFI) 108 norm-referenced assessment 8, 135 normality 193 normed fit index (NFI) 108 null model 159, 178 Nutonian 231n4 observed variable 101 Occam’s razor 228 offspring 218 OLS see ordinary least square (OLS) optimization 215 ordinary least square (OLS) 128 orthogonality 39 outfit mean square (MnSq) 204

over-identified model 104, 105 overfitting 197–8 overgrowth 199 overview of book 2–11 paragraph knowledge 89 paraphrasing 75 parceling 106 parent solution 218 parsimonious model 65 parsimony advantage 86 parsing 220 partial credit model (PCM) 5, 41 Partial Credit Rasch model 4 partial least square (PLS) SEM 107 partitioning 196, 198 path 102 PCM see partial credit model (PCM) Pearson Test of English (PTE) Academic 174–5 percentile 129 personalized learning 144 PIRLS see Progress in International Reading Literacy Study (PIRLS) PISA see Programme for International Student Assessment (PISA) plausible value estimates 41 PLS-SEM see partial least square (PLS) SEM PLS software 107 polysemy 202 predicting item difficulty 200–1; CohMetrix 200–1, 201–2; data analysis 205–6; data and test materials 203; demographic information 203; dependent variable (discretization of item difficulty) 205; dependent variable 204; discussion 208–12; IF-THEN rules 208, 209–10; independent variables 204, 205; nodes 208; performance and misclassification 206–7; situation model 202; surface level 201; textbase 201–2; variable importance index (VII) 207, 207–8 predicting reading ability 220–7; data source and instruments 221–2; linear regression model 222, 224, 229; model estimation and selection 224–6; psychometric validity of testlets 222; sensitivity 223–4; sensitivity analysis 226, 227; target expression and building blocks 223; theoretical framework 220–1; Winsteps software 222

Index 241 predictive fit indices 108–9 predictive models 194 predictive relationship between diagnostic and proficiency test 111–19; background 111–12; data analysis 115–17; DELTA 112–13, 114; hypothesized conceptual framework of study 113; IELTS 113; limitation of study 119; multivariate normality 116; questions to answer 112; results 117–18; sample 114–15; standardized/unstandardized estimates 118, 118; test fairness 119 predictive validity 112 premature convergence 218 probabilistic diagnostic classification model (DCM) 56 proficiency cut scores 135 Programme for International Student Assessment (PISA) 6, 41 Progress in International Reading Literacy Study (PIRLS) 41 prompt characteristics 153 pruning 199 psychometric structure of listening test 42–8; Conquest software 45; estimation of multidimensional Rasch models 45–6; instruments 43–4; results 46–8; sample 43; two rounds of analysis 44–5; unidimensional Rasch model 44 psychometric validity of testlets 222 PTE Academic see Pearson Test of English (PTE) Academic Q index 23, 23 Q-matrix: defined 57; HDCM 88–9; LCDM 64, 74 q-vector 89, 95n5 quadratic change pattern 180 quantile regression (QR) 128; see also formative assessment of English language proficiency Quantitative Data Analysis for Language Assessment (Aryadoust/Raquel): overview (advanced methods – volume II) 7–11; overview (fundamental techniques – volume I) 2–7 quantitative vs. qualitative differences 15, 17 quantitative methods 1 R (correlation coefficient) 219, 224–6 R statistics program 152, 167, 168 R 2 199, 219, 224–6

random coefficients 154 random effects 154, 164, 177–8, 179, 183 random effects modeling 152; see also multilevel modeling (MLM) random-slopes model 181 Rasch, Georg 33 Rasch-based DIF analysis method 4 Rasch measurement see Rasch model Rasch model: bifactor model 39, 40, 44, 49; brief explanation of 4; compensatory model 39–40; correlated factors model 38, 44, 45; criterion-referenced interpretation of test scores 33; manyfacet Rasch measurement (MFRM) 5, 41; MRM see mixed Rasch model (MRM); multidimensional model see multidimensional Rasch model; nested factor model 38, 45; non-compensatory model 40; probabilistic model 1; quantitative differences 15; Rasch-based DIF analysis method 4; Rasch model equation 33, 35; unidimensional model 29n1; unidimensional Rasch measurement 4–5, 44, 45 Rasch model equation 33, 35 Rasch residuals 4 Rasch testlet model 49 rater-mediated language assessment 5 rating scale model (RSM) 5 reading comprehension: EA-based symbolic regression 220, 229; hierarchical diagnostic classification model (HDCM) 87–95; mixed Rasch model (MRM) 21–8; word knowledge 94 Reading Test Strategy Use Questionnaire 7 receiver operating characteristic (ROC) 199 receptive skills 41, 42 reduced reparameterized unified model (rRUM) 57, 58 reflective variable 103 regression 153 regression weights/coefficients 107 relative fit indices: HDCM 84, 91, 93; SEM 108 reliability 178, 179, 183, 187n3 research studies see sample studies restricted maximum likelihood (RML) estimation 166–7 retrofitting 8, 56, 87 RFI 108; see also relative fit indices

242  Index RML see restricted maximum likelihood (RML) estimation RMSEA see root mean square error of approximation (RMSEA) ROC see receiver operating characteristic (ROC) root mean square error of approximation (RMSEA) 108 root node 196 rRUM see reduced reparameterized unified model (rRUM) RSM see rating scale model (RSM) rule-space methodology 57 Salford Systems’ CART 198, 205; see also classification and regression trees (CART) sample size: diagnostic classification model (DCM) 83; mixed Rasch model (MRM) 29; multilevel modeling (MLM) 167; structural equation modeling (SEM) 105, 120 sample size adjusted BIC (ssBIC) 65, 66 SAS program codes 148–9 scaling 107 SEM see structural equation modeling (SEM) sensitivity: classification and regression tress (CART) 199; EA-based symbolic regression 223–4, 231n7 sensitivity analysis 226, 227 sentence overlap 202 simple structure items 58, 64 situation model 202, 221 skewness 106 slope 153, 155, 155, 173 smoothing the data set 136, 136 Soper’s calculator 105 specific diagnostic classification model (DCM) 80 specificity 199 splitting 198 spuriousness 102 square roots 106 SRMR 108 see standardized root mean square residual (SRMR) ssBIC see sample size adjusted BIC (ssBIC) stability 224 standardized estimates 118, 118 standardized root mean square residual (SRMR) 108 structural equation modeling (SEM) 101–26; advantages 101–2, 119;

covariance-based (CB) SEM 107; data preparation 105–6; directional effects 107; exogenous/endogenous variables 102; five-step process 103, 110; formative/reflective variables 103; item aggregation/parceling 106; latent/observed variables 101; limitations 119–20; marker variable 104; measurement model 102; missing data 120; model fit and model interpretation 107–9; model identification 104–5; model specification 104; multicollinearity 106; over-, just-, or under-identified model 104, 105; overview 9; partial least square (PLS) SEM 107; purpose 102; research studies 111; sample size 105, 120; sample study see predictive relationship between diagnostic and proficiency test; spuriousness 102; structural model 102; univariate/multivariate normality 106; variance/covariance 107 structural parameters 60 student growth percentiles (SGPs) 127–49; “adequate,” “good,” or “enough” classification 134; devising appropriate lesson plan for each student relative to his peers 129; facilitating better allocation of finite resources 145; fit 133; how it works 128–9, 131–2; missing scores 144–5; overview 10; percentile, defined 129; prior scores 128, 129, 131; sample study see formative assessment of English language proficiency; SAS program codes 148–9 summarizing 89, 94 summative assessment 130 superordination 202 surface level 201 symbolic regression see evolutionary algorithm-based (EA-based) symbolic regression syntactic complexity 201 syntactic density 201 syntactic parsers 202 syntactic parsing 220 t-test 5–6 10-fold cross-validation 206; see also k-fold cross-validation test fairness 119, 195–6 Test of English as a Foreign Internetbased test (TOEFL iBT) 153, 171

Index 243 Test of English as a Foreign Language Paper-based test (TOEFL PBT) 222 Test of English for International Communication (TOEIC) 174 test repeaters’ scores 172, 174, 186 test validation 29, 150 testlet 49 testlet model 49 tetrachoric correlations 60, 72, 73 textbase 201–2 theoretical models of language comprehension 34 time-invariant predictors 173, 176–7 time-varying predictors 173, 176 TLI see Tucker-Lewis index (TLI) TOEFL PBT see Test of English as a Foreign Language Paper-based test (TOEFL PBT) TOEIC see Test of English for International Communication (TOEIC) top-down cognitive processes 61 top/root node 196 train-test set cross-validation 197, 198, 205–6 tree-growing rules 198 TTR see type-token ratio (TTR) Tucker-Lewis index (TLI) 108 type-token ratio (TTR) 202

UCLA Institute for Digital Research and Evaluation 168 unconditional model 159 under-identified model 104, 105 unexplained variance components 49 unidimensional Rasch model 4–5, 44, 45 unidimensionality 29, 29n1 univariate and multivariate statistical analysis 5–7 univariate generalizability theory 3 univariate normality 106 unstandardized estimates 118, 118 unstructured hierarchy 89, 90 use-involvement view 85, 94 variable importance index (VII) 199–200, 207, 207–8 variance 107 vertical scale 127 VII see variable importance index (VII) vocabulary knowledge 221, 230 WINMIR A software 22 Winsteps software 222 within-item dimensionality 36, 37, 38, 48 word knowledge 94 word recognition 220 WordNet 202 ZQ 23, 23