Advances in Rasch Analyses in the Human Sciences [1st ed.] 9783030434199, 9783030434205

This volume follows the publication of Rasch Analysis in the Human Sciences. This new book presents additional topics no

554 125 8MB

English Pages XVI, 306 [313] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Advances in Rasch Analyses in the Human Sciences [1st ed. 2020] 3030434192, 9783030434199

This volume follows the publication of Rasch Analysis in the Human Sciences. This new book presents additional topics no

523 49 1MB Read more

Applying the Rasch Model: Fundamental Measurement in the Human Sciences, 4th Edition [4 ed.] 0367141426, 9780367141417, 9780367141424, 9780429030499

Recognised as the most influential publication in the field, ARM facilitates deep understanding of the Rasch model and i

484 50 48MB Read more

Advances in Applications of Rasch Measurement in Science Education [1 ed.] 3031287754, 9783031287756

This edited volume presents latest development in applications of Rasch measurement in science education. It includes a

191 74 12MB Read more

Applying the Rasch Model in Social Sciences Using R [1 ed.] 1138500771, 9781138500778

This unique text provides a step-by-step beginner’s guide to applying the Rasch model in R, a probabilistic model used b

411 56 13MB Read more

A Course in Rasch Measurement Theory: Measuring in the Educational, Social and Health Sciences 9789811374968, 9811374961

This book applies Rasch measurement theory to the fields of education, psychology, sociology, marketing and health outco

966 101 10MB Read more

Advances in Human Factors in Training, Education, and Learning Sciences: Proceedings of the AHFE 2020 Virtual Conference on Human Factors in Training, Education, and Learning Sciences, July 16-20, 2020, USA [1st ed.] 9783030508951, 9783030508968

This book addresses the importance of human factors in optimizing the learning and training process. It reports on the l

520 114 19MB Read more

Invariant Measurement : Using Rasch Models in the Social, Behavioral, and Health Sciences 9781135104535, 9780415871228

This introductory text describes the principles of invariant measurement, how invariant measurement can be achieved with

227 51 13MB Read more

Rasch Models for Solving Measurement Problems: Invariant Measurement in the Social Sciences 1544363028, 9781544363028

This book introduces current perspectives on Rasch measurement theory with an emphasis on developing Rasch-based scales.

246 41 860KB Read more

Advances in Computer, Communication and Computational Sciences: Proceedings of IC4S 2019 [1st ed.] 9789811544088, 9789811544095

This book discusses recent advances in computer and computational sciences from upcoming researchers and leading academi

1,476 113 33MB Read more

A Course in Rasch Measurement Theory: Measuring in the Educational, Social and Health Sciences [1st ed.] 978-981-13-7495-1;978-981-13-7496-8

This book applies Rasch measurement theory to the fields of education, psychology, sociology, marketing and health outco

802 112 11MB Read more

Advances in Rasch Analyses in the Human Sciences [1st ed.]
9783030434199, 9783030434205

Author / Uploaded
William J. Boone
John R. Staver

Table of contents :
Front Matter ....Pages i-xvi
Introduction: For the Second Time (William J. Boone, John R. Staver)....Pages 1-11
Principal Component Analysis of Residuals (PCAR) (William J. Boone, John R. Staver)....Pages 13-24
Point Measure Correlation (William J. Boone, John R. Staver)....Pages 25-38
Test Information Function (TIF) (William J. Boone, John R. Staver)....Pages 39-55
Disattenuated Correlation (William J. Boone, John R. Staver)....Pages 57-63
Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate the Functioning of a Scale (William J. Boone, John R. Staver)....Pages 65-83
How Well Are Your Instrument Items Helping You to Discriminate and Communicate? (William J. Boone, John R. Staver)....Pages 85-92
Partial Credit Part 1 (William J. Boone, John R. Staver)....Pages 93-112
Partial Credit Part II (How to Anchor a Partial Credit Test) (William J. Boone, John R. Staver)....Pages 113-129
The Hills…with the Partial Credit Model (William J. Boone, John R. Staver)....Pages 131-146
Common Person Test Equating (William J. Boone, John R. Staver)....Pages 147-158
Virtual Equating of Test Forms (William J. Boone, John R. Staver)....Pages 159-172
Computing and Utilizing an Equating Constant to Explore Items for Linking a Test to an Item Bank (William J. Boone, John R. Staver)....Pages 173-185
Rasch Measurement Estimation Procedures (William J. Boone, John R. Staver)....Pages 187-198
The Importance of Cross Plots for Your Rasch Analysis (William J. Boone, John R. Staver)....Pages 199-214
Wright Maps (Part 3 and Counting...) (William J. Boone, John R. Staver)....Pages 215-253
Rasch and Forms of Validity Evidence (William J. Boone, John R. Staver)....Pages 255-266
Using Rasch Theory to Develop a Test and a Survey (William J. Boone, John R. Staver)....Pages 267-285
Presentation and Explanation Techniques to Use in Rasch Articles (William J. Boone, John R. Staver)....Pages 287-302
Some Concluding Thoughts… (William J. Boone, John R. Staver)....Pages 303-306
Correction to: Advances in Rasch Analyses in the Human Sciences (William J. Boone, John R. Staver)....Pages C1-C2

Citation preview

William J. Boone · John R. Staver

Advances in Rasch Analyses in the Human Sciences

Advances in Rasch Analyses in the Human Sciences

William J. Boone • John R. Staver

Advances in Rasch Analyses in the Human Sciences

William J. Boone Miami University Oxford, OH, USA

John R. Staver Purdue University West Lafayette, IN, USA

ISBN 978-3-030-43419-9 ISBN 978-3-030-43420-5 (eBook) https://doi.org/10.1007/978-3-030-43420-5 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

This book is dedicated to my wife, Rose— Little Flower Comet, 78 Mohawk, Valiant Musketeer Commuter, Kolping Westsider, and Eastside Walker…. Love, Bill

Preface

It has been a number of years since my first Rasch book (Rasch Analysis in the Human Sciences) was published. During the years between completing that first Rasch book and completing this second book (Advances in Rasch Analysis in the Human Sciences), I have had the fortune to learn a little more of Rasch each day— be it through my teaching of Rasch to colleagues and students or through articles I have authored and coauthored. I cannot imagine another field to have worked in. To measure is so central to what humans do, and Rasch methods help one measure! I am in debt to so many individuals as my Rasch journey continues onward—in particular to my students who ask questions that are probing, my students who have helped me improve the manner in which I explain things, and my students who helped me slow down. As I learn and explain Rasch, I continually am thankful for the opportunity to have interacted with Ben Wright and Mike Linacre. Ben served as my Ph.D. advisor and director at the University of Chicago. Although Ben is no longer with us, I feel as if I can hear his voice when I read his many articles. When I reflect upon the help that Mike Linacre has provided to me as I use Rasch, it is difficult to find all the appropriate words of praise and thanks. Mike authored Winsteps, a program that one can easily use. But the hallmark, in my mind, of Mike’s help is his willingness to aid with any Rasch inquiry. When Mike sends an email, the last line is most often “OK?”. Mike means that last part—he does want to know if there is more information that might be of help. Thank you so much Mike. I am also in debt to my colleagues with whom I have authored Rasch articles and with whom I have had Rasch discussions. Thank you Birgit Neuhaus, Andrea Möller, Maik Walpuski, Andreas Borowski, Kerstin Kremer, Charlotte Ringsmose, Grethe Kragh-Müller, Kirsten Schlüter, Hans Fischer, Hendrik Härtig, Philipp Schmiemann, Sabine Fechner, Susanne Bögeholz, Sabina Eggert, Jan Barkmann, Christian Förtsch, Claudia von Aufschnaiter, Andreas Vorholzer, Julia Arnold, Rüdiger Scholz, Susanne Weßnigk, Knut Neumann, Lisa Kenyon, William Romaine, Vanes Mesic, Kathy Malone, Chris Wilson, Molly Stuhlsatz, Cari Abell, Ross Nehm, Xiufeng Liu, Saed Sabah, Jason Abbitt, Darrel Davis, Kristy Brann, Amity Noltemeyer, Oli Tepner, Cristian Gugiu, Eva Fenwick, Jürgen Mayer, Maja Planinic, vii

viii

Preface

and Michael Peabody. Also, I am in great debt to Ashley George, Hailey Ward, and Amy Zhupikov for many Rasch editing ideas. And of course I am so grateful to my co-author John Staver. Throughout this book, the readers will see a range of names in the text and the activities. Many of those names (Billy, Rose, Katie, Stefanie, Amy, Dave, Joe, and Zhenya) honor those I love and those who tolerated my working on this book for years on end at my favorite Cincinnati Starbucks. Of course, life would not be wonderful without my friends Joe, Bob, Dennis, John, Mike, Charlie, and Chizuko who frequently ask me how my “thinking day” at Starbucks is going. My hope is that the readers of this book will apply the lessons of each chapter as they conduct their own measurement—be it with a test or a survey. There is a lot that still needs to be done within the Rasch community in terms of spreading the word of good measurement. When very high-stakes tests and surveys are developed, there is a high probability that Rasch methods will be used. However, when not so high- profile tests or surveys are developed, there is a good chance that raw scores are used for statistical analyses and non-Rasch methods are used to develop tests or surveys. I feel that we are not yet at the point where those using and developing tests and surveys will know that just as they have to conduct statistical tests to compare different groups of respondents, they must also use Rasch methods. I hope for each reader this book provides something of use—be it a technique to use for a Rasch analysis needed for a paper or talk, a teaching technique to utilize with students, or an explanation that allows one to further deepen ones appreciation of what it means to carefully measure. Cincinnati, OH, USA William J. Boone

Contents

1 Introduction: For the Second Time�� 1 Tie Ins to Rasch Analysis in the Human Sciences (Book 1)�� 1 Organization and Rationale of the Book and the Chapters�� 1 A Few Warm Up Reminders and Exercises from RAHS Book 1 �� 3 Topics Appearing in Advances in Rasch Analysis in the Human Sciences �� 4 This Chapter: “Introduction for the Second Time”�� 4 Chapter 2: “Principal Component Analysis of Residuals (PCAR)”�� 4 Chapter 3: “Point Measure Correlation”�� 5 Chapter 4: “Test Information Function (TIF)”�� 5 Chapter 5: “Disattenuated Correlation” �� 5 Chapter 6: “Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate the Functioning of a Scale” �� 6 Chapter 7: “How Well Are Your Instrument Items Helping You to Discriminate and Communicate?”�� 6 Chapter 8: “Partial Credit Part I” �� 6 Chapter 9: “Partial Credit Part II (How to Anchor a Partial Credit Test)”�� 7 Chapter 10: “The Hills…with the Partial Credit Model”�� 7 Chapter 11: “Common Person Test Equating”�� 7 Chapter 12: “Virtual Equating of Test Forms”�� 7 Chapter 13: “Computing and Utilizing an Equating Constant to Explore Items for Linking a Test to an Item Bank”�� 8 Chapter 14: “Rasch Measurement Estimation Procedures”�� 8 Chapter 15: “The Importance of Cross Plots for Your Rasch Analysis” �� 8 Chapter 16: “Wright Maps (Part 3 and Counting…)” �� 9 Chapter 17: “Rasch and Forms of Validity Evidence”�� 9 Chapter 18: “Using Rasch Theory to Develop a Test and a Survey” �� 9 ix

x

Contents

Chapter 19: “Presentation and Explanation Techniques to Use in Rasch Articles” �� 9 Chapter 20: “Some Concluding Thoughts…” �� 10 References�� 11 2 Principal Component Analysis of Residuals (PCAR) �� 13 Tie Ins to RAHS Book 1 �� 13 Introduction�� 14 Cross Plotting to Explore Dimensionality �� 18 Cross Plot Type 1 �� 18 Cross Plot Type 2 �� 19 Keywords and Phrases �� 21 Potential Article Text�� 21 Quick Tips�� 21 Data Sets (Go to http://extras.springer.com)�� 22 Activities�� 22 References�� 24 3 Point Measure Correlation�� 25 Tie Ins to RAHS Book 1 �� 25 Introduction�� 26 Reviewing Point Measure Correlations for Tests�� 29 Experiment #2 With Test Data �� 31 Keywords and Phrases �� 34 Potential Article Text�� 35 Quick Tips�� 35 Data Sets (Go to http://extras.springer.com)�� 36 Activities�� 36 References�� 38 4 Test Information Function (TIF)�� 39 Tie Ins to RAHS Book 1 �� 39 Introduction�� 40 Computing a TIF Graph�� 40 What to Do with the TIF plot? What Is Useful for the Researcher?�� 42 Further Experiments�� 44 Final Experiment�� 48 Some Final Thoughts on the Test Information Function Curve�� 51 Keywords and Phrases �� 52 Potential Article Text�� 52 Quick Tips�� 52 Data Sets (Go to http://extras.springer.com)�� 53 Activities�� 53 References�� 54

Contents

xi

5 Disattenuated Correlation�� 57 Tie Ins to RAHS Book 1 �� 57 Introduction�� 58 Keywords and Phrases �� 61 Potential Article Text�� 61 Quick Tips�� 62 Data Sets (Go to http://extras.springer.com)�� 62 Activities�� 62 References�� 63 6 Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate the Functioning of a Scale�� 65 Tie Ins to RAHS Book 1 �� 65 Introduction�� 66 How to Review and Interpret Model ICCs and Empirical ICCs�� 68 How to Read and Use the Plot �� 69 Keywords and Phrases �� 77 Potential Article Text�� 77 Quick Tips�� 77 Data Sets (Go to http://extras.springer.com)�� 77 Activities�� 78 References�� 82 7 How Well Are Your Instrument Items Helping You to Discriminate and Communicate?�� 85 Tie Ins to RAHS Book 1 �� 85 Introduction�� 86 Communicating and Evaluating the Manner in Which Test Items Discriminate Students�� 86 Computing the Gap�� 88 Keywords and Phrases �� 91 Potential Article Text�� 91 Quick Tips�� 91 Data Sets (Go to http://extras.springer.com)�� 92 Activities�� 92 8 Partial Credit Part 1�� 93 Tie Ins to RAHS Book 1 �� 93 Introduction�� 94 A Data Set Containing a Mix of Partial Credit and Dichotomous Items �� 95 Exploring and Understanding the Data Set�� 96 The Key to the Partial Credit Analysis Is the Judicial Use of ISGROUPS�� 100 The Person Measures from a Partial Credit Analysis�� 104 Person Fit, Item Fit, Reliability�� 104

xii

Contents

ISGROUPS=0 GROUPS=0 �� 105 Keywords and Phrases �� 107 Potential Article Text�� 108 Quick Tips�� 108 Data Sets (Go to http://extras.springer.com)�� 108 Activities�� 109 References�� 112 9 Partial Credit Part II (How to Anchor a Partial Credit Test)�� 113 Tie Ins to RAHS Book 1 �� 113 Introduction�� 114 Thinking About Anchoring Partial Credit Tests�� 114 Steps to Anchor a Partial Credit Test �� 115 Keywords and Phrases �� 125 Potential Article Text�� 125 Quick Tips�� 125 Data Sets (Go to http://extras.springer.com)�� 126 Activities�� 126 References�� 129 10 The Hills…with the Partial Credit Model �� 131 Tie Ins to RAHS Book 1 �� 131 Introduction�� 132 What Happens When the Master’s Partial Credit Model Is Used, When GROUPS=0, When GROUPS=AABBCCC?�� 135 How to Make Use of Figure 10.6 and Figure 10.7 for Your Research? �� 138 Evaluating Rating Scale Functioning�� 138 Communicating Data Analysis�� 139 Keywords and Phrases �� 142 Potential Article Text�� 142 Quick Tips�� 143 Data Sets (Go to http://extras.springer.com)�� 143 Activities�� 143 References�� 145 11 Common Person Test Equating�� 147 Tie Ins to RAHS Book 1 �� 147 Introduction�� 148 A Slight Twist�� 151 Keywords and Phrases �� 156 Potential Article Text�� 156 Quick Tips�� 156 Data Sets (Go to http://extras.springer.com)�� 157 Activities�� 157 References�� 157

Contents

xiii

12 Virtual Equating of Test Forms�� 159 Tie Ins to RAHS Book 1 �� 159 Introduction�� 160 First Steps for Virtual Equating�� 161 Keywords and Phrases �� 170 Potential Article Text�� 170 Quick Tips�� 170 Data Sets (Go to http://extras.springer.com)�� 170 Activities�� 171 References�� 172 13 Computing and Utilizing an Equating Constant to Explore Items for Linking a Test to an Item Bank�� 173 Tie Ins to RAHS Book 1 �� 173 Introduction�� 174 A Final Comment�� 179 Keywords and Phrases �� 182 Potential Article Text�� 182 Quick Tips�� 183 Data Sets�� 183 Activities�� 183 References�� 184 14 Rasch Measurement Estimation Procedures�� 187 Tie Ins to RAHS Book 1 �� 187 Introduction�� 188 Different Estimation Procedures�� 188 Step 1: The JMLE Estimation Procedures of Winsteps and FACETS�� 189 Step 2: What Are the Meaningful Differences, if any, Between the Estimation Procedures Used in Winsteps and Other Software Programs?�� 190 Added Steps to Understand the Similarities and Differences of Rasch Programs �� 193 The Bottom Line�� 195 Keywords and Phrases �� 196 Potential Article Text�� 196 Quick Tips�� 197 Data Sets�� 197 Activities�� 197 References�� 198 15 The Importance of Cross Plots for Your Rasch Analysis �� 199 Tie Ins to RAHS Book 1 �� 199 Introduction�� 199 Cross Plots to Understand Your Data�� 200

xiv

Contents

Error-vs-Person Measure �� 200 Sample Cross Plots to Aid Your Analysis and Your Articles�� 204 Item Measure-vs-Item Error Cross Plot�� 204 Item Measure-vs-Item MNSQ Outfit �� 204 Cross Plots from Separate Analyses�� 206 Cross Plots Used in Chaps. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, and 14�� 208 Cross Plots Used in Articles�� 209 Keywords and Phrases �� 211 Potential Article Text�� 211 Quick Tips�� 211 Data Sets (Go to http://extras.springer.com)�� 212 Activities�� 212 References�� 214 16 Wright Maps (Part 3 and Counting...) �� 215 Tie Ins to RAHS Book 1 �� 215 Introduction�� 216 Recent Wright Maps in the Literature�� 216 What Can Be Learned and Improved Upon Through Review of the Wright Maps in These and Other Articles?�� 217 The Common Wright Map Format�� 217 Surprises in Presentation of Wright Maps �� 217 Font�� 217 Wrap Around�� 220 M, S, T �� 223 Added Cleaning Up and Simple Additions You Should Consider�� 225 Units�� 226 Beyond the Basic Wright Map Output�� 227 Side by Side Wright Maps �� 227 Piling Up Items (or Persons) on a Wright Map �� 232 Wright Maps for Surveys�� 234 How to Present Items? How to Present Respondents?�� 236 Person Labels in a Wright Map�� 237 Special Additions to Wright Maps �� 240 Final Touch Ups �� 243 A Potential Wright Map Checklist�� 246 Keywords and Phrases �� 248 Potential Article Text�� 248 Quick Tips�� 248 Data Sets (Go to http://extras.springer.com)�� 249 Activities�� 249 References�� 251

Contents

xv

17 Rasch and Forms of Validity Evidence�� 255 Tie Ins to RAHS Book 1 �� 255 Introduction�� 256 Content Validity Evidence�� 257 Construct Validity Evidence�� 257 Predictive Validity Evidence�� 258 Concurrent Validity Evidence�� 259 Statistical Validity Evidence�� 259 Fit Validity�� 260 How Validity Evidence Is Sometimes Addressed in Rasch Research Articles �� 261 Keywords and Phrases �� 263 Potential Article Text�� 263 Quick Tips�� 264 Data Sets�� 264 Activities�� 264 References�� 265 18 Using Rasch Theory to Develop a Test and a Survey�� 267 Tie Ins to RAHS Book 1 �� 267 Introduction�� 268 Designing a Test Using Rasch Theory �� 268 What Does a Test Developer Do If no Variable Is Supplied to Guide Your Development of a Test? How Might You Go About Defining the Variable?�� 269 Use a Theory�� 269 Blueprints and Item Writing�� 270 The Next Steps�� 273 The Bottom Line�� 278 What About Surveys? Does This Same Technique Work?�� 279 Keywords and Phrases �� 281 Potential Article Text�� 282 Quick Tips�� 282 Data Sets�� 282 Activities�� 282 References�� 285 19 Presentation and Explanation Techniques to Use in Rasch Articles�� 287 Tie Ins to RAHS Book 1 �� 287 Introduction�� 288 Item Tables�� 288 Which Columns to Include?�� 289 Entry Number Column or Name Column?�� 290 Total Score and/or Total Count Column�� 292 Item Measure and Model S. E�� 293

xvi

Contents

Item Fit�� 293 Point Measure Correlation �� 293 Person Measure Tables�� 294 Final Thoughts �� 297 Keywords and Phrases �� 297 Potential Article Text�� 297 Quick Tips�� 298 Data Sets (Go to http://extras.springer.com)�� 298 Activities�� 298 References�� 301 20 Some Concluding Thoughts…�� 303 Reference �� 306

Chapter 1

Introduction: For the Second Time

Charlie and Chizuko: Two Colleagues Conversing Charlie: Hi Chizuko, look what I found…it’s a new Rasch book… Chizuko: Can I see it? Charlie: No problem. I just flipped through it, and it is a second book by Boone and Staver. It looks like they are hitting a lot of additional topics that introductory Rasch learners are interested in. Also, it looks like they use the same teaching techniques that they used in Rasch Analysis in the Human Sciences (Boone, Staver & Yale, 2014). Chizuko: I get the book first…

Tie Ins to Rasch Analysis in the Human Sciences (Book 1) Many techniques were presented in our first book to help researchers and students who are completely unfamiliar with Rasch, learn how to apply and think about Rasch. This second book presents added topics that were not considered in what we refer to as RAHS Book 1.

Organization and Rationale of the Book and the Chapters Shortly after completing RAHS Book 1, we started thinking about the many introductory Rasch topics that we did not hit in our initial book. This book is an effort to present a range of added topics that come up in our classes and workshops. Also, included are a few topics that are rarely discussed, that we feel can greatly help the beginning researcher and student deepen their Rasch thinking. One of our chapters concerns the importance of what we call cross plots. We have found the use of cross

© Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_1

1

2

1 Introduction: For the Second Time

plots helps to better understand the Rasch data one is sifting through when conducting a Rasch analysis. Each chapter begins with a conversation between two new students, Charlie and Chizuko, and chapters end with these same inquisitive students. We hope as readers review the banter between Charlie and Chizuko, they will nod their heads in agreement. The questions posed by our two students are quite common to those learning Rasch for the first time. Lightbulbs go on, and off, as the material is better understood. Throughout each chapter we also include Formative Assessment Checkpoints where we briefly review what we believe is a key take away point from the chapter. Often these questions are similar to those we have heard in one form or another from our students. Throughout each chapter readers will see that, for the most part, we attempt to teach a little bit about a Rasch topic, and then we demonstrate how the topic can be explored through the use of Winsteps software from Mike Linacre (Linacre, 2018). As we did in RAHS Book 1 in chapters of this book we make use of the writing of colleagues in forums such as Rasch Measurement Transactions (RMT), and journals such as the Journal of Applied Measurement. Some individuals learning Rasch may be interested in different software for Rasch analysis. Those reading this book will, we feel, be able to master specific Rasch topics. It will then be up to each reader to apply the technique using the Rasch software of his/her choice. We feel that Winsteps is very user-friendly, inexpensive, and most importantly, exceedingly well documented. The Winsteps Manual is hundreds of pages long, and every table and plot is explained in detail. Another reason for our using Winsteps is that we trust the calculations. The author, Mike Linacre, has been working with Rasch, and authoring Rasch computer code, for many years. Two additional points about this second book, Advances in Rasch Analysis in the Human Sciences, need to be made. This book has very little mathematics. This is not to say that mathematics is not a critical part of Rasch work. However, we chose to present limited mathematics. If readers are interested in the calculations of particular Rasch analyses steps, then we suggest identifying a number of Rasch resources (both old and new) to explore the mathematics. Finally, the purpose of this book is not to provide an overview of all the resources that are available to readers, nor to discuss many of the debates within the Rasch community or the debates between those who use Rasch and those who do not. The purpose of this book is to provide some added nuts and bolts guidance to the beginner. As readers review the table of contents, you will see that we provide a number of chapters that help one learn a new Rasch technique using Winsteps (e.g. common person equating, principal component analysis of residuals). Readers will also see some topics that have to do with a broad Rasch area that we feel is important. For example, we have a chapter entitled “Wright Maps (Part 3 and counting…).” This chapter is the third Wright Map chapter we have presented (there are two Wright Map chapters in RAHS Book 1). This chapter discusses how Wright Maps might be improved for publications.

A Few Warm Up Reminders and Exercises from RAHS Book 1

3

This book also provides a number of hands-on chapters that include some topics that are not often discussed. For example, we provide a chapter entitled “The Importance of Cross Plots for Your Rasch Analysis” in which we discuss the need for cross plotting to better understand your Rasch analysis. A second example is the chapter entitled “Presentation and Explanation Techniques Used in Rasch Articles.” This chapter is provided to furnish tips and ideas to readers that help them better present their Rasch results to others (e.g. how to create a meaningful and useful output table of item measures). As each chapter comes to a close, we always check in with our two students and we listen to a recap of what they have learned. We also provide a section entitled Keywords and Phrases where we provide phrases and words that beginning learners might practice, to use as they write Rasch articles and craft presentations. As we did in RAHS Book 1, we provide a sample article text (Potential Article Text) that could serve as model for the researcher using the techniques outlined in the chapter. For many Rasch learners, there is always an interest in better understanding exactly how one might write up the results of a Rasch analysis. Following the sample text, we provide a list of Quick Tips as well as Data Sets that can be used to tackle the activities that we have authored for each chapter. We provide a question and answer for the majority of the activities. We believe that many Rasch books provide only limited practice for readers to attempt. We hope our exercises fill this gap. References are provided for each chapter, as well as Additional Readings. To finish this brief introduction to Advances in Rasch Analysis in the Human Sciences, we feel it is helpful to provide a set of warm up exercises that could be used to help readers remember some key aspects of RAHS Book 1 and a listing of the chapters of Advances in Rasch Analysis in the Human Sciences as well as a brief rationale of the whys of each chapter. There are of course many added topics that one could present in a third book! We hope readers will learn and enjoy as we explain how we have tackled a number of Rasch topics in workshops and classes. Readers should note that throughout this book, as we did in Rasch Analysis in the Human Sciences, we often use data sets generated from use of the STEBI instrument of Enochs and Riggs (1990). A times we simply refer to the STEBI, but for interested readers the citation is the following: Enochs, L. G., & Riggs, I. M. (1990). Further development of an elementary science teaching efficacy belief instrument: A preservice elementary scale. School Science and Mathematics, 90(8), 694–706. The two scales of the instrument are a Self-Efficacy scale (SE) and an Outcome Expectancy scale (OE).

A Few Warm Up Reminders and Exercises from RAHS Book 1 • Find a test data set in RAHS Book 1 and create a control file. • Find a survey data set in RAHS Book 1 and create a control file.

4

1 Introduction: For the Second Time

• What are the problems with the techniques that were used (before Rasch) to evaluate test data and survey data? • What is the purpose of investigating misfit? If a person misfits in an analysis, what might that mean? If an item misfits, what might that mean? • Look at your test analysis. What is the meaning of an item having a higher logit measure than another test item? What is the meaning of a survey item having a higher logit measure than another survey item? • Repeat the same exercise above, but with persons. • How does one link one test to another test? Why is linking needed? • How and why might you convert a logit scale to a scale that varies from 0 to 1000? • Create a Wright Map for the test data set and the survey data set. Interpret the Wright Map. • For the two analyses, review and interpret the score measure table. How is the nonlinearity of raw scores revealed in the table? • What are some techniques that you can use to investigate the functioning of a rating scale? • Can you explain in a few sentences why missing data causes fewer problems in a Rasch analysis? • What is the difference between the Rasch perspective and an IRT perspective?

Topics Appearing in Advances in Rasch Analysis in the Human Sciences This Chapter: “Introduction for the Second Time”

Chapter 2: “Principal Component Analysis of Residuals (PCAR)” This chapter provides an overview of how PCAR can be used to investigate the dimensionality of an instrument. In RAHS Book 1 we introduced readers to the concept of fit and how fit can be used to investigate whether or not the items/respondents are fitting the Rasch model. Chapter 2 provides steps (e.g. which parts of Winsteps tables to review, what plots to make) that one can take to further investigate the dimensionality of an instrument.

Topics Appearing in Advances in Rasch Analysis in the Human Sciences

5

Chapter 3: “Point Measure Correlation” This chapter outlines how point measure correlation can be used to aid an analysis. In particular, how miscoding (not flipping items that need to be flipped, and/or a wrong answer key for an item) can be identified through review of point measure correlation, and how point measure correlation can be used to identify items that may not contribute to measurement. In RAHS Book 1 we helped readers learn how to flip and why to flip. When a survey item has negative wording then it is important to “flip” the coding that is used for the initial spread sheet used for an analysis. For instance, if coding of 1 is used for Agree and a code of 0 is used for Disagree, then a negative item will have a coding of 0 for Agree and 1 for Disagree. Also, point measure correlations ties to our RAHS Book 1 work with fit in which we explained how fit can be used to identify items that may not help us measure with confidence.

Chapter 4: “Test Information Function (TIF)” This chapter provides readers with the steps needed to evaluate and report TIF. In RAHS Book 1 we considered many issues to investigate the functioning of an instrument, and TIF provides another possible tool for researchers. Some researchers provide TIF plots in their articles and reports, and a number of our students have asked for details of TIF for a Rasch analysis. Winsteps allows one to easily generate TIF plots.

Chapter 5: “Disattenuated Correlation” A discussion is presented concerning how a Disattenuated Correlation can be investigated, and how this correlation can provide guidance as to the dimensionality of an instrument. In RAHS Book 1 we presented a number of techniques to investigate dimensionality, The techniques of this chapter provides an added tool for Rasch researchers to investigate dimensionality. With Winsteps one can compute a Dissattenuated Correlation and use general rules of thumb to help one evaluate the dimensionality of an instrument.

6

1 Introduction: For the Second Time

Chapter 6: “Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate the Functioning of a Scale” This chapter provides the steps one might take when investigating model ICC curves and empirical ICC curves with regard to attempting to identify items that might be problematic for a scale. In RAHS Book 1 we introduced readers to techniques such as Item Outfit MNSQ to identify items that might not be working well. This chapter provides added tools for readers wishing to investigate how their items are working for them. We provide guidance as to how to generate ICC curves and how to interpret those curves, all with the goal of improving the measurement properties of an instrument.

Chapter 7: “How Well Are Your Instrument Items Helping You to Discriminate and Communicate?” This chapter provides an introduction to a table in Winsteps that we do not think is used enough (or it may be used, but few report in an analysis report). This table helps us to evaluate one aspect of how well items are helping to discriminate respondents. As pilot items are evaluated for a test or survey, it is important to have a range of items which allows one to detect differences between respondents. Having items that maximize “discrimination” helps one conduct better measurement. This chapter also presents how discrimination data for items can be included in a Wright Map to further guide the selection of items for the final version of a measurement instrument.

Chapter 8: “Partial Credit Part I” In RAHS Book 1 we considered both multiple choice tests and rating scale surveys, but we did not consider partial credit data sets. In this first of two Partial Credit chapters, we present some of the techniques we use to help students learn how past analyses of partial credit data, before Rasch, were flawed. We also show readers how to add a simple command in their control file to conduct a partial credit analysis of data.

Topics Appearing in Advances in Rasch Analysis in the Human Sciences

7

Chapter 9: “Partial Credit Part II (How to Anchor a Partial Credit Test)” In RAHS Book 1 we provided details to readers as to the hows and whys of item anchoring. We discussed in RAHS Book 1 how to anchor with a multiple-choice test, and we described the steps needed to anchor a rating scale survey. This chapter helps readers learn how to anchor a partial credit instrument.

Chapter 10: “The Hills…with the Partial Credit Model” In RAHS Book 1 we talked readers through how to read probability curves that are provided in Table 21.1 of Winsteps. In this new chapter, we help readers understand and use such curves when a partial credit analysis of a data set is evaluated. We explain how to generate these plots, how to interpret the plots, and how to report such plots in articles.

Chapter 11: “Common Person Test Equating” A topic presented in RAHS Book 1 was equating through the use of common items. In this chapter, we provide guidance on how researchers can utilize common persons to link two (or more) scales that measure the same trait. With common persons it is possible (when the same trait is measured) to compute the item measures for two different scales and to have the item measures expressed on the same scale. In this chapter we outline the steps one must take in a Winsteps analysis and we talk readers through Winsteps tables which help one check and report an analysis which has used Common Person Equating.

Chapter 12: “Virtual Equating of Test Forms” In this chapter, we present a topic that has come up with our students, namely how one might equate if one does not have any common items? Of course, there are requirements to any steps that might be taken in an effort to equate two tests that do not have common items. In this chapter, we utilize what we feel is a gem (of many gems) in the Winsteps Manual where Mike Linacre describes how one might link

8

1 Introduction: For the Second Time

two tests with no common items. In order to understand the rationale and thinking for virtual equating readers will need to apply what they have learned from the linking chapter of RAHS Book 1. We believe this new chapter helps readers further develop their skills with linking. Not only when there might not be any common items, but also when there are common items available for linking.

Chapter 13: “Computing and Utilizing an Equating Constant to Explore Items for Linking a Test to an Item Bank” In chapter 13, we further explore the issue of linking tests. In this chapter, we explain one method used by the State of Ohio with regard to linking two tests. Just as taking on virtual equating helps one better understand the nuances of linking presented in RAHS Book 1, we feel exposure to equating constants also greatly helps one better understand linking as well.

Chapter 14: “Rasch Measurement Estimation Procedures” For those learning and attempting to apply Rasch, a decision one has to make is which software to use. There are many programs that provide researchers with the ability to conduct a Rasch analysis. As readers will know from RAHS Book 1, we feel that Winsteps is exceedingly user friendly and we trust the code. Winsteps is well documented in the lengthy and detailed Winsteps Manual. Often we are asked: Does it make a difference which program is used to compute person measures and item measures? In this chapter, we provide an overview of the issue and present some exercises we have carried out for students and colleagues.

Chapter 15: “The Importance of Cross Plots for Your Rasch Analysis” One important skill for the Rasch student to master is the ability to make use of cross plots in order to better understand their data, their instrument, and the steps they might take to maximize the measurement functioning of their instrument. In this chapter, we talk readers through our take on the importance of cross plots for a Rasch analysis. We suspect that cross plots may be underutilized by researchers and also by those teaching Rasch techniques to new learners. As readers page through this book, as well as RAHS Book 1, they will see many cross plots. In this chapter we detail added cross plots that can help researchers in their Rasch work.

Topics Appearing in Advances in Rasch Analysis in the Human Sciences

9

Chapter 16: “Wright Maps (Part 3 and Counting…)” In RAHS Book 1 we presented two chapters regarding the topic of Wright Maps— how to read them and how to construct them. In Advances in Rasch Analysis in Human Sciences we present a lengthy chapter concerning how to make sure your Wright Maps are well presented. We review some do’s and don’ts with your Wright Maps. We feel that readers should invest as much time as possible in their Wright Maps, and should present Wright Maps that help them advance and explain their research.

Chapter 17: “Rasch and Forms of Validity Evidence” In articles and presentations, students and researchers often speak about validity evidence they have addressed in their Rasch analysis. In this chapter, we make use of an exceedingly useful Rasch Measurement Transactions article (Baghaei, 2008) that we have used in our classes as we discuss validity and Rasch. In this chapter, we also provide a number of examples of how Rasch authors have considered validity. The details of this chapter expand the discussions regarding validity evidence we have considered in RAHS Book 1, which were primarily concerning construct validity. In this chapter, we discuss construct validity evidence and we consider some added aspects of validity evidence which have been proposed.

Chapter 18: “Using Rasch Theory to Develop a Test and a Survey” In many classes and workshops we are asked for a set of steps to develop a survey or test using Rasch techniques. In this chapter, we present an introductory set of steps that a researcher might take to do just that. We have learned over the years that using Rasch theory and Rasch software is really an art. There is, of course, no one set of rules one should follow to develop a test or survey using Rasch. In this chapter, we present just one set of potential steps that one might follow.

Chapter 19: “Presentation and Explanation Techniques to Use in Rasch Articles” When conducting a Rasch analysis many steps will not be reported in articles and presentations, and there will always be a limit to the number of tables and figures that will be printed in an article. In this chapter, we provide a lengthy set of tips to

10

1 Introduction: For the Second Time

help researchers/students/colleagues decide what to present, and how to present their results, in Rasch talks and publications. As most individuals using Rasch will be specialists in other fields we feel such tips will help improve the quality of articles and presentations.

Chapter 20: “Some Concluding Thoughts…” In this chapter we present some final thoughts regarding Advances in Rasch Analysis in the Human Sciences. Our hope is that readers look over the book, test out the homework we provide, and review the readings we suggest. There are now many technical and philosophical books concerning Rasch, and we hope readers will consider reviewing those works. We hope that this book provides a non-technical introduction to anyone wishing to learn Rasch. Why this ordering of topics? The ordering was informed by our theory and checking our theory with students and colleagues as to what chapter order they thought made sense. We view Chapts. 2, 3, 4, 5, 6 and 7 as being basics of a Rasch analysis that were not presented in RAHS Book 1. Chapters 8, 9 and 10 all involve the topic of partial credit data sets. Partial credit data is indeed something that should be tackled after one has a firm grip on the application of Rasch techniques to rating scale data and dichotomous data. Chapters 11, 12 and 13 all involve the topic of linking- be it the linking of a multiple choice test, a survey, or a partial credit test. Chapter 14 involves the topic of the software that one can use to conduct Rasch. We felt that for most readers of an introductory book this type of material is of interest, but after they have mastered the ins and outs of common Rasch techniques. With Chap. 15 we then begin the final set of chapters. These chapters are presented to help readers improve their articles and analyses. In these chapters there is far less number crunching and more thinking about how to put the pieces together to present your Rasch findings in some way. Chapter 15 concerns cross plots for an analysis, and Chap. 16 explores how different Wright Maps can be used for an analysis, talks, posters and presentations. Chapter 17 discusses validity and Chap. 18 provides a road map for the development of a test and a survey using Rasch techniques. Chapter 19 provides a wide range of ideas of how to best present your Rasch findings. Chapter 20 is a brief finale of thoughts. Charlie and Chizuko: Two Colleagues Conversing Chizuko: I can’t wait to read the other chapters Charlie. It looks to me as if many of the Rasch topics I read about in Social Science research articles are considered in this book. Also, I like that this book seems to be a hands-on “talk me through the steps” type of book. I like theoretical books, but I like practical ones as well. Charlie: I have to agree. I really enjoy learning about Rasch, but sometimes it is a little bit scary. Where do I learn the techniques? How do I interpret output? How do I write things up? I think this book will calm my nerves, just like Rasch Analysis in the Human Sciences.

References

11

References Baghaei, P. (2008). The Rasch model as a construct validation tool. Rasch Measurement Transactions, 22(1), 1145–1146. Boone, W., Staver, J., & Yale, M. (2014). Rasch analysis in the human sciences. Dordrect, Netherlands: Springer. Enochs, L. G., & Riggs, I. M. (1990). Further development of an elementary science teaching efficacy belief instrument: A preservice elementary scale. School Science and Mathematics, 90(8), 694–706. Linacre, J. M. (2018). Winsteps® Rasch measurement computer program. Beaverton, OR: Winsteps.com.

Chapter 2

Principal Component Analysis of Residuals (PCAR)

Charlie and Chizuko: Two Colleagues Conversing Charlie: Chizuko, I would like to conduct a Principal Component Analysis of Residuals (PCAR). Do you have any idea why and how someone does that with Rasch? Chizuko: Charlie, I think the key thing is that when we conduct a PCAR, we are taking an added step in evaluating how well the data fit the Rasch model. If the data fit, we have additional evidence that we are measuring as we wish to do (to measure as if we have a ruler or meter stick, to measure one variable). Charlie: Ok, that makes sense as a goal. Chizuko: I think with PCAR, as well as with other tools we have with Rasch, many people seem to forget that a key question is always, “what is the purpose of the measurement device?” What do we hope to measure?

Tie Ins to RAHS Book 1 PCAR provides an added technique for researchers to evaluate whether or not the data fit the Rasch model. Person Fit and Item Fit are two techniques of evaluating the fit of the Rasch model and also of evaluating the presence of a single variable for measuring. PCAR provides another tool. In our first book, we encouraged readers to think before leaping with respect to many measurement issues. This is particularly true when investigating the behavior of items on a single dimension. What is the evidence that an item might not be part of a single trait? What is the fit of the item? And what evidence can be collected from PCAR to further investigate dimensionality?

© Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_2

13

14

2 Principal Component Analysis of Residuals (PCAR)

Introduction In our RAHS Book 1, we presented a wide range of Rasch techniques that are used by researchers-for example an investigation of fit and the creation of Wright Maps. Fit naturally provides researchers with a technique that allows them to investigate dimensionality. In this chapter we provide an added technique that can be used to investigate dimensionality. The technique we present is Principal Component Analysis of Residuals (PCAR). When Rasch is used, we think of the variable, and we investigate whether or not our data fit the Rasch model. When the data do not fit, we need to pause and think. It may be that the data we collected are not suitable for computation of useful measures along a variable. As readers learn about PCAR, it is always important to remember that noise is always present in data, and confounding traits that impact the computation of a measure will always be present. The key question to ask is if our measure allows us to accomplish what we wish to do, even if it is not perfect. In Bond and Fox (2007), there are some wonderful paragraphs that describe a high jump competition. Yes, there is crowd noise, yes there may be wind, but in the end, the height jumped is viewed as a useful measure of the competitor’s ability. When we construct measures, we must consider many issues, and PCAR is one valued technique for analyzing the usefulness of a measure. This technique can be more difficult for the beginner to understand, that is why we left it out of RAHS Book 1. PCAR (also named PCA) is a type of analysis that allows researchers to investigate the measurement functioning of instruments. PCAR make use of Rasch measures to explore whether there is evidence of an added dimensions in a data set, and what the strength of that evidence might be. Thus readers will be able to note that in a PCAR analysis one is using Rasch measures of items and respondents. Linacre (1998) provides thoughtful guidance on what PCAR with Rasch is: The aim of the factor analysis of Rasch residuals is thus to attempt to extract the common factor that explains the most residual variance under the hypothesis that there is such a factor. If this factor is discovered to merely “explain” random noise, then there is no meaningful structure in the residuals. In this kind of investigation, neither factor rotation nor oblique axes appear to be relevant.

To understand PCAR, readers must note that Rasch PCAR is an analysis of residuals; when we conduct a PCAR, we examine data patterns after we have used Rasch techniques to measure with one dimension. The next step is to consider two comments made by Mike Linacre: An ideal of the Rasch model is that all the information in the data be explained by the latent measures. Then the unexplained part of the data, the residuals, is, by intention, random noise. In particular, after standardization of each residual by its model standard deviation, the noise should follow a random normal distribution. (1998, p. 636)

Readers must remember that when we use Rasch, during some parts of our analysis we are looking to see if the data fit the model. We do this when we look for misfitting persons and misfitting items. It should seem logical to investigate whether or not the random noise is random, as it will help provide additional details as to

Introduction

15

how well a data set fits the Rasch model, and thus sets the stage for confident computation of measures. This assessment of random versus nonrandom noise is a key concept that we try to keep in mind as we use PCAR in our work. Formative Assessment Checkpoint #1 Question: What are core underpinnings of PCAR? Answer: The Rasch model (a definition of measurement) predicts that data residuals should be random noise. If a principal component analysis of Rasch residuals only explains random noise, then we have evidence of no additional meaningful factors in the data.

To learn how to use PCAR to add guidance to a Rasch analysis, we supply the file: cf se all as part of the data sets of this chapter. This file allows us to evaluate the responses of students to the Self-Efficacy component of the STEBI (Enochs & Riggs, 1990). Parts of the output table (Table 23) are provided in Fig. 2.1. Note the

TABLE 23.0 e:se-all.cf 8 flips ZOU540WS.TXT Mar 23 2020 9:51 INPUT: 154 PERSON 13 ITEM REPORTED: 86 PERSON 13 ITEM 7 CATS WINSTEPS 4.4.8 -------------------------------------------------------------------------------Table of STANDARDIZED RESIDUAL variance in Eigenvalue units = ITEM information units Eigenvalue Observed Expected Total raw variance in observations = 24.7946 100.0% 100.0% Raw variance explained by measures = 11.7946 47.6% 49.1% Raw variance explained by persons = 5.4789 22.1% 22.8% Raw Variance explained by items = 6.3157 25.5% 26.3% Raw unexplained variance (total) = 13.0000 52.4% 100.0% 50.9% Unexplned variance in 1st contrast = 2.1190 8.5% 16.3% Unexplned variance in 2nd contrast = 1.6298 6.6% 12.5% Unexplned variance in 3rd contrast = 1.5171 6.1% 11.7% Unexplned variance in 4th contrast = 1.3689 5.5% 10.5% Unexplned variance in 5th contrast = 1.3160 5.3% 10.1% Approximate relationships between the PERSON measures PCA ITEM Pearson Disattenuated Pearson+Extr Contrast Clusters Correlation Correlation Correlation 1 1 - 3 0.4601 0.6397 0.5039 1 1 - 2 0.6507 1.0000 0.6677 1 2 - 3 0.7382 1.0000 0.7495

Disattenuated+Extr Correlation 0.7018 1.0000 1.0000

TABLE 23.2 e:se-all.cf 8 flips ZOU540WS.TXT Mar 23 2020 9:51 INPUT: 154 PERSON 13 ITEM REPORTED: 86 PERSON 13 ITEM 7 CATS WINSTEPS 4.4.8 -------------------------------------------------------------------------------CONTRAST 1 FROM PRINCIPAL COMPONENT ANALYSIS STANDARDIZED RESIDUAL LOADINGS FOR ITEM (SORTED BY LOADING) --------------------------------------------------------- -------------------------------------------------|CON- | | INFIT OUTFIT| ENTRY | | | INFIT OUTFIT| ENTRY | | TRAST|LOADING|MEASURE MNSQ MNSQ |NUMBER ITEM | |LOADING|MEASURE MNSQ MNSQ |NUMBER ITEM | |------+-------+-------------------+--------------------| |-------+-------------------+--------------------| | 1 | .69 | -.10 .74 1.03 |A 8 c112 q18 | | -.52 | .27 1.30 1.49 |a 4 c100 q6-f | | | -.50 | .29 1.23 1.41 |b 13 c118 q24-f | | 1 | .53 | .43 .81 .93 |B 7 c111-q17-f | 1 | .48 | -.73 1.42 1.47 |C 1 c96 q2 se | | -.35 | -.11 .63 .66 |c 5 c102 q8-f | | 1 | .41 | -.01 .71 .69 |D 3 c99 q5 | | -.29 | .23 1.38 1.23 |d 2 c97- q3 se-f | -.07 .81 .66 |E 6 c106 q12 | | -.27 | .13 1.42 1.81 |e 10 c115 q21-f | | 1 | .35 | | 1 | .18 | -.77 1.08 .85 |F 12 c117 q23 | | -.12 | .00 .75 .82 |f 11 c116 q22-f | | | | | | | -.05 | .44 1.17 1.14 |G 9 c113 q19-f | --------------------------------------------------------- --------------------------------------------------

Fig. 2.1 Part of the output of Table 23.0, 23.1 and 23.2 from a Winsteps analysis of Self-Efficacy STEBI data. (Table from Winsteps)

16

2 Principal Component Analysis of Residuals (PCAR)

heading of this table: Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units). What is of importance in these data? How do we interpret these data with the goal of learning whether or not there might be evidence of an additional factor in our data set? Our first step utilizes guidance provided through use of the Winsteps Help Tab under the topic of “Dimensionality: contrasts & variances.” This is the “1st contrast” (or first PCA component in the correlation matrix of the residuals). If the Eigenvalue of the first contrast is small (usually less than 2.0), then the first contrast is at the noise level and the hypothesis of random noise is not falsified in a general way.... If not, the loadings on the first contrast indicate that there are contrasting patterns in the residuals. The absolute sizes of the loadings are generally inconsequential. It is the patterns of the loadings that are important. We see the patterns by looking at the plot of the loadings in Winsteps Table 23.2, particularly comparing the top and bottom of the plot. (Linacre, 2018a, 2018b)

Review of the table presented in Fig. 2.1 suggests an Eigenvalue a little above 2.0. If that value was less than 2.0, then we would assert that the observed noise was random. In that case, there would be no evidence from the PCAR analysis of a possible added factor in our analysis of the Self-Efficacy data. Since our data are near a potential cut off, let us explore what researchers might do if the value were higher than 2.0. What are other pieces of information that are important to note? Our next step in a PCAR analysis is to review the plot of Table 23.1 (Fig. 2.2) and to consider some questions. First, we look at the plot that provides the location of each item as a function of Contrast 1 loadings (y-axis) and item measures (x-axis). The Clusters, which are identified on the far right-hand side of the diagram, are of particular usefulness. In this example Cluster 1, Cluster 2, and Cluster 3 are noted in the last column of the plot with Contrast 1 Loadings (y-axis) and Item Measures (x-axis) plot in Fig. 2.2. By reviewing Fig. 2.1 you will be able to link items noted by letters in Fig. 2.2 with the STEBI item names. Rasch PCAR looks for data patterns that are not expected. We are looking to see if item groups share the patterns of unexpectedness. If item groups do share patterns, then those items might be a second variable. The questions below (from the Winsteps Help Tab) are those that we often use to help us use and interpret the above plot: What is the secondary dimension? Is the secondary dimension big enough to distort measurement? What do we do about it? (Linacre, 2018a, 2018b)

To explore the secondary dimension, we look at the contrast between the content of the items at the top (noted with letters such as A, B, C…) and the bottom of the plot (noted with lower case letters such as a, b, c…) of the contrast plot in Winsteps Table 23.2 (Fig. 2.2). For instance, if the items (for a simple math test) at the top are addition items and the items at the bottom are subtraction items, then a secondary dimension might exist. However, one could also argue that one is dealing with one dimension of math, with addition at one end, and subtraction at one end. In the case of our analysis, are there differences in the types of items at the top of the plot (in

Introduction

17

Fig. 2.2 A portion of the Winsteps Table 23 that must be reviewed if researchers do not observe an Eigenvalue of less than 2.0

Fig. 2.2 items identified with letters A-q18, B-q17, C-q2, D-Q5, E-q12 and F-q23) and items at the base of the plot (in Fig. 2.2 items identified with the letters a-q6, b-q24, c-q8, d-q3, e-q21, f-q22 and g-q19)? Upon review of item text, we did not find a clear difference in item themes. For the next issue, “is the secondary dimension big enough to distort measurement?”, Mike Linacre through the Winsteps Help Tab (Linacre, 2018a, 2018b) advises us that “the secondary dimension needs to have the strength of at least two items to be above the noise level”. In our example, we see the strength (Eigenvalue) in the first column of numbers in Table 23.0. In this analysis, the Eigenvalue is 2.1. Thus, the possible secondary dimension is marginally above the predicted noise level. For the question “What do we do about it?”, often our decision is take no action. On a mathematics test, we might get a big contrast between algebra and word problems. We know the topics are different themes, but they are both part of mathematics. We do not want to remove either of them, and we do not want separate algebra measures and word problem measures. In our example, there does not seem to be evidence of items possibly along a different trait or items at different ends of a continuum. The PCAR we have conducted does not suggest removal of items. From the Winsteps Help Tab:

18

2 Principal Component Analysis of Residuals (PCAR) In summary, look at the content (wording) of the items at the top and bottom of the contrast: items A, B, C and items a, b, c. If those items are different enough to be considered as different variables (similar to “height” and “weight”), then the items should be split into separate analyses. If the items are part of the same dimension (similar to “addition” and “subtraction” on an arithmetic test), then no action is necessary. You are seeing the expected covariance of items in the same content area of a dimension. (Linacre, 2018a, 2018b)

Formative Assessment Checkpoint #2 Question: If the unexplained variance in the 1st contrast has an Eigenvalue >2.0, does this mean that you do not have one trait? Answer: No. If the Eigenvalue is >2.0, then you investigate the types of items that seem to be in different Clusters in the Standardized Residuals Contrast Plot of Table 23 of Winsteps. You then think about the types of items in potential clusters. It could be that you are really seeing, among many possibilities, evidence for items at two different ends of one continuum, one variable.

Cross Plotting to Explore Dimensionality Cross Plot Type 1 Cross plotting can be conducted when we are trying to assemble guidance regarding the strength and importance of potential additional variables that may impact measuring with an instrument. This cross plotting makes use of a comment by Linacre: The impact of a residual factor on the measurement system can be easily determined. Extract two subsets of items representing the opposite poles of the factor. Plot each item’s subset difficulty against its original difficulty to see in what manner the item construct is being distorted. Only perturbation that has an impact on the empirical meaning or use of the measures is of concern. (1998, p. 636)

To implement this tip use Fig. 2.2, in which we presented the Standardized Residual Contrasts. A researcher first needs to identify the items that represent the opposite poles of the factor. In the case of Fig. 2.2, readers should look at the right side of the diagram and identify the column that lists the Cluster. In this figure you will see the numbers 1, 2 and 3. To carry out the cross plot that considers subset difficulty and original difficulty, we suggest to start readers might consider only items in Clusters 1 & 3 (this means all items except items F, G, f which are all part of Cluster 2). Then the next step is for a researcher to: Step 1. Compute the person difficulties using only Cluster 1 items, and then cross plot those person measures against the person difficulties computed using all the items.

Cross Plotting to Explore Dimensionality

19

Step 2. Compute the person difficulties using only Cluster 3 items, and then cross plot those person measures against the person difficulties computed using all the items. Step 3. Then after you have completed the crossplots, review the plots. Did the different mix of items make any difference in the person measures? Compute if you wish the correlation between person measures using the different mix of items. If there is a high correlation, you might conclude that you are dealing with one variable. Step 4. There is another twist that you might attempt. Compute the item difficulties using only the Cluster 1 items. Then cross plot the item difficulties you computed using only the Cluster 1 items and the item difficulties you computed (for the Cluster 1 items) using all the items. Step 5. Compute the item difficulties using only the Cluster 3 items. Then cross plot the item difficulties you computed using only the Cluster 3 items and the item difficulties (for the Cluster 3 items) you computed using all the items. Step 6. You would expect that if you are dealing with one variable, then there should not be a large shift in item difficulty.

Cross Plot Type 2 When you run Winsteps and select table 23, you will be asked if you wish to “Display Table 23” and you will be asked if you wish to “Display Scatterplot of person estimates for the 1st contrast”. Select both of those options. By selecting the scatterplot you will be provided with a cross plot of person measures using only specific item clusters. Thus if your analysis has three potential clusters, you will be provided with; (1) a cross plot comparing person measures using only Cluster 1 items and using only Cluster 2 items, (2) a cross plot comparing person measures using only Cluster 1 items and using only Cluster 3 items, and (3) a cross plot comparing person measures using only Cluster 2 items and using only Cluster 3 items. You will want to visually compare the person measures as a function of Cluster. If most of the person measures fall between the error lines, that is evidence supporting the same variable. Also of great importance is that Table 23 computes the correlations for you in each of these plots. These will be values you will want to cite in your articles. For instance, look at Fig. 2.1. In that figure you can, for example, look up the correlation of the person measures from using only Custer 1 items and only Cluster 3 items, that value is .7018. To finish up this chapter just a few added comments. There are many Winstep tables which help one explore PCAR. In some of those tables one can explore person contrast by loading, and item contrast by persons. First, please take heed of a comment provided by Mike Linacre in the Winsteps manual (p. 394). “Please do not interpret this as a usual factor analysis. These plots show contrasts between opposing factors, not loadings on one factor.” Finally, the way in which the loadings can be utilized is also discussed in the Winsteps manual (p. 394).

20

2 Principal Component Analysis of Residuals (PCAR)

Formative Assessment Checkpoint #3 Question: What is the role of conducting cross plots of person measures when conducting a PCAR to investigate the possible presence of a secondary dimension in the data? Answer: An investigation of PCAR might suggest that a subset of items marks one trait and a second set of items involves another trait. A cross plot of person measures can allow one to investigate whether or not person measures were greatly altered as a function of item subset. If there are not a large number of persons whose measures change (person measures outside the confidence bands of a cross plot) then such a plot would suggest that one is not seeing two different variables.

Charlie and Chizuko: Two Colleagues Conversing Charlie: Ok. I think I have a handle on PCAR. Chizuko: Then tell me what you think you know! Charlie: Well, the first thing is it makes sense to me that there is a certain level of predicted noise when we use the Rasch model. I was reminded of this when I thought of fit statistics. We don’t expect a perfect Guttmann pattern in student answers, but we don’t want to see something that is too wild. Chizuko: Is that where we look at the unexplained variance in the 1st contrast and we want to see an Eigenvalue of 2.0 or less? Charlie: Yes exactly. If an Eigenvalue is above 2.0, we have evidence we are seeing more noise than we would expect. Chizuko: What are the other things you look at? Charlie: Well, this is not all of what I consider, but I usually look at the part of Table 23 with the following heading: Contrast 1 From Principal Component Analysis Standardized Residual Loadings for Item (Sorted by Loading). When I look at this table, I look to see if the items with positive loadings seem different in meaning than items with negative loadings. I also remind myself over and over not to think of key buzzwords in Factor Analysis when I look over these tables. Charlie: Then as I collect information, I may plot the person measures with one potential set of items defining the trait against the measures computed with a different set of items (to define the trait) suggested from PCAR. I look to see if my conclusions about respondents would be different as a function of items defining a measure. Chizuko: Would you not agree Charlie, that when we look at PCAR, it is a similar effort as when we look at fit? We start off an instrument and data collection with a theory, which supports one variable. Then we conduct an analysis, and we look for deviations from the Rasch model predictions. When we conduct a PCAR, we look for evidence of noise that is above what we would predict. Just as with fit, we expect there always to be noise. We just need to investigate whether our measurement is degraded, and we need to think through what we hope to measure. Charlie: You know I always agree with you, no matter the scale!

Cross Plotting to Explore Dimensionality

21

Keywords and Phrases PCAR (Principal Component Analysis of Residuals) Unexplained variance in the 1st contrast Eigenvalue Cross plot of person measures to investigate dimensionality

Potential Article Text For this example, we make use of text that was authored in an article: Rasch principal component analysis (PCA) of the residuals was used to test against the hypothesis of unidimensionality. The variance in UE motor function explained by the S-WMFT was analyzed. Unidimensionality was supported when the variance explained by the first dimension exceeded 50%, and an Eigenvalue of the unexplained variance in the first contrast was less than 2. (Chen et al., 2012, pp. 1020–1021)

Quick Tips From the Winsteps Manual: Please do not interpret Rasch-residual-based Principal Components Analysis (PCA) as a normal factor analysis. These components show contrasts between opposing factors, not loadings on one factor. Criteria have yet to be established for when a deviation becomes a dimension; therefore, PCA is indicative, but not definitive, about secondary dimensions. In conventional factor analysis, interpretation may be based only on positive loadings. In the PCA of Residuals, interpretation must be based on the contrast between positive and negative loadings. Winsteps is doing a PCA of Residuals, not of the original observations; therefore, the first component (dimension) has already been removed. We are looking at secondary dimensions, components or contrasts. When interpreting the meaning of a component or a factor, the conventional approach is only to look at the largest positive loadings to infer the substantive meaning of the component. In Winsteps PCA, this method of interpretation can be misleading because the component is reflecting opposing response patterns across items by persons. We, therefore, need to identify the opposing response patterns and interpret the meaning of the component from those. These are the response patterns to the items at the top and bottom of the plots. (Linacre, 2018a, 2018b)

An Eigenvalue above 2.0 suggests the possibility of a second trait. From the Winsteps Manual: “Sample size: A useful criterion is 100 persons for PCA of items, and 100 items for PCA of persons, though useful findings can be obtained with 20 persons for PCA of items, and 20 items for PCA of persons” (Linacre, 2018a, 2018b). In addition to a PCAR of items, one can conduct a PCAR of persons (Table 24 of Winsteps).

22

2 Principal Component Analysis of Residuals (PCAR)

Is the observed noise in the data above what would one expect in data conforming to the Rasch model? PCAR is not Factor Analysis. Make sure to ask yourself, are you measuring what you want to measure? Also, remind yourself that you need a measurement instrument that allows you to do your work. Some measurement instruments used by researchers are not perfect but can still be used. Always remember to cross plot person measures and item measures to investigate the impact of potential items that might define another trait. Table 23 provides you with cross plots of person measures as a function of cluster. Also remember Table 23 provides you with the correlations of the person measures cross plotted in the scatterplots provided by Table 23.

Data Sets (Go to http://extras.springer.com) cf se all

Activities Activity #1 Please explain in your own words what the point of considering noise is when one conducts a PCAR? (Refer to the text if you are stuck) Activity #2 In the potential article text of this chapter, we present text that has been presented in an article. Find two other published articles that report an investigation of PCAR as part of their efforts to conduct a Rasch analysis. Answer: Here are two sample articles: “Rasch Analysis of the Ocular Surface Disease Index (OSDI)” (Dougherty, Nichols, & Nichols, 2011) “Rasch Analysis of the Dutch Version of the Oxford Elbow Score” (de Haan, Schep, Tuinebreijer, Patka, & den Hartog, 2011)

Cross Plotting to Explore Dimensionality

23

Activity #3 Take a control file for one of your own data sets. Use the Quick Tips presented in this chapter to investigate your instrument. If you think there might be evidence for more than one variable, remember to cross plot. Also, remember to think about what you wish to measure. Activity #4 When good measurement is conducted, a lot of thinking that takes place concerning what the trait is that one is measuring. How does such thinking help in the interpretation of many Rasch indices and Rasch techniques such as PCAR? Answer: Thinking about what one wants to measure helps alert one as to the meaning of items, which may or may not appear to be part of a trait. For example, a math test may have been constructed for 15 year olds, and the test may contain a mix of item topics (e.g. algebra, geometry), but by including a mix of items and attempting to measure one is in essence stating that one can mark the trait from less to more with a range of math item topics. By thinking about how one wants to measure, then it is possible to identify items in a PCAR and then ask if perhaps some suggestions of a second variable make sense. Activity #5 Read the article: “Rasch modelling of a scale that explores the take-up of physics among school students from the perspective of teachers” (Oon & Subramaniam, 2011). Review the steps the authors take with regard to the analysis of PCAR. What steps match what was outlined in this chapter? Are there additional steps with respect to PCAR that are taken? Activity #6 Read the following article: “Psychometric validation of the Manual Ability Measure-36 (MAM-36) in patients with neurologic and musculoskeletal disorders” (Chen & Bode, 2010). Review the steps in which PCAR and Winsteps were used to evaluate a data set. What steps were used? Are there steps and interpretations that you can add to your Quick Tips list for conducting a PCAR using Winsteps?

24

2 Principal Component Analysis of Residuals (PCAR)

References Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). Mahwah, NJ: Lawrence Erlbaum. Chen, C. C., & Bode, R. K. (2010). Psychometric validation of the Manual Ability Measure-36 (MAM-36) in patients with neurologic and musculoskeletal disorders. Archives of Physical Medicine and Rehabilitation, 91(3), 414–420. Chen, H. F., Wu, C. Y., Lin, K. C., Chen, H. C., Chen, C. P., & Chen, C. K. (2012). Rasch validation of the streamlined Wolf Motor Function Test in people with chronic stroke and subacute stroke. Physical Therapy, 92(8), 1017–1026. de Haan, J., Schep, N., Tuinebreijer, W., Patka, P., & den Hartog, D. (2011). Rasch analysis of the Dutch version of the Oxford elbow score. Patient Related Outcome Measures, 2, 145–149. Dougherty, B. E., Nichols, J. J., & Nichols, K. K. (2011). Rasch analysis of the Ocular Surface Disease Index (OSDI). Investigative Ophthalmology & Visual Science, 52(12), 8630–8635. Enochs, L. G., & Riggs, I. M. (1990). Further development of an elementary science teaching efficacy belief instrument: A preservice elementary scale. School Science and Mathematics, 90(8), 694–706. Linacre, J. M. (1998). Structure in Rasch residuals: Why principal components analysis (PCA)? Rasch Measurement Transactions, 12(2), 636. Linacre, J. M. (2018a). Winsteps® Rasch measurement computer program User’s Guide. Beaverton, OR: Winsteps.com. Linacre, J. M. (2018b). Winsteps® Rasch measurement computer program Help Tab. Beaverton, OR: Winsteps.com. Oon, P. T., & Subramaniam, R. (2011). Rasch modelling of a scale that explores the take-up of physics among school students from the perspective of teachers. In R. F. Cavanaugh & R. F. Waugh (Eds.), Applications of Rasch measurement in learning environments research (pp. 119–139). Rotterdam, the Netherlands: Sense Publishers.

Additional Readings A good article that shows the mathematics of PCAR and also considers the application of PCAR in Rasch Analysis. The data set is a national data set of job satisfaction in Italy: Brentari, E., & Golia, S. (2007). Unidimensionality in the Rasch model: How to detect and interpret. Statistica, 67(3), 253–261. An excellent paper explaining the differences between factor analysis and Rasch-based analyses: Sick, J. (2011). Rasch measurement and factor analysis. SHIKEN: JALT Testing & Evaluation SIG Newsletter, 15(1), 15–17. Two excellent and brief articles authored by Ben Wright comparing Factor analysis and Rasch measurement: Wright, B. D. (1994). Comparing factor analysis and Rasch measurement. Rasch Measurement Transactions, 8(1), 350. Wright, B. D. (2000). Conventional factor analysis vs. Rasch residual factor analysis. Rasch Measurement Transactions, 14(2), 753.

Chapter 3

Point Measure Correlation

Charlie and Chizuko: Two Colleagues Conversing Charlie: Chizuko, I have heard of Point Measure Correlation, but I am not sure how and why I should grasp and apply the issue? Do you have any background on this? Chizuko: Well, yes Charlie, I have had some experience with this. Do you remember the extensive work that we have done considering whether or not an item behaves as it should? Of course, the behavior of an item, in part, can be measured by the amount of its misfit. Charlie: Yes, I do recall that…..but what does that have to do with Point Measure Correlation? Chizuko: Remember, we always want to do a good job using a set of items to help us measure a person. We could use a test (right/wrong) or we could use a survey. Our goal is to make sure that items help us measure a trait, and there is an overall pattern in respondents’ answers that is aligned with our predictions. As you learn more about point measure correlation, you will realize that we can use some fairly simple techniques to identify items that are acting in an odd way. I think you will see that this work is similar to the work we have already done with fit. Okay? Charlie: I got it! Tell me more!

Tie Ins to RAHS Book 1 In this chapter, we revisit a topic that we covered, in part, in RAHS Book 1. That topic involved the issue of negative (reverse coded) items. Recall that many of the STEBI self-efficacy items are negatively worded. Readers of our first book will recall that we are not big fans of items that need to be flipped. In some instances, the insertion of a negative word (in an effort to keep the reader of a survey alert), really makes a survey more of a reading survey. In this chapter we discuss how the point © Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_3

25

26

3 Point Measure Correlation

measure correlation of items can be quickly reviewed in Winsteps and what this has to do with flipping. Also, we discuss how point measure correlation can be used to identify items that may not help us to measure the trait of interest. In RAHS Book 1 we learned that item fit can be used to identify items that may not belong to a trait. In this chapter we learn how an added technique can be used to identify items that may be miscoded and/or may not be a part of a single trait.

Introduction Herein, as well as in our first book, we stress the importance of monitoring the functioning of a measurement instrument: How do items behave? Do the measures of items match what we predict? Do the behaviors of items fit the Rasch model? Another technique that we have found to evaluate the functioning of an instrument is to review the Point Measure Correlation of the items. In this chapter, we introduce readers to this technique, which can be incorporated into a Rasch analysis. If readers conduct a Rasch analysis of their data and review many of the Winsteps tables, e.g., Item Measure Table, Item Entry Table, Item Correlation Table, it is easy to find the column labeled PT-MEASUR CORR. This column presents the Point Measure Correlations, and we will use these values to learn more about our measurement instrument. In Winsteps (https://www.winsteps.com/winman/correlations.htm) Mike Linacre (2018) provides this summary of how the PT-MEASURE CORR is computed: Point-measure correlation for all observations. Computes the Pearson point-measure correlation coefficients, rpm between the observations and the measures, estimated from the raw scores including the current observation or the anchored values. Measures corresponding to extreme scores are included in the computation.

Where is the Point Measure Correlation? How do we interpret it? For this analysis, the control file cf se-all.cf will be used. This control file includes the now familiar 13 self-efficacy items of the STEBI. Readers should carefully note that the control file uses the line NEWSCR=6543210 and the line RESCOR=0101101011101 to rescore the negative items of the STEBI. Readers should also review the CODES line of the control file and note that the original coding of the STEBI was 0123456. Readers will then be able to note that, for example, the second Self-Efficacy item was recoded in the analysis. To confirm this, look at the RESOR line and find a 1 for instance the second number presented in the string of numbers 0101101011101. By looking at the NEWSCR line and the CODES line in the control file, readers can see we have told Winsteps that a 0 (for these flipped items) will be recoded as a 6, a 1 will be recoded as a 5 and so on. Readers’ next step toward better understanding the manner in which analysts might use Point Measure Correlation is by running Winsteps with the control file cf se-all.cf, and retrieving Table 26.1 (Fig. 3.1).

Introduction

27

TABLE 26.1 e:se-all.cf 8 flips ZOU149WS.TXT Mar 23 2020 10:23 INPUT: 154 PERSON 13 ITEM REPORTED: 153 PERSON 13 ITEM 7 CATS WINSTEPS 4.4.8 -------------------------------------------------------------------------------PERSON: REAL SEP.: 2.45 REL.: .86 ... ITEM: REAL SEP.: 4.66 REL.: .96 ITEM STATISTICS: CORRELATION ORDER -------------------------------------------------------------------------------------------------|ENTRY TOTAL TOTAL MODEL| INFIT | OUTFIT |PTMEASUR-AL|EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZEMP|MNSQ ZEMP|CORR. EXP.| OBS% EXP%| ITEM | |------------------------------------+----------+----------+-----------+-----------+-------------| | 1 772 152 -.63 .09|1.19 .57|1.32 .93| .42 .50| 41.7 50.6| c96 q2 se | | 12 803 153 -.87 .10| .99 .00| .84 -.46| .51 .46| 58.6 55.1| c117 q23 | | 10 650 153 .14 .07|1.39 1.43|1.59 1.90| .55 .62| 33.6 37.9| c115 q21-f | | 4 618 153 .29 .07|1.16 .65|1.29 1.06| .56 .65| 30.9 36.5| c100 q6-f | | 13 621 153 .27 .07|1.17 .70|1.26 .95| .58 .65| 40.8 36.7| c118 q24-f | | 2 647 151 .12 .07|1.48 1.68|1.37 1.25| .59 .62| 30.7 38.0| c97- q3 se-f| | 8 692 152 -.10 .07| .81 -.74|1.05 .19| .60 .58| 54.3 40.7| c112 q18 | | 11 668 153 .05 .07| .80 -.81| .92 -.27| .64 .61| 42.8 38.3| c116 q22-f | | 6 698 153 -.11 .07| .75 -1.01| .69 -1.24| .65 .58| 48.7 40.6| c106 q12 | | 3 667 153 .05 .07| .62 -1.72| .67 -1.34| .66 .61| 48.7 38.3| c99 q5 | | 7 582 153 .45 .07| .93 -.27| .99 -.04| .68 .67| 34.2 35.0| c111-q17-f | | 5 704 153 -.14 .07| .72 -1.12| .72 -1.08| .69 .58| 43.4 41.2| c102 q8-f | | 9 571 151 .48 .07|1.16 .65|1.11 .43| .71 .68| 32.7 34.8| c113 q19-f | |------------------------------------+----------+----------+-----------+-----------+-------------| | MEAN 668.7 152.5 .00 .07|1.02 .0|1.06 .2| | 41.6 40.3| | | P.SD 64.5 .7 .37 .01| .26 1.0| .28 1.0| | 8.7 5.8| | --------------------------------------------------------------------------------------------------

Fig. 3.1 Winsteps Table 26.1 displays the output of the analysis as organized by Point Measure Correlation. The Point Measure Correlation is listed in the column entitled PTMEASUR- CORR. The Point Measure Correlation of item q2 se is .42. Please note that the far right of the table allows one to identify each item. In this analysis, the values range from .42 to .71 Fig. 3.2 An edited control file in which the second Self-Efficacy item was not recoded, even though the survey item, due to item wording, needed to be recoded before conducting an analysis

TITLE='' FORMAT=(13A1) ; ; ITEM1=1 NI=13 NEWSCR=6543210 ;RESCOR=0101101011101 RESCOR=0001101011101 CODES=0123456 XWIDE=1 STBIAS=Y DISTRT=Y EXTRSC=0.3 FITP=3.0 FITI=3.0 LOCAL=Y CURVES=111 ;umean= ;uscale= &END

To begin explaining the usefulness of the Point Measure Correlation, we make an edit in our control file. Readers should place a semicolon (;) in front of the RESCOR line, and retype the line, with one change. Make the RESCOR line RESCOR=0001101011101. Astute readers will see that the second self-efficacy item that is read in the Winsteps analysis will no longer be recoded. This item was a negatively worded item and needed to be flipped for any analysis. However, by making the change above, we pretend that we forgot to flip this item. This means that q3 se-f, which should have been flipped, was not flipped! Below we present a segment of our new control file cf se forget something (Fig. 3.2).

28

3 Point Measure Correlation

Readers should note that our new control file includes the change in which you will not flip the second item of the SE items, even though the item, due to wording, needs to be flipped. Our next step is to conduct a Winsteps analysis with the new edited control file. However, before you run the file, can you predict what the impact will be on the Point Measure Correlations? In Fig. 3.3 we present the results of the analysis with the miscoded data. Readers should note that the unflipped survey item has a negative Point Measure Correlation (−.43). This is noteworthy because this is the only item with a negative point measure correlation. Readers should appreciate that the Point Measure Correlation, in our view, is a very useful tool for identifying potential miscoding of data. Hopefully, readers will also see a high MNSQ Outfit and a high MNSQ Infit for this item because the analysis was conducted with the miscoding. Readers will recall that a high misfit of the sort observed is often the result of miscoding of data; however, many other reasons exist for misfit. We therefore suggest that when analysts conduct their analyses, one of the first pieces of output to be reviewed is the Point Measure Correlation for each item. A negative point correlation may mean an item is not helping us measure. Formative Assessment Checkpoint #1 Question: Why should I review the Point Measure Correlation column? What should I look for? Answer: As a general rule of thumb, a negative Point Measure Correlation may suggest a coding problem and/or an item that is not matching our conception of the trait we are measuring. An analyst hopes to see positive Point Measure correlations for all items of a survey or test.

TABLE 26.1 C:\Users\Bill Boone\Desktop\cf se for ZOU442WS.TXT. Mar 23 2020 10:40 INPUT: 154 PERSON 13 ITEM REPORTED: 153 PERSON 13 ITEM 7 CATS WINSTEPS 4.4.8 -------------------------------------------------------------------------------PERSON: REAL SEP.: 1.63 REL.: .73 ... ITEM: REAL SEP.: 6.76 REL.: .98 ITEM STATISTICS: CORRELATION ORDER -------------------------------------------------------------------------------------------------|ENTRY TOTAL TOTAL MODEL| INFIT | OUTFIT |PTMEASUR-AL|EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZEMP|MNSQ ZEMP|CORR. EXP.| OBS% EXP%| ITEM | |------------------------------------+----------+----------+-----------+-----------+-------------| | 2 259 151 1.37 .06|3.48 3.11|4.86 3.28| -.43 .61| 17.9 28.4| c97- q3 se-f| | 1 772 152 -.64 .08|1.04 .07| .90 -.12| .51 .45| 36.8 43.0| c96 q2 se | | 12 803 153 -.84 .09| .92 -.10| .76 -.33| .57 .41| 56.2 50.3| c117 q23 | | 10 650 153 .01 .06|1.15 .30|1.29 .44| .57 .58| 37.9 32.6| c115 q21-f | | 4 618 153 .14 .06| .94 -.12| .96 -.06| .57 .61| 39.2 32.1| c100 q6-f | | 13 621 153 .12 .06| .93 -.15| .88 -.19| .62 .60| 39.9 32.1| c118 q24-f | | 11 668 153 -.06 .06| .74 -.56| .82 -.29| .66 .57| 42.5 32.9| c116 q22-f | | 8 692 152 -.18 .07| .65 -.74| .67 -.54| .69 .54| 52.0 35.1| c112 q18 | | 6 698 153 -.19 .07| .70 -.62| .64 -.61| .72 .54| 47.1 34.9| c106 q12 | | 3 667 153 -.06 .06| .54 -1.07| .52 -.90| .72 .57| 52.3 32.3| c99 q5 | | 5 704 153 -.22 .07| .70 -.62| .76 -.38| .73 .53| 34.0 35.4| c102 q8-f | | 7 582 153 .27 .06| .74 -.60| .77 -.42| .74 .63| 31.4 28.6| c111-q17-f | | 9 571 151 .29 .06| .94 -.13| .92 -.14| .81 .63| 25.8 28.7| c113 q19-f | |------------------------------------+----------+----------+-----------+-----------+-------------| | MEAN 638.8 152.5 .00 .07|1.04 -.1|1.13 .0| | 39.5 34.4| | | P.SD 127.1 .7 .50 .01| .73 1.0|1.09 1.0| | 10.5 5.9| | --------------------------------------------------------------------------------------------------

Fig. 3.3 The results of an analysis when an item that should have been flipped was not flipped. The item that was miscoded – not flipped – for the analysis is the only item with a negative Point Measure Correlation. (Table from Winsteps)

Reviewing Point Measure Correlations for Tests

29

Reviewing Point Measure Correlations for Tests Just as Point Measure Correlations can be used to identify possibly miscoded survey items, such techniques can also be used for tests. Continuing our explanation, we provide a control file entitled cf point measure correlation of a test. As a first step, we review the control file as we did for the survey analysis. Figure 3.4 presents much of this control file. This multiple-choice test consisted of 22 items, and answers were coded with a 1 for the selection of item answer A, a 2 for the selection of an item answer B and so on. The most important point for readers to note is the answer key for the test. Readers should look at the line that states KEY1. The students’ answers are provided with numbers. In this figure we provide data for the first three students. Following creation of the control file, the researchers then conducted a Winsteps analysis of the test data. Table 26.1, the result of that analysis, is provided below in Fig. 3.5. Examination of the Point Measure Correlation column shows that values range from a low of .05 to a high of .62. We now conduct a brief experiment for, and with, readers. First, we begin with the original test control file. Next, we alter the answer key for the item of the test (Q32). The correct answer for this item is 4 (student answer D), and we replace 4 with 3. So, we have inserted the wrong answer for Q32 in the control file. In Fig. 3.6 we provide a part of this new control file. TITLE='' FORMAT=(11A1,26(1X,1x),22(1A1,1X)) ; The answer key for the test KEY1="1334141221342344313221" ITEM1=12 NI=22 NAME1=1 ; The person ID is a total of 11 columns wide NAMLEN=11 CODES="12345" XWIDE=1 * &END Q27 Q28 Q29 Q30 Q31 Q32 Q33 Q34 Q35 Q36 Q37 Q38 Q39 Q40 Q41 Q42 Q43 Q44 Q45 Q46 Q47 Q48 END NAMES 1762354xxx,1,1,1,4,1,4,1,3,1,4,4,2,2,3,3,3,4,1,2,4,1,2,2,3,2,1,1,4,3,4,1,4,3,2,5,4,3,4,5,1,3,k,k,k,k,k,k,k,,k,k,k 2459395xxx,1,2,1,4,1,1,3,3,1,2,2,3,4,1,3,1,4,2,2,1,1,3,3,3,2,2,1,4,3,4,1,4,3,3,5,5,3,4,4,2,2,4,3,4,3,2,2,1,,2,2,1 3752190xxx,2,1,1,4,2,4,4,3,1,1,4,3,2,4,4,3,4,k,1,3,1,3,3,3,2,1,1,3,3,4,1,4,4,2,2,5,3,4,2,1,4,4,1,2,3,1,2,k,,k,k,k

Fig. 3.4 A component of a test control file. Of importance for readers to note is that each student’s answers are presented using the numbers. Also, readers should note that the control file includes an answer key. This allows Winsteps to determine if the student has, or has not, correctly answered each item

30

3 Point Measure Correlation

TABLE 26.1 C:\Users\Bill Boone\Desktop\cf 6th it ZOU254WS.TXT Mar 23 2020 10:59a test.txt INPUT: 64 PERSON 22 ITEM REPORTED: 62 PERSON 22 ITEM 2 CATS WINSTEPS 4.4.8 -----------------------------------------------------------------------------------------PERSON: REAL SEP.: .64 REL.: .29 ... ITEM: REAL SEP.: 2.40 REL.: .85 ITEM STATISTICS: CORRELATION ORDER ------------------------------------------------------------------------------------------|ENTRY TOTAL TOTAL MODEL| INFIT | OUTFIT |PTMEASUR-AL|EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR. EXP.| OBS% EXP%| ITEM | |------------------------------------+----------+----------+-----------+-----------+------| | 2 18 61 .84 .30|1.20 1.49|1.25 1.33| .05 .31| 68.9 72.3| Q28 | | 7 9 61 1.83 .37|1.09 .46|1.44 1.22| .07 .25| 85.2 85.2| Q33 | | 14 11 58 1.51 .35|1.16 .83|1.17 .64| .08 .27| 79.3 81.2| Q40 | | 17 14 31 .23 .38|1.15 1.22|1.17 1.23| .12 .32| 58.1 64.0| Q43 | | 20 11 29 .60 .40|1.11 .82|1.11 .68| .16 .32| 62.1 66.6| Q46 | | 18 9 31 1.00 .42|1.05 .33|1.18 .79| .20 .31| 74.2 72.7| Q44 | | 15 23 32 -1.02 .41|1.02 .19|1.03 .21| .24 .27| 71.9 72.3| Q41 | | 12 39 59 -.84 .29|1.06 .56|1.03 .23| .26 .33| 66.1 69.7| Q38 | | 3 40 61 -.85 .29|1.02 .19|1.11 .78| .28 .32| 72.1 69.1| Q29 | | 9 25 60 .25 .28|1.04 .47|1.01 .12| .29 .33| 56.7 64.8| Q35 | | 11 41 58 -1.09 .30| .94 -.38|1.08 .45| .35 .32| 77.6 72.7| Q37 | | 1 34 61 -.39 .27| .99 -.07| .97 -.20| .35 .33| 72.1 64.8| Q27 | | 6 54 61 -2.38 .41| .92 -.20| .74 -.54| .36 .23| 88.5 88.5| Q32 | | 4 30 61 -.09 .27| .97 -.29| .99 -.07| .36 .34| 63.9 63.6| Q30 | | 5 33 61 -.32 .27| .98 -.26| .96 -.36| .37 .33| 63.9 64.5| Q31 | | 13 32 58 -.33 .28| .97 -.26| .95 -.45| .38 .34| 67.2 64.9| Q39 | | 10 24 60 .33 .28| .91 -.97| .89 -.90| .45 .33| 71.7 65.5| Q36 | | 16 12 31 .52 .39| .89 -.78| .85 -.94| .48 .32| 67.7 66.3| Q42 | | 8 24 60 .33 .28| .89 -1.20| .84 -1.31| .48 .33| 71.7 65.5| Q34 | | 19 18 30 -.41 .39| .87 -1.08| .81 -1.16| .50 .30| 66.7 64.8| Q45 | | 22 13 27 .08 .41| .82 -1.65| .78 -1.66| .57 .31| 70.4 62.8| Q48 | | 21 13 28 .19 .40| .78 -1.94| .75 -1.87| .62 .32| 82.1 63.5| Q47 | |------------------------------------+----------+----------+-----------+-----------+------| | MEAN 24.0 49.0 .00 .34| .99 -.1|1.00 -.1| | 70.8 69.3| | | P.SD 12.2 14.6 .91 .06| .11 .9| .17 .9| | 7.9 7.0| | -------------------------------------------------------------------------------------------

Fig. 3.5 The results of evaluating a multiple-choice data set in which the answer key is correctly entered. Note the absence of any test items with a negative (-) point measure correlation. Table from Winsteps Fig. 3.6 The control file for the test data set in which the answer key for the sixth test item (Q32) is incorrect. The correct answer is 4, but this answer key presents the answer as 3

TITLE='' FORMAT=(11A1,26(1X,1x),22(1A1,1X)) ; The answer key for the test KEY1="1334131221342344313221" ITEM1=12 NI=22 NAME1=1 ; The person ID is a total of 11 columns wide NAMLEN=11 CODES="12345" XWIDE=1 * &END Q27 Q28

In Fig. 3.7 we provide Winsteps Table 26.1 from this analysis. Of importance, readers should now observe that item Q32 exhibits a negative point measure correlation (−.22). Just as the survey item that was not flipped was flagged by the negative point measure correlation value, an item that is miskeyed is also flagged by a negative point measure correlation value. As the result of the miscoded answer, readers can observe that this item exhibits a negative point measure correlation. Thus, the same pattern of point measure correlations (a negative point measure correlation is present) that was observed for the miscoded survey data is also observed for a test with an incorrect answer key.

Experiment #2 With Test Data

31

TABLE 26.1 C:\Users\Bill Boone\Desktop\cf 6th it ZOU931WS.TXTi Mar 23 2020 11: 9est.txt INPUT: 64 PERSON 22 ITEM REPORTED: 62 PERSON 22 ITEM 2 CATS WINSTEPS 4.4.8 --------------------------------------------------------------------------------------PERSON: REAL SEP.: .53 REL.: .22 ... ITEM: REAL SEP.: 2.45 REL.: .86 ITEM STATISTICS: CORRELATION ORDER ------------------------------------------------------------------------------------------|ENTRY TOTAL TOTAL MODEL| INFIT | OUTFIT |PTMEASUR-AL|EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR. EXP.| OBS% EXP%| ITEM | |------------------------------------+----------+----------+-----------+-----------+------| | 6 3 61 2.82 .60|1.12 .39|2.42 1.71| -.22 .15| 95.1 95.1| Q32 | | 17 14 31 -.03 .38|1.14 1.18|1.17 1.22| .11 .31| 58.1 63.7| Q43 | | 2 18 61 .60 .29|1.14 1.08|1.13 .79| .12 .30| 68.9 72.1| Q28 | | 7 9 61 1.56 .37|1.07 .37|1.14 .52| .12 .24| 85.2 85.2| Q33 | | 14 11 58 1.25 .35|1.13 .66|1.05 .26| .12 .26| 79.3 81.1| Q40 | | 20 11 29 .35 .40|1.12 .86|1.13 .79| .14 .31| 62.1 66.4| Q46 | | 18 9 31 .74 .41|1.06 .36|1.21 .91| .17 .30| 74.2 72.6| Q44 | | 15 23 32 -1.27 .41|1.05 .34|1.06 .32| .19 .26| 68.8 72.2| Q41 | | 12 39 59 -1.08 .29|1.08 .71|1.05 .39| .21 .31| 64.4 68.7| Q38 | | 3 40 61 -1.08 .28|1.05 .45|1.12 .85| .23 .31| 65.6 68.3| Q29 | | 9 25 60 .00 .28|1.03 .39|1.02 .25| .28 .32| 60.0 64.3| Q35 | | 1 34 61 -.62 .27|1.00 .00| .99 -.03| .32 .32| 70.5 64.3| Q27 | | 4 30 61 -.33 .27| .97 -.38| .97 -.26| .36 .32| 63.9 63.2| Q30 | | 11 41 58 -1.32 .30| .92 -.58|1.03 .25| .37 .30| 77.6 72.2| Q37 | | 13 32 58 -.58 .28| .97 -.39| .93 -.62| .38 .32| 67.2 64.2| Q39 | | 5 33 61 -.55 .27| .91 -1.06| .88 -1.16| .44 .32| 65.6 64.0| Q31 | | 10 24 60 .08 .28| .90 -1.14| .87 -1.06| .46 .32| 71.7 65.1| Q36 | | 8 24 60 .08 .28| .89 -1.24| .85 -1.27| .47 .32| 71.7 65.1| Q34 | | 19 18 30 -.66 .39| .88 -.94| .83 -1.04| .48 .30| 66.7 64.5| Q45 | | 16 12 31 .26 .39| .88 -.86| .84 -1.00| .49 .31| 67.7 66.1| Q42 | | 22 13 27 -.18 .40| .83 -1.57| .79 -1.60| .56 .30| 66.7 62.6| Q48 | | 21 13 28 -.06 .40| .79 -1.87| .76 -1.82| .61 .31| 78.6 63.1| Q47 | |------------------------------------+----------+----------+-----------+-----------+------| | MEAN 21.6 49.0 .00 .35|1.00 -.1|1.06 -.1| | 70.4 69.3| | | P.SD 11.1 14.6 .96 .08| .11 .9| .32 1.0| | 8.4 8.1| | -------------------------------------------------------------------------------------------

Fig. 3.7 Winsteps Table 26.1 for the analysis of a multiple-choice test. Item Q32 has a negative point measure correlation after our editing the answer key and placing a wrong answer as the answer for the sixth item on the multiple-choice test (Q32) Fig. 3.8 The edited control file for the test data. The answer key has been edited in that the wrong answer is provided as the right answer for the first item of the test. The correct answer 1 in the key for the first item (Q27) was changed to 2

TITLE='' FORMAT=(11A1,26(1X,1x),22(1A1,1X)) ; The answer key for the test KEY1="2334141221342344313221" ITEM1=12 NI=22 NAME1=1 ; The person ID is a total of 11 columns wide NAMLEN=11 CODES="12345" XWIDE=1 &END Q27 Q28

Experiment #2 With Test Data We now replicate the same experiment, but we use the first item of the test for our experiment. First we un-flip Q32. Then to miscode the test data, we insert an error for the first item in the test (Q27). Reviewing the control file in Fig. 3.4 and the line Key1, it is possible to note that the correct answer for test question Q27 was an answer coded as a 1 (students selecting A were coded 1, students selecting B were coded with 2 and so forth). Next, we insert an error in the answer key – miscoding a 2 as the correct answer for the first item on the test (Q27). This control file is named cf miscode point measure correlation of a test, and part of the control file is presented in Fig. 3.8.

32

3 Point Measure Correlation

Below is Fig. 3.9 (Table 26.1 of Winsteps) for the analysis of the test data using the control file with the intentional error in the answer key. However, there is something strange that we now see. The miscoded item, Q27, does not exhibit a negative point measure correlation; the observed value is .08. Why might that be the case, and how should we think about this? If we look behind the scenes at the number of students who correctly answered this item, we can observe that only 11 of 61 respondents “correctly” answered the miscoded item. This means 11 students selected 2 for this item. Given the recoding, it appears as if the item is harder than it really is. Examining the original analysis, we observed that 34 of 61 students correctly answered item Q27. Therefore, we should first realize that the recoding (the error in the answer key for this item) causes the item to appear harder than it actually is. Now, why do we not see a negative point measure correlation as we did for the survey data and the initial experiment with this test data? To understand this question, we suggest that readers review a brief article by Benjamin Wright in which he considers Rasch Fit Statistics (RFS) and point bi-serials. In that article, Wright makes the following comment with respect to point biserials: We don’t know where the value we are observing is placed in the possible range. We don’t know whether that value is acceptable, undesirably large or undesirably small. The rpbis is a misfit statistic but of unknown size and significance. All we know for certain, (and this is

TABLE 26.1 C:\Users\Bill Boone\Desktop\cf for fi ZOU555WS.TXT Mar 23 2020 11:18 INPUT: 64 PERSON 22 ITEM REPORTED: 62 PERSON 22 ITEM 2 CATS WINSTEPS 4.4.8 -------------------------------------------------------------------------------PERSON: REAL SEP.: .53 REL.: .22 ... ITEM: REAL SEP.: 2.52 REL.: .86 ITEM STATISTICS: CORRELATION ORDER ------------------------------------------------------------------------------------------|ENTRY TOTAL TOTAL MODEL| INFIT | OUTFIT |PTMEASUR-AL|EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR. EXP.| OBS% EXP%| ITEM | |------------------------------------+----------+----------+-----------+-----------+------| | 14 11 58 1.39 .35|1.14 .74|1.22 .81| .07 .25| 79.3 81.1| Q40 | | 17 14 31 .14 .38|1.17 1.46|1.19 1.33| .08 .31| 51.6 63.5| Q43 | | 2 18 61 .77 .29|1.14 1.12|1.27 1.41| .08 .29| 68.9 71.8| Q28 | | 1 11 61 1.46 .34|1.11 .59|1.19 .71| .08 .25| 82.0 81.9| Q27 | | 7 9 61 1.71 .37|1.04 .25|1.29 .88| .13 .23| 85.2 85.2| Q33 | | 18 9 31 .91 .41|1.06 .39|1.22 .94| .16 .29| 77.4 72.1| Q44 | | 20 11 29 .52 .40|1.07 .52|1.04 .28| .22 .30| 62.1 66.4| Q46 | | 12 39 59 -.95 .29|1.08 .70|1.08 .58| .23 .33| 66.1 70.4| Q38 | | 6 54 61 -2.46 .42| .97 -.01|1.01 .15| .28 .24| 88.5 88.5| Q32 | | 3 40 61 -.92 .29|1.01 .10|1.08 .56| .30 .32| 68.9 69.9| Q29 | | 4 30 61 -.16 .27|1.02 .22|1.03 .33| .30 .33| 67.2 62.7| Q30 | | 15 23 32 -1.11 .41| .98 -.07| .96 -.07| .31 .28| 78.1 72.6| Q41 | | 5 33 61 -.38 .27| .99 -.12| .98 -.19| .35 .33| 68.9 62.9| Q31 | | 9 25 60 .16 .28| .98 -.25| .94 -.44| .36 .32| 63.3 64.5| Q35 | | 11 41 58 -1.19 .30| .93 -.49|1.05 .34| .37 .32| 77.6 73.1| Q37 | | 13 32 58 -.44 .28| .97 -.33| .94 -.49| .38 .33| 65.5 63.8| Q39 | | 16 12 31 .44 .39| .92 -.54| .88 -.71| .43 .31| 71.0 66.1| Q42 | | 10 24 60 .24 .28| .88 -1.35| .86 -1.12| .47 .31| 75.0 65.0| Q36 | | 8 24 60 .24 .28| .86 -1.59| .81 -1.56| .50 .31| 71.7 65.0| Q34 | | 19 18 30 -.48 .39| .86 -1.15| .81 -1.26| .52 .30| 63.3 64.6| Q45 | | 21 13 28 .11 .40| .81 -1.75| .78 -1.72| .59 .31| 75.0 62.8| Q47 | | 22 13 27 .00 .40| .80 -1.86| .77 -1.83| .60 .30| 74.1 62.3| Q48 | |------------------------------------+----------+----------+-----------+-----------+------| | MEAN 22.9 49.0 .00 .34| .99 -.2|1.02 .0| | 71.8 69.8| | | P.SD 12.3 14.6 .96 .06| .10 .9| .16 1.0| | 8.3 7.7| | -------------------------------------------------------------------------------------------

Fig. 3.9 Winsteps Table 26.1 of the analysis of the test data when the answer for the first item of the test (Q27) is incorrectly entered in the answer key. A 2 has been entered for the correct answer when the correct answer is 1

Experiment #2 With Test Data

33

useful in detecting miscoded data), is that negative rpbis means that the observed responses to that item contradict the general meaning of the test. (Wright, 1992, p. 174)

Summarizing the evidence, we conclude that when analysts observe a positive point measure correlation, even when we have intentionally miscoded the data, the phenomenon is in part due to the issue described above by Wright, namely that the size and significance of a correlation can be unknown. The second part of this quotation from Wright is equally important, namely that a negative correlation is observed when students’ responses to an item – in our example item Q27 – contradict the meaning of the test. Given our recoding of the answer key with the wrong answer, the recode did not result in item Q27 contradicting the meaning of the test. The crucial aspect of this is that readers should not assume that all errors in coding will be “caught” through use of a correlation review. Thus, we encourage readers to review the point measure correlation of all items in a survey or a test. Do not, however, assume that such a review will always catch all such mistakes. This inability to spot all miscoding is also revealed by the MNSQ Fit (Outfit MNSQ 1.19) values for item Q27 following the analysis with the wrong answer key. Even though we know the wrong answer has been inserted into the answer key, the MNSQ does not flag the item for strange behavior. Also readers should note that point measure correlations can be a technique that is used to help spot items that may not contribute to measurement. Readers should also take note that just as PCAR (Chapter 2) can be used to explore the dimensionality of an instrument, point measure correlations can also be used to investigate this issue. Formative Assessment Checkpoint #2 Question: Why might an error in coding for a survey or a test answer key not produce a negative point measure correlation? Answer: Quoting Wright: “the point bi-serial is a misfit statistic but of unknown size and significance” (1992, p. 174). For the authors of this book, this means we can never be sure that a point measure correlation always flags a problem with an answer key or the coding of survey data or if the item contributes to measurement. The key is to use multiple techniques to help spot problems with item coding, answer keys, and how (and if) items define a trait. This is very similar to, and consistent with, the fact that one uses many techniques to evaluate the fit of items to the Rasch model. Charlie and Chizuko: Two Colleagues Conversing Chizuko: Alright Charlie, it’s quiz time on Point Measure Correlations. Ready? Charlie: I’m ready as I’ll be without my coffee. Chizuko: Let’s pretend you have been given a survey data set to evaluate. You are looking over the point measure correlation column. What are you scanning for and why?

34

3 Point Measure Correlation

Charlie: Well…I could pull up any of the item tables – item entry, item measure – but I like to pull up Table 26 because it is organized by correlation order. That table is organized from lowest point measure correlation to highest point measure correlation. Then I would scan with my eyes and look for items with a negative point measure correlation. Chizuko: If you see an item with a negative value, what does that tell you? Charlie: Well, this is really where thinking before leaping comes in. Do you remember when we look at fit, we try to understand our fit problems? We might ask why does this item misfit? What could cause the misfit? Sometimes we are not sure why an item is misfitting, so we keep the item in the analysis, but we try our best to monitor the item over time. It’s the same deal with point measure correlation. When I see a negative point measure correlation, I could be in a situation in which an item is miscoded for a survey, or there is an error in an answer key. Thus, when an item has a negative point measure correlation, I check the answer key, and I also check the survey coding. For surveys, it is often the case that an item that should have been flipped was not flipped. Also, for a test, there may have been an error in the answer key. Also I now know that an item with a negative point measure correlation may be an item which may not help us measure a trait (e.g. it might be a poorly worded item, it may not measure the trait as other items of the instrument). Chizuko: That is really, really good Charlie. Can I add something? Charlie: I can’t say no can I? Chizuko: You are just jealous. Do you remember in book one that the authors suggested creating a “Mr. Positive” (this person had the highest self efficacy one could have on the scale) and a “Mr. Negative” (this person had the lowest self efficacy one could have on the scale) regarding self-efficacy? The goal was to be able to better understand the meaning of measures. For example, what did a positive measure represent? What did a negative measure represent? Well, my brainstorm is the following: If you create a Mr. Positive and a Mr. Negative, you are really forced to think about the coding, and that could help you spot a coding error before data are entered. Also, even if data have been entered, then adding a Mr. Positive or a Mr. Negative into your data set would help you spot errors in coding. Charlie: Nice. Moreover, by creating a perfect test student (Henry) and creating a totally wrong test student (Pete), you could accomplish the same thing. You would have to know the answers to the test, but then you could compare your answer key to the answer key that you have been given!

Keywords and Phrases Point Measure Correlation Error in answer key

Experiment #2 With Test Data

35

Negative point measure correlations may indicate an error in the coding of a survey. For example, an item that should have been flipped may not have been flipped. A negative point measure correlation for test data may suggest a problem in the answer key. Not all coding problems will show up with a negative point measure correlation. Interpreting point measure correlation is somewhat like interpreting fit statistics. An item or person that is flagged by a negative point measure correlation is an item or person that needs additional investigation. A negative point measure correlation may suggest an item is not helping us measure.

Potential Article Text Following the entry of survey data and the construction of a control file a Rasch Winsteps analysis of the data was conducted. The initial analysis was conducted, in part, to identify any items that might need to be recoded (flipped). Those items that exhibited a negative point measure correlation might be candidates for further investigation regarding whether or not an item needed to be reverse coded (when it was not reverse coded) or whether or not an item was reverse coded when it did not have to be reverse coded. In this analysis, no items needed to be flipped. All point measure correlations were positive. The existence of positive point measure correlations provided one piece of evidence of the items working together to define the trait.

Quick Tips Error in coding of survey data; some flipped items may not have been flipped. A point measure correlation that is negative may suggest an item that is degrading measurement. Use Winsteps Table 26.1 and review the “PT-MEASURE CORR” as well as the “EXP” column. The EXP column provides that point measure correlation if the test/ survey item is answered as predicted by the Rasch model. In Fig. 3.3 look at question c97. One can see a negative PT-MEASURE CORR but a positve EXP. This suggests something is amiss. Perhaps a coding issue, perhaps a problem with how the item defines the trait. Before you panic about an item misfitting, reivew the point measure correlations of items. You may discover an item with an incorrect answer key, or an item that needed to be “flipped”. Once you correct the answer key or flip the item, you may discover that the item is well functioning as indicated by MNSQ. If you wish to check the setting of PTBISERIAL in your control file, and you do not see that phrase in your control file, simply click on Output Files on the gray

36

3 Point Measure Correlation

menu tab when you run Winsteps. Then click on Control variable list =. The list that you will see provides the settings of all control variables for your specific control file. Ben Wright (1992), in his article “Point-Biserials and Item Fits,” provides a comment “One author writes: “Ideally, it is recommended that items have point-biserials ranging from 0.30 to 0.70 (Allen, M. J. and Yen, W. M. (1979). Introduction to Measurement Theory. Waveland Press, Inc. Prospect Heights Il)”.”

Ben Wright also provides some useful guidance to those wishing to use and understand point-biserials and item fits. With regard to a potential range of 0.30–0.70 noted above, Ben comments “A rule such as this cuts off the very easy and very hard items, and may even eliminate good-fitting on-target items” (1992, p. 174). From the Winsteps Manual (bold and underline added by authors for emphasis): The point-bi-serial correlation (or the point-polyserial correlation) is the Pearson correlation between on the observations on an item (or by a person) and the person raw scores or measures (or item marginal scores or measures). These are crucial for evaluating whether the coding scheme and person responses accord with the requirement that “higher observations correspond to more of the latent variable” (and vice-versa). These are reported in Tables 14 for items and Table 18 for items persons. They correlate an item’s (or person’s) responses with the measures of the encountered persons (or items). In Rasch analysis, the point-bi-serial correlation, rpbis, is a useful diagnostic indicator of data mis-coding or item mis-keying: Negative or zero values indicate items or persons with response strings that contradict the variable. Positive values are less informative than INFIT and OUTFIT statistics. (Linacre, 2018)

Data Sets (Go to http://extras.springer.com) cf test2 cf se-all.cf cf se forget something cf point measure correlation of a test cf miscode point measure correlation of a test cf test2 error key data

Activities Activity #1 Using the control file cf test2, check to see what the setting is for PTBISERIAL. Answer: The setting for PTBISERIAL for this control file is No.

Experiment #2 With Test Data

37

Activity #2 For the above activity, does this mean that no point measure correlation is computed? Answer: A point measure correlation is computed. To see the specifics of the calculation, go to the Winsteps Help Tab, then look up PTBISERIAL and the setting of “No;” then you will be able to read the details of the calculations. Activity #3 Using the control file cf test2, conduct a Winsteps analysis of the point measure correlations of the test items. Is there evidence that there might be an error in the answer key? Is there evidence that items may work together to measure? Answer: Following a review of the point measure correlations presented in the output table, we do not see any negative values. This provides one piece of evidence suggesting that the answer key may be error free and that the set of items may help measure the trait. Activity #4 Change the answer key for cf test2 so the incorrect answers are scored as 1 for the first three items of the test. This means the answers for Q1, Q2 and Q3 in the answer key are presented as a 1 (for the selection of an A on a bubble sheet). Then conduct a Winsteps analysis of these incorrect answer key data. We provide that new control file as cf test2 error key data. Answer: An analyst reviews Table 26 and sees that Q1 is listed as a -.11 point measure correlation. The analyst also notices that item Q3 is listed with a value of .00 point measure correlation. An analyst can observe that not always will s/he see a value that is negative for a miscoding. In Table 26 there is no MNSQ listing for Q2. How can that be? By reviewing a different item tables (for example Table 14.1) an analyst is able to observe that no students correctly answered item Q2 (with the wrong answer as the correct answer). As a result, there is a maximum measure of this item. Thus, lots of error exist in the item calibration. This table does report a point measure correlation for item Q2 with a value of .00. One moral for those evaluating a test is that it is also important to review the answer key for items no one has correctly answered. Activity #5 For the analysis of cf test2 error key data, look at the other items of the test. Are any negative point measure correlation observed? Answer: In fact a negative point measure correlation is observed for Q14, with a value of -.04. The answer for this item is correct in the answer key. This provides

38

3 Point Measure Correlation

evidence that the presence of a negative point measure correlation simply shines a light on an item that needs to be investigated further. This is very similar to the examinations that we have conducted for why items misfit. Activity #6 Using the Winsteps Help Tab, review the different settings that are possible for the control variable PTBISERIAL. Answer: There are a variety of settings that can be made. For beginners, we recommend use of the setting used in this chapter, a setting of No. Activity #7 Read the entire article by Ben Wright (1992), “Point-bi-serials and item fits.” What are some added point bi-serial correlation tips that can be added to a tip list?

References Allen, M. J. & Yen, W. M. (1979). Introduction to measurement theory. Prospect Heights, IL: Waveland Press, Inc. Linacre, J. M. (2018). Winsteps® Rasch measurement computer program user’s guide. Beaverton, OR: Winsteps.com. Wright, B. D. (1992). Point-biserial correlations and item fits. Rasch Measurement Transactions, 5(4), 174.

Additional Readings Kelley, T., Ebel, R., & Linacre, J. M. (2002). Item discrimination indices. Rasch Measurement Transactions, 16(3), 883–884. Linacre, J. M. (2008). The expected value of a point-biserial (or similar) correlation. Rasch Measurement Transactions, 22(1), 1154. Stenner, A. J. (1995). Point-bi-serial fit indices. Rasch Measurement Transactions, 9(1), 416.

Chapter 4

Test Information Function (TIF)

Charlie and Chizuko: Two Colleagues Conversing Charlie: Chizuko, I have yet another question for you..... Chizuko: I’m the answer machine.... Charlie: Funny...anyway…. there is something that I occasionally see in Rasch articles. It is called the Test Information Function (TIF). What’s it all about? Chizuko: TIF… yeah, I have seen that discussed in some Rasch articles....I think sometimes people present TIF in an article or in a talk to say something about a test. It basically presents a curve that a researcher can sometimes use to make some assertions about how a test is functioning.

Tie Ins to RAHS Book 1 Many techniques were presented in our first book to help researchers interpret how well a test functions. Of course, many issues must be considered. For example, a researcher can report person reliability for a test, but that should always be only one part of the story. In this chapter, we present a technique that can be used to learn how a test functions. As readers will see, we talk through the technique, but in many ways we think using Wright Maps are more powerful than what can be learned with TIF. However, TIF is sometimes used in articles, and therefore we feel it is important to explain what TIF is all about. Just as Fit and Wright Maps are usually discussed in a Rasch article, it may be that a researcher should also address TIF in their article.

© Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_4

39

40

4 Test Information Function (TIF)

Introduction We present a range of issues in this book as well as our first book. We provided several indices and plots to help researchers better learn Rasch techniques. In many of our examples, we concentrate on presenting several plots that help researchers understand the ins and outs of a technique. We do the same with regard to computing the TIF of a test using Winsteps. Readers will see that we emphasize interpreting the shape of the plot and experimenting with different analyses of a single test in order to better understand how we might use TIF in a Rasch analysis of a data set. Just as we recommend researchers consider the use of PCAR (Chap. 2) and point measure correlation (Chap. 3) for analyses they carry out, we feel that TIF should also be explored in an analysis.

Computing a TIF Graph To compute a TIF graph, a researcher can follow several simple steps using Winsteps. Step 1. Run your Rasch analysis of the test data. For our experiments in this chapter we use a multiple-choice test data set. The control file, which is included as one of the data sets for this chapter, is named cf test. Step 2. Upon completion of your Rasch analysis, click on the button named Graphs, which is located in the gray bar at the top of the Winsteps screen. The Graphs button is located between the Excel/RSSSST button and the Data Setup button (see Figs. 4.1a and 4.1b). Step 3. When you click on the Graphs button, a drop-down menu will appear that allows you to select “Test Information Function.” When you select Test Information Function, the TIF plot for your test will appear. Below in Fig. 4.2 is the TIF plot for the control file cf.test. Also displayed are the different graphing options for the TIF plot.

Fig. 4.1a The part of the Winsteps display where a researcher can begin the process that allows for the computation of the TIF graph

Computing a TIF Graph

41

Fig. 4.1b By clicking on the Graphs button one is presented with an option of exploring the Test Information Function. Part of the display from Winsteps

Fig. 4.2 The Test Information Function from the analysis of the data set contained in the control file cf test. Winsteps output

42

4 Test Information Function (TIF)

hat to Do with the TIF plot? What Is Useful W for the Researcher? Okay, you have a TIF plot. Now, what do you do with the plot? How do you interpret the data? What is useful for the researcher? First, it is important, as for all graphs, to take note of the scale along the horizontal axis as well as the scale along the vertical axis. The horizontal axis can be interpreted as displaying the measurement range of the test. In this test, the measurement range of the test varies from −6 to 6 logits. The vertical scale is explained in the Winsteps Help Tab: Select by clicking on “Test Information” or from the “Graphs” menu. This shows the Fisher information for the test (set of items) on each point along the latent variable. The Test Information Function reports the “statistical information” in the data corresponding to each score or measure on the complete test. Since the standard error of the Rasch person measure (in logits) is 1/square-root (information), the units of test information would be in inverse- square-logits. See also TCCFILE= and Table 20. (Linacre, 2018)

Now what to do? How do we interpret this graph? How might we use the graph? Again, the Winsteps Help Tab provides some important guidance: In practice, the values of the information function are largely ignored, and only its shape is considered. We usually want the test information function to peak: a) where the most important cut-point is (criterion-referenced tests) b) where the mode of the sample is (norm-referenced tests) The width of the information function is the effective measurement range of the test. (Linacre, 2018)

In the case of this test, if we had decided upon a specific cut point, for example if we had decided that a cut point for a pass or fail on the test would be at the 1 logit location, that is where a researcher would like to see the peak of the curve. As is stated in the above guidance from the Winsteps Help Tab, it would also be the case that for a norm referenced test (thus not a criterion referenced test!) the measure where the most respondents were (in terms of measure) should be located where the peak of the curve is located. Thus, in our example, we would hope that the mode of test respondents’ measures would be at, or near, the 0 logit measure. When the authors of this book were first learning how best to use the TIF, one of the steps we took to expand our understanding was to create a Wright Map for the same analysis that we presented immediately above in Fig. 4.2. One of our goals was to relate the TIF plot to the Wright Map. Another goal was to compare what a researcher can learn from both the TIF plot and from the Wright Map. Fig. 4.3 presents the Wright Map for the test. What can be learned from the Wright Map that also seems to be learned to some degree from the TIF plot? First, it appears, from a Wright Map perspective, that the range of the test items varies from roughly 2 logits to −2 logits. We believe that the much larger range of −6 to 6 for the TIF plot simply extends the scale above 2 and below −2. A researcher could do so with the Wright Map, but the functioning of the test seems to be best in the range of the items that are presented on the test. Therefore,

What to Do with the TIF plot? What Is Useful for the Researcher?

43

Fig. 4.3 The Wright Map from the analysis of the control file cf test. Winsteps output

we feel that the Wright Map is probably more useful in terms of estimating the measurement range of the test. We now proceed to where a researcher would like the TIF curve to peak, and where the peaking relates to the goal of the test. In a broad sense, in the case of a test in which we might want to compare respondents, we would want to see if the average person measure is near the average item measure. If the average item measure is substantially higher than the average person measure, a researcher could say that the item targeting for this sample of respondents was not as good as we might want

44

4 Test Information Function (TIF)

to observe. It seems to us that an evaluation of test item targeting on the Wright Map is similar to the idea that the peak of the TIF curve is located at, or near, the mode of respondents. Our view is that although these two ideas seem to overlap, in reality, a researcher can learn much more from the Wright Map, even with the single goal of evaluating how well the test works for measuring (and comparing) respondents. With the Wright Map, as we know from the two Wright Map chapters of RAHS Book 1, we can look at the distribution of items in the test, and we can identify which items are helping us differentiate respondents as a function of person ability. Thus, it seems to us, with the Wright Map we can get a lot of information data that we can use to help us investigate a testing instrument. Now let’s proceed to the use of the TIF for assessing how good a test might be in terms of building a criterion referenced test. With the TIF guidance from the Winsteps Help Tab, we noted that the peak of the curve should occur at the cut point for a criterion referenced test. This should make sense – if we want to make, for example, a pass or fail decision, we as test developers would want to have the maximum test information at the cut point. In terms of our Wright Map, what information do we get? The first piece of information is that wherever we place the cut point we can see which items are near the cut point. This allows us to see which items are helping us differentiate respondents whose measures are near the cut point. Also, of course, we can identify those items that are not helping very much in the measurement of respondents near the cut point. These items (which do not help us from a measurement perspective) should be those items far away from the cut point.

Further Experiments To further explore what is, and what is not, provided by TIF, we present readers with a number of experiments that we conducted to explore what might be provided through the use of TIF. To begin our experiments, we added a control line to our control file. This was the result of using guidance from Mike Linacre when creating the TIF plots (From the Winsteps Help Tab: “To increase the maximum possible Measure range, set EXTREMESCORE=.01 or other small value”). Below is the plot (Fig. 4.4) when the command EXTREMESCORE=.01 is added to the control file. The result of adding this line is added detail at the extreme ends of the scale. An edited version of cf. test is provided as cf test w Extremescore equal .01. We feel that is the case with many commands, it is possible to increase the detail of the plot, but relatively little is gained by adding this detail to this plot. The height of the information value remains at 5.5, the same value as seen in Fig. 4.2. The next experiment we conducted to better understand TIF (and also to better understand the pros and cons of TIF in comparison to a Wright Map) involved the intentional removal of test items from the test. More specifically, we conducted an analysis to experiment with the removal of all difficult and middle of the road difficulty items. Thus, we were curious as to what would be the form and shape of the

Further Experiments

45

Fig. 4.4 The TIF plot resulting from use of the control line extremescore =.01. Winsteps output

TIF graph if we took the identical data set, but used only the easy items in the test? Also, we wondered what was the form of the Wright Map for this very same analysis. The control file that we created with only easy test items is named cf test only easy items w Extremescore equal .01. This file excludes all items with a difficulty greater than −.5 logits. Figure 4.5 presents the TIF for this test which has only easy items. This means a test with poor targeting of item difficulty. When a researcher makes a comparison of Figs. 4.4 and 4.5, he or she can observe that the curves have similar shapes, but a key difference is that the Fig. 4.5 the curve peaks at an “Information” value of about 2.25 in comparison to the “Information” value of 5.5 in the other plots. A researcher would of course predict that if the information function provided some sort of summary of how well a test was measuring, then indeed he or she would expect the information value for a test composed of only easy items to be lower than observed when there is a good mix of items in terms of item difficulty. As we did for the previous experiment, we also created a Wright Map for this same data set. Figure 4.6 presents this Wright Map for the control file that had only easy items.

46

4 Test Information Function (TIF)

Fig. 4.5 The Test Information Function for a test in which only easy items were presented to respondents. This is a TIF for the same set of respondents as in Figs. 4.2 and 4.4. Winsteps output

Certainly, in a publication a researcher could present the TIF plot, but again a researcher can see so much more information in Fig. 4.6. For example, a researcher can immediately see that the items are too easy for a majority of respondents, but a researcher can also see the ordering of those items from most easy to easy. Also, a researcher can observe where items are with respect to person measures to identify which respondents are the largest number of logits away from items, and which respondents are the smallest number of logits away from items. Formative Assessment Checkpoint #1 Question: What are the key aspects of the test information plot that a researcher should review and possibly report in a paper or article? Answer: The shape of the curve. Is it bell shaped? The value of the peak of the curve using units of the vertical information axis. The range of the horizontal axis in logits.

Further Experiments Fig. 4.6 The Wright Map resulting from the analysis of the data set that has only easy items. Winsteps output

47 MEASURE 3

2

1

0

-1

PERSON - MAP - ITEM | ######### + | | | | | | | | S| | .############ | + | | | | | | | M| .######## | | | + | | .# | |T | Item 7 | S| Item 12 |S .### | | Item 9 | +M Item 1 | | Item 3 # | |S Item 10 | T| Item 2 | |T # | | | + | | . | |

Item 6 Item 4

48

4 Test Information Function (TIF)

Final Experiment Below we provide a final experiment that we conducted to decide when, how, and why a researcher might include presentation of Test Information Function in a Rasch paper or Rasch article. To conduct this experiment, we conducted a Rasch analysis using only the 3 most difficult items of the data set and the three easiest items of the data set. The name of the control file for this final experiment is cf test 3 diff items 3 easy items w Extremescore equal .01. Below we again provide the test information curve as well as the Wright Map for the same analysis. We begin by looking at the shape of the Test Information Function curve for this analysis (Fig. 4.7). As readers will be able to see, the curve is decidedly not a bell shaped curve. The curve does not peak; it seems as if there are two rough peaks. The horizontal range along the X scale is about what we have seen in the earlier plots. However, the maximum values for the vertical axis are much lower than the two pre-

Fig. 4.7 The test information curve for an analysis in which three very easy items and three very difficult items are presented to respondents. The information value is about 1.0. The curve is not bell shaped. Winsteps output

Final Experiment

49

ceding experiments that we conducted. In our first analysis using all 25 items of the test, our maximum information value was about 5.5. In our second experiment, we included only the easy items of the test (nine easy items in total). The maximum information value was about 2.5. In this experiment, we utilized three easy items and three difficult items, and the maximum test information value fell to around 1.0. In our two earlier experiments, we presented readers with the Wright Map for each analysis as well at the TIF. We also did that as well for our last experiment. The Wright Map below (Fig. 4.8) provides the location of the test items, locations of respondents, and the now familiar indices such as the mean for the person measures and the mean for the item measures. As was the case for our discussion of the other

Fig. 4.8 The Wright Map for the analysis of the data set when only three easy items and only three difficult items are presented to respondents. The Wright Map provides the location of each item along the logit scale. The map also provides the location of each respondent along the logit scale. Winsteps output

50

4 Test Information Function (TIF)

test information curves and Wright Maps, we believe what a researcher is able to see and understand about test functioning goes far beyond what he or she can see with the Test Information Function. Certainly, the two odd peaks in the Test Information Curve give a hint that something is amiss. However, the Wright Map is far more revealing. With the Wright Map a researcher can immediately see the two clusters of items, identify which items are in each cluster, and identify the respondents that most likely have correctly and not correctly answered each of the test items. Generally, we feel that these experiments suggest to us that evaluating a Wright Map, and also, providing some of the Rasch indices concerning reliability, provides a more useful story with regard to instrument functioning.

Formative Assessment Checkpoint #2 Question: What are some pros and cons of using the TIF and a Wright Map to summarize some measurement properties of a test? Answer: The TIF curve can be quickly evaluated as to whether or not it is bell shaped. The location of the peak of the curve can be used to summarize where the maximum measurement of the test may take place. The Wright Map provides greater detail as to the measurement functioning of the test in that a researcher can review the distribution of item difficulty and review the location of test items with respect to the ability level of test takers.

To finish this chapter involving the TIF, we present some of the text that has been authored by researchers in articles that present the results of computing a test information function with Winsteps. Readers will be able to see that those articles generally present the curve, comment about the range of the curve on the horizontal scale, and comment on the shape of the curve and on the location of the peak of the curve. Often, but not always, authors will include information regarding standard errors of measurement in their Test Information Curve plots. In 2015, Wang, et al. considered the Test Information Curve in their article entitled The Development and Psychometric Properties of the Patient Self-report Neck Functional Status Questionnaire (NFSQ). In this article the authors wrote: “We assessed test precision using the Test Information Function (TIF) and standard error. We plotted the TIF using data generated from the NFSQ items, which indicate the level of information across the range of the construct’s continuum” (p .685). They also added, “as shown by the TIF curve in Figure 1, the TIF was bell shaped…” (p. 688) In 2007, Lund, Fisher, Lexell and Bernspang reported on the Test Information Function in their article entitled Impact on Participation and Autonomy Questionnaire: Interval Scale Validity of the Swedish Version for Use in People with Spinal Cord Injury. In that article, the authors wrote: “The Test Information Function (Fig. 4) supports lack of sensitivity of the scale, likely due to the limited range and number of gaps (especially at mid-range of the scale) between items (Fig. 3).”

Some Final Thoughts on the Test Information Function Curve

51

(p. 159). Later in the article the authors wrote “In accordance with this, the test information function and SEs for persons indicated insufficient sensitivity” (p. 161). The authors then utilize a Wright Map to make an interpretation of the shortfall of the item distribution. The TIF plot for these authors looks much like Fig. 4.7 of this chapter. A final example of how authors summarize the results of an analysis of the test information curve is provided by Lu, et al. in their 2013 article Measurement Precision of the Disability for Back Pain Scale by Applying Rasch Analysis. In this article the authors use added phrases and observations which were not present in the two previous examples. The Test Information Curve and SE according to person ability are shown in Figure 3. The shape of the test information curve was bell-shaped with its maximum at the middle of the person measure scale. The maximum information point was −0.34 logit, so a person in the disability level (−0.34) of the measure would provide the maximum information for disability due to back pain. The value of the maximum test information function of our study was more than 12. (Lu et al., 2013, p. 6)

Some Final Thoughts on the Test Information Function Curve The TIF curve can be generated by Winsteps and is sometimes reported in Rasch articles. For beginners, it is easy to look at a curve and evaluate whether the curve shape is that of a bell. It is also easy to report the maximum value of the information used for the vertical axis of the plot generated by Winsteps. The quotes from the three articles provide some examples of how such data have been reported in Rasch articles. As we considered the Test Information Curve, we honestly discussed what we felt the curve did and did not provide. On one hand, it provides something that can be reported in a paper, but honestly if a researcher has space limitations for an article, we believe that what can be learned from a Wright Map is much more powerful than what can be reported from review of a Test Information Curve. As readers will know quite well by now, a Wright Map allows a researcher to observe the location of items and persons. With the Test Information Curve, that is not possible. In the first experiment for this chapter, we generated a curve (Fig. 4.2), and certainly we could report on the shape of the curve as well as the maximum value of the information function. However, when a researcher reviews Fig. 4.3, he or she can see so much more. For example, but certainly not limited to, the fact that a researcher can identify items that are redundant and identify parts of the trait in need of added items. Charlie and Chizuko: Two Colleagues Conversing Charlie: Okay Chizuko, I’m all ready for quiz time on Test Information Curves. Chizuko: Well, I guess we will see what you know! Here goes....what do you think you get from a Test Information Curve? What’s the big deal? Charlie: Well, when I think about generating a Test Information Curve, it seems to me that I am attempting to generally summarize the measurement functioning of a test. I can look at the range of logits on the horizontal axis of the plot. That

52

4 Test Information Function (TIF)

gives me an idea of the measurement range of the test. Also, I can look at the value of the information function (the vertical axis of the Winsteps plot). And, I can look at the shape of the curve and would like to see a bell shape to the curve. The location of the peak of the curve would be the location of the maximum measurement potential of the test. Chizuko: Anything else you can think of Charlie? Charlie: Well, I guess an added thought. First, I can include such information in a paper I am writing. I don’t think reporting on the Test Information Curve will hurt the paper, but I am not sure that I have gained a lot. I think the shape of the curve might be a good way to show readers that a test is doing at least a reasonable job of measuring. Most people are familiar with bell shaped curves in statistics, and they may know that a bell shaped curve in many situations can be a good thing! But I think that the Wright Maps that I often create are more informative and useful. With a Wright Map, I can quickly identify which items are redundant and which items are well targeted to respondents. Also, if I wish to cite some indices that summarize the functioning of a test, I can cite indices such as person reliability, item reliability and strata. Chizuko: Grade of A, but keep your nose to the grindstone!

Keywords and Phrases Test Information Function (TIF)

Potential Article Text As a component of the STX project, the TIF of the instrument was evaluated. The primary goal of this specific TIF analysis was to evaluate the shape of the TIF graph. Figure 4.5 provides the TIF graph from the STX data set. A bell-shaped curve was observed. The range of the TIF was from approximately −3 to 3, and the peak of the curve was at 0. The researchers wished to set a cut point at 0, thus the alignment of the curve peak and the cut point suggests a measurement instrument that will allow one to discern those respondents who did not pass the test and those who did pass.

Quick Tips From the Winsteps Manual: In practice, the values of the information function are largely ignored, and only its shape is considered. We usually want the test information function to peak:

Some Final Thoughts on the Test Information Function Curve

53

1 . where the most important cut-point is (criterion-referenced tests) 2. where the mode of the sample is (norm-referenced tests) The width of the information function is the effective measurement range of the test. (Linacre, 2018)

The authors of this book feel that generally the Wright Map is more useful in attempting to communicate the strengths and weaknesses of a measurement instrument. Additionally, the summary statistics provided by Winsteps can help express the functioning of an instrument.

Data Sets (Go to http://extras.springer.com) cf test cf test w Extremescore equal .01 cf test only easy items w Extremescore equal .01 cf test 3 diff items 3 easy items w Extremescore equal .01 cf test activity 1 TIF chp

Activities Activity #1 We provide a control file for a multiple-choice test. The name of the control file is cf test activity 1 TIF chp. Please conduct a Winsteps analysis of TIF for this control file, and author text that might be appropriate for an article. The figure described below is Fig. 4.2 of this chapter. Answer: In an article an analyst might write: Below we provide a plot of the test information function of the data set as computed with Winsteps. The information function peaks at about 5.25, the curve is bell shaped, and the range of the test appears to vary from −6 logits to 6 logits. Activity #2 What can be observed in this data set of Activity #1 when reviewing the Wright Map from a Winsteps analysis? Answer: The Wright Map from the analysis shows that the range of respondents extends from a high of about 2.25 logits to a low of about −1.5 logits. The maximum item difficulty is about 1.75 logits, and the lowest item difficulty is around −1.75 logits. There are about 13 respondents who have measures higher than the most difficult test item. Generally, there could be a greater number of difficult items presented on the test.

54

4 Test Information Function (TIF)

References Linacre, J. M. (2018). Winsteps® Rasch measurement computer program user’s guide. Beaverton, OR: Winsteps.com. Lu, Y. M., Wu, Y. Y., Hsieh, C. L., Lin, C. L., Hwang, S. L., Cheng, K. I., et al. (2013). Measurement precision of the disability for back pain scale-by applying Rasch analysis. Health and Quality of Life Outcomes, 11(1), 119.

References

55

Lund, M. L., Fisher, A. G., Lexell, J., & Bernspång, B. (2007). Impact on participation and autonomy questionnaire: Internal scale validity of the Swedish version for use in people with spinal cord injury. Journal of Rehabilitation Medicine, 39(2), 156–162. Wang, Y. C., Cook, K. F., Deutscher, D., Werneke, M. W., Hayes, D., & Mioduski, J. E. (2015). The development and psychometric properties of the patient self-report Neck Functional Status Questionnaire (NFSQ). Journal of Orthopedic and Sports Physical Therapy, 45(9), 683–692.

Additional Readings Salzberger, T. (2003). When gaps can be bridged. Rasch Measurement Transactions, 17(1), 910–911.

Chapter 5

Disattenuated Correlation

Charlie and Chizuko: Two Colleagues Conversing Charlie: Chizuko, I have heard that some researchers try to compare how well one set of items defines a trait compared to another set of items. For example, a researcher might have a 20-item test, and s/he wants to see which set of items is the best fit for the Rasch model. I’m a little confused.... Chizuko: Well Charlie, this is my thought...we should always start with thinking. We should examine whether or not a set of items measures a single trait, and whether we could argue that the items mark a trait from theory. Once I have my theory and have made my predictions, I usually start my analysis by examining item fit. That is how I start; however, there are often two steps that I may take to investigate other aspects of dimensionality. One technique is to use the PCAR steps that are outlined in this book. Another technique that I use to evaluate dimensionality is to look at a Disattenuated Correlation.

Tie Ins to RAHS Book 1 In our last book we presented a detailed discussion concerning fit. By looking at fit, researchers can, in part, begin to investigate whether items of a test or survey appear to define one single useful variable. In this chapter we discuss Disattenuation Correlation which also provides some added analysis steps that can help researchers investigate potential multidimensionality. This technique can be used with PCAR (Chap. 2 of this book) and point measure correlation (Chap. 3 of this book) to investigate dimensionality.

© Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_5

57

58

5 Disattenuated Correlation

Introduction To investigate dimensionality we can begin by taking a set of items and use theory to separate items into two different item sets that may define two different traits. For this example, we use all 23 items (13 Self Efficacy items, 10 outcome expectancy items) of the STEBI. Because we have a theoretical base, we will separate the items into the Outcome Expectancy (OE) items. Now, to investigate a part of the dimensionality riddle, we will perform two Rasch analyses with Winsteps. One analysis includes only the SE items, and the other analysis includes only the OE items. Following each analysis, researchers should save the person measures in the Winsteps format that allows a cross plot by using Plots from the Winsteps Menu and then selecting Compare statistics: Scatterplot. We then suggest the following: Run each analysis, then using the Plots and Compare statistics: Scatterplot, make a plot of the person measures for each analysis. Plot the same person measures against each other. Such a procedure will produce a straight line. Save both Excel sheets. Next, pull up both Excel sheets (the one with the SE person measures and the one with the OE person measures). Remove the person measures and person errors in columns C and D of the SE person measure spreadsheet and replace those two columns by inserting columns C and D of the OE analysis (the OE person measures and the OE person measure errors) into the SE person measure spreadsheet. Now save this sheet, but use a new name, perhaps Cross plot SE P measures and OE P measures. The next step is to look at a calculation that Mike Linacre provides for researchers in the scatterplot worksheet at the base of all of the person measure data. The key line to look at Disattenuated Correlation (see Fig. 5.1). In an email to us, Mike Linacre indicated the following: “If the Disattenuated Correlation is close to 1.0, then there is no statistical evidence that the two sets of items are measuring different things for this sample” (May 14, 2012). Thus, for this sample analysis, researchers might predict, since the OE items should be a different trait than the SE items, there is evidence that the OE items measure a different trait than the SE items. In our example the Disattenuated Correlation is −0.08632. Formative Assessment Checkpoint #1 Question: Why is use of the Dissatenuated Correlation important in a Rasch analysis? Answer: When a researcher is conducting a Rasch analysis it is important to investigate the number of variables that might be present. Hopefully the instrument being used was designed with the goal of measuring one trait, one variable. Fit provides one way to investigate dimensionality. An added technique is investigation of a Dissatenuated Correlation.

Introduction

59

Fig. 5.1 The Excel sheet resulting from cross plotting the SE measures and OE person measures against each other using the Compare statistics: Scatterplot option of Winsteps

We wish to better understand some possible rules of thumb that may be considered in evaluating the level of a Disattenuated Correlation, and below we offer some guidance provided to the authors in two emails from Mike Linacre: Roughly speaking, we look at the Disattenuated Correlations (otherwise measurement error clouds everything): Correlations below 0.57 indicate that person measures on the two item clusters have half much variance in common as they have in common. (Cut-off for probably different latent variables?) Correlations above 0.71 indicate that person measures on the two item clusters have more than half of their variance in common, so they are more dependent than independent. Correlations above 0.82 are twice as dependent as independent. (Cut-off for probably the same latent variable?) Correlations above 0.87 are three times as dependent as independent. If the Disattenuated Correlation is close to 1.0, then there is no statistical evidence thatthe two sets of items are measuring different things for this sample. (personal communication, July 18, 2013)

In the second email to William Boone on September 27, 2014, Linacre suggests that a Disattenuated Correlation of 0.7 is a key one. He said: Most discussions of correlation assume that the data are point-estimates (no measurement error or measurement error too small to matter), so they are really talking about correlations without error = Disattenuated Correlations.

60

5 Disattenuated Correlation Here is the meaning of correlations according to http://fxtrade.oanda.com/analysis/ currency-correlation and many other websites: to 0.2 – Negligible, very weak correlation 0.2 to 0.4 Weak, low correlation (not very significant [in the common usage of “significant”, not the statistical usage]) 0.4 to 0.7 Moderate correlation 0.7 to 0.9 Strong, high correlation 0.9 to 1.0 Very strong correlation For me, 0.7 is a critical value, because above 0.7 more than 0.71∗0.71 = more than 50% of the variance in each variable is shared in common with the other variable. This seems also to be a critical value in Factor Analysis. (Linacre, personal communication, September 27, 2014)

Given the suggestions of Mike Linacre, we suggest that a 0.7 correlation could be used as a cut off to separate and identify a correlation as significant or non-significant. Finally, to aid readers, we also provide some details and a sample calculation for the Disattenuated Correlation. This calculation is made easier by the following formula provided by Mike Linacre: “If R1 and R2 are the reliabilities of the two subtests, and C is the correlation of their ability estimates reported by Excel, then their latent (error-disattenuated) correlation approximates [C/sqrt (R1∗R2)]” (personal communication, September 27, 2014). Mike Linacre has indicated that Spearman’s 1904 publication can be referenced for this formula. Now let’s complete the computation using the following information from the analysis conducted to create the spread sheet presented in Fig. 5.1, and details provided in the excel sheet. The SE data analysis has a person reliability of .87. The OE data analysis has a person reliability of .76. The correlation of the SE per measures and OE per measures from the scatterplot spreadsheet is −.07187. Using the equation above to compute the Disattenuated Correlation

−.07187 / sqrt ( R1 ∗ R2 )

Using the person reliability values:

−.07187 / sqrt (.87 ∗ .76 ) −.07187 / sqrt (.6612 ) −.07187 / .813 = −.088

The value of −.088 is the Disattenuated Correlation. This is the same value as that was reported in the spreadsheet after accounting for rounding errors.

Introduction

61

Formative Assessment Checkpoint #2 Question: What is the rule of thumb (critical value) for the Dissatenuated Correlation? Answer: The rule of thumb as suggested by Mike Linacre (see above text) is .70. Charlie and Chizuko: Two Colleagues Conversing Charlie: Okay Chizuko I think I have a handle on this business of a Disattenuated Correlation.... Chizuko: Well then show off and start talking! Charlie: I feel really confident in thinking about Fit. For example, using the MNSQ Outfit value of a test item informs my thoughts about whether or not an item may be part of a trait. Chizuko: Why are you talking about Fit here? Charlie: I think to understand Disattenuated Correlation (and PCAR), it is important to grasp the basics first. When I do a Rash analysis, I start with theory, and I can make an argument about why I think I have one variable. After doing an analysis, I might look at a Disattenuated Correlation if I think there may be two different sets of items in my instrument that measure two different variables. However, I would do this only if I think I might have two variables. Chizuko: Ok, keep rolling... Charlie: Well…I might review the Disattenuated Correlation and look to see if it is above .7 or not. If it is above .7, I have some evidence that I have one variable because it did not make much difference in the comparison of the sample of respondents if I used one scale or the other scale. Chizuko: Perfect!

Keywords and Phrases Disattenuated Correlation Multidimensionality Dimensionality

Potential Article Text To explore the question of whether or not a set of items is multidimensional, one strategy is to review the Disattenuated Correlation between the person measures computed with one set of items (trait 1) and another set of items (trait 2). In this research, an analysis was conducted using 15 items from the XYZ scale, and 10

62

5 Disattenuated Correlation

items from the ABC scale. Both scales have been proposed by the instruments’ developers to measure the trait of self-efficacy. Rasch measures were computed for scales XYZ and ABC (thus two measures were computed per respondent). A Disattenuated Correlation was computed and found to have a value of .8, which falls above the accepted borderline value of 0.7. This result suggests that there is evidence from a Disattenuated Correlation perspective that the two scales measure the same trait.

Quick Tips A Disattenuated Correlation is available through Winsteps when comparing the person measures computed with one set of items, and a separate set of person measures computed with the same respondents with a different set of items. The Disattenuated Correlation can begin to help researchers evaluate the assertion that one set of items defines a different trait than another set of items. Disattenuated Correlations of .7 and higher suggest a strong correlation. The Disattenuated Correlation can be computed with knowledge of the correlation coefficient and the person reliability of each analysis. Make sure to explore dimensionality with PCAR. Make sure to use a theory to evaluate dimensionality, and do not base a conclusion about dimensionality on a single analysis or a single magic number. As provided by Mike Linacre: “If R1 and R2 are the reliabilities of the two subtests, and C is the correlation of their ability estimates reported by Excel, then their latent (error-disattenuated) correlation approximates [C/sqrt (R1∗R2)]” (personal communication, September 27, 2014).

Data Sets (Go to http://extras.springer.com) cf test

Activities Activity #1 A control file cf test is provided for a subject test with 25 items. The test was authored to measure one trait. Compute the Disattenuated Correlation for a Test A which is defined with items 1–13 and a Test B defined with items 14–25.

References

63

Activity #2 Conduct a by hand calculation of the Disattenuated Correlation for the data set of Activity #1. Activity #3 Utilizing the guidance provided in this chapter, evaluate the strength of the evidence (the Disattenuated Correlation) that one or more dimensions might be present. Activity #4 After reading the chapter concerning Principal Component Analysis of Residuals (PCAR), conduct a PCAR for the two tests of the above activities. Is there evidence based on the PCAR of potential added dimensions, or is there evidence supporting the assertion of one dimension being measured by the set of 25 items?

References Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15(1), 72–101.

Additional Readings DeMars, C., & Linacre, J. M. (2004). Mapping multidimensionality. Rasch Measurement Transactions, 18(3), 9990–9991. Schumacker, R. E., & Muchinsky, P. M. (1996). Disattenuating correlation coefficients. Rasch Measurement Transactions, 10(1), 479.

Chapter 6

Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate the Functioning of a Scale

Charlie and Chizuko: Two Colleagues Conversing Charlie: There is something in Rasch analysis that I have heard bantered about from time to time. It is the phrase Item Characteristic Curve, also sometimes people say ICC. However, I am not sure how I can use the curve. Can you help me? Chizuko: Well Charlie, it is very interesting that you mention this, because I was just reading an article where the authors present an ICC and use some of that information to make an argument supporting their instrument. Let me show you a little bit on how to use an ICC, and how you might consider incorporating the curve into your analysis. How does that sound? Charlie: Sounds super-duper to me!

Tie Ins to RAHS Book 1 In RAHS Book 1, we outlined several techniques that can be used to evaluate the functioning of a scale, and how to spot potential problems in a way that our items define a variable. For example, how items are distributed in terms of difficulty and the shape and relationship of the Category Probability Curves for a rating scale analysis. ICCs can also be used, for instance, to identify items that may not work for a scale. For example, the steeper part of an ICC is a region of more discrimination. In RAHS Book 1 we helped readers learn how to interpret common plots used in Rasch analysis, in this chapter we continue helping readers by helping readers learn how to read and interpret ICC plots.

© Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_6

65

66

6 Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate…

Introduction We hope readers know and appreciate when a researcher develops a scale, or uses an already developed scale, that many quality control steps must be taken to ensure that the instrument will function in a manner that produces good measurement. One step is to identify items that may not fit the Rasch model (misfitting items may not define the same variable as do other items of the scale). Other monitoring instrument functioning techniques involve investigation of the parts of the trait that are marked by items. In this chapter, we present Item Characteristic Curves (ICCs) and introduce readers to these curves to allow additional evaluation of a survey or test. Our goal is to build the strongest possible measurement instrument, and ICCs can help us. To begin our ICC work we will review some past Rasch concepts. We introduce two key ideas that will help readers grasp the details of ICCs. First, readers should reflect on the ogive. Second, readers should embrace the think before one leaps concept, which we have emphasized extensively in our first book, as well as in this book. To begin, we consider the ogive and what it means. As readers will recall, when a Rasch analysis is conducted, it is possible to make a plot of the raw score to measure relationship for a test or a survey. Such data allow a plot, which is provided in Table 20 of Winsteps (named “Score Table”). In fact, Table 20 provides a plot of the relationship between all possible raw scores and all possible measures for an instrument. Readers should recall that the curve observed in such a plot confirms the nonlinear nature of raw scores and measures. This plot presents a curve which is the ogive. Below are a Score Table and plot (Fig. 6.1) from an analysis of a control file entitled cf n40 Jordan Activity. Data were supplied by our colleague Saed Sabah. The data consisted of students’ answers to the following rating scale survey items using the following rating scale: 1 (almost never), 2 (seldom), 3 (sometimes), 4 (often), 5 (almost always). We see these numbers on the vertical axis. The instrument items were developed by Campbell, Abd-Hamid, & Chapman (2010). The items in the data set are provided below. C1 I conduct the procedures for my investigation. C2 The investigation is NOT conducted by my teacher in front of the class. C3 I am actively participating in investigations as they are conducted. C4 I have a role as investigations are conducted. D1 I determine which data to collect. D2 I take detailed notes during each investigation along with other data I collect. D3 I understand why the data I am collecting are important. D4 I decide when data should be collected in an investigation. In this analysis, with the coding of 1 to 5 for the rating scale categories, a higher person measure represents someone who more often experiences the types of activities that were surveyed. Thus, if Bob has a high logit measure, Bob is answering with a high raw rating scale number for many of the survey items. A high Rasch

Introduction

67

Fig. 6.1 Part of Table 20 following analysis of an eight item survey (5 rating scale steps) completed by 40 respondents. Winsteps output

person measure in this analysis means use of the higher rating scale categories and a respondent who is reporting that he or she does the surveyed tasks often. A low Rasch measure means the use of the lower rating scale steps and identifies a respondent who reports rarely (less frequently) doing the surveyed topics. In Fig. 6.1, the nonlinear shape of the curve is of particular importance, as the plot allows researchers to estimate the Rasch measure for each respondent utilizing

68

6 Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate…

any of the possible raw scores. Given eight survey items that could be answered with a rating scale step coded as a 1 (almost never), 2 (seldom), 3 (sometimes), 4 (often), 5 (almost always), the raw scores displayed in the graph and figure range from a minimum of 8 raw score points to a maximum of 40 raw score points. Furthermore, one can observe that the lower the measure of a respondent, the less frequently a respondent is reporting something has taken place. For example, if Ted has a raw score of 8, this means that Ted responded 1 (almost never), for each of the eight survey items he answered. Ted has a measure of almost −5 logits. If Greg has a raw score of 40, this means that Greg could have answered 5 (almost always) to each of the 8 survey items. Note that Greg has a measure of 5.67 logits. One more part of our preamble remains, the importance of thinking before leaping. Throughout this book, as well as in RAHS Book 1, we encourage readers to think before leaping and to consider many types of evidence before deciding on their instrument. One example of this is reviewing items with respect to fit. An item might be observed to have an Outfit MNSQ greater than 1.30, but should that item be removed? Perhaps only a few respondents answered in a strange manner, and following removal of those respondents (or at least not using those respondents for the calibration of items along a scale) the item fits? Perhaps an item misfits, but upon review, it becomes clear that the misfit is not the result of a true problem with the item. Review of the item reveals that the item was not typed correctly in the survey (perhaps a critical word was misspelled). Another possibility is to monitor the functioning of an item over time to ensure the item’s measurement misbehavior is consistent. Our point is that many types of evidence can be collected, pooled, and reviewed to make an informed decision with regard to many important measurement decisions. In this chapter, we provide another technique that can be used to evaluate some aspects of a measurement tool’s behavior, and we tie this work into our understanding of the ogive from previous chapters.

ow to Review and Interpret Model ICCs H and Empirical ICCs To begin our investigation of ICCs, and using these curves and plots to identify items that may be problematic for an instrument, we first need to know how to generate such curves. Following a Rasch analysis of the file cf n 40 Jordan Activity, the first step is to click on the Graphs button of Winsteps. Following this step, one needs to select the option Category Probability Curves. What one should see is the example plot presented below (Fig. 6.2): It is important to see the label “Measure relative to item difficulty” along the x-axis. The next step is to click on the button (lower right) labeled Absolute x-axis. When you click on the button, you will get the following plot presented in Fig. 6.3: Now, click on the button labeled Exp+Empirical ICC. It is the plot that results from this click that we utilize to help identify items that may not define our single

How to Read and Use the Plot

69

Fig. 6.2 A category probability curve for a survey item. Winsteps output

variable as researchers would like to observe, with regard to our use of Rasch measurement theory. Learning to read this plot (Fig. 6.4), we could look at any of the item plots (each of the eight survey items has a plot), but for now we click the button that is named Next Curve, and then click until we get to the fourth item (Fig. 6.5).

How to Read and Use the Plot We consider the solid line that is an ogive shape. This line extends from a value of −8 to 5 along the horizontal axis. This is our model ICC line. This is the line that researchers expect for the relationship between a person’s measure and the raw rating they would provide for that item. For any measure of a respondent, it is possible to compute (using this plot) the raw score we would predict a respondent would select for a specific item. Moreover, the reverse is true as well; for any specific raw score, on a specific item, it is possible to note the overall measure of the respondent.

70

6 Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate…

Fig. 6.3 The graph that results when Absolute x-axis is clicked. Winsteps output

Through this plot, Mike Linacre in Winsteps is helping us remember a few things that we should know when we conduct a Rasch analysis. In the case of a test, if we know the difficulty of a test item, and we know if the person correctly answered the item, we can then work toward computing the person’s measure. Also, when a researcher knows the ability of a respondent and knows the difficulty of a test item, then the researcher can work toward predicting the raw score for that person to that item. This idea is at the core of Rasch measurement. Now, what about the X symbols that are presented and the line that connects the symbols? In the Winsteps Manual, Mike Linacre explains these X values and the line for the Empirical ICC in the following manner: Line: This shows the empirical (data-descriptive) item characteristic curve. Each black “x” represents observations in an interval on the latent variable. The “x” is positioned at the average rating (y-axis) at the average measure (x-axis) for observations close by. “Close by” is set by the empirical slider beneath the plot. The blue lines are merely to aid the eye discern the trend. The curve can be smoothed with the “smoothing” slider. The Points button controls whether points+lines, points or lines are displayed. Buttons are described in Graph window. The markers on the empirical ICCs are the data-points. The lines between the markers are interpolations. Think of the empirical ICC for an item. This summarizes the scored responses to the item. There is one scored response by each person. Each person has an ability. The abilities are stratified into ability ranges. The size of the range is shown by the slider below the middle of the graph. In each ability range, one marker is plotted. The x-axis is the average

How to Read and Use the Plot

71

Fig. 6.4 An initial plot from using the steps detailed to this point. Winsteps output ability of the persons in that range. If there is no person, then there is no marker. The y-axis is the average of the scored responses by the persons. The lines between the markers are to help our eyes see the pattern of the markers more clearly. If each ability range is very wide, then there may be only one or two markers. If each ability range is very narrow, there could be one marker for each observation. There is an art to adjusting the width of the ability ranges to produce the most useful empirical ICC. (Linacre, 2018)

As we attempt to understand this plot, a first step is to count the number of Xs in the plot (a total of ten). Below is the person measure table (Fig. 6.6) that allows researchers to quickly identify the number of different observed measures that we might predict. If there are ten Xs in the plot we should see ten different measures. A total of 16 different measures were observed for students answering all items, but in this plot, researchers see only ten X symbols. How can this be? The reason for the smaller than observed number of Xs emerges from the scaling of the plots, and the way in which an X is plotted. If a researcher simply makes the Empirical Interval smaller for the plot, then with a smaller interval the researcher will see an X appear, in this data set for this item. If a researcher makes the Empirical Interval larger, it is possible that some X symbols will disappear. Thus, the number of observed X symbols will depend upon the scaling for the plot. Below we present the plot (Fig. 6.7) when the smallest Empirical Interval is selected and when a large Empirical Interval is selected. Researchers will be able to see that the number of X values plotted

72

6 Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate…

Fig. 6.5 The ICC produced by Winsteps for the fourth item of the survey

depends upon the scaling selected for the plot. Now a researcher does indeed observe 16 different X marks, one X mark for each measure that was computed through the Rasch analysis. Below (Fig. 6.7, left) a small Empirical Interval (.01) is selected. This is done through the slide, which is located fourth from the right at the base of the figure. Now, a larger empirical interval is selected (Fig. 6.7, right). The plot for the same item is made with an Empirical Interval of 1.78. Finally, what is the meaning of the lines (without the X marks) that are located above the ogive line? What is the meaning of the lines (without the X marks) that are located below the ogive line? These two lines are 95% confidence interval lines. This means that if an X is outside either of these lines, then for this item being reviewed, there is evidence of a divergence from the expected pattern of responses (for this item) and the Rasch model. This situation is important for the work that we are now conducting to investigate the functioning of our measurement scale. As we worked with this plot, Mike Linacre offered a tip to us through an email on February 19, 2016 that the confidence lines should be “reasonably smooth”. Another tip from Mike Linacre in his email is that we should expect to see some Xs located outside the confidence intervals. For our discussion with this small data set, we simply ask readers to ignore the jagged confidence intervals for our example, as we work toward helping readers learn how to read and use these plots!

----------------------------------------------------------------------------------------------|ENTRY TOTAL TOTAL MODEL| INFIT | OUTFIT |PTMEASUR-AL|EXACT MATCH| | | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR. EXP.| OBS% EXP%| PERSON |------------------------------------+----------+----------+-----------+-----------+----------| .21| 75.0 76.1| 1 2 2.8| | 1 38 8 3.58 .78| .69 -.35| .57 -.52| .65 | 2 38 8 3.58 .78| .62 -.50| .50 -.69| .78 .21| 75.0 76.1| 2 2 2.45| | 14 37 8 3.05 .67|1.02 .21|1.08 .33| -.20 .24| 37.5 64.0| 14 2 3| | 8 36 8 2.64 .61| .85 -.16| .88 -.09| -.05 .27| 62.5 58.4| 8 2 3.33| | 10 36 8 2.64 .61|1.25 .63|1.16 .47| .49 .27| 62.5 58.4| 10 2 3.26| .27| 12.5 58.4| 11 1 3.5| | 11 36 8 2.64 .61|1.10 .36|1.11 .38| -.55 .51| .45 .27| 62.5 58.4| 23 2 2.33| | 23 36 8 2.64 .61|1.23 .60|1.18 | 40 36 8 2.64 .61| .92 .00| .90 -.03| -.15 .27| 37.5 58.4| 40 1 2.35| .28| 75.0 54.4| 4 2 3.7| | 4 35 8 2.30 .57| .66 -.62| .66 -.64| .11 | 28 34 8 1.99 .54|1.12 .39|1.08 .32| .72 .30| 25.0 49.7| 28 1 2.9| | 33 34 8 1.99 .54|1.47 1.00|1.49 1.05| .31 .30| 25.0 49.7| 33 2 2.94| | 37 34 8 1.99 .54|1.42 .92|1.37 .85| .41 .30| 25.0 49.7| 37 2 2.61| | 5 33 8 1.71 .52|1.24 .62|1.22 .59| .25 .31| 37.5 52.1| 5 1 3.45| .31| 37.5 52.1| 19 1 3.65| | 19 33 8 1.71 .52| .86 -.13| .85 -.17| .74 .32| 50.0 52.9| 7 2 3.5| | 7 32 8 1.45 .50| .81 -.22| .82 -.22| .47 | 12 32 8 1.45 .50| .11 -2.91| .11 -2.97| .00 .32|100.0 52.9| 12 1 2.6| .32|100.0 52.9| 13 1 3.58| | 13 32 8 1.45 .50| .11 -2.91| .11 -2.97| .00 | 26 32 8 1.45 .50| .11 -2.91| .11 -2.97| .00 .32|100.0 52.9| 26 2 3.1| .01| .35 .32| 50.0 52.9| 34 1 2.9| | 34 32 8 1.45 .50| .88 -.08| .92 | 38 32 8 1.45 .50| .11 -2.91| .11 -2.97| .00 .32|100.0 52.9| 38 2 2.78| | 39 32 8 1.45 .50| .11 -2.91| .11 -2.97| .00 .32|100.0 52.9| 39 2 2.97| .32| 50.0 51.7| 6 2 2.88| | 6 31 8 1.20 .49| .64 -.63| .63 -.68| .28 | 9 31 8 1.20 .49| .95 .07| .95 .06| .45 .32| 25.0 51.7| 9 2 3.55| | 20 31 8 1.20 .49| .97 .11|1.01 .20| .91 .32| 25.0 51.7| 20 2 3.38| .32| 50.0 51.7| 24 1 3.21| | 24 31 8 1.20 .49| .64 -.63| .66 -.59| .86 1.20 .49|2.16 1.88|2.17 1.92| -.45 .32| 37.5 51.7| 31 1 2.3| | 31 31 8 | 29 27 7 1.16 .52|1.46 .91|1.43 .88| .51 .35| 42.9 51.3| 29 2 2.86| .33| 75.0 50.7| 17 2 2.97| | 17 30 8 .97 .47| .55 -.89| .59 -.79| .58 | 16 29 8 .75 .46| .55 -.86| .53 -.93| -.02 .34| 50.0 49.5| 16 1 2.67| .35| 75.0 48.1| 22 | | 22 28 8 .54 .45| .48 -1.08| .45 -1.20| .69 .21| .48 .35| 62.5 48.1| 30 1 3.59| | 30 28 8 .54 .45| .98 .12|1.02 | 36 28 8 .54 .45|1.22 .58|1.20 .54| .14 .35| 12.5 48.1| 36 1 2.8| .36| 75.0 47.1| 3 2 3.79| | 3 27 8 .34 .44| .40 -1.40| .40 -1.40| .25 .34 .44|2.47 2.25|2.40 2.17| .27 .36| 37.5 47.1| 15 1 3.46| | 15 27 8 | 21 27 8 .34 .44|2.08 1.80|2.16 1.89| .61 .36| 12.5 47.1| 21 2 2.94| | 18 24 8 -.20 .41|2.55 2.47|2.61 2.51| .51 .38| 12.5 42.9| 18 1 3.75| | 35 24 8 -.20 .41|2.05 1.87|2.16 1.99| .40 .38| 37.5 42.9| 35 1 3.75| .39| 37.5 42.8| 32 2 3.18| | 32 22 8 -.53 .40|1.30 .76|1.39 .91| -.29 | 25 20 8 -.85 .40| .32 -2.03| .32 -2.03| .37 .39| 75.0 40.6| 25 1 2.4| | 27 20 8 -.85 .40| .27 -2.32| .26 -2.36| .49 .39| 75.0 40.6| 27 1 3.17|

Fig. 6.6 The person measure table for the analysis. Sixteen different measure values are presented. Winsteps output

Fig. 6.7 Left: The plot when a small Empirical Interval is selected. Winsteps output. Right: The plot when a large Empirical Interval is selected. Winsteps output

74

6 Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate…

Fig. 6.8 The plot for item 4 with some added details. If an X were present in the region of the circle this would be quite unexpected. This would mean a respondent answered in quite an unexpected way to this item. Winsteps output

We now ask readers to look through all eight pictures that are provided for the analysis of these data. Readers should observe that, overall, the location of X symbols is where researchers would like to observe the symbols (within the region marked by the two control lines). Now comes the important analysis technique for readers. If a researcher observes a number of Xs outside the lines for an item, perhaps the item is not behaving as a researcher would like in order to conduct the quality of measurement that he or she would like to conduct. Readers should remember and note this is similar to what we considered when reviewing item fit. Above (Fig. 6.8) we provide the ICC plot for this item. Readers should notice that we circle one potential region where, if we observe one or more Xs, there might be a potential problem with an item. Note that the circle is outside the region marked by the confidence bands. The circle to the lower right (if it had an X in the circle) represents the presence of a respondent who was predicted to use a high rating scale category (a measure of just below 4 logits on the x-axis, using the model ICC, predicts a selection of a rating scale category coded just below a 5).

How to Read and Use the Plot

75

If a researcher were to observe an X in the region marked by the lower right circle, the researcher could see that a person with a measure of about 4 logits would be predicted to have answered with a value of rating scale category coded with a 4 or 5 (follow the vertical line upward from about 4 logits; when the line hits the ogive, follow the horizontal line to the vertical axis). However, if readers were to observe an X or Xs for this item in the region of the lower circle, this would be evidence that the reaction of respondents to this item did not follow our theory for this item. An X in the region of the circle would mean that the respondent (or respondents) summarized with an X in the circle would be predicted to have answered a 1 or 2 (raw score) . This can be seen by looking at where the horizontal line extending from the circle intersects the vertical axis. This would be quite unexpected for a respondent of this measure. In this data set, a person with a measure of 4.0 logits would be someone who typically was reporting that the surveyed activities took place frequently and would be predicted to often use a raw rating scale of 4 or 5 for his or her answers. Readers can apply this thinking to the situation in which Xs are outside the confidence bands in the region where we plotted a triangle in Fig. 6.8. For an X to be located in this triangle, a researcher would consider a respondent with a low measure (about −4). Such a person, with the raw score coding used for this survey data, would be predicted to have answered by using low rating scale categories (e.g. 1 or 2). This can be seen by using the ogive. However, if we observe an X in the region of the triangle, this means that this person (with a low measure) is predicted to have used a high rating scale category for his or her answer. This is quite unexpected. There is another aspect of looking at ICCs, the region of more discrimination for an item is where the curve is the steepest (Linacre, 2002). Formative Assessment Checkpoint #1 Question: How do I read and use the plots that have an x-axis of “Measure on latent variable” and a y-axis of “Score on Item”? Answer: First, review the range of numbers on the x and y-axis. On the x-axis, a researcher will see measures that range from a − logit value to a + logit value. On the y-axis, a researcher will observe raw score values that represent the different numbers used to code the raw rating scale. Also displayed is an ogive that shows the measure to raw score relationship. The lines on either side of the ogive are 95% confidence intervals. An X plotted outside smooth confidence bands is evidence of some unexpected behavior on the part of a respondent.

The bottom line for those developing or monitoring measurement instruments is to review such ICC plots and then look for the presence of data points outside the 95% confidence interval bands. When such data points fall outside the bands, con-

76

6 Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate…

sider what the predicted response of a respondent of a specific measure would be. A researcher should then use the plot to identify the divergence from what is predicted and what is observed for respondents. Are the respondents with a high measure unexpectedly selecting a specific rating category? What is unexpected with what is observed? Then a decision must be made regarding what to do with the item. Formative Assessment Checkpoint #2 Question: When looking at the plot “Measure on the latent variable” and “Score on Item” where would one hope to see Xs (which represent respondents) and where would one hope not to see Xs? Answer: One expects, when data is perfect (it is never perfect) to see the Xs near the ogive line. When Xs are far away from the line, it suggests that something might be amiss with the respondent. Also, there might be something amiss with the item. Looking at such plots provides another technique by which the quality of an instrument and data can be evaluated. Charlie and Chizuko: Two Colleagues Conversing Charlie: Ok, it’s taken me a bit of time, but I am ready for your usual quiz. Chizuko: You sure? Charlie: Yes, I’m ready to give it my all. Chizuko: Well Charlie, tell me what the overall goal of the tips in this chapter have been. Why should you think about these issues? Charlie: Wow! You are giving me a lot of room here. I think what I really appreciate about this chapter are the techniques, which provide us with an additional way to look at how our items are functioning with the instrument. In these plots, if I see a number of Xs outside the confidence intervals, and if I am confident in my confidence intervals, then I need to think about what the Xs mean. What was unexpected with what happened? So, thinking precedes leaping. I need to keep reminding myself about that! Chizuko: Okay a solid A! Anything you want to add? Charlie: Well well well...I think another thing is to understand the plots and ideas in these plots. It was really important to remember some basics that I learned in RAHS Book 1. For example, I needed to seriously understand ogives, I needed to be able to understand the meaning of a higher person measure, and I needed to understand what raw score would be observed for respondents of particular measures. If I had not developed some skills with Table 20, I might have had some problems. One last thing that really helped me was to do an analysis with a multiple-choice test and then to compare that analysis to the survey analysis in this chapter. That allowed me to go more quickly between an analysis for a multiple-choice test analysis and a rating scale analysis. I can now see how the plots are similar. Also, I think it is very helpful to know that where an ICC is steep, that is a region of more discrimination for an item. Chizuko: Nice work. You are really on a roll!

How to Read and Use the Plot

77

Keywords and Phrases Item Characteristic Curve (ICC) Empirical Interval 95% Confidence Interval Lines Ogive

Potential Article Text For the development of the Quality of Life Instrument (QLI), a number of steps were taken to evaluate the measurement properties of the 20-item survey. In addition to the evaluation of item fit, Item Characteristic Curves were reviewed for each survey item. One item, survey item 7, suggested unexpected responses on the part of some patients. Several patients who had high quality of life measures, provided unexpected (low) ratings for this survey item. A number of Xs can be seen outside the confidence bands presented in Figure 2. Discussions with the research team resulted in a group consensus that this survey item could be interpreted differently by respondents with high measures and low measures. For the final version of the instrument, this survey item was removed.

Quick Tips Utilizing the menu option Graphs, and the subsequent buttons, Absolute-x Axis, and Exp+Empirical ICC will provide a plot of the Empirical ICC, and a set of Confidence Intervals. Data points lying outside the confidence bands may be evidence of an item not matching what is predicted from the Rasch measurement model. When you are correctly viewing the plots, you should see the following labels: “Measure on Latent Variable” (x-axis), “Score on Item” (y-axis). When the ICC is steep, that is a region where the item is more discriminating than where the ICC is flat.

Data Sets (Go to http://extras.springer.com) cf n40 Jordan Activity (data provided by Dr. Saed Sabah) cf Turkish Sci Educ Data (data provided by Dr. Sibel Telli) cf stebi all 23 items cf 25 GCKA (data provided by Dr. Kathy Trundle)

78

6 Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate…

Activities Activity #1 Repeat the analysis conducted in the text using the same data set, cf n40 Jordan Activity. Compare your plots from your analysis to the two plots presented in the text. Answer: As long as you click the correct buttons and you review the correct data sets, you will see the same plots as presented in the text. Activity #2 For item C4 of Activity #1, print out your graph that displays “Measure on the latent variable” and “Score on item.” Color in with a colored marker a region outside of the confidence bands. Then explain in words, what it would mean if an X (or Xs) would lie in the region with your coloring. Make sure to use the ogive to see what would be expected of a respondent with a particular measure and what was seen. What would be unexpected about a respondent who would lie inside your colored region? Answer: For any region, readers will be able to use the ogive and the two axes to compute what would be the expected answer for any specific person measure. Also, a researcher will be able to compute what a person with a particular X would have been expected to have answered (on the raw rating scale) for a specific measure. Readers should be able to explain what is unexpected about what was observed with an X outside the confidence bands. Activity #3 Use the same data set as for Activity #1 and conduct an analysis for item C1. Count the number of Xs that you see in the initial plot. How many Xs do you see? How many different measures were observed in your data set? Explain what you see. Answer: An initial analysis conducted with the cf n40 Jordan Activity reveals that for item C1, there are a total of ten Xs above and below the ogive curve. The number of Xs are less than the total number of different Rasch measures observed in the data set (there are a total of 16 measures in the data set). This discrepancy is the result of the Empirical Interval that is used to plot the data. When we evaluated our data set, the empirical interval was .46 (we observed ten Xs as well). However if one changes the empirical interval to .01, 16 Xs can be seen.

How to Read and Use the Plot

79

Activity #4 Using the data set cf Turkish Sci Educ Data, conduct an analysis and first review the measure table (Table 20) for the data set. The data set was collected using part of the TOSRA [refer to the book, Test of Science Related Attitudes Handbook (TOSRA) (Fraser, 1981)]. Below we provide the numbering nomenclature and item text for the one subset of TOSRA items that we provide in this data set. The subset was named “Enjoyment of Science Lessons” and was viewed as being a single variable. Data were entered for the positive items using the coding, Strongly Agree (5), Agree (4), Neither Agree nor Disagree (3), Disagree (2), and Strongly Disagree (1). For negative items, the following coding was used: Strongly Agree (1), Agree (2), Neither Agree nor Disagree (3), Disagree (4), and Strongly Disagree (5) (this means the flipped data were entered in the data set). Negative items are 6, 13, 19, and 31.

• • • •

What is the meaning of a higher person measure? What is the meaning of a lower person measure? What is the meaning of a higher raw rating score? What is the meaning of a lower raw rating score?

Answer: A higher person measure means a person is more interested in science. A lower person measure means a person is less interested in science. A higher raw score for a respondent means the respondent is more positive toward science. A lower raw score means a respondent is less positive (less interested) in science.

80

6 Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate…

Activity #5 Repeat the steps outlined in this chapter for the Activity #4 data set and review the ICC plot for the first item of the data set. Identify the measure axis, the raw score axis, the confidence bands, and count the number of Xs observed. Is the number of Xs observed the same as the number of different respondents in the data set? Answer: The measure axis is the x-axis, and the raw score is the y-axis. The x-axis ranged from about −6 logits to 6 logits. The y-axis ranges from 0 to 4. The locations of raw ratings of 1, 2, and 3 are also plotted. The number of Xs that can be observed are five. In total, 16 different measures can be observed. This can be checked by reviewing the person measure table for this analysis. Activity #6 Can you alter the Empirical Interval so that you can observe a total of 16 Xs? Answer: Yes. By changing the empirical interval you can observe a total of 16 Xs. Activity #7 If you evaluated a multiple-choice test data set, how do you think the plots would be similar or dissimilar to those you would observe with the analysis of a rating scale data set? Answer: The plots would be organized in a very similar manner. However, the vertical axis would range from a low of 0 to a high of 1. Activity #8 We provide a data set for a multiple-choice test, cf 25 GCKA. Conduct an analysis and review the plot for the first item of the test. What do you see? Answer: Analysis of this data set reveals a curve that is very similar to a curve for a survey data set. The only big difference is that the vertical axis ranges from 0 to 1. All other techniques that can be used to understand the survey data can be used to understand this plot. The plot for the first test item (named Q3) has a total of seven Xs, with an x-axis ranging from −6 to 6 (the range of the curve) and a y-axis with a

How to Read and Use the Plot

81

maximum of 1 and a minimum of 0. Changing the Empirical Interval to .01 results in a total of 13 Xs being plotted. By reviewing the person measure table of Winsteps, one can note that there are a total of 13 different measures earned by the respondents of this data set. Activity #9 Which plot from Activity #8 shows a good distribution of Xs within the confidence bands? Answer: There are a number of items, but if a researcher reviews Q23, he or she can see an item with well-placed Xs. Activity #10 For the Winsteps plot of Q3 (provided below using the file cf 25 GCKA), where would be a region of Xs (outside the confidence bands) that would suggest an unexpected miss on the part of a respondent? Where would be a region of Xs (also outside the confidence bands) that would suggest an unexpected “correct” answer on the part of respondents? What might be an explanation for an unexpected miss and an unexpected correct answer? Answer: A region to the lower right of the curve (plotted with the circle) would be an unexpected area to see an X outside the confidence bands. An X in this region would be a respondent with a high measure, but someone who had an unexpected miss on an item. For such a respondent, a researcher would predict for a high measure to observe a high raw score. The same logic can be used for a region at the upper left of the curve. We have inserted a box in this portion of the plot. If an X were observed in such a region one would predict for a low measure, a high score on an item. This is very unexpected. The X to the lower right of the curve might be a very capable student who might not have been concentrating, and therefore missed an item. An X to the upper left might be a low performing student who made a lucky guess.

82

6 Understanding and Utilizing Item Characteristic Curves (ICC) to Further Evaluate…

Activity #11 Conduct the same type of analysis for the data set cf stebi all 23 items. For this data set, all STEBI items (both Self Efficacy and Outcome Expectancy) are included. The rating scale consists of a six-step scale, from Strongly Disagree (1) to Disagree (2) to Barely Disagree (3) to Barely Agree (4) to Agree (5) to Strongly Agree (6). A respondent with a higher person measure is a respondent with more confidence in himself or herself and in what students can achieve. A person with a high person measure will have a high raw score total for all answers to the survey items.

References Campbell, T., Abd-Hamid, N. H., & Chapman, H. (2010). Development of instruments to assess teacher and student perceptions of inquiry experiences in science classrooms. Journal of Science Teacher Education, 21(1), 13–30. Linacre, J. M. (2002). Estimating item discriminations. Rasch Measurement Transactions, 16(1), 868. Linacre, J. M. (2018). Winsteps® Rasch measurement computer program user’s guide. Beaverton, OR: Winsteps.com

References

83

Additional Readings An overview of the Rasch measurement model as presented by rehab-scales.org. Item characteristic curves are discussed: Rasch Measurement Model. (n.d.). Retrieved from http://www.rehab-scales.org/rasch-measurement-model.html.

Chapter 7

How Well Are Your Instrument Items Helping You to Discriminate and Communicate?

Charlie and Chizuko: Two Colleagues Conversing Charlie: Hi Chizuko...I am thinking about some added steps that I might take to fine tune a test that I have, and also a survey that I am using. I’ve looked at several things such as Item Fit and Person Fit. I’ve also looked at my Wright Map to gauge how my instrument’s items might help me to locate persons on a trait. Given all of this, I’d like to learn about something else... Chizuko: And what might that something else be? Charlie: Well…I’d like to get an idea of how my items allow me to discriminate between students who understand a topic and students who do not. Chizuko: Okay. Let me show you. There is yet another technique that we can use to learn how well our items work with a group of respondents.

Tie Ins to RAHS Book 1 In our first book, we discussed a large number of techniques that can be employed to evaluate the manner in which a measurement instrument is functioning (e.g. Which items are redundant? Which item may define a different trait?). Moreover, we stressed the importance of being able to explain the meaning of a measure, such as by thoughtfully using Wright Maps. In this chapter, we introduce readers to a number of additional techniques that allow us to communicate how respondents appear to be interacting with each test item. Also, we introduce readers to a technique that will help them better evaluate the strength with which items can help discriminate across students. This chapter also ties to the previous chapter of this book (Chapter 6) in that Chapter 6 and 7 both present techniques which help researchers evaluate how items may differ in their ability to discriminate.

© Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_7

85

86

7 How Well Are Your Instrument Items Helping You to Discriminate and Communicate?

Introduction As readers will recall, when a Rasch analysis is completed, researchers must almost always consider the interplay of items and persons. This interplay can be explored through a number of plots and additional analysis techniques. In this chapter, we present one particular plot and analysis technique that we have found to be quite helpful as we have attempted to improve our measurement instruments. In this chapter, we consider an assessment of the range of “discrimination” an item has. Items that do a good job of discriminating help us detect differences between test takers, and items that do a poor job of discriminating between test takers may not help us very much. Just as there are a wide range of item difficulties of items, there can also be a wide range of item discrimination that will be exhibited by the items of a test. We believe item discrimination should be one of numerous factors a researcher will want to consider as they decide what items to keep in a test and which items to remove.

ommunicating and Evaluating the Manner in Which Test C Items Discriminate Students Below (Fig. 7.1) we provide one of the tables available from Winsteps (Table 2.6). This table was constructed with a control file for a test data set that contained 25 items and 75 respondents (cf test). The table is presented in a familiar form to most readers. The vertical axis provides the order and rough spacing of the test items from easiest (least difficult) to hardest (most difficult). The spacing is rough in that each item is presented on its own horizontal line. We observe that Item 2 is the easiest item, and Item 30 is the hardest item. Thus, the vertical axis goes from a negative logit value (base of plot) to a positive logit value (top of plot). For a component of the work in this chapter, we concentrate on the two values (0, 1) that are reported for each item in the plot. We begin by considering the location of the 0 and the 1 for item 2 using the horizontal axis metric. Eyeballing the axis, it appears as if the 0 for Item 2 is located approximately at a value of −.9 logits. It appears as if the 1 for Item 2 is located approximately at a value of .95 logits. What is the significance of these values for Item 2, and how do we use this information to help us measure with better accuracy? The first important point is that the value of -.9 logits for the 0 of Item 2, indicates the value of the typical person measure for the typical student who did not correctly answer Item 2. Perhaps not surprisingly, the value of the 1 marks the location of the typical student who correctly answered Item 2. These values, with regard to typical answers, are important when a researcher wants to explain how groups of students differed in their responses to this item.

Communicating and Evaluating the Manner in Which Test Items Discriminate Students

87

Fig. 7.1 Table 2.6 from a Winsteps analysis of a 25-item, 75-respondent multiple choice test instrument. Correct items are noted with a 1, and incorrect items are noted with a 0. The vertical axis presents items organized from easiest (Item 2 at the bottom) to most difficult (Item 30 at the top). The horizontal gap between the 0 and the 1 for each item provides an assessment of the discrimination of the test item. Larger gaps represent more discrimination, and smaller gaps represent less discrimination. Think of items that do a better job of discriminating as those items which help you do a better job of figuring out where a person is on the variable

There is, however, another exceedingly important aspect of this plot, and this aspect focuses on the distance between the locations of the 0 and the 1 for Item 2. Not only do these locations mark where the average measure of the respondents of the sample who correctly answered the item (this is denoted with the number 1), and incorrectly answered the item (this is denoted with the number 0). The gap between the two scores provides a gauge of how well the item differentiated the students who knew the answer to Item 2 and those students who did not know the answer to Item 2. Larger gaps represent a larger discrimination. Using knowledge of the gap between the 0 and the 1, which items in this test appear to do a good job of discriminating respondents? Moreover, which items do a mediocre job of discriminating respondents? A visual review reveals that Item 2 and

88

7 How Well Are Your Instrument Items Helping You to Discriminate and Communicate?

Item 35 have some of the largest gaps between the 0 and the 1. Which items do the poorest job of discriminating respondents? A visual inspection reveals that items 10, 9, and 30 have the smallest gaps between the 0 and the 1. Formative Assessment Checkpoint #1 Question: How do I interpret Winsteps Table 2.6? How does the table help me explain the performances of the respondents? How does the table help me better gauge how well my instrument is functioning? Answer: First, identify the vertical axis and the horizontal axis of the plot. Then, pick one of the test items listed on the vertical axis. Then, for that item, note the location of the 0 and 1 (for a test that was scored using 0 for incorrect and 1 for correct) using the scale of the horizontal axis. This location marks the average performance of those respondents who correctly and incorrectly answered the test item. Most important, however, an instrument developer must note the size of the gaps between the location of the 0 and 1 for each item. A larger gap indicates more discrimination between students, and a smaller gap indicates less discrimination between students.

Computing the Gap In our analyses of instruments, one additional step that we take, when using Table 2.6 to evaluate the discrimination level of an instrument’s items, is that we will review Table 13.3. This table provides the exact locations of each 0 and each 1 along the horizontal axis of Table 2.6. In Fig. 7.2 we provide a portion of Table 13.3.

Fig. 7.2 A portion of Table 13.3 from the Rasch Winsteps analysis used to create Table 2.6. This table allows readers to identify the exact location of the 0 and 1 values presented in Winsteps Table 2.6

Computing the Gap

89

For Item 30, Table 13.3 reports a mean respondent measure of .55 for those respondents who incorrectly answered Item 30, and a mean respondent measure of .97 for those respondents who correctly answered Item 30. As it is possible to compute the values of the gap between 0 and 1 for each item (.97−.55=.42 gap), we then often take the step of computing the gap between the 0 and the 1 for each item. We then plot those values next to test items presented on a Wright Map. Below we present such a map, with the added gap values (see Fig. 7.3). We also present part of Table 2.6, in which we also present the gap information. We present the information for three of the test items. In our first book, we emphasized that a legion of issues must be considered and weighed when deciding upon the mix of items to ultimately be used in an instrument. The gap value is by no means the only issue; however, we suggest that including the gap value is critical among the mix of issues to be considered when evaluating an instrument. A second, and equally important, issue focuses on the communication of results. Wright Maps can be used for explaining the results of an investigation. Moreover,

Fig. 7.3 Part of a Wright Map (from Winsteps) of the instrument’s items, organized from least difficult to most difficult. Next to each item is the gap value of the items. Also provided is Table 2.6 of Winsteps in which one can also type in the gap information

90

7 How Well Are Your Instrument Items Helping You to Discriminate and Communicate?

an added component of communication should be the prudent use of Winsteps Table 2.6. Analysts can observe the amount of item discrimination – the size of the gap – and also identify the typical measure of the respondents who correctly answered (in this data set example) the test items and those respondents who did not correctly answer each test item. By adding the gap information to a Wright Map as one sorts through which items to retain and which items to throw out, we believe an analyst can better to develop a test. We would suggest that some of the key aspects of item functioning a researcher will want to report would be item fit, the distribution of items along the trait, item reliability, item separation and an assessment of each item’s discrimination through a computation of the “gap” as detailed in this chapter. Formative Assessment Checkpoint #2 Question: How do I compute the distance between the 0 and the 1 in Winsteps Table 2.6? Answer: Winsteps Table 13.3 will allow one to compute the gap. A plot such as Table 2.6 will allow one to see the gaps between the 0s and the 1s for each item. But Table 13.3 will allow one to compute the exact distance of the gap. Charlie and Chizuko: Two Colleagues Conversing Charlie: Well Chizuko, you really are quite good at explaining things. Chizuko: If that is the case, then tell me a little of what you have learned, and why these new bits of information are important! Charlie: l begin with a broad comment...I think that I have a good feel for many important aspects of measurement using Rasch and Winsteps. For example, I look at item fit (often outfit) and I look at person fit, item reliability, and person reliability. I also review the order of steps to look for disordering. I really look at lots of things. For example, I have not mentioned all of the ways how I use a Wright Map to evaluate my instrument and to communicate my results. Chizuko (interrupting politely): Ok, what does this have to do with Table 2.6 that I showed you? Charlie: I think the important point is that Table 2.6 provides yet another way for me to evaluate the quality of the instrument that I am trying to develop. Also, Table 2.6 allows me to communicate the typical performance of a test respondent who correctly answered or did not correctly answer an item. The techniques that you have shared with me today provide me with even more techniques to explain to others and allow me to take additional steps as I develop my measurement instrument! Chizuko: Nice going Charlie. You know what strikes me about this table and how we use it? It is hard for me to believe that, before Rasch, instrument developers

Computing the Gap

91

looked at a limited number of issues when developing their instruments, and, of course, nonlinear data were often treated as ordinal!

Keywords and Phrases The average measure of a respondent who correctly answered item X can be seen through use of Table 2.6 The larger the gap between the average measure of a respondent who correctly answered item X and the average measure of a respondent who did not correctly answer item X can be used to evaluate the discrimination of the test item.

Potential Article Text A Rasch analysis was conducted on a 30-item multiple choice test data set. In addition to evaluating the fit of items and respondents to the Rasch model, a number of analysis steps were taken to evaluate the manner in which test items discriminated between lower measure and higher measure respondents. Table 2.6 of Winsteps and Table 13.3 facilitated both visual and numerical analyses of item discrimination. Most important was the computed measure gap between respondents who correctly or incorrectly answered each test item. As the test needed to be reduced in length, items with the smallest gaps were identified and evaluated with respect to item difficulty and the expected ability of range of future test takers. As test takers’ ability is predicted to cover the range of the measurement scale, an equal number of low discriminating items were removed from the low, middle, and high end of the measurement scale. A total of six lower discriminating test items were removed from test instrument. Two items were easy, two items exhibited middle difficulty, and two items were difficult.

Quick Tips To use Table 2.6, run your Rasch analysis. Then recall for a test in which correct answers are coded with a 1, and incorrect answers are coded with a 0. The test items are organized along the vertical axis from easiest – bottom– to most difficult – top – along the vertical axis. To identify the average measure of respondents who received a 0 for an item, draw a vertical line from the 0 for an item, to the horizontal axis, which is a measure of respondent measure. You can follow the same steps for items correctly answered. The distance between the 0 and the 1 for each item provides an assessment of the discrimination of each test item. The larger the gap between the 0 and a 1 for an item, the larger the discrimination.

92

7 How Well Are Your Instrument Items Helping You to Discriminate and Communicate?

Data Sets (Go to http://extras.springer.com) cf test cf discrim chp mc test cf second test for discrim chp

Activities Activity #1 Repeat the analysis of this chapter with the control file that was used to generate the figures. That file is cf discrim chp mc test. Activity #2 Repeat the analysis of this chapter with a new test item data set. Use the file cf second test for discrim chp to conduct the analysis. Identify those test items that exhibit low discrimination and high discrimination. For one of the items, identify the average measure of the students who incorrectly answered the test item and the average measure of the students who correctly answered the test item. Review your knowledge of the discrimination of each test item. Then use a Wright Map to identify three items that you might remove from the test if you were to remove items based only on item discrimination. Author a paragraph in which you explain your use of the Wright Map and the discrimination values to select three items that might be removed from the Wright Map.

Chapter 8

Partial Credit Part 1

Charlie and Chizuko: Two Colleagues Conversing Charlie: OK, I feel pretty good about my understanding of how to evaluate a multiple-choice test. But what about a partial credit test? Such tests are so common in many settings. Chizuko: Well Charlie, I’m pleased to tell you that once you have thought about the ins and outs of multiple-choice tests, the conceptual jump to understanding Rasch analysis of partial credit tests is not that difficult. You know that many tests include a mix of dichotomous items and partial credit items, and you have already done some other Rasch thinking that will help you. Charlie: Mmmmm….how about a hint? Chizuko: What can you think of that might be coded as 1, 2, 3, 4…something that you have looked at quite often? Charlie: I have it; it’s a rating scale! Chizuko: Yup, that is right Charlie. When you think about partial credit data, you can also use some of the skills you have mastered when you evaluate a rating scale data set with Rasch techniques.

Tie Ins to RAHS Book 1 In this chapter readers will see several links to our first book. In RAHS Book 1, we routinely helped readers remember the importance of thinking about the meaning of a number. What does the symbol mean? What does the symbol not mean? A second tie in is our use of Wright Maps throughout this chapter. In RAHS Book 1, we used Wright Maps to explore tests in which dichotomous items are presented. In this chapter, we extend that work to tests with partial credit items. RAHS Book 1 introduced readers to ISGROUPS=. In that book, we also presented a chapter that talked readers through the use of this command to facilitate the analysis of survey data

© Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_8

93

94

8 Partial Credit Part 1

when different rating scales are used to measure a single trait. In this new chapter concerning partial credit data, we also revisit this command. Yes, indeed, researchers could try to tackle investigating a partial credit data set right off the bat. However, by working up to a partial credit data set, we think that readers can develop a more thorough understanding of a partial credit analysis. In our first book, we often discussed thinking about the meaning of measuring. Doing that helped readers to be careful with thinking about, among many things, the meaning of numbers used to code data. A nice byproduct of being careful is learning not to mix numbers (without some pondering) that seem to be the same with one another. For example, we have learned not to assume that selecting “Agree” for item 2 of a survey should be assumed to have the same meaning as selecting “Agree” for item 10 of the same survey. This issue will help us as we learn how to evaluate data sets that include partial credit items. In this chapter, we also make extensive use of Wright Maps to help decode and highlight the measurement properties of a test that includes both multiple choice items and partial credit items. Finally, but not the last connection to our first book, is the use of the control file command ISGROUPS which allows us to specify different groups of a rating scale in one Rasch analysis. Naturally the techniques we have introduced in Chaps. 2, 3, 4, 5, 6, and 7 of this book can be used with partial credit data.

Introduction Tests that include partial credit items are commonplace in many evaluation instruments and systems. Frequently, evaluations include a set of dichotomous items as well as a few partial credit items. For example, a medical certification examination may include a large number of multiple-choice items and three essay items that will be scored by expert judges. Before the advent of Rasch analysis, researchers and evaluators typically counted the number of correctly answered right/wrong items, and then added the number of points awarded for the essay answers. For example, if a candidate correctly answered 15 of 20 multiple-choice items and received a 2/4 for essay #1, a 3/4 for essay #2, and a 3/4 for essay #3, the candidate would be awarded 23 points (15+2+3+3=23). Readers should already be wiggling in their chairs as they think of the mistakes that have already been made. One mistake is acting as if the 20 multiple-choice items are of the same difficulty. Another error is acting as if 3 points earned on essay #2 has the same meaning as 3 points for essay #3. Clearly, several issues are layered upon one another, and these issues must be considered in an analysis of partial credit data. Fortunately, Rasch analysis allows us to conduct an accurate analysis of a partial credit data set when care has been taken to measure one trait. For readers interested in a technical discussion of partial credit analysis, we suggest an article authored by Masters (1982). Below we provide some initial guidance about partial credit analysis in a nontechnical manner.

A Data Set Containing a Mix of Partial Credit and Dichotomous Items

95

Data Set Containing a Mix of Partial Credit A and Dichotomous Items In this chapter, we present a partial credit data set which can be evaluated using Winsteps software. As with analyses that we presented in RAHS Book 1, we use the data set to help readers examine and practice Rasch measurement techniques. The data set that we supply is in an Excel sheet named Edited Data Set for Partial Credit Chapter PC 3, 14, 21, 25. Throughout this text we will use the phrase Winsteps to indicate the Rasch software that we use. Formative Assessment Checkpoint #1 Question: In the data set, Edited Data Set for Partial Credit Chapter PC 3, 14, 21, 25, why might it be incorrect to simply compute a raw score total for a student named Henry? Answer: Raw data – both dichotomous and partial credit – must not be assumed to be linear. Regarding partial credit items, it is not logical to assume that a raw score of 2 earned for item #3 has the same meaning in terms of student understanding as a raw score of 2 earned for item #25.

Figure 8.1 presents the data. Reviewing the data set, how is it organized? Columns A-B, D-M, O-T, V-X are the dichotomous items; a 1 is used to identify a correct answer and a 0 is used to identify an incorrect answer. Columns C, N, U and Y are the partial credit items. Review of the full data set and the headers reveal that items Q3 and Q21 are worth a maximum of 2 points, whereas the other two partial credit items (Q14 and Q25) are worth up to 4 points each.

Fig. 8.1 The first 11 respondents answering the 25-item test. As presented in an Excel spreadsheet

96

8 Partial Credit Part 1

Exploring and Understanding the Data Set The next steps are exploring, thinking, and learning that data sets with partial credit items require a Rasch analysis. We will start this exploration with the construction of a control file and an initial analysis of the test data. The control file created by an immediate control file construction of the Excel data is provided as cf straight up, and the Wright Map from the analysis of the data set using cf straight up is provided in Fig. 8.2. We supply cf straight up in the data sets for this chapter. As readers will see, we provide this straight up analysis as a way of making a point about a partial credit analysis. We believe this point will help readers better understand and explain the need for a careful partial credit analysis of data sets. What can be observed, and how might the pattern in the Wright Map be explained? First, remember from the coding that was used, a higher number (raw score) received by a respondent indicates a better response. This means that a higher person measure is supposed to indicate a higher ability student. Moreover, a higher item measure should indicate a harder item. However, why are the partial credit items in a group at the base of the plot? Why are the dichotomous items at the top of the plot? Why is there no mixing of items? Understanding this pattern of items is a reader’s MEASURE 2

1

0

-1

-2

-3

-4

-5

PERSON - MAP - ITEM | + Q10 Q15 | | |S Q 4 Q23 + Q16 | Q 5 Q 8 | Q20 Q24 | +M Q 18 Q11 | Q 7 | Q 6 T| Q1 X + XXXXX |S Q3 pc 0 1 2 XXXXXX S| XXXXXXXX | XXXXXXXX + XXXXXXXXXXXXX | Q25 pc 0 1 2 3 4 XXXXXX M| XXXXX |T XXXXXX + Q21 pc 0 1 2 XXX S| XXXXX | Q14 pc 0 1 2 3 4 XX | X + XXXX T| | X | + |

Q22

Q13 Q12

Q17

Q19 Q9

Q2

Fig. 8.2 The Wright Map resulting from the Winsteps analysis of the data set using the control file cf straight up. The phrase “pc 0 1 2” beside items Q3 & Q21 and the phrase “pc 0 1 2 3 4” beside items Q14 & Q25 have been added to help identify the partial credit items, as well as to communicate the range of points that can be earned for each partial credit item. All dichotomous items (e.g. Q11) are simply plotted with just an item name. Either 0 or 1 points are possible for the dichotomous items

Exploring and Understanding the Data Set

97

Fig. 8.3 The top third of the control file cf straight up. Winsteps, with this control file, treats all items as having a maximum of 4 points possible

first big step toward understanding a partial credit data set, and the types of issues that must be considered when evaluating such data. When the cf straight up (Fig. 8.3) control file is used to evaluate the data, as far as the program is concerned, the highest value that could be earned for any of the items is a 4, and the lowest is a 0. This can be seen in Fig. 8.2. When Winsteps runs, all that is known is that CODES=01234. This explains why the dichotomous items are more frequent toward the top of the Wright Map, whereas the partial credit items are more frequent toward the bottom of the Wright Map for this data set. When only 0s and 1s are possible for some items, the item appears more difficult in comparison to a partial credit item in which 0s, 1s, 2s, 3s, and 4s are observed. Thus, in Fig. 8.2 it should make sense that the dichotomous items are plotted as difficult compared to the partial credit items. If only a maximum of a 1 can be earned on an item, then compared to an item where a 4, for example, has been earned, it would appear as if the dichotomous item is more difficult than the partial credit items. How then might the ordering of partial credit items from easy to hard be explained if one considers the coding and the confusion that results if one immediately conducts a Rasch analysis with the data set? The easiest item (Q14) has possible raw score points of 0, 1, 2, 3, or 4; however, this item is followed by an item (Q21) with possible points of 0, 1, or 2. In turn, moving up the Wright Map, the next item (Q25) can be answered 0, 1, 2, 3, or 4. The next item Q3, moving up the Wright Map, can be answered 0, 1, or 2. Why are the two easiest items not the two items with possible points of 0, 1, 2, 3, or 4? The answer is the ordering items in terms of their apparent difficulty depends on how students answered the items. In our example, it was the case that for students it was easier to receive a 2 for Q21 than it was to receive a 4 for item Q21. At this point, readers should again review Fig. 8.2. This plot of item difficulty displays the calibrations of items that result if an analyst considers the nonlinear nature of raw scores and secondly treats all ratings of items to be within the same group. That is what we are doing through the use of the cf straight up control file. The first part, considering the nonlinear nature of raw scores, is a familiar topic for readers. If ordinal data are used to compute a measure, then the Rasch model must be used to compute linear measures. The second step, not distinguishing between the meaning of a 1 of dichotomous item Q23 and the meaning of 1 of partial credit item Q3, is

98

8 Partial Credit Part 1

a slightly new topic. We say slightly because in our first book we c onsidered this issue when we considered the steps needed to be taken when combining rating scales. Before making some alterations in the control files so that we can move forward to conducting an accurate partial credit Rasch analysis, we repeat some comments that are extremely difficult for many researchers and students to digest initially. Below, we present an edited Wright Map with only a few of the test items (Fig. 8.4). When we work with students, our first action is to remind them how all items of a test should mark part of a trait. We reiterate that for a set of dichotomous items, although a respondent gets one point for a correct answer, in reality, 1 point does not have the same meaning for all items. Getting the correct answer and 1 point for item Q10, demonstrates a higher mastery of a topic than getting the correct answer and 1 point for item Q18. Our next big leap is VERY difficult for our students. We ask them whether or not they would agree if a few partial credit test items (0, 1, 2 points possible) were added to the test, that it would be reasonable to not assume that the way the rating scale functioned for the items Q10 and Q18 would be the same as for Q21 and Q3. Usually, our students ultimately start to understand the issue. If we asked the students to summarize our comments most of our students will typically say: (a) We have taken care to evaluate a ten item dichotomous test in such a manner that raw scores are not treated as linear; (b) therefore, we now must consider that just because a partial credit item is worth more points, the 2 points awarded to John for the correct solution to item Q3, might not represent a higher level of ability than John correctly being awarded 1 point for the correct solution to item Q10 (a dichotomous item). Sometimes, several class sessions are needed for the majority of the class to ponder and eventually process this idea. Clearly, not treating raw scores as measures is a foreign idea when learners first start with Rasch measurement. Equally clear is the difficulty of weighing the meaning of a 1 for item Q3 and a 1 for item Q10.

Formative Assessment Checkpoint #2 Question: How can it be that analysts should not view a score of 2 for a partial credit test item with possible scores of 0, 1 and 2 as being the same as a score of 2 for a partial credit item with possible scores of 0, 1, 2, 3, and 4? Answer: Analysts should not view all items as having the same difficulty. Answering one multiple-choice item correctly should not be viewed as providing the same information about a respondent as answering a different multiple-choice item correctly. Moreover, the meaning of a score of 2 for a 2-point partial credit item (0 points, 1 points, 2 points) should not be viewed as the same as a score of 2 for another 2-point possible (0 points, 1 points, 2 points) partial credit item. Furthermore, earning a 2 for a 4 point (0, 1, 2, 3, 4) partial credit item should not be assumed to have the same meaning as earning a 2 for a 2 point (0, 1, 2) partial credit item. For example, perhaps the 4-point item is exceedingly difficult, and a raw score of 2 therefore represents a far higher level of a respondent’s ability than a score of 2 represents for the 2-point item?

Exploring and Understanding the Data Set

99

Fig. 8.4 A partial Wright Map of 4 test items. Two items are dichotomous, and two items are partial credit

An additional technique that we use with our students to help them to begin to think about partial credit analyses is to present Winsteps Table 2.2 (Fig. 8.5), which provides the Rasch half point thresholds. Our students are already familiar with this plot, because it has been used to bring meaning to the measures of rating scale data such as the STEBI (Enochs & Riggs, 1990) which was used in RAHS Book 1. We first point out to our students that, as expected, there is an ordering of items that is supposed to be from easiest – at the bottom of the plot — to hardest – at the top of the plot. We point out that the ordering of items from supposedly the easiest to the hardest is the same as seen in Fig. 8.2. We then point out that the pattern of 0s, 1s, 2s, 3s, and 4s is the same for each item. For example, the space for region 1 is the same width for all items, and region 3 is the same width for all items. We remind our students that unless we inform our Rasch analysis program of some nuances in the rating scale, then the rating scale will be set to the same jumps from 1 to 2 and so on, for each item. The last point we make to our students is to emphasize that regions of 0s, 1s, 2s, 3s and 4s are present for all items. Students then correctly assert that many of the items were ones with a maximum score of 1. It was not possible to earn a 2 or 3 or 4 on many items. At this point, many students start to develop an uneasy gut feeling that, indeed, something is amiss from a measurement perspective when an analyst does not take into consideration that three rating scales are used in this data set: A rating scale of only 0 and 1; a rating scale of only 0, 1 and 2; and a rating scale of only 0, 1, 2, 3, and 4.

100

8 Partial Credit Part 1

INPUT: 74 PERSON 25 ITEM REPORTED: 74 PERSON 25 ITEM 5 CATS WINSTEPS 4.4.8 -------------------------------------------------------------------------------Expected Scores: score-point measures and peak category probabilities, ":" half-point measures (illustrated by an Observed Category) -9 -7 -5 -3 -1 1 3 5 7 |------+------+------+------+------+------+------+------| NUM ITEM 0 0 : 1 : 2 :3 : 4 4 15 Q15 0 0 : 1 : 2 :3 : 4 4 22 Q22 0 0 : 1 : 2: 3 : 4 4 10 Q10 | | | | 0 0 : 1 : 2: 3 : 4 4 23 Q23 0 0 : 1 : 2 :3 : 4 4 4 Q 4 0 0 : 1 : 2 : 3 : 4 4 16 Q16 0 0 : 1 : 2 :3 : 4 4 8 Q 8 0 0 : 1 : 2: 3 : 4 4 5 Q 5 | | 0 0 : 1 : 2: 3 : 4 4 20 Q20 0 0 : 1 : 2: 3 : 4 4 24 Q24 | | 0 0 : 1 : 2 : 3 : 4 4 11 Q11 0 0 : 1 : 2 : 3 : 4 4 13 Q13 0 0 : 1 : 2 :3 : 4 4 2 Q2 0 0 : 1 : 2 :3 : 4 4 18 Q 18 0 0 : 1 : 2 :3 : 4 4 17 Q17 0 0 : 1 : 2 :3 : 4 4 19 Q19 0 0 : 1 : 2: 3 : 4 4 12 Q12 0 0 : 1 : 2 :3 : 4 4 7 Q 7 0 0 : 1 : 2 :3 : 4 4 9 Q9 0 0 : 1 : 2 : 3 : 4 4 6 Q 6 | | 0 0 : 1 : 2: 3 : 4 4 1 Q1 | | | | 0 0 : 1 : 2 : 3 : 4 4 3 Q3 pc 0 1 2 | | | | 0 0 : 1 : 2 :3 : 4 4 25 Q25 pc 0 1 2 3 4 | | | | 0 0 : 1 : 2: 3 : 4 4 21 Q21 pc 0 1 2 | | 00 : 1 : 2 :3 :4 4 14 Q14 pc 0 1 2 3 4 |------+------+------+------+------+------+------+------| NUM ITEM -9 -7 -5 -3 -1 1 3 5 7

Fig. 8.5 Winsteps Table 2.2 presents the results of an analysis of the 25 test items without noting in the analysis that the data set included dichotomous items and partial credit items with a rating scale of 0-1-2, as well as partial credit items with a rating scale of 0-1-2-3-4. This means the GROUP command has not been used

he Key to the Partial Credit Analysis Is the Judicial Use T of ISGROUPS Readers should now appreciate that evaluating a partial credit data set requires the application of a number of important measurement techniques that have been presented for the analysis of dichotomous tests and also for rating scale surveys. The key in the analysis of the current data set is to indicate which test items are paired with which rating scales. In this data set, Q3 and Q21 are rated with a 3 category scale (0, 1, 2). Items Q14 and Q25 are rated with a 5 category scale (0, 1, 2, 3, 4). The remaining items, Q1-Q2, Q4-Q13, Q15-Q20, Q22-Q24, are dichotomous items

The Key to the Partial Credit Analysis Is the Judicial Use of ISGROUPS

101

Fig. 8.6 The first third of the control file used to evaluate the partial credit test data when the analysis includes consideration that the test includes three different types of items regarding the rating scale of items. The complete control file is named cf pc 2

with a two-step rating scale (0, 1). By indicating in our Rasch analysis which items use which rating scales, we are noting that, indeed, the meaning of movement from 0 to 1 for the dichotomous items has the same directional meaning along the variable as, say, movement from 3 to 4 for Q25 (an increase in the numerical value of the label awarded to a respondent represents a “better” response regardless of the rating scale of the item). However, by indicating which items use which rating scale, we are able to take into account that the jump from, for example, a 1 to a 2 may not have the same meaning for Q14 and Q21. Figure 8.6 presents the revised control file. The important line is the ISGROUPS line. This line contains a notation made to emphasize the different types of items. Type A items are dichotomous (0-1). Type C items have three categories (0-1-2). Type B items have five categories (0-1-2-3-4). It is this line in the analysis which tells the Rasch program that there are three types of items in the test. The first “A” of the ISGROUPS line is telling the program that the first item read in is an item where 0 or 1 points are possible. Note the second “A”, this is just telling the program that the second item read in is an item where 0 or 1 points are possible. Now look at the third letter (a “C”). This is telling the program that the third item of the data set is a type C item (one can earn 0 or 1 or 2 points). The ordering of the letters refers to the item type and the placement of each item type in the data set. For instance the last item of the data set is item type B. Analysis of the data using cf pc 2 results in the Wright Map presented below (Fig. 8.7). There are several patterns in the item ordering and spacing that should now make sense, considering our comments above. First and foremost, the test items are ordered by difficulty along the trait. Some dichotomous items (e.g. Q15, Q10, Q22) are displayed as marking more difficult (more advanced, if you will) portions of the trait than some of the items for which one could receive higher raw score ratings. For example, items Q14 (0-4 points) and Q3 (0-2 points) have lower item measures than three dichotomous items. Another important aspect of the plot

102

8 Partial Credit Part 1 MEASURE PERSON - MAP - ITEM | 5 + | | | X | | 4 + | X T| | | XXXX | 3 + XXX | | XXX S|T XXXXXXXX | XXXXX | Q25 pc 0 1 2 3 4 2 XXX + | Q15 XXXXXX | Q10 XXXXXXX M| Q14 pc 0 1 2 3 4 XXXXXX |S Q3 pc 0 1 2 XXXXX | 1 XXX + Q23 XXX | Q 4 XXX | Q16 X S| Q 8 XXXX | Q 5 XX | 0 X +M Q20 | XXXX | T| Q11 X | Q 18 | Q12 -1 + | Q21 pc 0 1 2 |S Q 7 | | Q 6 | -2 + | | |T | | -3 + | Q1 |

Q22

Q24 Q13 Q2 Q17

Q19

Q9

Fig. 8.7 The Wright Map resulting from the use of ISGROUPS (this is the same as the GROUPS command) in the control file for a Winsteps analysis. Unlike the straight-up analysis, in which all dichotomous items are displayed as more difficult than the partial credit items, one now sees a mix of items (0/1 point items, 0/1/2 point items, 0/1/2/3/4 point items) along the trait. This makes sense in that, for example, partial credit items could be very easy or very difficult

103

The Key to the Partial Credit Analysis Is the Judicial Use of ISGROUPS -4 -3 -2 -1 0 1 2 3 4 5 |-----+-----+-----+-----+-----+-----+-----+-----+-----| 0 0 : 1 : 2 : 3 : 4 4 | | | | 0 0 : 1 1 0 0 : 1 1 0 0 : 1 1 0 0 : 1 : 2 : 3 : 4 4 | | 0 0 : 1 : 2 2 | | 0 0 : 1 1 0 0 : 1 1 | | 0 0 : 1 1 0 0 : 1 1 0 0 : 1 1 | | 0 0 : 1 1 0 0 : 1 1 | | 0 0 : 1 1 0 0 : 1 1 0 0 : 1 1 0 0 : 1 1 0 0 : 1 1 0 0 : 1 1 0 0 : 1 1 | | 0 0 : 1 : 2 2 0 0 : 1 1 0 0 : 1 1 | | | | 0 0 : 1 1 | | | | 0 : 1 1 |-----+-----+-----+-----+-----+-----+-----+-----+-----| -4 -3 -2 -1 0 1 2 3 4 5

NUM 25

ITEM Q25 pc 0 1 2 3 4

15 22 10 14

Q15 Q22 Q10 Q14 pc 0 1 2 3 4

3

Q3 pc 0 1 2

23 4

Q23 Q 4

16 8 5

Q16 Q 8 Q 5

20 24

Q20 Q24

11 13 2 18 17 19 12

Q11 Q13 Q2 Q 18 Q17 Q19 Q12

21 7 9

Q21 pc 0 1 2 Q 7 Q9

6 1 NUM

Q 6 Q1 ITEM

Fig. 8.8 Table 2.2 of Winsteps when the partial credit analysis uses ISGROUPS to acknowledge the three different item rating scales present in the data set

is that one partial credit item (Q21, 0-2 points) exhibits a lower item measure than many dichotomous items. The next step in developing a gut feeling and understanding of including and analyzing partial credit items in a test analysis is revealed by what can be observed in Table 2.2 of Winsteps (Fig. 8.8). Some aspects of our analysis can be seen in Fig. 8.8. First, and also seen in the Wright Map, is an ordering of test items from easier to more difficult (easier items toward the bottom and more difficult items toward the top of the Wright Map). An ordering in which we see some dichotomous items are more difficult than some partial credit items. Second, an aspect of the table focuses on the jumps for the two 0-4 point items. If readers look at the spacing for 0-1, 1-2, 2-3, and 3-4, one can see that the spacing is the same for both of the 4-point items. If readers look at the jumps for the two 0-2 point items for 0-1, 1-2, one will see that the jumps are the same for both two point items. A third observation is that the jumps for the 0-2 point items are not the same as the jumps for the 0-4 point items. This is a reflection of the ana-

104

8 Partial Credit Part 1

lyst telling Winsteps that one should not assume that the rating scale for the 2 point items, for example, is the same as the rating scale for the 4 point items. At this point, to finish our introduction to students, we ask them to place Figs. 8.5 and 8.8 next to each other, and then to note the differences and explain aloud the differences. The meaning of the respondents’ measures to the 25-item set can be best appreciated by drawing a vertical line upward through Fig. 8.8. Readers can add a line representing a respondent who receives an average person measure (1.57 logits) for this group of respondents. This line allows an analyst to observe the meaning of the average measure of this group of respondents. In particular what response would be predicted for the average respondent to each and every item.

The Person Measures from a Partial Credit Analysis One issue that is sometimes forgotten by those first learning Rasch, is that Rasch is used to evaluate the functioning of an instrument. Rasch can be used to make corrections in a data set and to take into consideration the types of issues we have stressed to this point in the chapter. For example, that earning 1 point on a dichotomous item may have represented a more advanced location along the trait of student than receiving 4 points for one of the partial credit items. Another comment must be made, namely, when conducting a Rasch partial credit analysis of data, it is i mportant to use the person measures that are computed from the analysis. For example, if data is collected from 200 students (100 males, 100 females) using a partial credit test, it is important to make the appropriate corrections in a control file using a technique such as that facilitated by ISGROUPS, and then to use the resulting person measures for any subsequent statistical analysis. Experts in Rasch analysis will roll their eyes, but many times during workshops some participants forget they need to use Rasch person measures for their statistical analyses.

Person Fit, Item Fit, Reliability We emphasized in our first book that analysts can confidently and accurately explore a multitude of measurement issues through Rasch measurement. The analysis outlined herein for a partial credit examination is somewhat more complicated than the analysis of a survey in which all items are evaluated with the same rating scale. However, the same rules and techniques can be used to monitor the quality of a partial credit instrument, evaluate partial credit data quality, and communicate partial credit findings. Thus, it is important to evaluate issues such as item fit and person fit!

ISGROUPS=0 GROUPS=0

105

Formative Assessment Checkpoint #3 Question: Can the techniques used to evaluate item quality (e.g. MNSQC C->M RMSR |DISCR| |------------------------+---------------------+---------+-----------------+-----| | 0 NONE |( -2.73) -INF -1.98| | 0% 0% 3.1020| | | 1 -1.44 .18 | -1.24 -1.98 -.81| -1.70 | 45% 4% 2.2494| .84| | 2 -.23 .10 | -.51 -.81 -.27| -.65 | 20% 22% 1.4927| -.09| | 3 -.43 .08 | -.04 -.27 .18| -.25 | 33% 41% .8821| .65| | 4 .51 .07 | .43 .18 .75| .18 | 22% 42% .8497| 1.22| | 5 -.19 .06 | 1.22 .75 2.12| .51 | 48% 54% .7949| 1.10| | 6 1.77 .06 |( 2.98) 2.12 +INF | 1.92 | 70% 43% .9433| 1.02| ----------------------------------------------------------------------------------

0 1 2 3 4 5 6

Fig. 10.4 The Andrich Thresholds present the location of the intersection of the curves from one rating scale category to the next. For example, at −1.44 logits the curve for rating scale category 0 intersects the curve for rating scale category 1. A second example, at 1.77 logits, the curve for rating scale category 5 intersects the curve for category 6. We have marked where you can find the exact value of the category 0 and category 1 curve intersection in Table 3.2. Tables provided by Winsteps

hat Happens When the Master’s Partial Credit Model Is W Used, When GROUPS=0, When GROUPS=AABBCCC? It should make sense to readers that when the Master’s Partial Credit Model is used – meaning there is no assumption that the step from a rating scale category to the next step is the same across all items – some changes will occur in the types of tables and plots provided above. Moreover, important, new interpretation and instrument management possibilities may open up for the researcher. Below, in Fig. 10.5, is a control file identical to the control file already presented. However, the phrase Groups=0 has been inserted to conduct an analysis in which a researcher does not assume that the rating scale functions in a similar manner for all the rating scale items of a survey. We now have a control file that allows us to conduct a revised analysis, one in which each rating scale functions in a unique manner for each item. Let us review

136

10 The Hills…with the Partial Credit Model

Fig. 10.5 The control file for an analysis made with the command Groups=0. In such an analysis, the rating scales of items are not assumed to function in an identical manner. This file is named se-all groups equal zero.cf

the impact in the Hill plots that are possible through Winsteps. Below is Fig. 10.6, which provides the Hills for the same two items presented immediately above. Formative Assessment Checkpoint #2 Question: Why are the Hills different from item to item when I use the Masters Partial Credit Model? Answer: When one uses the Masters Partial Credit Model, one does not assert that the manner in which the rating scale works is identical for each of the items. Thus, the plots of the curves may be (and most often are) different.

The important aspect of this plot, that the pattern of the Hills is different between the two survey items, is critical for readers to note. This should not be surprising, as we indicated that we did not make the assumption that the rating scale functioned in an identical manner for each item. A second observation is also important. The pattern of the Hills observed utilizing, Groups=0, is different from the pattern when Groups=0 is not used, which means an analyst assumes the rating scale functions in the same manner for all items. Using Fig. 10.4, we showed readers that Table 3.2 provided the details of the intersection of the logit value one curve and the next curve. Figure 10.7 below provides the tables that are constructed using the Groups=0 command. In these tables readers can also see in Fig. 10.6 that the location at which steps of the rating scale are equally probable for adjacent rating scale categories is different for different survey items.

W hat Happens When the Master’s Partial Credit Model Is Used, When GROUPS=0,…

137

Fig. 10.6 The Hills of two items following an analysis utilizing Groups=0. Note that the pattern of Hills is quite different for these two items. Also, note that the Hill pattern of the items in Fig. 10.6 is different than the Hill pattern in Fig. 10.3. Plots provided by Winsteps

138

10 The Hills…with the Partial Credit Model

Fig. 10.7 Part of Table 3 provided through a Groups=0 analysis of the survey data. The important aspect of these tables is that one can observe a different Andrich Threshold as a function of a survey item. We provide Winsteps Table 3.3 for the same two Self-Efficacy items presented and discussed previously

How to Make Use of Figure 10.6 and Figure 10.7 for Your Research? Evaluating Rating Scale Functioning One technique considered in our first book was the numerous ways in which a researcher could evaluate the manner in which the rating scale functions for an instrument. In that previous work, we specifically considered the Andrich model, in which the rating scale steps were considered to function in an identical manner for all items. As researchers might predict, an extension of this type of technique is to use Groups=0, or a Groups command for a partial credit analysis, and to then review the Hills for each survey item. In particular, analysts might investigate whether or

How to Make Use of Figure 10.6 and Figure 10.7 for Your Research?

139

not the thresholds increase in a monotonic manner for adjacent categories for each survey item. Researchers have published numerous articles using this technique to investigate whether or not the thresholds increase monotonically for adjacent categories. Two that present such an investigation are a paper involving the investigation of a spinal cord injury instrument (Itzkovich et al., 2002) and a paper concerning a back pain scale (Lu et al., 2013). In both studies each item’s pattern of Hills was reviewed, and that information was used to inform an overall assessment of the scale’s functioning. We can imagine that such a review can be used to: (a) Evaluate the overall functioning of the rating scale when Groups=0 is utilized. (b) Identify and perhaps remove some survey items that might exhibit such strange disordering of adjacent thresholds. (c) Identify specific survey items that exhibit disordering of adjacent thresholds, and then to experiment with combining of rating scale steps for specific items. If researchers utilize point c, it is important to note that, since Groups=0, each rating scale is viewed as possibly functioning in a different manner. It will therefore be possible to recode data (for example, combine rating scale categories) for a single item (or a set of items) and not to do so for other items.

Communicating Data Analysis We suggest a second way that researchers might utilize the Hills plots for communicating research results when Groups=0 is used and each item is reviewed. The plots in Fig. 10.6 use a vertical axis that provides a probability value from 0 to 1, and the horizontal axis provides a “Measure relative to item difficulty” axis. These are the forms of the Hill plots that are quite often found in articles. There exists, however, another option in Winsteps that allows researchers to communicate the meaning of the Hills for a respondent and groups of respondents. In Fig. 10.8 we provide a screen shot that allows one to generate the rating scale curves. In Fig. 10.9 below we provide the Hill plot for the third survey item of the STEBI. When we created this plot (using the menu of Fig. 10.8 by first clicking Graphs, then clicking Probability Category Curves) we also clicked on another button named Absolute x-axis. We used the Adjust minimum X-value and the Adjust maximum X-value to ensure that we made use of the entire x-axis in our plots. The resulting plot is in Fig. 10.9, and we will describe how this plot can be of immense use to researchers. Please note that when a researcher clicks on Absolute x-axis, s/he obtains the plot below, but the button that stated Absolute x-axis changes to the original scaling of Relative x-axis. How can a researcher make use this plot? Let us pretend that a subgroup of respondents taking the survey had a group mean measure of 2.0 logits. Using this plot, a researcher can mark the location of this subgroup and bring meaning to the measure of the respondents for this item (remember, we are looking at item Q3 of

140

10 The Hills…with the Partial Credit Model

Fig. 10.8 The menu from the Winsteps graphs menu in which one can generate the probability category curve plots as well as other plots

the SE scale) and the rating scale. We explain this idea in the following manner to our students: the vertical line that we inserted in the plot intersects the Agree curve for this item at approximately a probability value of .34, and the line that we have inserted intersects the Strongly Agree curve at approximately a value of .60. This means if we wish to explain the meaning of a group measure (or an individual mea-

How to Make Use of Figure 10.6 and Figure 10.7 for Your Research?

141

Fig. 10.9 The curves for a rating scale and a single survey item. The probability for each potential person measure is provided. The probability for a respondent of 2.0 logits is marked for two of the curves (the Strongly Agree curve, and the Agree curve). Remember Group was set to 0. Plot provided by Winsteps

sure) for this item and this rating scale, we can do so. This provides the opportunity to explain the meaning of a measure in a different, but an equally important, manner than a Wright Map. Using the plots such as those provided above, a researcher can take each survey item, and for any measure, explain the probability of a specific response. Just find the location of a respondent, or a group of respondents, and it is possible to compute the chance (the probability) of a specific response. Remember since you set Group=0, each plot for each item will more than likely differ as the rating scale is not being forced to be the same for the items. Charlie and Chizuko: Two Colleagues Conversing Charlie: OK, I think I get this Groups=0 thing, and also how groups equal 0 impacts the Hills. And how Hills are impacted by a partial credit analysis. Chizuko: Then tell me more! Charlie: OK. First, I might decide to conduct an analysis of a rating scale by considering each rating scale to potentially be operating in a different manner. By this I mean the ordinal scale is ordinal for each instrument item, but I do not assume everything is identical. The command Groups=0 allows me to treat each rating scale as potentially spaced out differently. Chizuko: Is there more, or is that it? It is still interesting, but is there more?

142

10 The Hills…with the Partial Credit Model

Charlie: Well…indeed there is more. When we use Groups=0, and we look at the Hills discussed in book one, we first notice that the plots are read in the same way as when a straight rating scale model is used. But then, we notice that the plots for each item are different. Chizuko: Why is that? Charlie: Well…it is because we have allowed each item to have its own different operating rating scale. Each item has the same number of rating categories, but each item can have a rating scale that is functioning a little bit differently. Chizuko: How do we use such plots? Charlie: Interesting….remember with the Hills in RAHS Book 1, we looked for step ordering (and disordering)? We can do the same for all these individual plots. Chizuko: What do you do if you have disordering in a few items, but not in all items? Charlie: Well, honestly I think it is best to see if there is an overall pattern with all my items. If there are just a few problem items, maybe I would remove the items if the pattern seems very strange. I think another smart step would be not to remove an item, but rather to recode data just for an item. With that recoding, I might get my hoped for monotonic increasing Andrich Thresholds. Also, I think it is pretty nice that I can plot a person measure (or a group measure) on the plots, and then I can explain how that person (or group of persons) would have had a high probability of answering for that item. That allows me to really explain things. This all works for a partial credit analysis too!

Keywords and Phrases Groups=0, Groups=AABBCC Thresholds Andrich Rating Scale Category Probability

Potential Article Text The analysis of the data set made use of the GROUPS=0 command in Winsteps. This command allows researchers to conduct an analysis in which each rating scale for an item is viewed as potentially functioning in a different manner. A review of the category probability curves was conducted for each of the 13 survey items. One item (#4) exhibited potential step disordering. A second analysis was conducted in which the categories Strongly Agree and Agree were collapsed into that single category. Following that step, the item did not exhibit any step disordering. By using a version of these plots in which the vertical axis is probability, and the horizontal axis is absolute, it was also possible to explain the meaning of the measures of females and males with respect to each of the 13 survey items.

How to Make Use of Figure 10.6 and Figure 10.7 for Your Research?

143

Quick Tips Groups=0 for analysis in which each item can have a different rating scale structure. By clicking the Graphs button of Winsteps and then the Probability Category Curves option from Winsteps, analysts can generate the Hill plots for an analysis. Note that when Groups=0 is used, analysts will see a unique plot of Hills for each instrument item. Review the Andrich Threshold ordering for each item. If the thresholds do not increase monotonically for adjacent categories, then investigate. It may be necessary to do some recoding (combine categories in a logical manner). Probability Category Curves can be used to explain the meaning of a measure. This is most easily done by selecting Graphs, then clicking Probability Category Curves, then selecting a button named Absolute x-axis.

Data Sets (Go to http://extras.springer.com) se-all groups equal zero.cf se-all.cf cf stebi just OE

Activities Activity #1 Run an analysis of se-all.cf (a data set with all STEBI Self-Efficacy items). Do not use GROUPS=0. Review all the Hill plots. What do you see? Answer: All the Hills plots are the same. Activity #2 Run an analysis of se-all.cf. Use GROUPS = 0 to ensure that the rating scale is treated as different for each survey item. Do you see different patterns for different items? Answer: Yes, there is a different Hill pattern for each item. Activity #3 Identify those items that exhibit limited (or no) step disordering.

144

10 The Hills…with the Partial Credit Model

Activity #4 Identify those items that exhibit significant step disordering (look at the curves, look at the Andrich Thresholds). Which items are perhaps the worst regarding step disordering? Then for the worst item, can you identify which categories are disordered? Activity #5 For one of the items with a disordering problem, do a by-hand recode of your data only for that item. In your recode for that item, experiment with recoding of adjacent categories. For example, you may see disordering for the categories labeled 3 and 4, so recode the categories as a 3 and change all 4 values, only for that item, to a 3. Then rerun your analysis and see if the disordering for that item has lessened or disappeared. Answer: You pick the item, and you do the analysis. Activity #6 Run an analysis without GROUPS=0 for the control file cf stebi just OE. Verify that all the Hill plots present the same picture. Activity #7 Run an analysis of cf stebi just OE, but use GROUPS = 0 to ensure that the rating scale is treated as different for each survey item. Do you see a different pattern for different items? Answer: Yes, there is a different Hill pattern for each item. Activity #8 Identify those items for which there is limited (or no) step disordering. Activity #9 Identify those items that exhibit significant step disordering. Which items are perhaps the worst regarding step disordering? Then for the worst item, can you identify which categories are disordered?

References

145

Activity #10 For one of the items with a disordering problem, do a by hand recode of your data only for that item. In your recode for that one item, experiment with recoding of adjacent categories. For example, you may see disordering for the categories labeled 3 and 4, so recode the categories as a 3 and change all 4 values just for that item to a 3. Then rerun your analysis and see if the disordering for that item has lessened or disappeared. Answer: You pick the item and you do the analysis. Activity #11 For one of the Hill plots from the analysis of activity 7, pretend the group mean measure of males in the data set was 1.25 logits. Using the Absolute x-axis button, can you identify which rating scale category was most probable? Activity #12 Most of this chapter made use of Groups=0 to take into consideration that each rating scale might function in a different manner for each survey item. Run cf stebi just OE without GROUPS=0 and compute the person measures of the data set. Then run the same control file, with GROUPS=0, and compute the person measures (use PFILE). After you have computed the two sets of person measures, create a cross plot to investigate the impact of using or not using GROUPS=0 for this analysis? Did it make a big difference to use GROUPS=0? Answer: Make your cross plot in the software package of your choice. If there are few respondents off-diagonal, it does not appear (at least from the perspective of a cross plot) that it made much difference to use or not use GROUPS=0. Note there are many other reasons to use, or not use, GROUPS=0.

References Enochs, L. G., & Riggs, I. M. (1990). Further development of an elementary science teaching efficacy belief instrument: A preservice elementary scale. School Science and Mathematics, 90(8), 694–706. Itzkovich, M., Tripolski, M., Zeilig, G., Ring, H., Rosentul, N., Ronen, J., et al. (2002). Rasch analysis of the Catz-Itzkovich spinal cord independence measure. Spinal Cord, 40(8), 396–407. Lu, Y. M., Wu, Y. Y., Hsieh, C. L., Lin, C. L., Hwang, S. L., Cheng, K. I., et al. (2013). Measurement precision of the disability for back pain scale-by applying Rasch analysis. Health and Quality of Life Outcomes, 11(1), 119.

146

10 The Hills…with the Partial Credit Model

Additional Readings Gothwal, V. K., Bharani, S., & Reddy, S. P. (2015). Measuring coping in parents of children with disabilities: A Rasch model approach. PLoS One, 10(3), e0118189. Linacre, J. M. (1999a). Category disordering (disordered categories) vs. threshold disordering (disordered thresholds). Rasch Measurement Transactions, 13(1), 675. Linacre, J. M. (1999b). Investigating rating scale category utility. Journal of Outcome Measurement, 3(2), 103–122. Linacre, J. M. (2006). Demarcating category intervals. Rasch Measurement Transactions, 19(3), 341–343. van der Wal, M. B. A., Tuinebreijer, W. E., Bloemen, M. C. T., Verhaegen, P. D. H. M., Middelkoop, E., & van Zuijlen, P. P. M. (2012). Rasch analysis of the Patient and Observer Scar Assessment Scale (POSAS) in burn scars. Quality of Life Research, 21(1), 13–23.

Chapter 11

Common Person Test Equating

Charlie and Chizuko: Two Colleagues Conversing Charlie: Chizuko, I’ve got an interesting data set…I have three tests…and there are no common items across the tests. So, that means no luck on using item anchoring to link the scales. But I do have three students who took Test A and Test B. Also, I have three students who took Test B and Test C. I also have three students who completed Test C and Test A. I’m wondering, but I am not sure, if I might somehow be able to use those nine people to help me link the scales? Chizuko: You are asking about a technique called common person equating. I have used this technique to link some scales. At first it might be a little difficult to understand, but if you think through the logic of item anchoring, you will see that we can use the same logic to conduct person anchoring.

Tie Ins to RAHS Book 1 In our first book, we introduced readers to the concept and steps needed to anchor scales using common items. Such item anchoring allows researchers to, among many processes, express items of many different tests on the same scale and to compute the measures of all respondents regardless of test form completed. Just remember that we must be linking the same variable. This chapter makes use of the above-mentioned processes, and the steps show how common persons are used to link scales. For example, if two different Self Efficacy scales were developed with entirely different sets of items, the two scales can be linked if we have some test takers take both scales. Moreover, all items of both scales can be displayed on a Wright Map and expressed on the same measurement scale. Readers should appreciate that the techniques presented in RAHS Book 1 as well as the previous chapters of this book should be used to evaluate the measurement properties of a scale when common persons are used to link two or more scales. © Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_11

147

148

11 Common Person Test Equating

Introduction Given the information in our first book, we can present a very useful Rasch technique herein. For example, our previous work with item linking should have helped readers understand the concept and use of item linking. A second, but similar, process that requires time to digest is called common person equating. This process is closely tied to common item equating, but it takes some thinking to understand and to recognize when this technique will help researchers solve a measurement issue. What situation would require common person equating? For example, a researcher has collected data from students via three different tests that are supposed to test the same trait. Some students completed only Test A, other students completed only Test B, and still, other students completed only Test C. Moreover, several other students completed tests A and B, and other students completed Tests B and C. Finally, some students completed tests A and C. Most important, there are no common items across the three tests. Figure 11.1 presents a schematic of these data. Of course, in the real world, we would have designed a test with more than five items. Also, we would have followed the advice that we offered in RAHS Book 1, with regard to the development and use of instruments. What can be done with this data set? There are no common items - item anchors to link the tests, and there seems to be no way to place every test taker on the same scale. Are we facing an insurmountable problem? Let us step back and think first about measuring a single variable with one test. In the example just described, we have three separate tests, and each test separately measures the same variable. The fact that each test measures the same variable serves as the key to placing all test items on the same metric and placing all test takers on the same metric, regardless that no common items exist across the three tests. To understand how this works, our first step is to remind readers, students, and colleagues, alike, that person measures and item measures are on the same scale. This can be documented by reviewing the Rasch model and noting that Di cannot be subtracted from Bn if the units are not identical!

Bn Di ln Pni / 1 Pni

Taking step two, we ask students to rewrite the data matrix that we presented in Fig. 11.1. In so doing, we ask students to plot the test takers on the horizontal axis and list the tests on the vertical axis (Fig. 11.2). When readers complete this plot and reflect for a moment, readers will realize that they observe the same pattern of links that they observed when we wrote about common item anchoring – the case in which we link tests by having common items across different forms of a test. In this plot, readers can see that persons are the links that tie the three test metrics to a single metric. Three students (ID 26, 27, 28) link Forms A and B. Three other students (ID 39, 40, 41) link Forms B and C, thereby linking Forms B and C. Three other students (ID 51, 52, 53) link Forms C and

Introduction ID 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

Test A Q1Test A Q2 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1

1 1 1

149 Test A Q3 1 1 1 1 0 1 1 1 1 1 0 1 1 0 0 1 0 1

1 1 1

Test A Q4 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1

1 1 1

Test A Q5 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1

1 1 0

Test B Q1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1

1 1 1

Test B Q2

0 1 0 0 1 0 0 1 0 1 1 1 1 0 0 1

Test B Q3

0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1

Test B Q4

0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 1

Test B Q5

0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1

Test C Q1

0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1

Test C Q2

0 1 1 1 1 1 0 0 1 1 1 0 0 0 0

Test C Q3

1 0 0 0 0 1 0 1 1 0 0 0 0 0 1

Test C Q4

1 0 0 0 1 1 1 0 0 1 1 1 1 1 1

Test C Q5

1 0 1 1 1 1 0 1 1 1 1 1 0 0 1

0 0 1 1 1 0 1 1 1 1 1 0 0 0 1

Fig. 11.1 An Excel data spreadsheet in which three tests were administered to a sample of respondents. Most respondents completed only one test. Three common respondents (ID 26, 27, 28) serve as a link between Test A and Test B. Three other common respondents (ID 39, 40, 41) serve as a link between Test B and Test C. Three still other common respondents (ID 51, 52, 53) serve as a link between Test C and Test A. If a respondent had completed Test A, B, and C they would also serve as a link

Fig. 11.2 The Excel data matrix in which the data from Fig. 11.1 is presented. But, the tests are presented as rows (where respondents usually are in our data matrix) and the respondents are presented in columns (where test items are usually located in our data matrix)

A. What doors do linking through persons open for a researcher? If a researcher has separate forms of a test or different forms of a survey, the scales of the different instruments can be linked as long as some respondents take multiple forms of the test. Another important aspect of common person linking is, if all scales measure the same trait, by using common person equating a researcher can: (a) express all the

150

11 Common Person Test Equating

items of a number of surveys (or all items of a number of tests) on the same Wright Map, and (b) compute a respondent measure on one scale, regardless of the survey form (or test form) completed. Making the jump from understanding how common items can be used to link metrics to how common persons can be used to link metrics is difficult to understand. A key point for readers to digest is: just as the presence of common items across different forms allows researchers to align scales, persons who take two forms of a test (or survey) can allow researchers to align the scales of the two forms of a test or a survey. To apply this technique of common person equating, we now guide readers through the steps outlined above using a data set we have organized to reflect the steps just outlined. We have organized a file named common person eq ss two for readers. This file consists of test takers who took Form A or Form B or Form C of a test. The file is that presented in Fig. 11.1. Three students (ID 26, 27, 28) are common because they took Forms A and B. Three other students (ID 39, 40, 41) are common because they took Forms B and C, thereby linking Forms B and C. Three other students (ID 51, 52, 53) are common because they took Forms A and C, thereby linking Forms C and A. Following the organization of the spreadsheet, readers can construct a Winsteps control file. We provide this file as cf common per eq. Below are the resulting person entry table (Fig. 11.3) and item entry table (Fig. 11.4). All respondents and items are expressed on the same metric. This means, for example, that although students 11–25 took only Test A, they are expressed on the same metric as students 42–50 who only completed Test C. All test items are expressed on the same scale. Therefore, by reviewing a Winsteps Wright Map (Fig. 11.5), a researcher can observe all items for all test forms. This means that a researcher can compare, for example, the location of any item along the trait regardless of test form. One can also compare students regardless of the test form completed. This example of common person equating expresses test takers measures on the A scale defined by all the items of tests A, B and C. Formative Assessment Checkpoint #1 Question: What is the relationship between item anchoring and person anchoring? Answer: When we use Rasch, we always consider one variable. Also, when we think about Rasch, we know that we can think of persons interacting with a variable and that persons are located somewhere on a variable. We can also think of items marking a variable. Just as we can tie two tests to the same scale by using common items, we can also tie two different scales to the same scale by using common persons. This is possible because the location of a person on a trait should always be the same, regardless of the test a student might take. Of course, the tests must involve the same variable.

A Slight Twist

151

PERSON STATISTICS: ENTRY ORDER -------------------------------------------------------------------------------------------| |ENTRY TOTAL TOTAL MODEL| INFIT | OUTFIT |PTMEASUR-AL|EXACT MATCH| |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR. EXP.| OBS% EXP%| PERSON| |------------------------------------+----------+----------+-----------+-----------+-------| | 1 4 5 -.18 1.14|1.01 .24| .91 .14| .23 .16| 80.0 80.5| 11 | | 2 4 5 -.18 1.14|1.01 .24| .91 .14| .23 .16| 80.0 80.5| 12 | | 3 5 5 1.21 1.92| MAXIMUM MEASURE | .00 .00|100.0 100.0| 13 | | | 4 5 5 1.21 1.92| MAXIMUM MEASURE | .00 .00|100.0 100.0| 14 -1.19 .93| .96 -.07| .94 -.10| .31 .20| 80.0 61.6| 15 | | 5 3 5 | 6 5 5 1.21 1.92| MAXIMUM MEASURE | .00 .00|100.0 100.0| 16 | .14| .23 .16| 80.0 80.5| 17 | | 7 4 5 -.18 1.14|1.01 .24| .91 | 8 4 5 -.18 1.14|1.32 .66|2.04 1.29| -.84 .16| 80.0 80.5| 18 | | 9 5 5 1.21 1.92| MAXIMUM MEASURE | .00 .00|100.0 100.0| 19 | | 10 4 5 -.18 1.14|1.01 .24| .91 .14| .23 .16| 80.0 80.5| 20 | | 11 4 5 -.18 1.14| .84 -.02| .68 -.20| .62 .16| 80.0 80.5| 21 | | 12 5 5 1.21 1.92| MAXIMUM MEASURE | .00 .00|100.0 100.0| 22 | 13 4 5 -.18 1.14|1.17 .46|1.29 .60| -.24 .16| 80.0 80.5| 23 | | -2.04 .93| .70 -1.26| .68 -1.25| .88 .21| 80.0 62.6| 24 | | 14 2 5 -1.19 .93| .79 -.80| .76 -.78| .69 .20| 80.0 61.6| 25 | | 15 3 5 .49| 90.0 72.1| 26 | | 16 5 10 -.67 .73| .50 -1.87| .44 -1.51| .84 | 17 6 10 -.13 .74|1.89 2.15|2.36 2.29| -.18 .49| 40.0 73.9| 27 | | 18 6 10 -.13 .74| .45 -1.92| .39 -1.59| .88 .49|100.0 73.9| 28 | -2.63 1.91| MINIMUM MEASURE | .00 .00|100.0 100.0| 29 | | 19 0 5 | 20 4 5 1.93 1.18| .56 -.57| .39 -.52| .83 .31| 80.0 80.0| 30 | -1.19 1.16| .64 -.43| .45 -.37| .71 .27| 80.0 79.9| 31 | | 21 1 5 .34| 80.0 66.4| 32 | | 22 2 5 -.10 .97| .65 -1.10| .60 -.87| .75 -1.19 1.16|1.27 .60|1.26 .57| -.04 .27| 80.0 79.9| 33 | | 23 1 5 | 24 2 5 -.10 .97|1.82 2.13|2.23 2.15| -.72 .34| 40.0 66.4| 34 | .31| 80.0 80.0| 35 | | 25 4 5 1.93 1.18| .56 -.57| .39 -.52| .83 .31| 80.0 80.0| 36 | | 26 4 5 1.93 1.18| .56 -.57| .39 -.52| .83 | 27 5 5 3.40 1.92| MAXIMUM MEASURE | .00 .00|100.0 100.0| 37 | .36| 40.0 68.3| 38 | | 28 3 5 .82 .98|1.73 1.77|1.99 1.89| -.55 .40| .16 .37| 60.0 73.2| 39 | | 29 7 10 1.79 .74|1.24 .76|1.12 | 30 2 10 -.79 .84| .97 .08|1.28 .59| .26 .30| 80.0 80.2| 40 | .33| 90.0 80.3| 41 | | 31 8 10 2.41 .84| .85 -.18| .76 -.12| .47 .97| .92 -.11| .91 -.10| .43 .34| 80.0 67.2| 42 | | 32 3 5 1.69 .30| 80.0 79.5| 43 | | 33 4 5 2.79 1.16| .56 -.58| .40 -.60| .86 | 34 4 5 2.79 1.16|1.29 .63|1.26 .57| -.08 .30| 80.0 79.5| 44 | 35 2 5 .78 .96|1.01 .14| .92 -.03| .33 .31| 40.0 64.5| 45 | | .34| 40.0 67.2| 46 | | 36 3 5 1.69 .97|1.38 1.08|1.36 .96| -.14 .30| 80.0 79.5| 47 | | 37 4 5 2.79 1.16|1.42 .78|1.72 .97| -.33 .30| 80.0 79.5| 48 | | 38 4 5 2.79 1.16| .56 -.58| .40 -.60| .86 .30| 80.0 79.5| 49 | | 39 4 5 2.79 1.16| .56 -.58| .40 -.60| .86 .31| 80.0 64.5| 50 | | 40 2 5 .78 .96| .66 -1.17| .61 -.90| .77 .61| 80.0 78.2| 51 | | 41 6 10 .41 .81| .57 -1.03| .41 -.96| .82 | 42 6 10 .41 .81| .57 -1.03| .41 -.96| .82 .61| 80.0 78.2| 52 | | 43 8 10 1.85 .92|1.62 1.31|3.16 1.50| .07 .49| 70.0 83.0| 53 | |------------------------------------+----------+----------+-----------+-----------+-------|

Fig. 11.3 The person entry table of the Winsteps analysis of the spreadsheet data for three different test forms when common persons are used to link the scales for each test to one scale

A Slight Twist Above we described the logic and thinking of using common persons to link scales. The logic is very similar to linking tests through common items that we presented in our first book. However, another way exists to link through common persons. Linking three different tests, a researcher could also follow a different procedure. For example, if an analysis of Test A is made, then the common person measures for Test A can be used to link Test A and Test B through an analysis of only the Test B data. This procedure can be continued to link Test B to Test C using the measures of

152

11 Common Person Test Equating

TABLE 14.1 common person eq ss two.xls ZOU041WS.TXT Apr 6 2020 14:59 INPUT: 43 PERSON 15 ITEM REPORTED: 43 PERSON 15 ITEM 2 CATS WINSTEPS 4.4.8 -------------------------------------------------------------------------------PERSON: REAL SEP.: .53 REL.: .22 ... ITEM: REAL SEP.: 1.85 REL.: .77 ITEM STATISTICS: ENTRY ORDER ----------------------------------------------------------------------------------------------| |ENTRY TOTAL TOTAL MODEL| INFIT | OUTFIT |PTMEASUR-AL|EXACT MATCH| |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR. EXP.| OBS% EXP%| ITEM | |------------------------------------+----------+----------+-----------+-----------+----------| | 1 18 21 -1.86 .67|1.12 .43|1.03 .25| .26 .36| 75.0 81.4| Test A Q1| -1.11 .57| .76 -1.01| .71 -.81| .62 .43| 87.5 71.4| Test A Q2| | 2 16 21 | 3 17 21 -1.45 .61|1.01 .13| .96 .05| .38 .40| 81.3 76.3| Test A Q3| -1.45 .61| .98 .05|2.26 2.23| .25 .40| 81.3 76.3| Test A Q4| | 4 17 21 -2.38 .78|1.22 .56|1.31 .62| .09 .30| 87.5 87.2| Test A Q5| | 5 19 21 .60| 78.6 74.7| Test B Q1| | 6 8 16 .43 .63|1.01 .15|1.06 .27| .59 .60| 78.6 74.7| Test B Q2| | 7 8 16 .43 .63| .69 -1.02| .58 -1.21| .75 | 8 9 16 .04 .63| .59 -1.66| .48 -1.46| .79 .59| 85.7 72.4| Test B Q3| .66|1.11 .48| .93 .10| .52 .55| 64.3 74.3| Test B Q4| | 9 11 16 -.78 | 10 5 16 1.69 .68|1.20 .69|1.03 .25| .52 .58| 64.3 76.8| Test B Q5| .73| .49 .46| 73.3 70.8| Test C Q1| | 11 8 15 1.51 .58| .90 -.33|1.21 | 12 5 15 2.53 .60|1.10 .50| .92 -.02| .34 .39| 60.0 70.3| Test C Q2| | 13 10 15 .80 .62|1.51 1.58|1.57 1.33| .07 .46| 46.7 74.3| Test C Q3| | 14 11 15 .40 .66| .56 -1.42| .41 -1.32| .81 .45| 80.0 77.3| Test C Q4| | 15 9 15 1.17 .60| .86 -.46| .86 -.32| .58 .46| 80.0 73.0| Test C Q5| |------------------------------------+----------+----------+-----------+-----------+----------| .63| .98 -.1|1.02 .0| | 74.9 75.4| | | MEAN 11.4 17.3 .00 | P.SD 4.6 2.6 1.40 .05| .25 .9| .45 1.0| | 11.1 4.2| | -----------------------------------------------------------------------------------------------

Fig. 11.4 The Winsteps item entry table. Note that all test items are expressed on the same metric

the common respondents in Test B. Below we outline the steps a researcher would take to implement this procedure for common person linking. You can follow these steps using the data presented in Fig. 11.1 Step 1. Conduct an analysis of only Test A data and compute a person measure for each person and each item. Step 2. Examining the data, let’s pretend that persons with entry numbers 16, 17 and 18 are the persons common to forms A and B. Pretend that these three test takers have the following measures: Person entry #16 (ID 26) is 1.0 logits; person entry #17 (ID 27) is .5 logits; and person entry #18 (ID 28) is −.25 logits. Step 3. Conduct an analysis of only the Test B data using the common persons anchored to the person measures determined through the Step 1 analysis. Persons ID 26, ID 27, and ID 28 are the first, second and third persons to be evaluated when analyzing the Test B data set. This means that these common persons linking Test A and Test B have entry numbers of 1, 2 and 3 when Test B data are evaluated. To anchor Test B data making use of our knowledge as to where these respondents fall on the scale, we insert the following commands into our Winsteps control file for our analysis of the Test B data. Remember, this is the information we place in the control file when we evaluate only the Test B data and we wish to link to Form A scale (Fig 11.6). Insertion of these values will ensure that the three test takers have this measure on the Test A metric when we evaluate the Test B data. Of importance is that this step aligns the scale from Test A with the scale of Test B. Remember you will add

A Slight Twist

153

TABLE 12.2 common person eq ss two.xls ZOU041WS.TXT Apr 6 2020 14:59 INPUT: 43 PERSON 15 ITEM REPORTED: 43 PERSON 15 ITEM 2 CATS WINSTEPS 4.4.8 -------------------------------------------------------------------------------MEASURE PERSON - MAP - ITEM | 3 XXXXXX + | XXXXX |T | | | Test C Q2 X | | | | 2 S+ XXX | XX | XX | Test B Q5 | | Test C Q1 |S | | Test C Q5 | 1 + | XXX | Test C Q3 | M| | XX | Test B Q1 Test B Q2 Test C Q4 | | | 0 +M Test B Q3 XXXX | XXXXXXX | | | | | X S| X | Test B Q4 | -1 + | Test A Q2 XXXX | | |S | Test A Q3 Test A Q4 | | | | Test A Q1 -2 X + T| | | | Test A Q5 | | | |T | -3 X + |

Fig. 11.5 The Wright Map resulting from the common person equating analysis using Winsteps. Three different tests were administered to a sample of students. Through common person equating, in which some respondents take more than one test form, all persons and items can be expressed along the same metric regardless of test form

154

11 Common Person Test Equating

Fig. 11.6 The command to link from Form B to Form A using persons as anchors

Fig. 11.7 The command to link Test B to Test C using common persons. Students with IDs 39, 40 and 41 will be entry numbers 1, 2 and 3 when evaluating the Test C data

these lines of code to your control file in the section of the control file that contains the program commands, for example you could enter these commands under the “CODES” line of the control file. Step 4. This step is a continued iteration of Step 3, but in this case we anchor the common persons between Test B and Test C. We have already used linking people to link Test A to Test B. Now we move onto linking Test B and Test C. To start our linking Test B to Test C with common persons we write down the person measures, from the analysis that linked Test A to Test B, of the students who have IDs 39, 40 and 41. Let us pretend that student 39 has a measure of .40, student 40 has a person measure of .80, and student 41 has a measure of .99. Remember these are the measures of these three students computed as the result of carrying out step 3 (Fig. 11.7). As was the case in our earlier explanation, it is important to note that these three students (IDs 39, 40 and 41) will be entry numbers 1, 2 and 3 when we evaluate the Test C data. We find it easier to place the linking students at the top of our data file, but that is not a requirement. This means when we evaluate only the Test C data, we will use the command presented in Fig. 11.7 in our control file: By using this command when evaluating only the Test C data, we link through persons to the Test B data. Because we have already linked Test A to Test B, all three tests are expressed using the same metric. Readers should look at the three students with IDs 51, 52 and 53. Notice that these three students serve as a link between Test C and Test A. We could use students 51, 52, and 53 to link Test A to Test C. For those researchers interested in common person equating there are indeed some articles that one might consider reviewing (e.g. Leon, 2008). We will close with a final comment. We have shown how common person equating can be used to put test items from different tests on the same metric and how the technique can be used to place survey items from different surveys on the same metric. We have also demonstrated that by using common persons all respondent measures can be expressed on the same single metric regardless of test/survey form completed. It is important to note that when item linking (common items) are used to link test forms one must evaluate the stability of an item through techniques such as DIF, the same issue will have to be considered with person linking. If one wishes

A Slight Twist

155

to link with persons over time, then one would have to evaluate the differential person function of a person. If a person being used to link from a time point 1 to time point 2 drifts in their location on the trait then it might be the case the person should no longer be used to link one test to another test, or to link one survey to another survey. Formative Assessment Checkpoint #2 Question: How does one conduct common person equating in one spreadsheet? How does one conduct common person equating using “person anchors”? Answer: To conduct common person equating with one spreadsheet, one needs to have persons who have answered items from two or more instruments that measure the same trait. When wishing to use “person anchoring”, run an analysis of respondents who answered one instrument measuring a trait. Write down the person measure that is computed. Then, conduct a second analysis with the persons who answered a second instrument that is measuring the same trait. For this second analysis make sure to add anchored persons to your control file. When you add these people to your file, you will be able to calibrate the instrument items (of instrument 2) on the same scale that the instrument 1 items are calibrated. You can also put all the data for the persons and the tests in one file, create the control file, and then evaluate the data. Charlie and Chizuko: Two Colleagues Conversing Chizuko: Well, you have had a lot of information presented to you. Would you be up for impressing me with what you learned? Charlie: Sure. I guess I would explain common person equating to people in the following manner…when we use item anchoring, we are linking scales. Knowing where an item is on our trait can be used to set the location of the item on the trait when we are evaluating a new data set. Item anchoring is a little like figuring out where you are on two paper maps. If both maps have the same three mountains plotted, then you can figure out where you are on both maps. By having some known mountain locations, you can figure out where you are on a map. The same is true for person anchoring. Let me take the example of two tests, a Test A and a Test B. If I have some people who take both tests, I can make use of the fact that those persons will be at the same location on the trait, regardless of the test they take. So, just as I make use of items not moving around on a trait to do item anchoring, I can use the same ideas to link some scales using person anchors. When I am following the steps to person anchor, I can stack a data set and simply run the data. Another option is that I would conduct a number of analyses, and then I would use person measures to anchor. Of course, it is important to make sure that I keep track of person entry numbers! Chizuko: Sounds good to me, but remember, in a week I will ask you the same question!

156

11 Common Person Test Equating

Keywords and Phrases Common person anchoring Common person equating PAFILE The concept of person anchoring shares and applies the same ideas as item anchoring. Using Rasch techniques, we are always careful to measure one variable. We know that items mark the trait, and we know that persons are located on a trait. It does not make any difference which items a person attempts, the person can still be located on the trait. Also, it does not matter which persons take an item; we can locate an item on a trait. These ideas allow researchers to anchor scales with common items and to anchor scales using common persons.

Potential Article Text The Newton Physics Project (NPP) collected data from a sample of 3000 students. Three separate groups of 900 students completed different tests (Test A, Test B, Test C) involving the topic of electricity. In addition to the responses of these 2700 students, 150 other students were administered Tests A and B, and another 150 students were administered Test B and Test C. Although the three tests contained no common items, common person equating was used to link all three forms and to express all student measures on the same scale. Data were entered into an Excel spreadsheet with each student as a row of data and each column as a test item. Entering all test data into a single spreadsheet allowed a single Winsteps analysis to be conducted in which the common respondents across forms linked all test forms to one metric.

Quick Tips Use PAFILE to anchor one instrument to another instrument through use of common person anchoring. To use common person equating, it is necessary to have a set of respondents answer two or more different instruments that measure the same trait. Thus, if there are two instruments measuring Self Efficacy, if a set of respondents answer both instruments, it is possible to calibrate all items on the same scale using common person equating. For a survey, anchor items and the rating scale.

References

157

Data Sets (Go to http://extras.springer.com) cf common per eq common person eq ss two Common Person Equating Spread Sheet Activity 1 Chp Common Person Equating

Activities Activity #1 Create a sample data set in which 11 students complete Test D with 5 test items. Also, 14 students complete Test E with 6 items. Three of the students who completed Test D also completed Test E. Answer: The file Activity 1 Chp Common Person Equating includes a potential data set. Note that data have been stacked so that all respondents are included. The three persons serving as links can be identified as respondents 9, 10 and 11. The linking persons (the common persons) can be easily identified for the responses of these three students can be seen across the entire spreadsheet from left to right. Activity #2 Activity – Conduct a Rasch analysis with the data set of Activity #1. Verify that all respondents are expressed on the same metric and verify that all items are on the same metric.

References León, A. B. (2008). Common-item (or common person) equating with different test discriminations. Rasch Measurement Transactions, 22(3), 1172.

Additional Readings Hong, I., Woo, H. S., Shim, S., Li, C. Y., Yoonjeong, L., & Velozo, C. A. (2018). Equating activities of daily living outcome measures: The functional independence measure and the Korean version of modified barthel index. Disability and Rehabilitation, 40(2), 217–224. Li, C. Y., Romero, S., Bonilha, H. S., Simpson, K. N., Simpson, A. N., Hong, I., et al. (2016). Linking existing instruments to develop an activity of daily living item bank. Evaluation and the Health Professions, 41(1), 25–43.

158

11 Common Person Test Equating

Masters, G. N. (1985). Common-person equating with the Rasch model. Applied Psychological Measurement, 9(1), 73–82. Taylor, W. J., & McPherson, K. M. (2007). Using Rasch analysis to compare the psychometric properties of the Short Form 36 physical function score and the Health Assessment Questionnaire disability index in patients with psoriatic arthritis and rheumatoid arthritis. Arthritis and Rheumatism, 57(5), 723–729. Yu, C. H., & Popp, S. E. O. (2005). Test equating by common items and common subjects: Concepts and applications. Practical Assessment, Research & Evaluation, 10(4), 1–19. We recommend that readers review chapter 5 of Best Test Design. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: Mesa Press.

Chapter 12

Virtual Equating of Test Forms

Charlie and Chizuko: Two Colleagues Conversing Chizuko: Oh gosh Charlie, I am in a bind. I am sitting here with two simple mathematics tests that were given to two groups of students. The big, big headache is the absence of any common items. With no common items, I can’t do any item anchoring. I think I am at a dead end. Also no common persons. Charlie: Well well…it seems the tables have turned, and I get to help you out. Chizuko: Okay…fill me in… Charlie: In the Winsteps Manual, Mike Linacre provides some ideas regarding how an analyst might do what Mike names, “Virtual Equating of Test Forms.” Basically, he suggests some steps researchers might take in a situation such as yours. The steps involve items that measure the same part of the trait, then creating some plots to see if items might define the same parts of the trait, and finally making some basic changes in the control file. Ultimately, it is possible to work toward equating your two tests. An analyst must think extensively and carefully, but it is possible. May I show you a little experiment I did to try out the virtual equating of test forms?

Tie Ins to RAHS Book 1 Our first book includes a chapter in which we show how Rasch theory and Winsteps can help researchers link two test forms. Among many points, one main point is having common items that make it possible to link the scales of two tests that measure the same variable. In this chapter, we provide an additional example concerning equating. In this example, no common items exist (so, no identical items), but numerous items exist that a researcher could hypothesize measure the same part of the trait. Knowing these items, thinking carefully, and taking some simple steps, an analyst may attempt to link two tests even if no items are perfect matches (e.g. identical in wording) across two test forms. Of course, extensive thinking must occur to © Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_12

159

160

12 Virtual Equating of Test Forms

evaluate if the tests are well linked and if the linking makes sense. Although this chapter concerns linking with items, the broad thinking to understand linking links to the concepts we just presented in Chap. 11.

Introduction Readers will remember that in RAHS Book 1 we presented a number of steps that were needed in order to link two different tests that measure the same variable. The key aspect of what we have previously presented is that to “link”, or “equate”, two tests there were common items (items which appeared in both tests). This allowed one to anchor a test to another test. If you are rusty on the equating steps, you might consider rereading that chapter of RAHS Book 1. In this chapter, we explain how an analyst might equate if he or she had two tests that had no common items. The key aspect of this equating, in our mind, is that an analyst must have tests and items that measure the same variable. This must be present if a researcher hopes to measure along a variable, and this is what a researcher must have when s/he tries to equate two tests that have no common items. This topic is described in the Winsteps Manual (Linacre, 2018). For this chapter we will use the Winsteps Manual as our guide, but there is also a very nice article, by Luppescu (2005), which appeared in Rasch Measurement Transactions that is also helpful. As we talk readers through the steps that are needed to equate, we will make use of some basic mathematics and temperature concepts that many readers will have encountered in their pre-college years. When a researcher wishes to convert Fahrenheit temperatures to Celsius temperatures, it is possible, by plotting temperature data (in Celsius and Fahrenheit), to draw a line that expresses the relationship between Fahrenheit and Celsius. This line has a slope and an intercept. Knowledge of that intercept and slope allows a researcher to quickly convert Celsius to Fahrenheit. As readers work through the ideas presented in this chapter, they will notice we also use such ideas. Finally, readers will hopefully recall a chapter in our earlier book in which we mentioned the ability to rescale from logits to user friendly scales (of course all based on a logit scale). Those rescaling steps made use of the commands USCALE and UMEAN in Winsteps. We also make use of these comments for the work used to evaluate one test is on the same metric as the scale used for the other test. Formative Assessment Checkpoint #1 Question: What is one requirement to equate two instruments? Answer: To equate two instruments one must have a single variable. Thus, if one wishes to equate two tests, one must have two tests that measure the same variable.

First Steps for Virtual Equating

161

First Steps for Virtual Equating In RAHS Book 1, we described in detail how two tests can be anchored to each other using common items. Doing so insures that the two scales are linked. To attempt to conduct virtual equating (equating in which there are no common items, when one compares two tests measuring the same variable) a researcher must take several steps. As we did in RAHS Book 1, as well as in chapters of this book, we make use of the writing of colleagues in venues such as Rasch Measurement Transactions (RMT), journals such as the Journal of Applied Measurement, and the extensive, helpful, and detailed Winsteps Manual. Below, we provide the text supplied by Mike Linacre (2018). First, we review these steps, then we follow these steps with a data set. As we follow these steps, we make use of Rasch person measures, item measures, and conduct a number of experiments using cross plotting. However, we continuously think about what it means to measure a single trait. Thus, as in every chapter in RAHS Book 1 and this book, extensive thinking and pondering must be present. Here are the steps that Mike Linacre suggests. These steps are also provided in Luppescu’s Rasch Measurement Transaction article (2005). Virtual Equating of Test Forms The two tests share no items or persons in common, but the items cover similar material. Step 1. Identify pairs of items of similar content and difficulty in the two tests. Be generous about interpreting “similar” at this stage. These are the pseudo-common items. Step 2. From the separate analyses, cross plot the difficulties of the pairs of items, with Test B on the y-axis and Test A on the x-axis. The slope of the best-fit line i.e., the line through the point at the means of the common items and through the (mean + 1 S.D.) point should have slope near 1.0. If it does, then the intercept of the line with the x-axis is the equating constant. First approximation: Test B measures in the Test A frame of reference = Test B measure + x-axis intercept. Step 3. Examine the scatterplot. Points far away from the joint-best-fit trend-line indicate items that are not good pairs. You may wish to consider these to be no longer paired. Drop the items from the plot and redraw the trend line. Step 4. The slope of the best fit is: slope = (S.D. of Test B common items) / (S.D. of Test A common items) Include in the Test B control file: USCALE = the value of 1/slope UMEAN = the value of the x-intercept and reanalyze Test B. Test B is now in the Test A frame of reference, and the person and item measures from Test A and Test B can be reported together (Linacre, 2018). Okay those are the steps. Now let’s work through these steps!

162

12 Virtual Equating of Test Forms

We have two math tests: A and B. Step 1. Identify pairs of items of similar content and difficulty in the two tests. Be generous about interpreting ‘similar’ at this stage. These are the pseudo-common items (Linacre, 2018). We try to identify Test B items that match Test A items and by match we mean that the Test B items are assessing similar content to what the Test A items are assessing We hypothesize that these items are located on the same part of the trait that is that they have similar difficulties. The matching is based upon the content of the items! So, if we had a math test, Test A might have an item which is 4 + 4 =?, and Test B might have 3 + 3 =?. Although these two items are different, we may hypothesize that each item marks the same part of the trait. This is an example of match items! In total, we have identified 13 items of a Test A that might match with 13 items of a Test B. Figure 12.1 provides, in part, a summary of the set of 13 match items from the two tests. For example, item 5 of Test B (11 + 12 =?) is matched with item 9 of Test A (13 + 14 =?). At this stage in the equating practice, we make a hypothesis as to which pairs of items may mark the same part of the single trait being measured by the test. This proposed matching is based upon our theory of what it means to measure respondents along the trait. Our predictions may not be perfect, but we base our predictions on a theory. If, at this point in the process, we cannot identify potential match items, then we would stop. Step 2. From the separate analyses, cross plot the difficulties of the pairs of items, with Test B on the y-axis and Test A on the x-axis. The slope of the best-fit line i.e., the line through the point at the means of the common items and through the (mean + 1 S.D.) point should have slope near 1.0. If it does, then the intercept of the line with the x-axis is the equating constant (Linacre, 2018). To proceed with Step 2 of virtual equating, we need item calibrations (item difficulties) of the Test A and Test B items. Of course, we must run a Rasch analysis of

Fig. 12.1 A Excel table listing Test B item names (column B) and Test A item names (column E). Column A provides the Rasch measure of the Test B analysis. Column D provides the Rasch measure of the Test A analysis

First Steps for Virtual Equating

163

the Test B items. By running Test B, we can compute the item measures that are needed for the cross plot mentioned above. We then run a Rasch analysis of Test A items. At this stage, we point out to readers that all of the techniques that we urged readers to follow in RAHS Book 1 and this book must be taken (for example investigation of item fit and so forth), but for this chapter we concentrate on the steps and thinking that are needed to conduct the virtual equating. Once the match items are identified and the item difficulties are calibrated, it is useful to make an Excel sheet to summarize which items are the match items. Also, it is important to know, and note, the Rasch item calibrations of the items. Readers should remember we have used a separate analysis of Test B and Test A. Also, remember that we guessed on what items match up! Of course, some Test B items may not serve as match items. There may also be items from Test A that do not serve as match items. Figure 12.1 provides the item calibrations. To stay organized, we provided the names of the items for each of the two tests. Our next step is to make a cross plot of item calibrations for Test A and Test B based upon our matching predictions. Readers should remember that only the matched items will be plotted. This should make sense because if some items on a test did not have a match item on the other test, an analyst could not plot the item on the cross plot! In Fig. 12.2 we provide a cross plot of all 13 matches. Test B items are plotted on the vertical axis, and Test A items are plotted on the horizontal axis. In Fig. 12.2, the authors have provided a best fit line that approximates the relationship between the Test A item calibrations and the Test B item calibrations. The location of one match item is identified. Look at Fig. 12.1 to confirm the plot of this match item. A line has been added to the plot to allow readers to identify the x intercept of the line (this is the value of X when the value of Y is 0). From our plot (and simply for talking readers through the steps to carry out virtual equating), let us say that the x-intercept has a value around −.05. When readers think about mathematics classes they have taken, they will be familiar with the term y-intercept, which is the value of Y for a line when the value of X is 0. The same techniques are used for the computation of an x-intercept. However, an analyst must identify where a line has a value of Y = 0, as opposed to X = 0. Mike Linacre points out that the slope of the line should be about 1.0, and if the slope is indeed about 1.0, then the equating constant (the number we must use to convert the measures from the Test B measures to the Test A frame of reference) will only use the x-intercept value to convert a Test B measure to the Test A scale (2018). This step should make sense if readers reflect upon the steps needed to convert from Celsius to Fahrenheit. This is a common mathematical problem often presented to students in the United States. The equation to convert from C to F is:

T ° F = T ° C × 9 / 5 + 32

( )

( )

If an analyst were to plot temperatures in Celsius and Fahrenheit for say 13 temperatures, s/he would have a plot with a perfect straight line. The slope of the line

164

12 Virtual Equating of Test Forms

Fig. 12.2 A SPSS cross plot of the 13 match item pairs. There were no common items on Test A and Test B, but match pairs are items that have been hypothesized to mark the same portion of the trait. A rough best fit line has been inserted by the authors insuring that roughly half of the dots are located above the best fit line, and roughly half of the dots are located below the best fit line. Also added are lines helping one identify the x-intercept of the line

would be 9/5 (which is 1.8), and the y-intercept of the line would be 32. Now, if we pretended (we are just pretending) that the slope of the C/F temperature line was 1, then the equation of the line converting C to F would simply be

T ° F =  T ° C × 1  + 32 ( )  ( ) 

which can be simplified as

T ° F = T ° C + 32

( )

( )

This is exactly the case with our plot and our data from our Test B and Test A Rasch analysis. Mike Linacre is stating that if our slope is near 1, then the equating constant (the number we will use to convert from one test frame of reference (test scale) to another frame of reference (test scale) is simply the x-intercept of our plot (2018).

First Steps for Virtual Equating

165

Using Mike Linacre’s comments in Step 2: “If it does, then the intercept of the line with the x-axis is the equating constant” and “First approximation: Test B measures in the Test A frame of reference = Test B measure + x-axis intercept” (2018). For our data set, what does this mean if we wish to make a first approximation of one item from Test B being expressed in the test a frame of reference (the Test A scale)? If test item 2 of Test B has a measure of −1.57 in the Test B frame of reference. To translate that item to Test A frame of reference then:

Item 2 of Test Bin Test A frame of reference = −1.57 + ( −.05 ) = −1.52.

What we have just done is very important in our effort to equate two tests that do not share common items, but do measure the same trait. From the steps we took, we made a calculation that allows us to take the calibrations of the Test B items and convert those items to the scale that is used to express the Test A items. By now knowing the measures of the Test B items using the Test A scale we can set (anchor) the Test B items to the Test A scale. We then can conduct a Winsteps analysis of the Test B data set (anchored to the value that put the Test B items in the measurement scale of Test A), and then we could compute person measures for test takers who completed the Test B, but can now express the persons using the Test A scale. Formative Assessment Checkpoint #2 Question: What does a consideration of two different types of temperature scales have to do with linking two tests? Answers: There is a relationship between Celsius and Fahrenheit. And this relationship can be used to convert any Celsius temperature to Fahrenheit and vice versa. One does need to have observed all Celsius temperatures possible to be able to compute this conversion. And the equation to convert from one temperature scale to another temperature scale can be used with any C and any F temperatures. Step 3. Examine the scatterplot. Points far away from the joint-best-fit trend-line indicate items that are not good pairs. You may wish to consider these to be no longer paired. Drop the items from the plot and redraw the trend line (Linacre, 2018). Mike Linacre points out in the Winsteps Manual that if an item is far from the line, it might be a match set of items that does not work all that well for us (2018). There could be many reasons for the item not matching. In our plot we can see that there is an item located at about −1.00 on the Test B axis and at about 0.00 on the Test A axis. We can see from our Excel spread sheet that this is item 8 for Test B and item 35 for Test A. One option is to remove that item! Then replot! Then calculate the slope of the line. Then compute the x-axis intercept. Below we provide a new plot with item 8 Test B (which is also item 35 for Test A) removed from the plot.

166

12 Virtual Equating of Test Forms

Also, we provide the computed slope of the line. It is important to note that the type of steps we are conducting (identification of items that are not matching our expectations of good measurement) is very similar to steps we have conducted in RAHS Book 1 as well as other parts of this book. For example, when DIF for items is being investigated, if an item seems to not behave in a similar manner for two groups of respondents, we might often consider removal of the item. One way that an analyst can identify such an item in a DIF analysis is through cross plots very similar to what we have made above. From this plot in Fig. 12.3 we can see that the slope of the best fit line with the 12 items is a slope of .856. Also, SPSS (Fig. 12.4) provides us with the y-intercept of the line as well (a value of .061). Step 4. The slope of the best fit is: slope = (S.D. of Test B common items) / (S.D. of Test A common items). Include in the Test B control file: USCALE = the value of 1/slope UMEAN = the value of the x-intercept

Fig. 12.3 A SPSS cross plot of the Test A item measures and the Test B item measures. One of the match pairs has been removed from the plot, and a best fit line has been computed using SPSS

First Steps for Virtual Equating

167

Fig. 12.4 Computation of the slope and y-intercept (not x-intercept) of the best fit line computed through use of SPSS. The slope of the line is .856 and the y-intercept is .061

and reanalyze Test B. Test B is now in the Test A frame of reference, and the person and item measures from Test A and Test B can be reported together (Linacre, 2018). We are almost finished. Through this most recent plot (Fig. 12.3) and analysis (Fig. 12.4), we can use knowledge of the slope of the line (from the plot of the match points) and the y-intercept to work toward values that can be inputted into a Winsteps analysis to convert the Test B frame of reference to the Test A frame of reference. Remember, this final step is a case in which the slope of the line is not 1, and the x-intercept is not 0. Moving forward, we know the slope of the line; thus, we can actually solve for the x-intercept that we need for our run of Winsteps to convert Test B to the Test A frame of reference. Remember, the x-intercept is the value of X, for which we have a value of 0 for Y. For our work today, we must solve for the x-intercept. To do so we must use the equation to solve for X when the value of Y is 0. Below, we provide the mathematics to solve for the x-intercept using the value of the slope of the line that we have computed and the value of the y-intercept of the line that we have also computed.

y = mx + b

y = .856 X + .061

To solve for the x-intercept (remember the x-intercept is the value of X when Y is 0)

0 = (.856 X ) + .061 −.061 = .856 X −.061 / .856 = X −.07 = X

Now, we can move forward. This step allows us to tell Winsteps what changes in the analysis must be made to place the Test B items on the same scale as the Test A items when the slope of the line when plotting the match pairs of items is not a slope of 1, and when the line does not have an intercept of 0.

168

12 Virtual Equating of Test Forms

Fig. 12.5 The control file with the appropriate UMEAN and USCALE inserted

In the Winsteps Manual, Mike Linacre writes that we must set the value of USCALE in our analysis and set the value of UMEAN (2018). If readers need to refresh their memory regarding what USCALE and UMEAN do, then review those terms using the Winsteps Help Tab. For Step 4, the suggestions as to how to conduct virtual equating are: USCALE = the value of 1/slope UMEAN = the value of the x-intercept and reanalyze Test B. Test B is now in the Test A frame of reference, and the person and item measures from Test A and Test B can be reported together (Linacre, 2018). In our experiment the slope of the line is .856; thus, the value of USCALE will be 1/.856 = 1.17 and the value of UMEAN will be −.07 (Fig. 12.5). Now readers can use those values of UMEAN and USCALE in our analysis of the Test B data. After that analysis, the values of items and persons are on the same scale as that of Test A. Below is how part of the Test B control file appears with this correction that is supplied through UMEAN and USCALE. Finally, with regard to

First Steps for Virtual Equating

169

the slope Mike Linacre (personal communication, April 24, 2018) kindly indicated to us that the slope calculation of above is a simple approximation. If one wishes to compute the exact value, naturally, the statistics package of your choice can make an exact computation. In this chapter, we make use of our thinking with respect to what it means to measure one variable. Also, we take on the challenge of investigating the possibility to equate two tests that contain no common items, but when some items possibly measure the same part of the trait. In order to potentially equate, it is first important to identify potential match items. Then, it is important to cross plot and investigate the overall pattern of plotting the matched items that were calibrated through two separate runs of Winsteps. When an item pair seems off (not near a best fit line), it is then prudent to remove that item, and then replot. Finally, once the slope of the line and an intercept can be computed (in our example with the specific choice of tests for each x- and y-axis, and our goal of expressing Test B in the Test A frame of reference), it is possible through use of the commands USCALE and UMEAN to express one test in the frame of reference of another test. Clearly, for those working with Rasch techniques, hope exists that as tests are designed linking items (common items) can be included in different forms of a test. However, cases exist in which researchers wish to use data that were collected with two or more tests that measure the same trait, but unfortunately have no common items or common persons. This chapter makes use of thinking presented throughout RAHS Book 1 and this book. For example, our discussion of equating constants (when common items are present to link a test) in a chapter of this book links to the steps that we described in this chapter. In RAHS Book 1 we also presented readers with information regarding rescaling using UMEAN and USCALE when readers wish to create user friendly rescaling. The changes in items from a logit scale to a user-friendly scale of, say, 200 to 1200 share some common steps as outlined in this chapter. Charlie and Chizuko: Two Colleagues Conversing Charlie: Hi Chizuko…how is it going with your two math tests that have no common items? I have been losing sleep over your conundrum…did you ever figure out what you might be able to do with no linking items? Chizuko: Yeah…at first I was a little panicky, but I figured out how to link the tests. The key thing from the get go is whether or not I am really measuring the same trait with the two tests. I think I am, so that was a good first step. The next step, a critical one, is to determine if there are any items that I could hypothesize are at the same location on the math trait. I was able to find about 10 items that I think matched. Once I had those items, I used this chapter, which makes use of some ideas from Mike Linacre in the Winsteps Manual. I had to do some cross plots, compute the slope of a line, as well as an intercept. Then, once I was confident in those values, I was able to input a value into Winsteps for UMEAN and USCALE, and that allowed me to express one test in the frame of reference (in the same measurement scale) as the other test. Charlie: Very impressive!

170

12 Virtual Equating of Test Forms

Keywords and Phrases Match items Slope Intercept Best fit line Cross plotting item difficulties Using UMEAN and USCALE to express items on a test in the frame of reference of another test

Potential Article Text For the SWOPA study reported in this paper, a technique named virtual equating was used to link the two math tests to the same logit scale. The two math tests both presented test items concerning the topic of addition, however, there were no common items shared by both tests. However, through a technique named virtual equating, it was possible to link the two tests. In order to do so it is critical that the same variable be measured by both tests. Second, one must have some test items on each of the two tests that one can hypothesize mark the same location on the trait. For this project 8 items were identified on Math test A and Math test B that were similar in content (e.g. item 3 of test A was 15 + 4 =?, item 6 of test B was 14 + 5 =?). Each of the two data sets was used to determine Rasch item difficulties through separate analyses. A cross plot was made using the Test A common item difficulty values (X axis) and the Test B common item values (Y axis). The slope of the line using the 8 sets of item difficulties was 1. This allowed an equating constant to be computed (the intercept of the line with the X axis provided the equating constant).

Quick Tips When working toward equating, one must always work with the same variable. In order to equate, one must compute the y-intercept and the slope of the line. It is these values that are used to set UMEAN and USCALE in the control file to express a test on the same metric as another test.

Data Sets (Go to http://extras.springer.com) None.

First Steps for Virtual Equating

171

Activities Activity #1 For the following set of Fahrenheit temperatures compute the Celsius temperature. 10 F, −20 F, 20 F, 31 F, 32 F, 33 F, 60 F, 120 F Answer: 10 F is −12 C, −20 F is −28 C, 20 F is −6 C, 31 F is −.5 C 32 F is 0 C, 33 F is .5 C, 60 F is 15.5 C, 120 F is 48.8 C Activity #2 Cross plot the temperature data above. Use F for the x-axis, use C for the y-axis. Draw a best fit line. What is the slope of the line? What is the y-intercept? Answer: Using the plotting and statistics program of your choice, you will see that the y-intercept of your best fit line is at −32 on the vertical axis, and that the slope of the line is 1.8. The slope of the line increases (from left to right), which tells you that you should end up with a positive slope as you compare C (y-axis) and F (x-axis) temperature values. The intercept on the y-axis will be an added factor that you will use as you convert C to F, or you convert F to C. The fact that you have a slope of the line and you have a y-intercept value that is not 0, tells you that your conversion from C to F and from F to C will involve some sort of slope term and some sort of intercept term. Activity #3 Another temperature scale is named Kelvin (K). Please convert the following Celsius temperature values to degrees Kelvin. −50 C, 0 C, 40 C, 200 C, Answer: −50 C is 223 K, 0 C is 273 K, 40 C is 313 K, 200 C is 473 K The formula to convert from C to K is Degrees K = 1 × Degrees C + 273 Note the slope of the line is 1, and the intercept is 273. Activity #4 Cross plot the C and K temperature data of Activity #3. Plot C on the vertical axis and plot K on the horizontal axis. Compute the slope of the best fit line and compute the y-intercept of the line.

172

12 Virtual Equating of Test Forms

Use the computed slope and y-intercept to author the equation that allows one to compute a Celsius temperature for any Kelvin temperature. Answer:

C = (1 × K ) − 273

Where 1 is the slope of the line from Activity #4 and −273 is the y-axis intercept. Activity #5 Try to find two instruments (tests, surveys) that measure the same trait. Can you find potential match items?

References Linacre, J. M. (2018). Winsteps® Rasch measurement computer program user’s guide. Beaverton, OR: Winsteps.com. Luppescu, S. (2005). Virtual equating. Rasch Measurement Transactions, 19(3), 1025.

Additional Readings Lee, O. K. (1992). Calibration matrices for test equating. Rasch Measurement Transactions, 6(1), 202–203. Ryan, J., & Brockmann, F. (2011). A practitioner’s introduction to equating with primers on Classical Test Theory and Item Response Theory. Washington, DC: Council of Chief State School Officers. Retrieved from www.ccsso.org/Documents/Equating%20HandbookCoverANDinterior.pdf Wolfe, E. W. (2000). Equating and item banking with the Rasch model. Journal of Applied Measurement, 1(4), 409–434. A very clear article written by Ben Wright, in which he discusses 8 issues that must be considered when equating tests. The issues considered in Wright’s article impact any type of equating. Both common item equating, common person equating, or the virtual equating described in this chapter. Wright, B. D. (1993). Equitable test equating. Rasch Measurement Transactions, 7(2), 298–299.

Chapter 13

Computing and Utilizing an Equating Constant to Explore Items for Linking a Test to an Item Bank

Charlie and Chizuko: Two Colleagues Conversing Charlie: Hey Chizuko, I thought you were done with work today? Why are you still here? Chizuko: Ah….it’s the same old story, an email request. Someone sent me test data collected in Spring, 2012 and additional test data collected in Spring, 2015. The two tests contain some common items; I have been asked to link the tests, and to make sure that the items are not moving around too much on the trait. Do you have any idea how I might do this? I have read about equating constants, but I have not had much exposure to that concept.

Tie Ins to RAHS Book 1 In our first book, we presented several issues that are considered when researchers want to link tests. For example, using anchor items allows researchers to produce the same measurement scale, regardless of which test form is used. We also included a chapter that introduced readers to the idea of Differential Item Functioning (DIF). The example that we discussed focused on investigating whether the way in which a trait was defined by a set of items is the same as a function of gender. If an item appeared to define the trait in a different way as a function of gender, we suggested several next steps. For example, removing the item from the analysis. Another possibility is to view the item as two items: one item as the male item, and the other item as the female item. Each item would have its own calibration. In this chapter we provide an introduction to how researchers might look at the changes in items (the drift of items) when they attempt to link a test that is administered at two time points using some common items. Readers will see that we provide some new ideas, but these concepts also touch upon our previous chapters on item anchors. Readers also have to be alert to a second issue which we touched upon in RAHS Book 1, namely a logit on one scale does not necessarily have the same © Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_13

173

174 13 Computing and Utilizing an Equating Constant to Explore Items for Linking a Test…

meaning as a logit on another scale. If the same set of items has been used at two time points to collect data, “logit” values computed from each analysis of the two data sets does not mean one cannot count on the values in logits have the same meaning at both time points, unless anchoring has been used. Readers should appreciate that just as the issue of evaluating differences in item calibration prior to linking one test to another test is important, such evaluations would also be important when linking with persons (Chap. 11 of this book).

Introduction In this chapter, we tackle a common issue that arises when researchers wish to link two tests. This issue goes a bit beyond some of our discussions in RAHS Book 1, but this chapter also uses the concepts of RAHS Book 1 to help us explain the issue to readers. In essence we are interested in identifying if some items “drift” too much from one test administration and another test administration. If an item drifts too much (its item calibration has changed from one administration of the instrument to another administration of the instrument), we might need to remove that item from our analysis. Setting the stage for our discussion, readers should reflect on the whole point of item anchoring. Recall that when items are common to two different test forms, performing item anchoring can link the two forms. In our previous chapter on linking, we discussed some questions such as: How many anchors are needed? Where should the anchors be? What are the mechanics of anchoring with Winsteps? Readers also learned how critical it is to achieve linked scales. In another previous chapter (in RAHS Book 1), that is critical to understanding this new chapter, we introduced the often-difficult concept of DIF. DIF involves understanding that a researcher wants the location of items along the trait to be relatively stable. If one has a test, and one is interested, for example, in comparing the performance of males and females, then it is important to investigate whether or not the items of the test function in a similar manner for males and females. In this chapter, we present some steps that a researcher might take to investigate the movement of test items (the drift of items) when attempting to link a test form to a test item bank. Guiding readers through the steps, we make use of two paragraphs of a technical report authored by the State of Ohio (2006): “March 2006 Ohio Achievement Tests Administration Statistical Summary”. This technical report details the steps the State of Ohio took to investigate, in our opinion, the potential movement of items (drift) when comparing test item calibrations in an item bank (bank items) and test item calibrations computed as the result of the administration of a test (current items). We focus on parts of that discussion step by step and explain the process. As readers review the steps we provide, it is helpful to remember that many techniques are used to check on the movement of items when attempting to link a test. Fortunately, all of the techniques at their core contain the same general philosophy. A 2008 report by the American Institutes of

Introduction

175

Research (AIR) provides a nice summary of different techniques employed by various research groups (Cohen, Jiang, & Ping, 2008). The AIR report comments that the State of Ohio and the State of Washington used the technique that we describe below (Cohen et al. 2008). But now let us move on to the steps that this Ohio research group reported in 2006 as they investigated the drift of test items. As we move through the steps which were taken, we provide the text from the report and then we talk readers through the steps. In this chapter we are utilizing Winsteps computed item difficulties, and then we are making some “by hand” computations to allow us to calculate an equating constant, which is then used to help us identify items (current items) which may have drifted too much from “bank items” for high quality measurement. It is important to note that when item anchors are used in Winsteps, one is provided with the “displacement” of an item. That information is, of course, very useful for the researcher and provides an added way of investigating drift. We have found the steps we provide in this chapter to introduce the computation of an equating constant, and how an equating constant can be used, are helpful to advance our students’ Rasch learning. This also helps them better appreciate the “displacement” value that is provided by Winsteps when one conducts an analysis with anchored item values. So let us begin our work with computing and using an equating constant. Please remember for this lesson we are making use of the Ohio technical report we have cited above. Our initial guidance, which we call Step 1, is provided here, “Calibration and equating proceeded through four steps. First, the March 2006 operational difficulty values were computed from the early return data and compared to the ‘bank’ or reference difficulty values” (Ohio Department of Education, 2006, p.14). March 2006 refers to the date at which item difficulties were computed for the Ohio Department of Education. What does this mean? First, we pretend that test items were administered in Spring 2012, and these items were viewed as the item bank. Item difficulty values were computed from the Spring 2012 administration of the test. We also pretend that the Spring 2012 data collection was a very large sample. Moreover, we are confident that these item difficulty values are really good ones! Therefore, the Spring 2012 test data can be used to produce the bank difficulty values, which can also be called the “reference difficulty” values. Also we could name these difficulties “item bank difficulties”. We also pretend that an administration of the test in Spring 2015 is referred to as the “current administration”. We could also name those items which have been calibrated using the Spring 2015 data collection as the “current item bank difficulties”. Now let us go onto Step 2: “The mean difference between the current and the bank difficulties of the anchor items is called the equating constant” (Ohio Department of Education, 2006, p.14). What does this mean? The statement seems to say that if you take all of the anchor items – the items that appear on the test from Spring 2012 and the test from Spring 2015 – you can calculate the average difference in difficulty between all of the items. The text appears to say that this is called the equating constant. Moreover,

176 13 Computing and Utilizing an Equating Constant to Explore Items for Linking a Test…

computation of mean difference seems to refer to the phrase computed and compared of Step 1 above. Below (Fig. 13.1) are some sample item difficulties. The data labeled Bank Difficulties & Master Item ID provides a list of items (with ID designation) and the item difficulties (in logits). These are the item “bank” difficulties. Imagine that these items were administered to a large group of students, and that we have strong confidence in the item calibrations. The data labeled Current Difficulties & Master Item ID provides the item difficulties using only the 2015 data. These are the “current” item difficulties. It is these “current” item difficulty values which we will compare (using a correction) to the “bank” difficulties. If there is too much of a difference (too much drift) between the “current” difficulty of an item and the “bank” difficulty of an item following a correction, then it would not be wise to use the item as an anchor. Thus the data below on the left hand side “Bank Difficulties” is from 2012. The data on the right hand side are the difficulties from the 2015 data. Following the computation of the two averages, the equating constant is computed by a simple subtraction! By computing an equating constant we are able to put the “current” items on the same logit scale used to express the difficulty of the items on the “bank” scale. Using these two values, the equating constant can be computed: .4825 (Bank Average) −.5675 (Current Average) = −.085 (The Equating Constant) Then the equating adjustment can be used for all current items: “Current item difficulty + equating constant = Current item difficulty expressed on Bank scale” To review some of our steps so far, we have conducted a Rasch analysis with the current data set (we named this Spring 2015), and we then know the logit values of all the items on the test. We are particularly interested in the values for the Spring 2015 items, which also appear in the item bank. These are the common items which might serve as anchors which can allow us to link a test form to an item bank. Fig. 13.1 Sample item difficulties comparing the current and bank difficulties in terms of logits and the corresponding Master Item ID

Introduction

177

Following the computation of the measures of the Spring 2015 test items, we can compute the average of all common items (common items being the items that appeared in both Spring 2012 and Spring 2015) from the analysis of the Spring 2015 data. Also, we can compute the average of the item measures from the bank, which we named Spring 2012. Note this average is the average of the item difficulties from the bank, using only those items which appeared in both the current test and the item bank. To compute the equating constant, we must calculate the difference between the Spring 2015 average item measures and the Spring 2012 average item measures. Of course, we must use only those items that appear both in the item bank and in the new form of the test! Step 3: “…the equating constant was added to each March 2006 difficulty value so that the mean item difficulties from the March 2006 administration and from the bank were equal” (Ohio Department of Education, 2006, p.14). Translation: This means that the value of the equating constant was added to the items that were common between the item bank (from the Spring 2012 test) and the current test (Spring 2015). More specifically, the equating constant is added to the item difficulty values of the common items using the item difficulty computed from the Spring 2015 data analysis. The net effect of adding the equating constant to each of the Spring 2015 item calibrations is that the average item difficulties (for the common items) of both the Spring 2012 test and the Spring 2015 test are on the same logit scale. Let’s now see how an analyst follows these steps. Below (Fig. 13.2) we provide the mathematics in which the equating constant is added to all item difficulties from the current Spring 2015 test.

Fig. 13.2 Current item difficulties with the equating constant added

178 13 Computing and Utilizing an Equating Constant to Explore Items for Linking a Test…

Now we typically do a calculation to check our equating constant calculation: (1) Compute the average of these current item difficulties that have, in effect, been corrected with the equating constant; and (2) compare this average to the average of the eight test items from the item bank. .4825 is the average value of all of the current Spring 2015 test items after adding the equating constant of −0.085

( −.585 + .115 + −0.185 + .245 + .795 + .685 + 1.215 + 1.575) / 8 = .4825

.4825 is indeed the average of the Bank difficulties. We computed this value in our Step 2 calculations. Now that we have computed the equating constant, and checked our work, what is the next step? Step 4: Next, the adjusted current values were subtracted from the bank values to identify the item with the largest absolute difference between the two values. If the absolute value of the difference was greater than 0.3 logits, the item was eliminated as an anchor item (Ohio Department of Education, 2006, p.14). Below (Fig. 13.3) are the linked current values. These are the item difficulties of the current Spring 2015 test, after the equating constant has been included. Also included are: (a) the item difficulty values for the calibrations of the items from the item bank (think Spring 2012 data); (b) the corrected current test item values; (c) computation of the difference between the two values for each item and; (d) identification of the item with the largest absolute difference via these computations.

Fig. 13.3 Corrected current item difficulties and bank difficulties, showing the difference between the two logit values

A Final Comment

179

A review of the differences between the corrected item difficulty and the item bank difficulty reveals that the largest absolute difference is .415 logits for item A1403. If we use the rule of .3 logits to identify items that have drifted too much, we would no longer use that item as anchor. The item has drifted too much to be of use. Step 5: “This procedure was repeated until the largest difference between an adjusted current value and bank value was less than 0.3 logits. This procedure ensured that the items used to anchor the operational test to the reference scale were stable” (Ohio Department of Education, 2006, p.14). This means that one goes back to step 2 with the reduced set of anchor items. The process repeats itself until no items exhibit a drift of more than .3 logits. The steps that we outlined above are just one of many techniques that are used when researchers attempt to look at the items that they may or may not use when tests are linked. For readers who are interested in details regarding other techniques, we suggest reviewing Cohen, Jiang, and Yu’s work, “Information-Weighted Linking Constants” (2008). Formative Assessment Checkpoint #1 Question: What is one way to evaluate whether or not an item on a test can be anchored to an item bank measure? Answer: Compute the average of a set of items for both a new test and through the use of an item bank. Then compute an equating constant and add its value to the test item measures of a new test. This puts each “current” item on the same logit scale as that of the respective “bank” item. If an item has not drifted too much, there should be a difference of less than 0.3 logits in the measure of the test item. But it is important to note that there are a number of rules of thumb that can be proposed.

A Final Comment The technique that we outlined above employs techniques that we have not yet discussed herein, or in RAHS Book 1. Appreciation of overlap between this chapter and other book topics will help readers deepen their understanding of Rasch. First, in Rasch we are always interested in a single construct, in one variable. With one variable we can measure and advance. As part of the quality control that we employ when measuring with a test or survey, we want to make sure that our instrument functions in a consistent fashion from time point to time point, and from group to group. In our first book, we wrote at length about DIF. In many of our examples, we asked readers to consider the measurement of males and females with a test. We stressed that it was important to evaluate the manner in which a set of test items defined the trait for males, and the way test items defined the trait for females.

180 13 Computing and Utilizing an Equating Constant to Explore Items for Linking a Test…

If there was a difference in the way items defined the trait for the two subgroups, then such an occurrence might impact the quality of our measurement. Such shifting of an item along the trait can be visualized by placing a Wright Map of items calibrated using only males, next to a Wright Map of items calibrated using only females, and then looking for large shifts in the ordering and spacing of items. Of course, we demonstrated how researchers would evaluate those shifts statistically. Just as we helped readers realize that a shift in items when comparing males and females is something to consider when advancing toward high-quality measurement, we point out that the same issues are present when a researcher wishes to anchor a scale to a set of items from an item bank. Of course, the advantage of being able to anchor is that a scale can be linked at, in this example, two-time points. This allows a researcher to measure, such as in this example, the growth of test takers. However, when anchoring a set of items, it is important to evaluate the degree to which an item may drift from the time point of the new test and the item difficulty computed for the item bank. Once a researcher follows the steps as outlined above, in the end the researcher will have items that do not differ by more than .3 logits when comparing the bank item difficulty values and the current test item difficulty values. Getting to a point at which no items differ by .3 logits can be viewed as the point at which items have not drifted so much as to impact the fidelity of the measures. Formative Assessment Checkpoint #2 Question: What is meant by the term drift? What does drift have to do with Rasch thinking? Answer: Drift has to do with the manner in which the difficulty of an item may change from time to time. In Rasch we understand that the difficulty of an item can be viewed as marking a trait. This is important, in that the ability of a person is determined by how a person answers each item and the difficulty of the item. A test item with a higher difficulty than a person’s ability level would be predicted to be incorrectly answered (for a right/ wrong test). If the difficulty of an item has changed, this can impact decisions made about a person’s ability. When linking two forms of a test, it is important to evaluate how much, and to what degree, the difficulty of an item has changed. Finally, when equating two forms, it is possible to compute an equating constant. One can think of this constant as one which helps one evaluate item drift.

What should be done with the items that have drifted and thus you do not want to use for your linking of scales? One option is to remove the drifted item from your analysis. Another option is to retain the drifting test item, but not to anchor that item. From a measurement perspective, this means that the item – although identical in wording and format for both test administrations – needs to be viewed as two different items.

A Final Comment

181

As we have made presentations on this issue, we have found that, for those not familiar with measurement techniques, this phrase “two items are identical, but we treat the items as different” is a very difficult concept to understand. As a result, when we have a large number of items in a test, we often will completely remove such items from our analysis. However, there are cases when items are at a premium, and we retain the item even though it needs to be viewed as a different item. As we have mentioned when items are anchored with Winsteps one is provided with a displacement value. The Winsteps Manual (Linacre, 2018) suggests “items with displacements of more than 0.5 logits, that are also bigger than the item S.E.s, are candidates for recalibration”. Charlie and Chizuko: Two Colleagues Conversing Charlie: Well… I see you look like you are wrapping up your work for the day, does that mean that you understand how to make use of an equating constant? Chizuko: I hope this does not surprise you, but yes, I do get it. Here is the thing that helped me understand the issue: I do feel I understand DIF…it makes complete sense to me that, if I have a test, I need to make sure that the way items mark the trait for men and women is the same. It does not mean that I am saying that men or women perform better, but what I AM saying is that my measurement instrument must work in the same way for both groups. The marks on my meter stick must be sample independent. This is the way a real meter stick in the world functions. Charlie (sheepishly interrupting): But what is your point about DIF and this topic of an equating constant and drift? Chizuko: Getting to it. I finally understand that the same concepts, and concerns, we have with DIF, are also present when we attempt to link test items to an item bank. When we use items from an item bank, it is very important to look at whether or not the items have or have not shifted along our ruler. If the current items have shifted, drifted, too much (in this case more than .3 logits), then we should not anchor a current item to a bank value. This is because there has been too much drift with the item. The bottom line is that when we follow the procedures outlined in steps 1–5, we can identify the items that are common between our test and the item bank that we can confidently use as anchors. Next, since I want to anchor my new test to the scale of past tests, I would use the item measures from our item bank. Charlie: That is really cool, not only the steps to anchor, but also that the concepts one has to understand for DIF are the same concepts one has to know to understand item drift! In this chapter we learned how to compute an equating constant and that in the end that allowed us to confidently identify items that may have drifted too much. By too much drift I mean the drift of the item might compromise the measurement that we want to conduct.

182 13 Computing and Utilizing an Equating Constant to Explore Items for Linking a Test…

Keywords and Phrases Equating constant Current test item difficulty Item bank After adding the equating constant to the current test item difficulty, is the corrected test item difficulty within 0.3 logits of the test item bank item difficulty? The topic of item drift (and using an equating constant) makes use of similar thinking as is needed to understand DIF.

Potential Article Text In 2010, the Calhoun School District developed a science test item bank of 400 multiple-choice items using a sample of 10,000 students. The item difficulties (the bank difficulties) of those items were calibrated using Winsteps. In 2019, a new test was administered to district students. The new test included 15 anchor test items as well as 20 new test items. In order to anchor the 2019 test to the scale defined by the item bank, a Winsteps analysis of the 2019 data was conducted. Next, the average item difficulty of the 2019 anchor items (the current items) was computed. This average item difficulty (current items) was compared to the average item difficulty of the 15 test items present in the item bank (bank items). The differences between these two averages (Bank average – Current average = Equating Constant) allowed for an equating constant to be computed. The equating constant is “Bank Average – Current Average”. This value was then added to the item measure values computed for the 2019 test items. This can be expressed as “current item difficulty + equating constant = current item difficulty expressed on bank scale” Finally, a comparison was made between the corrected 2019 test item difficulties and the item bank difficulties. When an item exhibited a difference of more than .3 logits when comparing item bank difficulties to corrected 2019 item measures, the item was removed as a potential anchor. This procedure was repeated until no item exhibited a drift of more than .3 logits. In total, two of the 15 items exhibited drift larger than .3 logits, and these items were removed from the analysis. Therefore, in total 13 items of the 2019 test were anchored to the item measure values computed for the item bank.

A Final Comment

183

Quick Tips In this chapter, 5 steps are outlined that one can follow to compute the equating constant to link one test form to another test form. When interested in investigating “drift”, consider using the displacement value which is reported by Winsteps for an item when an item anchor is used in an analysis.

Data Sets None needed.

Activities Activity #1 Repeat the procedures steps 1–5 for the following data-

Answer: The answers to these steps are provided in the chapter text.

184 13 Computing and Utilizing an Equating Constant to Explore Items for Linking a Test…

Activity #2

Calculate the average difficulty of the item bank difficulties, calculate the average item difficulty of the current test. Answer: Bank Average is .5575. Current Average is .59. Activity #3 In your own words, attempt to describe to a colleague why it is important to investigate item drift (how much the item difficulty of an item on a test differs from the item difficulty computed for an item bank).

References Cohen, J., Jiang, T., & Yu, P. (2008). Information-weighted linking constants. American Institutes for Research. Retrieved from: http://assessment.air.org/psychometrics_docs/linking_constants. pdf Linacre, J. M. (2018). Winsteps® Rasch measurement computer program user’s guide. Beaverton, OR: Winsteps.com. Ohio Department of Education. (2006). March 2006 Ohio achievement tests administration statistical summary. Retrieved from: https://education.ohio.gov/getattachment/Topics/Testing/ Testing-Analysis-and-Statistics/Statistical-Summaries-and-Item-Analysis-Reports/March2006-OGT-Statistical-Summary.pdf.aspx

References

185

Additional Readings No author listed. (2004). Equating constants with mixed item types. Rasch Measurement Transactions, 18(3), 992. Rentz, R., & Bashaw, W. (1975). Equating reading tests with the Rasch model. Volume II, Technical reference tables. Washington, DC: Distributed by ERIC Clearinghouse. A lengthy discussion between Mike Linacre and others concerning the ins and outs of equating. Old Rasch Forum – Rasch on the Run: 2011 403. Identity & Empirical Lines (2011). Retrieved from https://www.rasch.org/forum2011.htm

Chapter 14

Rasch Measurement Estimation Procedures

Charlie and Chizuko: Two Colleagues Conversing Charlie: Hi Chizuko, I’ve been wondering about something…..it seems that different procedures are used to compute Rasch person measures and item measures. Some programs use one procedure, and other programs use another procedure. The common goal is to compute a Rasch measure…do you think in my work it makes any difference which procedure I use? Chizuko: That’s a great question, Charlie. I think there must be some ins and outs to this question, probably similar to what a researcher thinks about if they should look at Infit or Outfit. There might be some pros to one method over another and vice versa, but in the end, maybe for the work that you and I do, it does not make much difference. Why don’t we think, read a little, and conduct some experiments?

Tie Ins to RAHS Book 1 In RAHS Book 1 we presented many different Rasch topics – for example MNSQ infit, MNSQ outfit, item reliability, person error, and so on. There are, of course, mathematical calculations that are used to compute these and other values in Winsteps. Many of these math steps are outlined in Best Test Design (Wright & Stone, 1979) and Rating Scale Analysis (Wright & Masters, 1982). In this chapter, we explore if using different applications of the Rasch model (and different Rasch software programs) to compute person measures and item measures really makes much of a difference.

© Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_14

187

188

14 Rasch Measurement Estimation Procedures

Introduction One topic that we did not address in our first book concerns whether or not the choice of an estimation technique to compute a Rasch item measure and a Rasch person measure makes much difference. In most situations, when researchers in education, medicine, or business wish to develop a measurement instrument and compute person measures and item measures, they are generally experts in fields other than psychometrics. Moreover, they must depend upon those who think about measurement day and night. In this chapter, we present some basic background on some of the many estimation procedures that are used in Rasch software to compute Rasch item measures and person measures. While estimation procedures exhibit differences, our view is that few differences are great. For this chapter, we will rely primarily on a simple experiment that we conducted with colleagues using a data set evaluated by two Rasch programs. Also, two highly recommended readings for this chapter are “Rasch Model Estimation: Further Topics” (Linacre, 2004) and “Understanding Rasch Measurement: Estimation Methods for Rasch Measures” (Linacre, 1999). The third resource regarding different estimation procedures is the extensive Winsteps Manual (Linacre, 2018).

Different Estimation Procedures As Rasch analysis has been applied for the development of measurement instruments and for the computation of person measures and item measures, psychometricians have proposed a range of estimation procedures. An entire book would be required to consider and compare all estimation techniques. A table provided on the Rasch Measurement website (http://www.rasch.org/software.htm) reports many of the estimation procedures for numerous software programs that allow analysts to conduct a Rasch analysis at some level. Herein we discuss only a portion of the full list of the Rasch programs and estimation procedures reported on the rasch.org website. To start off, here are some key programs and estimation procedures used by these programs: Winsteps (uses JMLE, PROX); FACETS (uses JMLE, PROX); ConQuest (uses MMLE, JMLE); RUMM2030 (uses PMLE, WMLE); WINMIRA (Uses CMLE); XCalibre (Uses EM); Quest (Uses JMLE); jMetrik (Uses JMLE, PROX); ConstructMap, (Uses MMLE). Needless to say, numerous estimation procedures have been and are currently being used in Rasch software programs. To consider the implications, lack of implications, and which estimation procedure is used for programs, we begin a brief, mostly nontechnical, introduction to the topic of estimation procedures and present a set of issues that can be considered when pondering estimation methods for Rasch software programs. The text we author below is based on comments and observations made by Mike Linacre in two articles entitled “Estimation Methods for Rasch Measures” (Linacre, 1999) and

Different Estimation Procedures

189

“Rasch Model Estimation: Further Topics” (Linacre, 2004). These two articles are ones we suggest that readers consider reading. Where is the starting point? We suggest two initial steps. Step 1 is to review how Winsteps ultimately starts with a set of respondents and their answers to items (right/ wrong or rating scale) and computes Rasch measures. Step 2 is to compare the person measures and item measures computed with two different Rasch programs that use different Rasch estimation methods.

tep 1: The JMLE Estimation Procedures of Winsteps S and FACETS The Winsteps Manual (2018) provides guidelines to help researchers develop a quick understanding of how the Winsteps program uses Joint Maximum Likelihood Estimation (JMLE). Below we provide text from the Winsteps Manual to facilitate readers’ understanding of these details. Here we briefly consider the PROX estimation method, which is used to set the stage for JMLE estimations. It is important to note that JMLE also goes by the name of UCON, Unconditional Maximum Likelihood Estimation. JMLE was proposed by Wright & Panchapakesan in 1969. Winsteps implements two methods of estimating Rasch parameters from ordered qualitative observations: JMLE and PROX. Estimates of the Rasch measures are obtained by iterating through the data. Initially all unanchored parameter estimates (measures) are set to zero. Then the PROX method is employed to obtain rough estimates. Each iteration through the data improves the PROX estimates until they are usefully good. Then those PROX estimates are the initial estimates for JMLE which fine-tunes them, again by iterating through the data, in order to obtain the final JMLE estimates. The iterative process ceases when the convergence criteria are met. These are set by MJMLE=, CONVERGE=, LCONV = and RCONV=. Depending on the data design, this process can take hundreds of iterations (Convergence: Statistics or Substance?). When only rough estimates are needed, force convergence by pressing Ctrl+F or by selecting “Finish iterating” on the File pull-down menu. (Linacre, Winsteps Manual, 2018, p.574)

This means that Rasch measures in Winsteps are computed by first implementing procedure #1 (PROX) to compute initial estimates of each person measure and item measure. These person measures and item measures are then “fine-tuned” using procedure #2 (JMLE). The Winsteps Manual also provides the following comment regarding JMLE: JMLE “Joint Maximum Likelihood Estimation” is also called UCON, “Unconditional maximum likelihood estimation”. It was devised by Wright & Panchapakesan, www.rasch.org/ memo46.htm. In this formulation, the estimate of the Rasch parameter (for which the observed data are most likely, assuming those data fit the Rasch model) occurs when the observed raw score for the parameter matches the expected raw score. “Joint” means that the estimates for the persons (rows) and items (columns) and rating scale structures (if any) of the data matrix are obtained simultaneously. The iterative estimation process is described at Iteration. (Linacre, Winsteps Manual, 2018, p.574)

190

14 Rasch Measurement Estimation Procedures

Formative Assessment Checkpoint #1 Question: Why are different mathematical procedures used in different Rasch programs to estimate a person measure and an item measure? Don’t all the different programs use the Rasch model? Answer: First, it is important to remember that the theory and thinking of Georg Rasch is the centerpiece of Rasch programs. Measures are computed along a single variable. Analysts attempt to evaluate the fit of the data to the Rasch model. Person measures and item measures are computed. Second, when an analyst uses a data set to compute a person measure and an item measure, data are taken from the real world, with all the noise of such data. This is, in our mind, where the different mathematical steps used in different Rasch programs come into play. Many analysts argue that one procedure is better than another procedure. In our experiments, we assert that generally few major differences exist between the different procedures. Moreover, each procedure has minor pluses and minuses.

tep 2: What Are the Meaningful Differences, if any, Between S the Estimation Procedures Used in Winsteps and Other Software Programs? Before discussing and dissecting some of the differences between the variety of estimation procedures, we summarize many highly technical published and presented papers that have asserted the preference of one model over another. In our work, quite honestly, we want first and foremost to use the Rasch model for all of the reasons detailed in our first book and this book. Second, if the specific estimation procedures result in nearly identical results, for our work it does not matter which procedure was used to compute our item measures and person measures. Just as a meter stick will never be perfectly constructed, or perfectly used, our question to ourselves as practitioners in psychology, medical research and education is simply: Are the measures good enough for us to confidently conduct our research? In our work with Winsteps, we have found this to be the case. Moreover, as we will demonstrate, in terms of estimating the person measures and item measures of a data set, we have confidence in the other estimation procedures, too. But we prefer the well documented Winsteps program. To explore the impact – and lack of impact – of using different software, we conducted a simple experiment that we recommend readers also conduct with their own data sets of varying sizes (different samples of respondents, different numbers of items). In this experiment, we, along with our colleague Carina Gehlen (a Ph.D. stu-

Different Estimation Procedures

191

dent of Maik Walpuski) in Essen, Germany, computed the Rasch person measures for a large sample of respondents using Winsteps. We then computed the person measures using a different Rasch program (Conquest). Each data set investigated via Rasch will have a unique sample size, a unique number of test items, and a unique pattern of item difficulty. Also, the distribution of person measures will be unique to a sample. Therefore, this little experiment is unique to our data set! The data set we used in our experiment is composed of 1043 students who completed a test built using a 160 items. Figure 14.1 below presents a cross plot of the person measures from a Winsteps analysis and a Conquest analysis of the data. As is clear from the cross plot, the measures lie on a line, indicating that the ordering and spacing of the test takers do not change as a function of which analysis program was used. Continuing to explore the relationship between the person measures computed using these two programs, we computed a correlation coefficient. Its value is .999 (See Fig. 14.2). An additional component of our experiment compared the ordering and spacing of test items as a function of the Rasch analysis program utilized. Figure 14.3 presents a scatter plot of the item measures, which were computed with Winsteps and with Conquest. The scatter plot reveals little, if any, change in the ordering and

Fig. 14.1 A SPSS scatter plot of the person measures for a sample of 1043 test takers. The Winsteps person measures are presented on the x-axis, and the ConQuest measures are presented on the y-axis

192

14 Rasch Measurement Estimation Procedures Correlations

Person_Winsteps

Pearson Correlation Sig. (2-tailed) N

Person_Winsteps

1 1043

Person_Conquest .999 .000 1043

Fig. 14.2 The Pearson Correlation coefficient when person measures computed with Conquest are compared to the person measures computed with Winsteps. Correlation computed through use of SPSS

Fig. 14.3 A SPSS scatter plot of item measures computed using Winsteps and Conquest Rasch software. The Winsteps item measures are presented on the x-axis, and the Conquest item measures are presented on the y-axis. No meaningful differences in item ordering and spacing are observed

spacing of test items along the trait. As was done for the investigation of person measures, a correlation coefficient was computed to compare item ordering and spacing as a function of the analysis program. The correlation coefficient was also .999 (see Fig. 14.4). The nearly identical person measures computed using both software programs can predict what is observed for the test item measures. If the person measures remain the same, the way items define the trait likely did not change as a function of the software program used.

Different Estimation Procedures

193 Correlations

Item_Winsteps

Pearson Correlation Sig. (2-tailed) N

Item_Winsteps

1

Item_Conquest

160

.998 .000 160

Fig. 14.4 The correlation between Winsteps item measures and Conquest item measures. A very strong correlation is observed between the item measures computed with the two software programs. Computed through use of SPSS

Obviously, our experiment is for one data set, one specific sample of respondents, and one specific sample of test items. Each sample is unique regarding the number of items and respondents. Also, the distribution of person ability and item difficulty is unique to our data set. We suggest that if readers wish to compare different software programs, they should simply follow similar comparisons to those conducted here.

dded Steps to Understand the Similarities and Differences A of Rasch Programs For those who are interested in learning and applying Rasch to other fields of research, be they medicine or education, a great deal of time is required to master using a single software program. For example, how should researchers use the bells and whistles of a program? How should researchers interpret tables and plots? Therefore, comparing multiple Rasch programs, no matter the ease of program use, can be time consuming, as such analyses depend upon fluency with a specific Rasch program. We encourage readers to review two plots in Linacre’s 1999 article entitled “Understanding Rasch measurement: Estimation methods for Rasch Measures” in the Journal of Outcome Measurement. The figures of that article to review are Figs. 14.2 and 14.3. These two plots present similar results to our results. Primarily, the plots present the relationship between person measures computed with Winsteps, which uses JMLE to reach a person measure, in comparison to person measures computed using other software using alternative estimation techniques to compute a Rasch person measure. Below we provide text from that article. Using a common data set, Winsteps was run in its default mode, which does not correct for estimation bias. These person measures were compared to person measures computed for Quest (which uses JMLE), RUMM (which uses a Pairwise estimation), Conquest (which uses MMLE) and LPCM (which uses CMLE). (Linacre, 2018).

The pattern observed in our experiment presented in Fig. 14.1 shows no significant difference in person measures computed as a function of the Rasch programs

194

14 Rasch Measurement Estimation Procedures

utilized. Following his analysis in which person measures were compared for numerous Rasch programs. Linacre states: Each Rasch estimation method has its strong points and its advocates in the professional community. Each also has its shortcomings. Nevertheless, when the precision and accuracy of estimates are taken into account (Wright, 1998), all methods produce statistically equivalent estimates. Care needs to be taken, however, when estimates produced by different computer programs or estimation methods are to be compared or placed on a common measurement continuum. (Linacre, 1999, p. 402)

Formative Assessment Checkpoint #2 Question: Many different Rasch programs are available, and some use different estimation procedures. What techniques might be used to check the impact, if any, of using one program or another? Answer: One of the best procedures to check the impact of using one program as opposed to another program is making some cross plots. Using cross plots, a researcher can identify any difference between person measures. Another strategy is to read the articles that are referenced in this chapter. Researchers will encounter some misunderstandings regarding estimation procedures. Analyses of data sets in the referenced articles help better discern the impact or non-impact of one estimation procedure over another. Finally, although parts of the manual are technical, searching the extensive and detailed Winsteps Manual for details regarding estimation procedures will reap substantial benefits.

Readers may wonder what the meaning of (L-1)/L is in some of the Linacre (Linacre, 1999) figures. In the Winsteps Manual, Mike Linacre comments: Ben Wright and Graham Douglas discovered that the multiplier (L-1)/L is an approximate correction for JMLE item bias, where L is the test length. For a two-item test, this correction would be (2–1)/2 = 0.5. It is implemented in Winsteps with STBIAS=YES. (Linacre, 2018)

Of course, a natural question to pose is: What is the impact of this correction on an estimate of person measures and item measures with a Winsteps analysis? Below (Fig. 14.5) we present an analysis that we conducted with a data set. The items are plotted on the vertical axis are the analyses without the correction (STBIAS=NO), and the item measures plotted on the horizontal axis use the correction (STBIAS=YES). The data set consisted of 44 items and 324 respondents. As readers will be able to see, at least for this data set, there was no significant change in the items measures as a function of STBIAS setting. Readers are encouraged to also create their own cross plot using the person measures as well.

The Bottom Line

195 4

Item Measures With STBIAS=No

3

-4

2

1

0 -3

-2

-1

0

1

2

3

-1

-2

-3

-4

Item Measures With STBIAS=Yes

Fig. 14.5 A cross plot investigating the impact of STBIAS. Item measures of a data set are cross plotted as a function of STBIAS setting. Cross plot provided through Winsteps

The Bottom Line What is the potential bottom line for readers? Based on our past work, our reflections on the plots we present, and our past discussions from those who are interested in different estimation procedures, we affirm that it does not make that much difference which estimation procedure is used. Each estimation procedure has some pros and cons. In the end, however, for constructing instruments, computing person measures and item measures, the choice of a program does not appear to broadly impact the work to be done following a Rasch analysis (e.g. interpreting a Wright Map, evaluating a PCAR). There may be some situations, very specific ones, in which one program might be slightly better in some way than another. However, all of us work with a wide range of data sets (numbers of respondents, distribution of respondents along a trait, numbers of items, and distribution of items along a trait); therefore, it seems to us that, from a computation standpoint, the choice among the commonly used programs makes little difference. The choice will depend upon what sort of analysis a researcher wishes to conduct, and of course, personal preferences. We close by quoting an article authored by Mike Linacre: The investigations of 50 years of Rasch estimation are confirmed. All these estimation methods (CMLE, MMLE, JMLE) are equally good for practical work when implemented

196

14 Rasch Measurement Estimation Procedures

proficiently. However, their capabilities do differ considerably in other areas, such as the analysis of datasets with missing data. (Linacre, 2016, p. 1548)

Charlie and Chizuko: Two Colleagues Conversing Chizuko: Well Charlie, what do you think about these different estimation procedures, and what differences do they make for your work on our data sets? Charlie: Hmmmmmm.....okay, here is the deal....to me the bottom line is: Are we using a Rasch program? Then in terms of which program one uses, I think that it does not matter which one is used. Which Rasch program you use may depend on what types of plots and tables are available from a software program. Also, the level of documentation that is available in terms of a user’s manual. Chizuko: I did find a quote that I hang over my work space...maybe you will be interested? Mike Linacre told us: Each Rasch estimation method has its strong points and its advocates in the professional community. Each also has its shortcomings. Nevertheless, when the precision and accuracy of estimates are taken into account (Wright, 1998), all methods produce statistically equivalent estimates. (Linacre, 1999, p. 402).

Charlie: Ok, I think I have a handle on the issue! I understand the details, but honestly the cross plots are the ones that really helped me appreciate the issue. . Chizuko: Great! Let’s try some on our own!

Keywords and Phrases Rasch estimation procedures Joint Maximum Likelihood Estimation (JMLE) Marginal maximum likelihood estimation (MMLE)

Potential Article Text The Rasch analysis program Winsteps was utilized for the analysis of this data set. This program, as do all Rasch programs, uses Rasch theory and the mathematical expression of Rasch theory to compute person measures and item measures. The Winsteps program uses a two-step process. The first step is the so-called PROX (Normal Approximation Estimation Algorithm) method (Wright & Stone, 1999) which is used to determine initial person measures and item measures. The second step is named JMLE (JMLE). This step enables the final estimates of person and item measures to be computed.

The Bottom Line

197

Quick Tips A range of Rasch estimation procedures can be used to compute item measures and person measures. The two estimation procedures used in Winsteps are PROX and JMLE. Two excellent articles are: Linacre, J.M. (1999). Understanding Rasch Measurement: Estimation Methods for Rasch Measures. Journal of Outcome Measurement, 3(4), 381–405. Linacre, J. M. (2004). Rasch model estimation: Further topics. Journal of Applied Measurement, 5(1), 95–110. Generally, in articles reporting the results of a Rasch analysis, it is important to simply provide some citations to Rasch, but for most articles other than citing the software used, reviewers and readers will not be interested in details of the estimation procedure unless the article is written for a psychometric audience.

Data Sets None

Activities Activity #1 Open the Winsteps Manual and search for the term JMLE. Activity #2 Go to the following YouTube links and view the explanations by Mike Linacre about the steps to take to calculate calibrations. If in doubt search “YouTube Rasch Linacre Rasch Model Estimation”. https://www.youtube.com/watch?v=LvE8npeSjZ0 https://www.youtube.com/watch?v=X2zaib5VDnk&t=4s https://www.youtube.com/watch?v=O6xH2lKbWgc&t=3s

198

14 Rasch Measurement Estimation Procedures

Activity #3 Select a data set and compute person measures and item measures with Winsteps and another Rasch program. Cross plot the person measures and item measures. Answer: Your cross plot should, in almost all cases, reveal no significant difference in the estimates.

References Linacre, J. M. (1999). Understanding Rasch measurement: Estimation methods for Rasch measures. Journal of Outcome Measurement, 3(4), 381–405. Linacre, J. M. (2004). Rasch model estimation: Further topics. Journal of Applied Measurement, 5(1), 95–110. Linacre, J. M. (2016). Is JMLE really that bad? No, it’s actually rather good! Rasch Measurement Transactions, 29(4), 1548–1549. Linacre, J. M. (2018). Winsteps® Rasch measurement computer program User’s Guide. Beaverton, OR: Winsteps.com. Wright, B. D. (1998). Estimating Rasch measures for extreme scores. Rasch Measurement Transactions, 12(2), 632–633. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago, IL: Mesa Press. Wright, B. D., & Masters, G. M. (1982). Rating scale analysis. Chicago, IL: Mesa Press. Wright, B. D., & Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational and Psychological Measurement, 29(1), 23–48. Wright, B. D., & Stone, M. (1999). Measurement essentials. 2nd Ed. Wide Range Inc. Wilmington, Delaware.

Additional Readings This article provides added comparisons in the use of different estimation procedures: Linacre, J. M., Chan, G., & Adams, R. J. (2013). An “Estimation Bias” shootout in the wild west: CMLE, JMLE, MMLE, PMLE. Rasch Measurement Transactions, 27(1), 1403–1405. This article provides an overview to JMLE. Remember that JMLE was initially named UCON: Wright, B. D & Bell, S. R. (1977). Verifying the unconditional estimation procedure for Rasch item analysis with simulated data. Statistical Laboratory Department of Education, The University of Chicago, September 1977.

Chapter 15

The Importance of Cross Plots for Your Rasch Analysis

Charlie and Chizuko: Two Colleagues Conversing Chizuko: So Charlie, what in the world is going on with all those cross plots? Charlie: Well Chizuko…I’ve been playing around with something that has really helped me think more about each of my Rasch analyses. Some of the plots help me make some decisions while I am doing my Rasch analysis but they probably do not appear in my papers. Other plots, I will probably put in my next paper. Chizuko: Charlie, can you tell me about the plots and how they helped you? Charlie: Sure..here we go!

Tie Ins to RAHS Book 1 Many topics were presented in RAHS Book 1. Generally, the topics emphasize how to conduct a basic Rasch analysis, how to work toward explaining a Rasch analysis, and ultimately writing up an analysis. In this chapter we make use of several cross plots that can be easily conducted for a Rasch analysis. We talk readers through why they should consider such plots, how such plots can improve their understanding of Rasch, and how they might use such plots in an article. Readers will recall in this book there are also numerous instances of cross plots being used to explore a Rasch issue. For example, the exploration of PCAR makes use of cross plots as well as does the investigation of TIF.

Introduction Cross plots are used to investigate in many fields of work. For example, a plot of years of education and income show a trend that generally more education means more salary. In our Rasch work with students and colleagues, we have found that © Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_15

199

200

15 The Importance of Cross Plots for Your Rasch Analysis

cross plots can play a very important role in our measurement work, and such plots can be used to help further expand researchers’ understanding of Rasch. For the examples that we begin with, we will make use of two control files. One file is cf se survey. This file is composed of the responses of 75 respondents to the 13 item Self Efficacy scale (Enochs & Riggs, 1990) that we used throughout RAHS Book 1 and in parts of this current book. The second file we make use of is cf test. This file is composed of 75 respondents who answered a 25-item multiple choice test. The last piece of information important for readers to be aware of is the ease with which cross plots can be made with the Winsteps program. Many useful cross plots are provided in the tables provided by Winsteps, and many added cross plots can be constructed through the cross-plot option of Winsteps. Thus, the tools and resources we use for this chapter are two control files, and the cross-plot ability of Winsteps.

Cross Plots to Understand Your Data Error-vs-Person Measure The easiest way to make your cross plot is to run your Rasch analysis on the data set of your choice. For this example we use the file cf se survey. Once you have run the control file, go to the Plots option of Winsteps (Fig. 15.1), then click on Plots, select the option that is entitled Compare statistics: Scatterplot. At this point you are presented with a window that looks like the window below (Fig. 15.2): Starting to learn how to use this plotting option, an analyst must first note that one can do plots for Items and for Persons. We can see that in the first line of the picture we provide in Fig. 15.2. The next most important part of this box is the line that says “Plot this (left x-axis)”, and the line that says “and this (right y-axis).” This is where you can indicate to Winsteps what variables you wish to plot on the y-axis of your scatter plot, and what variables you wish to plot on the x-axis of your scatter plot. If you click on the downward arrow that is also on the same line, you can see all the different variables that can be plotted. Finally, look for the two locations on the screen that say “this analysis.” This is where you can indicate where the data being plotted are coming from. For our work in this chapter, we will make plots with this analysis of data with cf se survey. However, it is important to be aware that it is possible to make scatter plots of Rasch data from different analyses. For example,

Fig. 15.1 A view of Winsteps. Take note of the Plots option

Cross Plots to Understand Your Data

201

Fig. 15.2 Scatterplot option in Winsteps

for the survey data we have used, it is possible to conduct an analysis of only the male respondents, only the female respondents, and use this Winsteps plotting option to plot the item measures (computed from the male data set) versus the item measures (computed from the female data set). Below we provide a picture (Fig. 15.3) that shows how a researcher would select the appropriate buttons to make a plot of person measures versus the error of each person measure. After the selections have been made as shown in the table, an analyst must simply click the button OK, then a new small window pops up (Fig. 15.4). This table allows you to select the manner in which you wish to plot your data in the scatter plot. Our example has 75 respondents, thus we select Marker for our plot. But, as we will see later, sometimes it is quite useful to plot the label of a person or item. Below we provide the plot (Fig. 15.5) that resulted from the use of the steps we describe above. Readers should observe that the plot provides a plot of each of the 75 respondents as a function of the person error. Having made this plot, what use is it to us? How are we helped by such a plot? What might it teach us about Rasch? How might such a plot be important for our analysis? The first thing a researcher can see is that the error of measurement is not the same for all of the respondents. In fact, many different levels of measurement error exist. Upon review of the plot, we can also see two lines made by the dots (if we were to “connect the dots”). The lower line begins with a dot around a measure of −1.0 logits. If one follows that sequence of dots from the left of the plot to the right of the plot, the error decreases, then increases. The minimum error on the plot for this set of dots is around .20 logits. Then, as a researcher moves toward higher and

202

15 The Importance of Cross Plots for Your Rasch Analysis

Fig. 15.3 The scatterplot menu in Winsteps Fig. 15.4 The Winsteps window allowing one to select different symbols for plots

higher measures, the error reported for each dot increases. Also, notice there are a handful of dots that have larger error than the dots along the curving line we just considered. What is up with these dots? These dots result when a smaller number of Self Efficacy items were answered by respondents. For each of these respondents, a total of 6 items were answered, as opposed to all 13 items. Recall that with Rasch it is possible to compute a measure for respondents, regardless of how many items are answered and which items are answered. However, when fewer items are answered, for most situations, there will be a larger error of measurement. Two final points; what is up with the person who has a very high measure (around 7.5 logits)? Also, why should it make sense this person also has a very high error of measurement? If a researcher uses the Winsteps table that is organized by person measure (Table 17)

Cross Plots to Understand Your Data

203

SE STEBI data used to construct initial cf for fit chp.xlsx 2

1.8

1.6

Standard error

1.4

1.2

1

0.8

0.6

0.4

0.2

0 -1

0

1

2

3

4

5

6

7

8

Measure

Fig. 15.5 A Winsteps cross plot of person measures and the standard error of each measure

a researcher can identify this person (dot) as someone who answered all 13 Self- Efficacy items with the highest rating scale category (a 6). This is then the case in which a researcher knows the person has a very high self-confidence, but since they “topped out” on the scale, we do not know how much more confidence they have. As a result, a researcher will see a high error of measurement. Finally, when readers run this plot that we provide, take note that if you put your cursor over any of the dots, you can see the coordinates of the dot. Thus, if you place the cursor over this person who topped out on the scale, you can see that the measure is 7.52 and the error is 1.85. Formative Assessment Checkpoint #1 Question: Can you construct a similar plot (item measures-vs-item errors; person measures-vs-person errors) using the control file cf test that is supplied for this chapter? Before you run the plot, try to predict what you will see in your plot. Answer: You will see a similar pattern. However, how much of the “U” that you will see will depend upon the range of item difficulty in your data set and the range of person ability.

204

15 The Importance of Cross Plots for Your Rasch Analysis

Sample Cross Plots to Aid Your Analysis and Your Articles Of course, thousands of plots can be made. In this section of the chapter we will provide some go-to plots that we often make for our analyses and also for our articles. Also, as you may already know, numerous plots are observed in the tables provided by Winsteps. Each of these plots is helpful in some manner as you work with Rasch. Finally, before we move on to the types of cross plots, we recommend that readers strongly consider edits in their plots as they present their own cross plots to audiences. In the two Wright Map chapters of RAHS Book 1, and in the two Wright Map chapters of this book, we have emphasized repeatedly the importance of editing out pieces of your Wright Map that are not needed for your figure. Moreover, we have emphasized enhancing aspects of Wright Maps (for example using different font for different item types presented in a Wright Map), to highlight a point you wish to make. The same advice is critical for cross plots. Include only the information that is needed to make your point.

Item Measure-vs-Item Error Cross Plot This plot is very similar to the plot discussed above. In most analyses, the number of items will be far smaller than the number of respondents. So, be prepared to see only a few dots. In many cases researchers will see a “U” pattern of dots, or at least part of a “U” as you look over the pattern of dots. We feel the most important part of this plot are the dots (the items) at the extreme ends of the plot. These are the items often with the greatest item error. The question we ask ourselves, is whether or not this amount of error is acceptable for the work we are doing? Thus if we are presenting a discussion of the item ordering and spacing on our Wright Map, do the items at the extremes have so much error that we have to be careful about assertions that we make regarding comparisons in item difficulty we make among items?

Item Measure-vs-Item MNSQ Outfit As we stressed in RAHS Book 1, one of the aspects of scale construction that we think about most often focuses on the fit of items and the fit of persons. As readers now know, it is critical to evaluate the fit of the data to the requirements of the Rasch model. If items do not fit, then we do not want to use those items to define our construct. If people do not fit, we do want to pause, and think about why the persons might not fit the Rasch model expectations. With regard to fit, we do a lot of work during an analysis considering the fit of items. One cross plot that we use and value is the plot of item measure versus item MNSQ Outfit. It is possible to review a table of items sorted by MNSQ Outfit

205

Sample Cross Plots to Aid Your Analysis and Your Articles SE STEBI data used to construct initial cf for fit chp.xlsx 2

1.8

1.6

Outfit mean-square

1.4

1.2

1

0.8

0.6

0.4

0.2

0 -3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Measure

Fig. 15.6 A Winsteps cross plot of item measures and item MNSQ outfit

(Table 10 of Winsteps). Moreover, it is possible to review a table sorted by item measures (Table 13 of Winsteps). However, the interplay of item measures and item fit can be confusing as a researcher jumps from one table to another table. Above we provide our plot (Fig. 15.6) for the Self-Efficacy survey data. With this plot we have selected Label so the item name we provided to Winsteps is plotted. From a Rasch perspective, what do we learn from this plot? The main thing we learn is that one item (Q3) appears to have a MNSQ Outfit that is higher than we might like to see (looking at the Excel worksheet that was used to make the cross plot, we can quickly see that the MNSQ Outfit is 1.64). Certainly, we will follow the steps we outlined in RAHS Book 1 to investigate the ins and outs of this item’s fit. However, this plot helps us appreciate that in terms of item measure, this potential problematic item is not at the extremes of the measurement scale [one has items that have higher measures (there are items plotted to the right of Q3) and one has items that have lower measures (there are items plotted to the left of Q3)]. Also there are items with similar measures as that of Q3, but those items have more acceptable MNSQ Outfit values. This means that, from a scale construction point of view, we can quickly see with this plot that if we removed this item, we would not be opening up a large gap in the coverage of our trait with items. Moreover, in fact we have some items already in our scale that appear to mark the trait in the same general region as with Q3. Finally, we wish to mention that Table 9 of Winsteps can be used to review the same information as we have provided above. We find, however, it is

206

15 The Importance of Cross Plots for Your Rasch Analysis

faster for us to use this plotting capability of Winsteps, and we find that being able to select the label of each item also helps us to quickly make a plot that we can use. Formative Assessment Checkpoint #2 Question: Using the file cf test create the same plot as described above for the survey data. Select label in order to see the name of each item that is plotted. What pattern with items do you see in the plot? Answer: After you have made the plot, look over the range of item measures and the values of Outfit MNSQ for those items. Generally, for this set of test items, it appears that some of the difficult items and some of the easy items are the items that have the higher values of Outfit MNSQ.

Cross Plots from Separate Analyses Let us pretend, at least for the purposes of this chapter, that we wish to quickly explore the impact upon person measure of the removal of an item from the 25-item test that was used to collect the data provided in the file cf test. Let us pretend that we have decided to shorten the test. Two candidates for shortening the test are the removal of item Q8 and item Q36. These are two items that have very similar item measures to those of two other test items (Q31, Q35). But, we want to explore the impact of removing items Q8 and Q36. For example, will the measures computed for each respondent be altered so that we would not reach the same conclusions as to how all the students in the data set compared? This means, does removal of these two items (creating a test of 23 items) change the ordering of respondents? In essence, with regard to student ordering, does it make much difference to have a 23-item test or a 25-item test? The plot can be made in a number of ways. One technique that we use is to conduct two separate Rasch analyses, one with all test items and one with the 23 test items [using IDFILE to delete the two items Q8 (item entry 7), and Q36 (item entry 25)]. Our next step to make this plot is to simply run each of the control files, and then make a plot of person measures (x-axis) versus person measures (y-axis) for the same analysis. As we are plotting the same measures against each other, we will get a straight line. We provide the plot from cf test below. When we made the plot, we unchecked the box that concerns confidence bands (Fig. 15.7). The next step is quite simple. Take the worksheet (Fig. 15.8) provided for this plot and remove the column of data that is provided in column C. Then insert the column C data from the other analysis into the now blank column C of the initial analysis. Below (Fig. 15.8) we provide the first seven rows of the excel sheet from the cross plot of the person measures we made from the analysis of the full 25 item test. At this point, we make sure to save this Excel sheet that has the person measures from the use of the 25-item test (column B) and the person measures from the use of the 23-item test (column C). This is shown in Fig. 15.9. Following this saving of

207

Cross Plots from Separate Analyses 25 items 75 people no high persons Fall 2012 OGT No Partial Credit Items Boone.x 3

2.5

2

1.5

Measure

1

0.5

0 -2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

-0.5

-1

-1.5

-2

Measure

Fig. 15.7 A Winsteps cross plot of identical measures

Col A

Col B

Col C

Col D

Row 2

Entry

Measure

Measure

Person

Row 3

1

1.94

1.94

1M

Row 4

2

1.64

1.64

2M

Row 5

3

2.3

2.3

3M

Row 6

4

1.94

1.94

7M

Row 7

5

-0.28

-0.28

9M

Row 1

Fig. 15.8 The first seven rows of the Excel sheet used for the cross plot of Fig. 15.7

the data set, we select scatter plot from the lower left tab of the Excel sheet. At this point we are provided with the plot of the person measures from the two analyses (provided below in Fig. 15.10). Also in the data files for this chapter, we provide the Excel sheet we created (Data file for cross plots of 25 item test 23 item test). What do we learn from this plot? We learn that although there are some changes in the ordering of the persons as a function of the test used (25 item test, 23 item test), there is no large-scale change in the ordering of respondents beyond what would be expected through measurement error.

208

15 The Importance of Cross Plots for Your Rasch Analysis

Col A

Col B

Col C

Col D

Row 2

Entry

Measure

Measure

Person

Row 3

1

1.94

1.84

1M

Row 4

2

1.64

1.84

2M

Row 5

3

2.3

2.22

3M

Row 6

4

1.94

1.84

7M

Row 7

5

-0.28

-0.10

9M

Row 1

Fig. 15.9 Part of an Excel sheet that includes person measures from the 25-item test (column B) and the person measures from the 23-item test (column C)

25 items 75 people no high persons Fall 2012 OGT No Partial Credit Items Boone.x 3

2.5

2

1.5

Measure

1

0.5

0 -2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

-0.5

-1

-1.5

-2

Measure

Fig. 15.10 The Winsteps scatterplot of the person measures using the 25-item test and the 23-item test

ross Plots Used in Chaps. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, C and 14 There are a range of cross plots utilized in this book: Figs. 2.2, 4.2, 6.1, 8.10, 8.11, 12.2, 14.3, and 14.5.

Cross Plots Used in Articles

209

Cross Plots Used in Articles For the final part of this chapter on the importance of cross plots, we provide some examples of how cross plots have been presented in the Rasch literature. We suspect that a lot of cross plot work might not be mentioned in articles, in that such steps allow a researcher to move from A to B to C in an analysis. Below we provide some examples of cross plots that have been presented in the literature. We do feel the judicial increase in cross plots in Rasch articles might be helpful. Below we provide a full citation of a number of articles that have used cross plots in a Rasch analysis. We also provide a brief summary and discussion of the data that were cross plotted. Elementary Education Majors’ Views on Evolution: A Comparison of Undergraduate Majors Understanding of Natural Selection and Acceptance of Evolution (Hermann, 2016). Cross plot of person measures on one scale (natural selection) versus person measures on another scale (acceptance). This type of plot is important because, if person measures are along a line, that suggests the two different scales may measure the same trait. This type of comparison is similar to the discussion we presented comparing the person measures with the 25-item test and the 23-item test.

A Rasch Analysis of Assessments of Morning and Evening Fatigue in Oncology Patients Using the Lee Fatigue Scale (Lerdal et al., 2016). The authors of this study presented a cross plot of person raw scores for a 13-item scale versus person measures for a revised scale. We believe that such a plot can be used in articles to show the nonlinear relationship between the raw scores and the logit measures.

User Acceptance of a Touchless Sterile System to Control Virtual Orthodontic Study Models (Wan Hassan et al., 2016). In this study, a cross plot of person measures from two different scales is provided. The authors discuss, in part, what happens when a researcher moves up the measurement scale for one of the measures, and what a researcher sees in terms of changes with the second measure.

210

15 The Importance of Cross Plots for Your Rasch Analysis

Self-Reported Frequency and Perceived Severity of Being Bullied Among Elementary School Students (Chen, 2015). This study presents a cross plot of item measures. Two different scales were presented to respondents. However, the theme of items was the same topic. This allowed each item to be calibrated on each of the two scales, and also allowed a cross plot of item difficulties to be constructed.

Evaluating the Psychometric Properties of the Foreign Language Classroom Anxiety Scale for Cypriot Senior High School EFL Students: The Rasch Measurement Approach (Panayides & Walker, 2013). As part of this study, an experiment was conducted to investigate the impact of a scale having a different number of items. One cross plot was of person measures using a 25-item scale and a 28-item scale. Another plot explored the impact of using a 24-item scale or a 28-item scale.

The Flow Experience: A Rasch Analysis of Jackson’s Flow State Scale (Tenebaum, Fogarty, & Jackson, 1999). In this analysis, two cross plots were presented. One cross plot presented the location of item calibrations for a scale when only males were used to calibrate the items (y-axis) and when only females were used to calibrate the items (x-axis). A second cross plot was made. A set of items was calibrated. One calibration was done with a sample of elite athletes. The other calibration was done with a sample of non-elite athletes. The cross plot allowed an investigation of the impact of athletic status upon the calibration of instrument items. Charlie and Chizuko: Two Colleagues Conversing Charlie: Well, here we go super researcher, I wanted to show you something that is kind of cool. Chizuko: Ready to hear you out as always. Charlie: Ok, here is the neat thing. I think I have kind of forgotten to make use of something I have seen in many classes I have taken - in Business, in Geology, in Physics, and in Medicine. That thing is the cross plot. Chizuko: Seems pretty simple to me, x-versus-y, what’s the big deal? Charlie: Here is the deal. Cross plots really help me understand data. Also, Cross plots can be used to clearly communicate some of my Rasch work. It is important

Cross Plots Used in Articles

211

to explain, of course, what is being plotted, and why and what one sees in the plot. However, it is really a good way to get to know your data, play with your data, experiment, and communicate your results. I know that page limits are important in writing an article. That might be one reason why I don’t see cross plots that often. However, I am not all that sure that cross plots are being used by researchers as much as they could be used. I think the simple cross plot will help me conduct my Rasch analyses. Also, I think cross plots could help me do a better job of teaching Rasch to my colleagues.

Keywords and Phrases Cross plot Compare statistics: Scatterplot

Potential Article Text The Lincoln Park project utilized Rasch measurement techniques to develop a new instrument the LP1. One technique by which the instrument was evaluated concerned the evaluation of the fit of respondents. Below we provide a cross plot of the person Outfit MNSQ values and person Infit MNSQ values for each of the 2000 respondents. The cross plot was reviewed in an effort to identify those respondents who exhibited both high Outfit MNSQ values and high Infit MNSQ values. A total of 17 students were identified. The researchers then reviewed the information we had regarding each of the 17 students to investigate if there was a commonality among these 17 students. It was determined that all the students were from one classroom. It was decided to remove the students from the analysis.

Quick Tips Winsteps provides a number of techniques that allow one to very quickly cross plot data. Consider cross plotting your data to better understand Rasch and to better understand your data set. Cross plots should also be considered for inclusion in papers and presentations to explain your findings.

212

15 The Importance of Cross Plots for Your Rasch Analysis

Data Sets (Go to http://extras.springer.com) cf se survey cf test Test data for cross plot chp SE STEBI data for cross plot chp

Activities Activity #1 Create a control file for the file named Test data for cross plot chp. Then conduct a Winsteps analysis. Then make a cross plot of item MNSQ Outfit (x-axis) and item MNSQ Infit (y-axis). What pattern is seen? Answer: We suggest when you make your cross plot, you select Label. This will help you quickly identify the items. The plot reveals a small range of differences in the MNSQ Infit values. Moreover, there is a larger range of MNSQ Outfit values. Also, there appears to be a trend that the larger Outfit value items are also the items with larger Infit value items. One aspect of the plot that an analyst might consider is whether items with very similar values (e.g. item 4 & item 9) are similar in some manner. A review of the text of each item might reveal some commonalities in the items. Activity #2 Using the test control file you have created, make a cross plot of raw scores for persons (x) and the Rasch measures (y) of the persons. When you make this plot, select Marker for your plot. What do you see? How can you explain this pattern of dots to students you are teaching? Answer: The lowest raw score dot is at a 6, and the highest raw score dot is at a 22. Why is this? This is because for this test and set of students, the lowest p erforming student earned 6 raw score points, and the highest performing student earned 22 raw score points. When we look at the plot, we can see that the pattern of dots is not a straight line. There is a curve to it. This is because we are seeing the ogive in the plot, which is the pattern we see when we convert the non-linear raw scores to the linear Rasch measures.

Cross Plots Used in Articles

213

Activity #3 Using the test control file you have created, make a cross plot of person measure (x-axis) and person Outfit MNSQ (y-axis). What do you notice in the pattern of the dots? What might be the implications for your analysis? Answer: Reviewing the plot, there appears to be a pattern that for higher and lower person measures (so, measures at the two extremes) that the person Outfit MNSQ is the greatest. For the middle range of person measures, the Outfit MNSQ seems to be lower. This pattern should make sense to readers in that Outfit MNSQ helps one look at unexpected responses (did the person answer an item quite different than his or her ability level in a strange way) for the test takers. Activity #4 For this activity, you are first going to make a prediction. After you make your prediction, you will make a cross plot. What is your prediction if you use the test data set to make a cross plot of the person measures (x-axis) and the person standard error (y-axis)? What is the pattern you expect to see? Why that pattern? When you make your cross plot, make sure to select Marker as the manner in which you wish to display your data. Answer: You will have made a prediction of the pattern you would expect to see. Hopefully it matches the cross-plot pattern you see with your Winsteps plot. The plot that you see is a “U” shape. Higher person measures have higher errors, and very low person measures have higher error, and the persons with middle of the road measures have the lowest error. The way to understand the pattern is when one has a high measure, one knows that a test taker did well on the test, but we don’t know how much better they could have done had there been some more item of higher difficulty. The same is true for low performing students. If the students have a low measure, we do not have a good idea as to what the students know, for that type of student it would have been helpful to have more easy items on the test. It should make sense that, for the students with the middle of the road measure, we would predict they would have answered a number of items correctly and also answered a number of items incorrectly. We think of this as bracketing in the persons. By having a nice mix of items (harder and easier than the person’s ability level), we are able to more confidently figure out their measure (thus a lower standard error). Activity #5 Repeat Activity #1 but use the data set SE STEBI data for cross plot chp.

214

15 The Importance of Cross Plots for Your Rasch Analysis

Activity #6 Repeat Activity #2 but use the data set SE STEBI data for cross plot chp. Activity #7 Repeat Activity #3 but use the data set SE STEBI data for cross plot chp. Activity #8 Repeat Activity #4 but use the data set SE STEBI data for cross plot chp.

References Chen, L. M. (2015). Self-reported frequency and perceived severity of being bullied among elementary school students. Journal of School Health, 85(9), 587–594. Enochs, L. G., & Riggs, I. M. (1990). Further development of an elementary science teaching efficacy belief instrument: A preservice elementary scale. School Science and Mathematics, 90(8), 694–706. Hermann, R. S. (2016). Elementary education majors’ views on evolution: A comparison of undergraduate majors understanding of natural selection and acceptance of evolution. Electronic Journal of Science Education, 20(6), 21–44. Lerdal, A., Kottorp, A., Gay, C., Aouizerat, B. E., Lee, K. A., & Miaskowski, C. (2016). A Rasch analysis of assessments of morning and evening fatigue in oncology patients using the Lee Fatigue Scale. Journal of Pain and Symptom Management, 51(6), 1002–1012. Panayides, P., & Walker, M. J. (2013). Evaluating the psychometric properties of the foreign language classroom anxiety scale for Cypriot senior high school EFL students: The Rasch measurement approach. Europe’s Journal of Psychology, 9(3), 493–516. Tenebaum, G., Fogarty, G. J., & Jackson, S. A. (1999). The flow experience: A Rasch analysis of Jackson’s Flow State Scale. Journal of Outcome Measurement, 3(3), 278–294. Wan Hassan, W. N., Abu Kassim, N. L., Jhawar, A., Shurkri, N. M., Baharin, N. A., & Chan, C. S. (2016). User acceptance of a touchless sterile system to control virtual orthodontic study models. American Journal of Orthodontics and Dentofacial Orthopedics, 149(4), 567–578.

Additional Readings None

Chapter 16

Wright Maps (Part 3 and Counting...)

Charlie and Chizuko: Two Colleagues Conversing Charlie: Here we go again Chizuko...I’ve got another question for you...ready? Chizuko: You know I am always ready for Rasch questions....full speed ahead. Charlie: Ok here is the deal....I’ve been looking through about 40 articles that have been recently published using Rasch analysis. I have noticed that some articles present Wright Maps, some do not present Wright Maps, and some articles do a better job than other articles of making use of Wright Maps. So, I was wondering what your take on this is....how can the use of Wright Maps in recent articles help me do a better job of communicating in my own work? Chizuko: To tell you the truth Charlie, sometimes I think you are sleeping...but now I can see that you are thinking – although maybe with your eyes closed!

Tie Ins to RAHS Book 1 In our first book, we devoted two chapters to Wright Maps (Person-Item Maps). Those chapters primarily talked readers through the process of creating a Wright Map from a multiple-choice test, and we introduced readers to some of the ins and outs of interpreting and using Wright Maps. In this book, we used Wright Maps to help readers understand some of the nuances of partial credit analyses. This chapter provides added details to help readers utilize Wright Maps in their work. We believe that Wright Maps are one of the most useful aspects of Rasch measurement because such maps allow researchers to readily explain research results to readers. However, we feel that researchers often do not invest enough time in preparing their Wright Maps. Often for a final figure in a paper courier font is utilized. It does not take much effort to switch to a different font. Also often the default plotting is used by

© Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_16

215

216

16 Wright Maps (Part 3 and Counting...)

researchers. That means in some cases respondents are plotted with a dot or a #. For final figures it is possible to plot all the respondents. In some cases that is preferable. In this lengthy chapter we provide a wide range of techniques to help you maximize the communicative power of Wright Maps.

Introduction Wright Maps are causing substantial changes in research. This change focuses on some easy to understand components of Rasch, and also some more subtle issues. For example, Wright Maps provide the location of item difficulty that is corrected for the nonlinearity of raw scores. We say subtle because sometimes beginners need a bit of time to appreciate that raw scores from a multiple-choice test are nonlinear. In terms of easy to understand issues, we believe that the ability to plot the difficulty of items (be they multiple choice test items, partial credit test items, or survey items) allows analysts to see a story, and also tell a story, with the data. Most analysts are familiar with looking at a hierarchy and interpreting the ordering of a set of items. Since we completed our first book, we have seen more and more presentations of Wright Maps in the literature. We have seen some mistakes in published Wright Maps, we have seen Wright Maps that produce great clarity for a study, and we have seen Wright Maps where the addition of more detail could immensely improve the use of the Wright Map. Although this is the third chapter we have dedicated to Wright Maps, we suspect this is not the last chapter concerning Wright Maps that we will author.

Recent Wright Maps in the Literature To prepare for this chapter, we reviewed recent Rasch articles that appeared in the literature between 2003 and 2017. We reviewed only a subset of the articles that have been published, but we believe that these articles are a representative sample of the journals that have not only published Rasch articles, but also the range of studies that have used Rasch analysis methods and Wright Maps in the last few years. Our goal in this chapter is to help readers consider ways in which they may further expand and improve their use of Wright Maps. We encourage the readers to browse the list of Additional Wright Map Examples at the end of this chapter. These articles, as well as this chapter’s References and Additional Readings, help readers to quickly see and understand the diverse and cross-discipline usage of Wright Maps.

Surprises in Presentation of Wright Maps

217

hat Can Be Learned and Improved Upon Through Review W of the Wright Maps in These and Other Articles? The Common Wright Map Format Readers are now quite familiar with the organization of Wright Maps as provided by Winsteps. A vertical line is presented, scaled in units of logits. On the left side of the Wright Map are X symbols. These X symbols represent respondents. If a data set is small, usually an X represents a single respondent. If a data set is large, then an X represents a number of respondents. At the base of the Wright Map, Winsteps provides information on the number of respondents represented with an X. On the right side of the vertical line, the items of the test or survey are presented. The symbol used to indicate the location of the item is the name of an item as it appears in the control file names. Finally, to set the stage for our discussion of Wright Maps, we remind readers that the left side of the Wright Map presents respondents from lowest measure (at the base of the Wright Map) to highest measure (at the top of the Wright Map). On the right side of the Wright Map, the item difficulties are presented from the lowest item measure (at the base of the Wright Map) to the highest item measures (at the top of the Wright Map). This is when we use coding as utilized in RAHS Book 1.

Surprises in Presentation of Wright Maps Font Our first point for readers is how surprised we are that the vast majority of Wright Maps are presented in the exact same font as seen in Winsteps. We hold nothing against Courier font, but we do not understand why authors and presenters do not take sufficient time to alter their Wright Map font from Courier to another font of their choice. Part of the issue is simply that for computer output, it is of no concern that a table or a plot is in Courier font. But for a publication, the use of a more modern font is much more appealing to readers. Yes, we know that the interpretation of the Wright Map in Courier font is no different than the interpretation of a Wright Map in Times Roman font. But, the non-Courier font Wright Map is much more pleasing to the eye. The key is if you use a non-Courier font, make sure to line up what you type. Thus if you decide to make a Wright Map using a more pleasing-to- the-eye font, make sure to put your Wright Map with nicer font next to the original Wright Map with Courier font. This is important in order to check if your map’s items and persons line up in the way presented in the Courier font plot. An example of a paper in which the Wright Map is used as is from Winsteps is the Wright Map presented in “Development and Validation of an Iranian Voice Quality of Life Profile (IVQLP) Based on a Classic and Rasch Rating Scale Model

218

16 Wright Maps (Part 3 and Counting...)

(RSM)” (Dehqan, Yadegari, Asgari, Scherer, & Dabirmoghadam, 2017) and in “Further Validation of the Inventory of Mental Toughness Factors in Sport (IMTF-S)” (Stonkus & Royal, 2015). An example of a Wright Map that is presented in another font other than Courier is “Development and Validation of a Short Version of the Cornell Scale for Depression in Dementia for Screening Residents in Nursing Homes” (Jeon et al., 2016). When transferring a plot from Winsteps in Courier font to another font, some tricks must be used, but the tricks are not difficult. When using another font other than Courier, it is important to make sure that the text lines up as it does in the original Courier plot. To line up parts of the map, we have found it helps simply to use a tab key. Below we present part of a Wright Map in Courier font (Fig. 16.1), and also we present the same Wright Map in Times Roman font (Fig. 16.2). As readers will see, when an immediate conversion to a different font is made, even with the careful use of tab keys and space bars, there is not an exact line up of the parts, as is seen when Courier font is used. We think this might be one reason why many researchers use the Wright Map as it is presented initially through Winsteps in their publications. We suggest that when researchers have their final version of their Wright Map, they should consider spending a little time creating their own Wright Map. This can be easily done though knowledge of the difficulties of each item and the measures of each respondent. For sake of comparison to our first and second Wright Maps, we Fig. 16.1 Part of a Wright Map from Winsteps in Courier font

MEASURE PERSON - MAP - ITEM + 3 T XXXXXX 2 + XXXXXXXX T S XXXXX Item 30 Item 29 XXXXXXXX XXX Item 19 Item 20

Surprises in Presentation of Wright Maps

219

Fig. 16.2 The same Wright Map from Winsteps as above in Times Roman font

present the same information. Our point in this initial example is that although both plots are accurate, it is much more pleasing to the eye to use a font other than courier. Formative Assessment Checkpoint #1 Question: Why is font an important issue is Wright Maps? Answer: Font is important for Wright Maps, as it is for any written or presented document. Font must be readable and clear. But there is another issue, and it has to do with how characters which are entered are lined up. There are many ways to line things up. One way that helps is the use of Courier font. But with Courier, the font is not all that pretty. So, if you do not use Courier font, then switch to a nice-looking font, but make sure to line things up. As you attempt to use some other fonts, look up “fixed- width” and “monospaced” fonts. Such fonts “line” up as does a Courier font.

220

16 Wright Maps (Part 3 and Counting...)

Wrap Around When a Wright Map is created by Winsteps, and when many test items are at a similar calibration and/or the names of the items are very long, a problem may occur in which a break appears in the vertical line for the Wright Map. We have seen this wrap around in many Wright Maps. We believe this is just something authors have missed. Also, it is possible that when figures are forwarded to publishers, a formatting issue occurs. Below we present a Wright Map (Fig. 16.3) for one of our data sets in which we typed in a long name for questions No 68, No 75, and No 76. We typed an extended name for these three items to help readers see the problem that occurs with wrap around. Below, readers will see the impact of the long name for No 75 and No 76. Because the name of the item is so long, the two items (at least on the plot) seem to not be plotted at the same location as is seen in the initial plot with short item names (Fig. 16.4). As the names of No 75 and No 76 are so long, No 76 was shifted down 4

## .##

3

.#### ### .###

2

1

0

-1

.##### .## .#### #### .#### .#### .#### ########## .#### .## .## .## .# . . # .# . .

+ T| | | | | + | S| | | | + | | M| No 68 For this item the student needs to find the octagon | |T + | |S No 71 | S| | No 76 For this item the student needs to find the triangle No 75 For this item the student needs to find the circle +M No 74 | No 73 | No 72 | No 70 |S No 69 | No 77 T+ |T |

Fig. 16.3 A Wright Map from Winsteps in which a long name is provided for a number of items. The long name can cause wrap around, which causes a downward shift in part of the plot. This can be noted in the plot at the location where the vertical line between persons and items has a gap

Surprises in Presentation of Wright Maps

221

the page. Readers should take note of the now present gap in the line. Whenever readers review Wright Maps and see this break in the line, it is likely the result of this wrap around. Also note that the long name for item No 68 did not cause a problem in that the item’s name was not so long as to cause the wrap around. A manner in which wrap around can occur for items is when a very large number of items have a similar item difficulty. Below we present how the Wright Map would appear if items No 78-No 84 were included, and these seven added items all had the same item difficulty as item No 76 (Fig. 16.5). In this case, the wrap around occurs not due to the length of the item name, but rather due to the result of having so many items at the same item difficulty level. Again, readers should note the tell-tale break in the line. How can a researcher avoid this type of wrap around? One technique that we recommend to our students (Fig. 16.6) is to experiment with changing the orientation of the paper used for the Wright Map. Often, rotating the paper to landscape Fig. 16.4 The Wright Map from Winsteps resulting from the use of a short name as opposed to the long name as presented in Fig. 16.3

4

## .##

3

.#### ### .###

2

1

0

-1

.##### .## .#### #### .#### .#### .#### ########## .#### .## .## .## .# . . # .# . .

+ T| | | | | + | S| | | | + | | M| | |T + | |S | S| | +M | | | |S | T+ |T |

No 68

No 71 No No No No No No No

75 74 73 72 70 69 77

No 76

222

16 Wright Maps (Part 3 and Counting...) 4

## .##

3

.#### ### .###

2

1

0

-1

.##### .## .#### #### .#### .#### .#### ########## .#### .## .## .## .# . . # .# . .

+ T| | | | | + | S| | | | + | | M| No 68 | |T + | |S No 71 | S| | No 75 No 76 No 78 No 79 No 80 No 81 No 82 No 83 No 84 +M No 74 | No 73 | No 72 | No 70 |S No 69 | No 77 T+ |T |

Fig. 16.5 A Wright Map from Winsteps. Wrap around which can occur if a large number of items have a similar calibration

avoids the wrap around. Below (Fig. 16.6) is the Wright Map when a researcher uses Landscape and does a bit of fine tuning with the page margins. Notice that the problem with the vertical line has disappeared. This rotation of the paper, in our example, solves the wrap around issue for both the case of the long item name as well as the case in which many items have a similar item calibration. To finish up this particular aspect of tips with Wright Maps, please make sure to review your Wright Maps for the line break as we have presented here. If a line break is present, you must edit your Wright Map so the items are in the “Wright” location! If you do not take care to spot line breaks, if you have a line break, your item will not appear where it should be on the logit scale. The item will appear one line lower than where it should be placed.

Surprises in Presentation of Wright Maps 4

## .##

3

.#### ### .###

2

1

0

-1

.##### .## .#### #### .#### .#### .#### ########## .#### .## .## .## .# . . # .# . .

+ T| | | | | + | S| | | | + | | M| | |T + | |S | S| | +M | | | |S | T+ |T

223

No 68 For this item the student needs to find the octagon

No 71 No No No No No No No

75 For this item the student needs to find the circle 74 73 72 70 69 77

No 76 For this item the student needs to find the triangle

Fig. 16.6 The Wright Map from Winsteps that results from the rotation of a page. This provides enough space for no wrap around

A final comment, readers will note that in Fig. 16.6 each “#” represents 4 respondents. Using the command TFILE in Winsteps one can alter the number of respondents that are plotted with a “#’. This can be helpful as you create your final Wright Map for a talk or paper. It might be initially you use the default setting of Winsteps, but that for your final paper you change the number of respondents who are plotted with a “#”. It could be that you wish to plot all your respondents. Formative Assessment Checkpoint #2 Question: What is the whole deal with wrap around? Answer: With wrap around, one not only ends up with a break in the line of the Wright Map, but we end up with an item (or a person) not really appearing where they should be on the Wright Map. An item that is wrapping around gets shifted down the Wright Map and thus appears to have a lower measure than its actual measure.

M, S, T When Winsteps is used to create a Wright Map, each map will include the symbols M, S and T on both sides of the vertical line. M on the item side of the Wright Map marks the mean of the item difficulty. The two S symbols on the item side mark one

224

16 Wright Maps (Part 3 and Counting...)

standard deviation above and below the item mean. The two T symbols on the item side mark two standard deviations above and below the mean on the item side. M, S and T are also used to provide the same information for the person measures. Some authors of Wright Maps use the maps as is from Winsteps. For example, see the two Wright Maps presented in “What Is the Best Measure for Assessing Diabetes Distress? A Comparison of the Problem Areas in Diabetes And Diabetes Distress Scale: Results From Diabetes MILES-Australia” (Fenwick et al., 2016). But if we review the two Wright Maps of “Educational Leadership Effectiveness: A Rasch Analysis” (Sinnema, Ludlow, & Robinson, 2016), we can see that the authors have chosen not to retain the symbols of M, S, and T on both the item and person sides of the Wright Map. Some authors have decided not to include the symbols of S and T, but have retained the M for both persons and items [see “Rasch Analysis of the Locomotor Capabilities Index-5 in People With Lower Limb Amputation” (Franchignoni et al., 2007)]. Some authors have chosen to present, for example an M on one side of the Wright Map, but have presented an M and S on the other side of the Wright Map [see “Educational Intervention in Patients Undergoing Orthognathic Surgery: Pilot Study” (Sousa, Turrini, & Poveda, 2015)]. In some cases, we believe that an S or a T is not presented simply because that part of the Wright Map is not being provided to readers. Rather, only the part of the Wright Map that is useful for the explanation and analysis conducted by the researcher is presented. An example of this style is “Using Rasch Analysis to Explore What Students Learn About Probability Concepts” (Mahmud & Porter, 2015). In this article’s Wright Map, an upper T and a lower T are presented on the person side of the Wright Map, but only the upper T is presented on the item side of the Wright Map. A second, and we believe a very good, reason for not presenting the S or T of the Wright Map- simply has to do with the question of how much a researcher wishes to present in a figure. If the S and T symbols are not used as part of the analysis presented in a paper, then we believe it is much better to remove the S and T. Many readers of your Rasch articles are not going to be Rasch experts. The simpler you can make your figures, the better. Formative Assessment Checkpoint #3 Question: What do M, S, and T show in a Wright Map? Should you use M, S, and T? Answer: M shows the location of a mean (be it persons or items), S shows the location of one standard deviation from the mean, and T shows the location of two standard deviations. In statistics the standard deviation, for normally distributed data, can be used to help one see the spread of the data. In a Wright Map you might consider including at least the M and the S in your plots. In particular, if you wish to provide readers of an article an overview of the spread of your data. With a Wright Map, that of course can mean the spread of the items and the spread of the persons.

Surprises in Presentation of Wright Maps

225

Added Cleaning Up and Simple Additions You Should Consider The basic Wright Maps provided in an analysis can be immensely powerful for your research. We suggest readers to consider some additional clean ups and easy additions. At the top of the Fig. 16.1 Wright Map, is the text and . If we had plotted the full Wright Map for Fig. 16.1 we would see at the base of the Wright Map in the text and . This can be seen in the Wright Map presented in “Research Report: Screening for Depression in Pregnant Women From Côte D′Ivoire And Ghana: Psychometric Properties of the Patient Health Questionnaire-9” (Barthel, Barkmann, Ehrhardt, Schoppen, & Bindt, 2015). One clean up that authors have used is simply writing the word frequent instead of freq in their Wright Map [see “Using Rasch Analysis to Explore What Students Learn About Probability Concepts” (Mahmud & Porter, 2015)]. Another change, one that we personally prefer, is the following for a final Wright Map for a publication: (a) insert words at the top and bottom of the vertical list of person measures, and (b) insert words at the top and bottom of the vertical list of item difficulties. These words, of course, will relate to the types of items administered to respondents, and the words will also be dependent upon the way in which respondents can answer the items. In the study “P-Drive: Implementing an Assessment of On-Road Driving in Clinical Settings and Investigating Its Internal and Predictive Validity” (Patomella & Bundy, 2015), the authors (on their Wright Map) at the top of the person measures placed the phrase “More able” and at the base of the person measures the authors added the phrase “Less able drivers.” On the item side of the Wright Map, the authors added the phrase “More difficult” at the top of the Wright Map, and the phrase “Easier items” was added at the base of the Wright Map on the item side. If we were to make an edit to those additions, we would suggest using the phrases “More able drivers” and “More difficult items.” In the study “Educational Intervention in Patients Undergoing Orthognathic Surgery: Pilot Study” by Sousa et al. (2015), the authors used the phrases “more” and “less” on the person side, and “high” and “low” on the item difficulty side. As a header on the person side, the authors used a title in caps PERSON ABILITY and a title on the item side ITEM DIFFICULTY. We believe the key issue is to always include an identifier that explains what it means for an item to be at the low end of a scale, what it means for an item to be at the high end of a scale, what it means for a person to be at the high end of a scale, and what it means for a person to be at the low end of a scale. It is also important to use these phrases in your article or talk. A final example, but by no means the last example for readers, is the work of Stelmack et al. (2004). In their Wright Map, the authors added two bold vertical arrows. On the left side (the person side of the Wright Map), the arrow has a notation at the top of the arrow of “Persons with least visual ability” and a notation at the

226

16 Wright Maps (Part 3 and Counting...)

base of the arrow “Persons with most visual ability.” For the arrow on the right side of the Wright Map readers will see the notation “Least Difficult Items” (above the arrow) and “Most Difficult Items” (below the arrow). We find it very helpful to add such clear guidance for readers, even if the information on how read the Wright Map is contained in the figure caption and/or the article text. The bottom line is it is critical to add guidance to readers as to what the meaning is of going up or down the Wright Map. Both for the respondent side of things, as well as for items. Make sure to label the top and bottom of the scales for respondents and items. If you do not do so it will be very difficult for a reader to understand the meaning of moving up and down your Wright Map.

Units A basic Wright Map produced by Winsteps will allow you to do so much. However, when you present the Wright Map in your work, you will want, somewhere, to identify the units of your scale. When you run Winsteps, the measures for items and persons are in logits (unless you do some rescaled logit units). So, you may want to add a notation in your Map with the phrase logits near your vertical scale. You must be careful, however, if you rescale to a user-friendly value to not label the scale as being logits. In the article “The Match Between Everyday Technology in Public Space and the Ability of Working-Age People With Acquired Brain Injury to Use It” (Malinowsky & Larsson-Lund, 2015), the authors did do a rescaling (the vertical scale of the Wright Map ranges from 40 to 90) but added the term logits to the Wright Map. An example of a study presenting a Wright Map for a rescaling in which there is the presentation of a revised scale is “Challenging Instructors to Change: A Mixed Methods Investigation on the Effects of Material Development on the Pedagogical Beliefs of Geoscience Instructors” (Pelch & McConnell, 2016). The Wright Maps in this study are scaled from 20 to 70; however, no units are presented in the Wright Map. An example of a Wright Map with the presentation of the word logits is in the work, “Difficulty of Healthy Eating: A Rasch Model Approach” (Henson, Blandon, & Cranfield, 2010). Generally, most of the scales do not have units on the Wright Map [e.g. “The Environmental Action Scale: Development and Psychometric Evaluation” (Alisat & Riemer, 2015)]. If you are not rescaling then make sure to note the units as logits. If you rescale it is common to create a name for your rescaled units, and then to make sure you explain in your text how you rescaled the logit. But the bottom line of bottom lines is to make sure you label your scales with your units. Tables must have units as well as your Wright Maps.

Beyond the Basic Wright Map Output

227

Formative Assessment Checkpoint #4 Question: You run some test data collected at a pre time point, and you collect some data at a post time point. You create a Wright Map for the pre data analysis. You create a Wright Map for the post data analysis. Why should you not put the two Wright Maps side by side and make some conclusions? Both maps are in units of logits. What’s the issue? Answer: Logits can be unique to an analysis. Even if you are running the same test data, the meaning of a logit (in this example) does not have the same meaning at the pre time point and the post time point if you have not anchored. So, be careful when you make use of side by side Wright Maps. If you wish to compare the same scale in the same units, make sure you have anchored.

Beyond the Basic Wright Map Output Thus far we have presented a brief discussion on how to make sure your Wright Map output is accurate (meaning not having word wrap present the location of an item that is not correct) and considering whether you should retain all the information that is printed on your Wright Map.

Side by Side Wright Maps A very useful way to make a point with Wright Maps involves the side by side presentation of two (or more) Wright Maps. Of course, two side by side Wright Maps usually imply two separate Rasch analyses in a paper or presentation. Moreover, there must be a reason for presenting the two maps side by side. The side by side presentation of two Wright Maps can be very useful in a paper. In a moment we will discuss why such side by side maps can be very useful. Before we get to this point, we want to mention some things that researchers will want to observe and check. Usually when researchers place Wright Maps side by side, the two Wright Maps are presenting items and respondents along the same variable. Moreover, the researcher’s goal or hope is to communicate some patterns in the two Wright Maps. Maybe to comment upon the different location of respondents on the two Wright Maps? Maybe to comment upon the different locations of some items on the two Wright Maps? If an author wishes to comment about the two Wright Maps, which involve the same scale, and suggest some observations, the researcher must ensure that the two maps are on the same metric. This does not just mean that logits are used to express the scale on both maps; rather, one must ensure that the meaning of a logit is the same from Map to Map. In Fig. 16.7, we present two Wright Maps. One

228

16 Wright Maps (Part 3 and Counting...)

Fig. 16.7 Two Wright Maps from a Winsteps analysis of a pre and a post data set. As the result of item anchoring, the test items are in the same difficulty location. If items are not in the same location from pre to post, then it may be that items were not anchored. And thus it may mean one cannot compare respondents

map presents pre-person measures and pre-item measures. The second map presents post-person measures, and importantly, the second map is anchored to the pre-item difficulty values (ensuring that the same scale is used to compare pre and post respondents). That is why the items are at the same item difficulty location on both Wright Maps. We present two Wright Maps in Fig. 16.8 that can be quite tempting for beginners in their use of Rasch. In this figure, an analysis was made of the GCKA pre- data, and a Wright Map was created. A second analysis was made using the GCKA post-data; however, no anchoring was conducted to link the scales. Beginners are exceedingly tempted to present both maps, and to suggest some conclusions that

Beyond the Basic Wright Map Output 2

+

1

XX

XXXX XXXXX XXXXXXXXX 0

X XXXXXXX XXXXXX XXXXXXXXX XX

-1

XX

| | | | | | | | T| | |S + | | | | S| | | | | | | +M M| | | | | | | | S| | | + |S |

2

Q25,

Q23,

229

Q35,

+

Q39,

1

Q19,

XXX XXXX

Q45,

XXXXX

Q3,

XXXXX Q31, Q27, Q21,

Q37,

0

Q43, Q7, Q13,

X XXXXXXX XXXXXXX

Q11,

XX

Q15, Q29,

XXXXXXXXX

-1

XXXX

| | | | | | | | | T| |S + | | | | S| | | | | | | M+M | | | | | S| | | | | | + T|S |

Q25, Q19,

Q39,

Q35, Q23, Q21, Q3, Q31, Q7, Q37, Q27,

Q45,

Q43,

Q13,

Q11, Q15,

Q29,

Fig. 16.8 A Winsteps analysis of pre and post data from the same measurement instrument. If one were investigating, for example DIF, then no item anchoring takes place, and it is not unexpected to see shifts in item difficulty. It will require further investigation on the part of the researcher to evaluate if significant DIF is present

can be made regarding changes that occurred from pre to post for respondents and items. The problem is, although the pre-analysis results in data expressed in logits, and the post analysis results in data expressed in logits, the meaning of the logits is not the same. Thus, we warn researchers learning Rasch to ensure that if they wish to make comparisons with side by side Wright Maps, please be aware that the scales (even of the same variable) cannot be assumed to be the same unless some sort of anchoring is conducted. If you wish to make some sort of conclusion using two side by side Wright Maps for the same scale, the bottom line is to carefully think if your side by side Wright Maps need to be anchored. If you are investigating DIF, then there would not be any anchoring. Because you are attempting to see if there are any shifts in items. But if you are investigating something else, you will most often want to make sure you have the same scaling for the two Wright Maps.

230

16 Wright Maps (Part 3 and Counting...)

There are some instances in which a researcher may wish to present two Wright Maps that are side by side, but not on the same scale. One instance is when a researcher wants to show through a figure the types of changes in the measurement system that are taking place when, for example, a researcher measures a sample of girls and a sample of boys. Below we provide two Wright Maps (Fig. 16.9) to make this point. The Wright Map on the left is an analysis for boys taking a test, and the Wright Map on the right is an analysis of girls taking a test. Again, a researcher must ensure that readers understand that the scale is not the same scale for the two groups. However, it can be useful to present two such maps, and then to identify those items that seem to have moved when a researcher is comparing the two maps. This is what can be done to study DIF. Readers should reread that chapter if they wish. An excellent Wright Map that uses this technique to point out how items have shifted between groups is “Parent Perceptions of Children’s Leisure and the Risk of Damaging Noise Exposure” (Carter, Black, Bundy, & Williams, 2016). That study presented

Fig. 16.9 Two side by side Wright Maps utilizing Winsteps. One map is from an analysis of just boys. The other map is just for an analysis of girls. Arrows can be added to identify those test item which seem to have shifted

Beyond the Basic Wright Map Output

231

two side by side Wright Maps, added arrows to point out items that had moved, and, in the figure containing the two Wright Maps, provided the results of a DIF analysis of the instrument items. In Fig. 16.9 we have an arrow to show the shift of an item. This is what the authors do in the article “Parent Perceptions of Children’s Leisure and the Risk of Damaging Noise Exposure” (Carter et al., 2016). In this article, the authors use one color for items going up, another color for items going down. Of course researchers will want to evaluate DIF to make sure an item shift is meaningful. We provide this arrow to simply show researchers how they might conceive communicating any shift that is observed. Another example of two side by side Wright Maps that do not have to be on the same scale concerns studies in which data are collected from two or more variables. In many studies a test might be given in two different subjects. Those two subjects can of course be considered as two different variables. In many instances it is very useful for the researcher to place two Wright Maps side-by-side for the two variables. Then the researcher, for example, attempts to look for the story in each variable and also what changes are taking place in each variable (as defined by test items) as one goes further up the variable. Although the two variables are different, it can be very useful to the researcher to use these side by side maps to better identify patterns in the data. For example, does the change in math item difficulty match the changes a researcher sees in the change in science item difficulty? A very good article that presents two side-by-side Wright Maps, that presents a hybrid of the points we have made above, is a study “Rasch Analysis of a New Hierarchical Scoring System for Evaluating Hand Function on the Motor Assessment Scale for Stroke” (Sabari, Woodbury, & Velozo, 2014). In this study, two different sets of items were evaluated to measure hand function. One set of items involved hand movement and another set of items involved hand activities. The two Wright Maps (one for each of the traits) were placed side-byside in two separate figures. The key issue, we think, is to make sure figure captions and text address whether the two side by side Wright Maps involve the same variable. Side by side Wright Maps are, we feel, often underutilized to tell the story of your data. It is through a side by side Wright Map that readers will be able to see more easily the shift and movement in items and respondents. In cases where anchoring is taking place, then in a side by side Wright Map then there should be no movement of items (in the case of item anchoring). If you are presenting two side by side Wright Maps think though very carefully if you wish to make the side by side figure if you have two different variables. You will not want readers to think you are discussing the same two variables.

232

16 Wright Maps (Part 3 and Counting...)

Formative Assessment Checkpoint #5 Question: When presenting side-by-side Wright Maps, what are some key issues you must be careful to consider? Answer: When presenting a side-by-side Wright Map, an author is wanting the reader to make some sort of comparison. For example, are persons in different locations from one Wright Map to the other? Are items in a different location? If side-by-side Wright Maps are made, in many instances, it will be important to ensure that the scale is the same for the two Wright Maps. So, make sure the same logits are being shown in each Wright Map. Also, make sure the physical distance (on paper) for the logits in each Wright Map is the same as well.

Piling Up Items (or Persons) on a Wright Map When creating a Wright Map, it is important to make sure that a researcher has a scale that is wide enough to plot the range of data for the persons and items on a Wright Map. If all persons have measures that range from 2 to −2 logits, and all items have measures that range from 1 to −1 logits, it is important to make sure that if a researcher wants to accurately plot the location of each person and item, then a researcher must have a Wright Map scale that runs from 2 to −2 logits. We have seen some situations with Wright Maps when a piling up of items (or persons) exists due to the scaling of the Wright Map. It is possible that the author’s intent was to do so, but we think in many cases that was not the intent of the author. Below we present a Wright Map for a data set with a scale that creates a piling up of items (Fig. 16.10). The important point for readers to be aware of is the possibility of piling up of items or persons. Clearly, if a piling up of items exists at the top or bottom of the Wright Map, it does not mean that all items cut the variable at that location! For example the piling up of items at the top of the Wright Map just means that all the items at the top of the map have at least the value of the maximum possible on the printed Wright Map. Thus if the maximum value of the scale is 2.0 logits, and there are a number of items listed on a line at the top of the Wright Map, these items could have measures of 2.0 logits and higher. Also, if a piling up of persons exists at the top or bottom of the Wright Map, it does not mean that all those respondents have that measure. In our review of the paper “Educational Intervention in Patients Undergoing Orthognathic Surgery: Pilot Study,” (Sousa et al., 2015), it looks to us that such a piling up of items in the Wright Map might be present. So, just be careful! You can always refer to your item measure table and your person measure table to check to see if a piling up of items or persons exists!

Beyond the Basic Wright Map Output MEASURE

PERSON - MAP - ITEM | 1 XXXX +S Q19, Q23, Q25, Q35, Q39, | | | | | | XXXXXX | | | Q45, | | | S| | XXXXXXXX | | | | Q3, | | Q21, | | XXXXXXXXX | | | Q31, | | X | Q37, | XXXXXXXXXXXXX | 0 +M | Q27, Q43, M| Q7, X | | | | XXXXXXXXX | | | | | | | Q13, | XXXXXXXXXXXXX | | | | Q11, | | | S| XXX | | Q15, | | Q29, | | | | -1 XXXXXXXX +S Q33, Q41, Q47,

233 2

+T

1

XXXX XXXXXX XXXXXXXX XXXXXXXXX

X 0 XXXXXXXXXXXXX X XXXXXXXXX XXXXXXXXXXXXX XXX -1

XXXXX

X

-2

XX

| | | | | | | | T| | | +S | | | | S| | | | | | | +M M| | | | | | | S| | | | +S | | | T| | | | | | | | +T | |

Q25, Q39, Q19, Q35,

Q23,

Q45,

Q3, Q21, Q31, Q37, Q27, Q7,

Q43,

Q13, Q11, Q15, Q29,

Q33, Q47,

Q41,

Fig. 16.10 The original plot of persons and items when a scale of −2 to 2 is used. A second plot with a scale of −1 to 1 is also provided. Note the piling up of items and to some degree the piling up of persons. Wright Maps provided by Winsteps

234

16 Wright Maps (Part 3 and Counting...)

Fig. 16.11 A Winsteps Wright Map presenting person measures and item difficulty for a 13 item rating scale survey

Wright Maps for Surveys Side-by-side Wright Maps for surveys can be very useful communication tools. However, an issue exists that we have seen a number of times, and we want to warn readers about it. This warning also extends to the analysis of partial credit data. Readers will recall our frequent use of the self-efficacy data set collected through use of the Enochs and Riggs (Enochs & Riggs, 1990) survey to explain Rasch techniques. We present a Wright Map (Fig. 16.11) for the control file cf se survey. On the left side are the person measures, and on the right side are the item measures. The locations of the persons are accurate. The locations of the items are accurate. However, when evaluating how persons are measured by the survey items and how the items cut the variable, a researcher must be careful. Such care is necessary

Beyond the Basic Wright Map Output

235

because the items are answered with a rating scale. This means that the items are really working for a researcher to measure the respondents over a larger range of the scale. This, in turn, means that just because an item, for example Q6, is above a person it does not mean that the item is not helping to measure the person. To consider this issue let us consider Q8 of Fig. 16.11. Although that item is below all the respondents, that item will still be helping in some manner to measure these respondents. This is because the item is answered with a rating scale. If the rating scale were Strongly Agree, Agree, Disagree and Strongly Disagree, it may be that this item is not being answered by the respondents using full scale. It my be that respondents are using only Strongly Agree and Agree to answer items. Table 2.2. of Winsteps, for example can be used to help a researcher begin to understand how the rating scale helps to measure respondents. In this chapter, we will not discuss this table, but we want readers to be careful about their use and interpretation of Wright Maps when a rating scale is used, or when partial credit data are used. From time to time we see authors present a Wright Map constructed from rating scale data, and then some assertions are made about the scale that are, unfortunately, a bit overboard. This assertion would be valid for a test with dichotomous items, but not with rating scale items (when there are more than two rating scale steps). Why does this problem occur? We believe that once researchers have learned how to read a Wright Map for dichotomous data, they wish to immediately extrapolate their Wright Map reading techniques to a Wright Map created from rating scale data. The bottom line is try your best not to have the problem of having numerous respondents and items plotted at the top and bottom of your plot on one line. If the respondents and items do not have the same measure, then consider expanding your scale so that you can see exactly where the items and persons are located. Formative Assessment Checkpoint #6 Question: You have completed an analysis of a rating scale survey data set (six steps to the scale). You have reviewed your Wright Map, and you notice that three items are higher than all the respondents. The three items are slightly above the sample of respondents. Does this represent a mistargeting of three survey items? Are these three items not helping you very much in your analysis? Answer: In this case, the three survey items noted, things may not be that bad. The reason is that the items are rated with a rating scale. This means that for these three items, maybe some rating scale categories would be predicted to be have been rarely selected by respondents, but it is also the case that some of the rating scale categories could be predicted to have been used by respondents. Be careful about using all the techniques for Wright Maps built with dichotomous items when evaluating Wright Maps from survey items.

236

16 Wright Maps (Part 3 and Counting...)

How to Present Items? How to Present Respondents? When using a Wright Map, researchers present the measures of persons and items. Most Wright Maps have a limited number of items (most test and surveys have a smaller number of items than respondents). This means that normally, each item can be presented in some manner in the Wright Map. For the respondents, this is often not the case. If a large number of persons exists, Winsteps presents respondent measures with an X that represents a certain number of respondents and a period (.) indicates a lower number of respondents. For example, in the work of Wells, Bronheim, Zyzanski, and Hoover (2015), the authors present a Wright Map in which a pound sign (#) is used to indicate five respondents, and a period (.) is used to indicate one to four respondents. This is the common plotting technique used in Winsteps, and it is the plotting technique we have commonly used in our work! We believe that in some cases it might be advantageous to use a technique in which a symbol (for example an X) is used to plot all of the respondents, even if a large number of respondents exists. We think that in some cases, perhaps, an X for each respondent can better visually communicate respondents’ measures. Pensaville and Solinas (2013) used an X to plot a very large number of respondents (each X represented one to three respondents). Below we present a command (Fig. 16.12) to fine tune our plotting of a Wright Map (Fig. 16.13). The first Wright Map is the default output. For this data set each # represents two respondents. The second map is the plot that we are able to request by inserting the following command into our control file. By inserting this command, we were able to plot each respondent. This meant each # represented one respondent. This command allows us to specify for a table (we choose Table 1 of Winsteps which also provides a Wright Map), how many persons will be represented by a # and how many items will be represented by a #. The first 1 that readers see in the command refers to the table a researcher wants to alter. Then four dashes are used, then a 1 is entered to indicate the number of persons per #, and a third 1 is used to indicate the number of items to be represented by a #. In Fig. 16.13 we first present the default Winsteps Wright Map plot, and then we present the Winsteps Wright Map through use of the TFILE command listed below (Fig. 16.12). Notice the default map plots using a # for 2 respondents and a dot for 1 respondent. The Wright Map with the added TFILE command results in each respondent being plotted with a single X. If a researcher wants to show the item names, it is a simple step just to edit the Wright Map and insert item names.

Fig. 16.12 The Winsteps command which allows one to specify details of an output table. This example uses table 1

How to Present Items? How to Present Respondents?

237

Person Labels in a Wright Map We have presented a common type of output for a Wright Map up to this point in our Wright Map discussion. We used a symbol to indicate a person and an item name to indicate each item. Inclusion of an item name can be quite informative in terms of spotting patterns and stories in the ordering of items that define a trait. It can also be very helpful to consider plotting persons in a Wright Map so that person labels of some sort are used to plot respondents. Below (Fig. 16.14) we present such a plot. This plot is a Wright Map of the Knox Cube Test data that is available with Winsteps. This test is also used, in part, in the book Best Test Design (Wright and Stone, 1979). For readers considering whether to add a designation to the person variable, and to plot these designations, we have provided a plot using Winsteps and Table 1.0 of Winsteps. The control file was altered to read an F or M as the Name of each respondent. The letter M indicates a

Fig. 16.13 The default Wright Map for Winsteps in which each # represents two respondents and a second Wright Map for the same data set in which all respondents are plotted

238

16 Wright Maps (Part 3 and Counting...)

Fig. 16.13 (continued)

respondent was a male, and the letter F indicates the respondent was a female. It is important not to confuse the single letter M showing the location of the mean respondent measure with the location of each individual male respondent. This mean measure is show with a letter M very close to the 0 logit value. The important aspect of this plot for readers is that, as in all of our recent Wright Maps, the test items are organized from least difficult at the base of the map, to most difficult at the top of the map. Also, the respondents are organized from lowest ability (at the base of the map) to highest ability (at the top of the map). Each respondent is plotted as a function of gender (the name of each respondent was either an F or an M). What is gained by such a plot with person names? We believe that a researcher can use all the statistics in the world, but the human brain is very good at looking for patterns in data. Potential patterns can suggest specific statistical tests that might be important to conduct. In this plot it appears there is no difference in the distribution of male and female respondents. Also, such Wright Map plots with respondents (shown with some sort of identifier) can help greatly when discussing a data set and what was observed in a Wright Map. In our review of Wright Maps, only rarely have

How to Present Items? How to Present Respondents? KID - MAP - TAP | 5 + 18=4-1-3-4-2-1 | 15=1-3-2-4-1-3 | | | | 4 + T| F M | |S | 14=1-4-2-3-4-1 | 3 + M | | | | | 12=1-3-2-4-3 2 F F M M + 13=1-4-3-2-4 S| | | | | 1 F F F M M + | 11=1-3-1-2-4 | | | | 0 +M M| F F F F F F F M M M M M | | | | -1 + | F F M | | 10=2-4-3-1 | | -2 + F M S| | 8= 1-4-2-3 | | | -3 M M + | | 6= 3-4-1 |S F M | | 5= 2-1-4 -4 + T| M | 4= 1-3-4 | | | -5 F + 1= 1-4 |

239

MEASURE

16=1-4-2-3-1-4 17=1-4-3-1-2-4

9= 1-3-2-4 7= 1-4-3-2

2= 2-3

3= 1-2-4

Fig. 16.14 A Wright Map from Winsteps which utilizes an M or an F to plot the person measure for each respondent

240

16 Wright Maps (Part 3 and Counting...)

we seen researchers present a Wright Map with a person identifier. One example of this plotting technique is “User Acceptance of a Touchless Sterile System to Control Virtual Orthodontic Study Models” (Wan Hassan et al., 2016). The bottom line for your Wright Map presentation of respondents is that although you might present each respondent with one symbol, it might better tell your story if you plot your respondent measure data using some sort of easily understood code. This will allow one to identify patterns in the distribution of respondents.

Special Additions to Wright Maps As we have detailed, a Wright Map consists of many parts. As readers can see, we believe strongly that researchers should up their game with regard to the Wright Map. There are just so many reasons to take time with your Wright Map. In this final section, we present some of the fine tuning we have seen in Wright Maps that go beyond what we have discussed up to this point. One type of Wright Map was presented in “The Hierarchy of Ethical Consumption Behavior: The Case of New Zealand” (Wooliscroft, Ganglmair-Wooliscroft, & Noone, 2013). In this Wright Map, the authors decided to present not only the items of their data set, but also an indication of a Stage of Change to the right of the items. This was done by providing a name of the stage, and also indicating where a stage began and ended. We have edited our Knox Cube Wright Map, and also added fictiStage of Development 5

4

3

2

+ | | | | | + | | | | | + | | | | | + | |

18=4-1-3-4-2-1 15=1-3-2-4-1-3 16=1-4-2-3-1-4 17=1-4-3-1-2-4

High School________________

Middle School

14=1-4-2-3-4-1

____________________________

Elementary School 5-6 12=1-3-2-4-3 13=1-4-3-2-4

____________________________ Elementary School 3-4

Fig. 16.15 A Wright Map in which different stages are presented. The stages are determined by using theory and also reviewing the location of item difficulty. The left side of the Wright Map is from Winsteps

Special Additions to Wright Maps

241

tious stage names to show readers how these authors organized their own Wright Map (Fig. 16.15) in this manner. We believe that this organization of a Wright Map is very useful in certain situations. For example, if a researcher wishes to discuss the variable, and how it increases in some manner from the bottom of the map to the top of the map. For readers who may know only a little about the test items, it is possible to summarize what it means to move up the map. Another aspect of this type of map, which we discussed in RAHS Book 1, is that the widths of the stages are not the same. This occurs because the stages are defined by considering which items students should correctly answer when they are at a specific stage of development. Also, of importance when using a map such as the one we present above, it can be useful to use words to describe stages as opposed to use phrases to describe how old the students are (as we have done in our Wright Map). In the article “The Hierarchy of Ethical Consumption Behavior: The Case of New Zealand” the following words were used to describe what the authors named “Stage of Change:” Maintenance, Action, Preparation, Contemplation, Precontemplation (Wooliscroft et al., 2013, p. 69). The bottom line is that yes one can create a Wright Map just with items and respondents. However, if there is a story in the items where you can classify what it means to move up the variable, then it is very helpful to add information regarding different bands of difficulty as one moves along the variable. These bands might refer to the item topics (addition, subtraction..) and/or bands might refer to different performance levels of respondents (grade level, proficiency). A close relative of this way of editing a Wright Map is provided by the authors of “Education Accountability and Principal Leadership Effects in Hong Kong Primary Schools” (Hallinger & Ko, 2015). As we have done before, we edit our Knox Cube test data to try to summarize how those authors edited and fine-tuned their Wright Map (Fig. 16.16). In their work, the authors named each of the regions defined by the solid line boundaries, and those names appeared vertically to the left of each region. Also, the authors of the study used colors for each different region of the Wright Map. They also used different intensity of color for each region as a function of whether the area was for respondents or items. For example, in our plot for the region of persons and items falling between approximately 3.9 logits and 1.2 logits, the part of the region for persons might have been colored with a dark blue, and the part of the same region, but for items, might have been colored with light blue. One of the most useful additions you should consider for your Wright Maps is a marking of different regions of your Wright Map to indicate different levels of a trait. Of course your markings must be based on theory. But if you can decide upon different regions of your trait, then show them in your Wright Map.

242

16 Wright Maps (Part 3 and Counting...)

Fig. 16.16 Potential boundary lines determined by item topic. These boundaries can be named and identified with differing colors. Initial Wright Map provided by Winsteps

Final Touch Ups

243

Formative Assessment Checkpoint #7 Question: Why might it be useful to mark regions of a Wright Map with lines indicating different levels of a variable? Answer: As readers know, the ordering of items along a variable should tell a story. Also, the ordering of persons should tell a story as well. We suggest that readers attempt, when possible, to mark the location of different levels of a variable on their Wright Map. Often for policymakers, there is an interest in knowing where respondents lie with regard to variable, and it helps to have levels indicated. Also, it can be helpful to researchers to know where different items lie with respect to a variable. When you can start to tell a story or expand a story that someone else has already started, we suggest considering adding lines to your Wright Map to mark different parts of a variable.

Final Touch Ups Using Wright Maps to communicate results is key to the successful use of a Wright Map. As readers will know, the Wright Map can help tell many stories: how well the instrument functioned, how students might progress in their learning, or how patients might exhibit aspects of a disease. In the final part of this chapter, we provide some of the techniques we have seen in Wright Maps, and we comment about what techniques seem to work and not work. Our goal is that readers can develop the best Wright Maps possible. One technique of editing a Wright Map concerns the highlighting of test items. In the article, “Development of an Instrument to Measure a Facet of Quality Teaching: Culturally Responsive Pedagogy,” by Boon and Lewthwaite (2015), the authors highlighted key items that appeared in the Wright Map. For their Wright Map they used highlighting to indicate the location of types of items. For example, imagine you have created a Wright Map that involves different types of simple mathematics items. In such a case the math test items that involves division might be highlighted with one color. In a different color, all those items on the test that involved addition might be highlighted. When color presentations are possible, we believe that color does make a difference. But, of course, if no colors are possible, then it is possible to use techniques such as underlining items or placing a square around an item as a way of identifying a type of item. The key point is that if you have different groups of items in your Wright Map, then it is helpful to highlight those groups using colors or at least different fonts. Certainly, using an identifier with an item on a Wright Map might not only be for achieving a goal of identifying groups of items (e.g. in terms of content or skills exhibited with an item). For example, in Illian, Parry and Coloma’s work, Rasch Measurement for Child Welfare Training Evaluators, the presenters provided a Wright Map, but in their map they underline those items and identify those items as “Negative Discrimination and/or Higher Difficulty Compared to One or More

244

16 Wright Maps (Part 3 and Counting...)

Fig. 16.17 The use of underline and bold to ID specific items in a Wright Map provided by Winsteps

Fig. 16.18 One way of item naming. A technique in which the item is written behind the item number. This is a good technique, but when the item has a very long text this does not work well. Initial Wright Map provided by Winsteps

Other” (2010). Above we provide a Wright Map (Fig. 16.17) where we pretend that the three items which are bolded and underlined are those items which exhibited negative discrimination in an analysis.

Final Touch Ups

245

The final issue that we wish to discuss is providing item text in a Wright Map. When items are plotted in a Wright Map, it is very easy for researchers and readers to read the map (from say, bottom to top) and to note how items change in difficulty from easy to hard. If only the item name (e.g. Q41) is used in the map, the reader has no idea of the content of the item. Certainly, one way of communicating the content of the item might be through the highlighting of different item types in the Wright Map. In addition to highlighting items, authors have used other techniques. One technique is a shorthand item name that quickly summarizes an item. Having a short name will help the item name to be printed in an understandable manner in the Wright Map. If the item name is too long, then readers will often get a jumble of text that is hard to follow. Figure 16.18 shows one example of such item naming. In this example the item number is presented (e.g. Q7) as well as the item presented to respondents (2 × 10). Below we provide the type of technique that Zain, Samsudin, Rohandi, and Jusoh have used in their study, “Using the Rasch Model to Measure Students’ Attitudes Toward Science in ‘Low Performing’ Secondary School in Malaysia” (2010). In this study, arrows were used to guide the reader to the full text of each item. A study by Teoh, Alrasheedy, Hassali, Tew, and Samsudin (2015) used a similar arrow technique in their study. In Fig. 16.19 we show how the use of arrows and the insertion of text can be used to improve a Wright Map. There are we feel pros and cons to this procedure. By adding item text it can make it easier for a reader to interpret the Wright Map. But due to spacing the line of text (e.g. I am confident teaching) may not be at the exact difficulty level as that shown for an item (e.g. Q7).

Fig. 16.19 Arrows can be useful to create space in a plot. This can help one, for example, provide the full text of items. Initial Wright Map provided by Winsteps

246

16 Wright Maps (Part 3 and Counting...)

A Potential Wright Map Checklist Each Wright Map is unique. But make sure to spend time constructing your Wright Map so that you can best tell the story seen in your data. Wright Maps provide the opportunity to plot person measures and item measures on a linear scale. And that scale for persons and items is the same scale. Make the most use of your Wright Map by considering the additions we detail above, and by making sure you do not make some simple errors that are common in many Wright Maps. • Do you need to present both persons and items to make your point? If your interest concerns the change in item difficulty, and the story told by the item order and spacing, then skip the persons. • Do you need to show the items in the Wright Map? Perhaps your main story concerns the differences in the persons? In that case maybe you only need to show the persons in your Wright Map? However, one of the great advantages of Rasch measurement techniques is that we can explain the meaning of a person measure. When someone has a measure for a test of dichotomous items, we can use our Wright Map to explain what item the person would be predicted to have correctly answered and not correctly answered. • Have you shown your scale? • Are your items identified in a meaningful manner in your Wright Map? Have you described your items, using the full item text or shorthand? • When you plot your person measures, do you want to use a code to indicate what type of person had which measure? Sometimes plotting the person measure data in this manner will help you, and your readers, see patterns in the data. Maybe you plot an F for the female test takers and an M for the male test takers. • Winsteps provides many notations on the Wright Map. Do you need all the notations? For example, maybe only consider including the notation as to the location of the mean item and the mean person. Thus, remove the S and T on the Wright Map unless you will be using those notations in your text. • When you have your final Wright Map, have you considered the type of font? Does the font clearly and aesthetically express the map? Have you added a solid vertical line in your Wright Map? • Have you added notations on your Wright Map to indicate what it means to be the highest measure respondent, the lowest measure respondent, the highest item measure, and the lowest item measure? • If comparing two or more Wright Maps, have you thought carefully if the maps need to be on the same scale? If two maps are not the same variable, then of course you do not have the same scale. But if you are comparing two maps for the same variable, there are cases when you must ensure the scales are the same. A scale being the same does not just mean that logits are reported for both scales. • Have you considered adding details to the Wright Map that will help you tell the story that you wish to tell? If you are interested in stages that respondents might be moving through, then you will want to indicate which items represent which stages.

A Potential Wright Map Checklist

247

• Have you been careful with wrap around with your items, and your respondents? • Spend time on your Wright Map, do not simply cut and paste your Wright Map. • When you plot items on your Wright Map there may be items along one line of the Wright Map (that means for the purposes of graphing the items are shown as being at the same measure). Consider the order of the items along the line. It may be that you want the items to appear in numerical item ID order (e.g. Q2, Q14, Q21, Q26). Such an ordering is easier for a reader to review and link to other tables (such as the Item Entry Table) that you most probably have presented earlier in your article. However it could be that you should order items along one line by using some other criteria. For example, if we imagine that there were graphing and nongraphing items on a test, you might want to present all the items along a single line sorted by graphing/nongraphing and then sorted by item number (G-Q2, G-21, NG-Q14, NG-Q26). Charlie and Chizuko: Two Colleagues Conversing Chizuko: Well, well Charlie....it seems to me that I have not seen you for days...I keep seeing you reading these articles and circling Wright Maps. What’s going on? Charlie: Wow! I have learned a lot about Wright Maps, and now I have so many ideas as to how I can better use Wright Maps. Many Rasch articles have a Wright Map and use the map to evaluate instrument functioning, while some use Wright Maps to tell a story. But I think many Wright Maps can be a LOT better. Chizuko: Can you give me some examples that I can think about? I’d like to improve my Wright Maps in my papers. Charlie: Okay here is the first thing...I was sort of amazed, most of the time it looks to me as if people use the Winsteps output “as is” in their papers. The “as is” output is perfect for running the analysis, but I think the Maps can be made to look much better for a final presentation. Also, sometimes there is information that can be removed, and some information that can be added. Chizuko: Like what? Charlie: The Courier font is totally okay as I run analyses. But, just as I take Winsteps tables for publications and put the tables in another font, I would like to do that for my Wright Maps as well. Chizuko: Other things..... Charlie: Well I can’t just spew out everything in one sitting, but I think it is important to both remove some things from my Wright Map, and also add some things. For example, I am not sure I need to show the S and the T in my Wright Map, but I think in every Wright Map, I need to show some sort of label that indicates what it means for a person to have a low or high measure. Also, I need to have some sort of label to indicate what it means for an item to have a low or high measure. Chizuko: Wow! Wow! Charlie: Okay just one more thing...lots of times in publications I noticed side-byside Wright Maps presented to show a story. Sometimes it looks like it might be

248

16 Wright Maps (Part 3 and Counting...)

a Wright Map for a pre-data set and a Wright Map for a post-data set. The deal is, it looks like sometimes the scales were not anchored. In a lot of cases you need to anchor those scales....

Keywords and Phrases Wright Map Wrap around Logits When comparing side-by-side Wright Maps, consider if the two analyses need to be anchored. For final Wright Maps, use font that is attractive. Provide guidance to readers so they can easily interpret the meaning of high and low measures of both respondents and items. Consider, when appropriate, labeling different stages in Wright Maps.

Potential Article Text Two side-by-side Wright Maps are presented. The left Wright Map presents the results of the pre-data collection and the right Wright Map presents the location of the post-data collection. The scales of the two Wright Maps were linked through item anchoring in which the pre-item measures were used as anchors for the post items. As the result of the linking, the items of both Wright Maps are in the identical locations in each Wright Map. The growth in respondents can be seen through the upward shift of the 3561 test takers. This shift can be seen in the higher M value of respondents when comparing the pre-Wright Map to the post-Wright Map.

Quick Tips Use TFILE to plot respondents on a Wright Map. Watch for wrap around When presenting side-by-side Wright Maps, if you are presenting the same variable you will need to make sure your Wright Maps are anchored to the same scale. If you are not presenting the same two variables, be careful about presenting the Wright Maps side-by-side.

A Potential Wright Map Checklist

249

Data Sets (Go to http://extras.springer.com) cf se survey pre test for Wright Map part 3 chp post test for Wright Map part 3 chp

Activities Activity #1 Take one of your own test data sets for items that can be scored with 0 and 1 and run an analysis. First, review your Wright Map. How many persons are indicated with a # and how many items with an id? Then, use the TFILE command in the control file to change the number of persons or items that are shown with a #. Print out both Wright Maps: before the use of TFILE and after the use of TFILE. Did it make a difference which type of plot was used? Can you do a better job interpreting your data with one particular plot? Activity #2 Edit your test control file. More specifically, type out an extensive one-line descriptor for each of the test items in your control file where you find the names of each item. After you type in these extensive names, save your control file. Then, rerun your analysis. Then, review your Wright Map. What has been the impact of the very long item names? Answer: The long name can cause wrap around. The long item name can make it appear as if other items are shifted downward. Such wrap around can make it difficult to read the Wright Map and know where items are located. Activity #3 Go back to your initial Activity #1 Wright Map before you make any edits in your control file. Look at the range of your persons and items. Do you have items and persons that are greater in measure than one logit, or are smaller in measure than negative one logit? If you do, make an edit in your control file by adding the command MRANGE = 1. This will provide a plot that ranges only from 1 logit to −1 logit. Run your analysis. Do you see items and/or persons piled up at 1 logit and/ or − 1 logits? If you do, why would such a plot be problematic? Answer: When items or people are piled up at the top or bottom of the scale, that can cause a big problem. The problem is that all the item/persons that are piled up

250

16 Wright Maps (Part 3 and Counting...)

look as if they have the same measure, but in reality, they may not have the same measure. Activity #4 This chapter includes two data sets. One data set is a pre-data set (pre test for Wright Map part 3 chp) and the other is a post data set (post test for Wright Map part 3 chp). Create the two control files and generate a Wright Map for the pre data set and generate a Wright Map for the post data set. Place the two Wright Maps next to each other. As the Wright Maps are not anchored to each other, you may see a different distribution of items on each Wright Map. Also, you may see that each map is plotted with different maximum and minimum logit values. After you have created these two Wright Maps, use the linking chapter of RAHS Book 1 (and item anchors) to link the pre to the post data set. After you have linked the pre and the post to the same scale, generate a Wright Map for the pre data and generate a Wright Map for the post data (you will just use your pre control file to get the pre-Wright Map, and you will use your post control file to get the post-Wright Map. But, make sure to use item anchors when you are conducting your post test analysis). Then, place the two Wright Maps next to each other, and you will see the items are in the same locations. Answer: It is only with linking that you can link two scales. Just because you might collect data with the same instrument, you cannot assume that the way in which items define the scale will be identical from data collection to data collection. As this is the case, you must carefully link. Without linking we feel you should not put two Wright Maps side-by-side for the same scale, unless you are attempting some first steps at interpreting DIF. Activity #5 Find a data set in which you have both pre-data and post-data. First, run the pre-data and print out a Wright Map. Then, run just the post data and print out the Wright Map. Place the Wright Maps next to each other. What do you see? What is potentially amiss? Answer: Depending upon your data set, you will most likely see that a different range of logits will be displayed on your Wright Map. Also, you will most likely see that items are not in the exact same location. Most often this is due to your not having anchored the items.

References

251

References Alisat, S., & Riemer, M. (2015). The environmental action scale: Development and psychometric evaluation. Journal of Environmental Psychology, 43, 13–23. Barthel, D., Barkmann, C., Ehrhardt, S., Schoppen, S., & Bindt, C. (2015). Screening for depression in pregnant women from Côte d′Ivoire and Ghana: Psychometric properties of the Patient Health Questionnaire-9. Journal of Affective Disorders, 187, 232–240. Boon, H. J., & Lewthwaite, B. (2015). Development of an instrument to measure a facet of quality teaching: Culturally responsive pedagogy. International Journal of Educational Research, 72, 38–58. Carter, L., Black, D., Bundy, A., & Williams, W. (2016). Parent perceptions of children’s leisure and the risk of damaging noise exposure. Deafness & Education International, 18(2), 87–102. Dehqan, A., Yadegari, F., Asgari, A., Scherer, R. C., & Dabirmoghadam, P. (2017). Development and validation of an Iranian voice quality of life profile (IVQLP) based on a classic and Rasch rating scale model (RSM). Journal of Voice, 31(1), 113. Enochs, L. G., & Riggs, I. M. (1990). Further development of an elementary science teaching efficacy belief instrument: A preservice elementary scale. School Science and Mathematics, 90(8), 694–706. Fenwick, E. K., Rees, G., Holmes-Truscott, E., Browne, J. L., Pouwer, F., & Speight, J. (2016). What is the best measure for assessing diabetes distress? A comparison of the problem areas in diabetes and diabetes distress scale: Results from diabetes MILES-Australia. Journal of Health Psychology, 23(5), 667–680. Franchignoni, F., Giordano, A., Ferriero, G., Muñoz, S., Orlandini, D., & Amoresan, A. (2007). Rasch analysis of the Locomotor Capabilities Index-5 in people with lower limb amputation. Prosthetics and Orthotics International, 31(4), 394–404. Hallinger, P., & Ko, J. (2015). Education accountability and principal leadership effects in Hong Kong primary schools. Nordic Journal of Studies in Educational Policy, 2015(3). https://doi. org/10.3402/nstep.v1.30150 Hassan, W. N. W., Abu Kassim, N. L., Jhawar, A., Shurkri, N. M., Baharin, N. A. K., & Chan, C. S. (2016). User acceptance of a touchless sterile system to control virtual orthodontic study models. American Journal of Orthodontics and Dentofacial Orthopedics, 149(4), 567–578. Henson, S., Blandon, J., & Cranfield, J. (2010). Difficulty of healthy eating: A Rasch model approach. Social Science & Medicine, 70(10), 1574–1580. Ilian, H., Parry, C., & Coloma, J. (2010, May). Rasch measurement for child welfare training evaluators. Presented at the National Human Services & Training Evaluation Symposium, Berkeley, CA. Jeon, Y. H., Liu, Z., Li, Z., Low, L. F., Chenoweth, L., O’Connor, D., et al. (2016). Development and validation of a short version of the Cornell scale for depression in dementia for screening residents in nursing homes. The American Journal of Geriatric Psychiatry, 24(11), 1007–1016. Mahmud, Z., & Porter, A. L. (2015). Using Rasch analysis to explore what students learn about probability concepts. Journal on Mathematics Education, 6(1), 1–10. Malinowsky, C., & Larsson-Lund, M. (2015). The match between everyday technology in public space and the ability of working-age people with acquired brain injury to use it. British Journal of Occupational Therapy, 79(1), 26–34. Patomella, A. H., & Bundy, A. (2015). P-Drive: Implementing an assessment of on-road driving in clinical settings and investigating its internal and predictive validity. American Journal of Occupational Therapy, 69(4), 1–8. Pelch, M. A., & McConnell, D. A. (2016). Challenging instructors to change: A mixed methods investigation on the effects of material development on the pedagogical beliefs of geoscience instructors. International Journal of STEM Education, 3(1), 1–18. Pensavalle, C. A., & Solinas, G. (2013). The Rasch model analysis for understanding mathematics proficiency – A case study: Senior high school Sardinian students. Creative Education, 4(12), 767–773.

252

16 Wright Maps (Part 3 and Counting...)

Sabari, J. S., Woodbury, M., & Velozo, C. A. (2014). Rasch analysis of a new hierarchical scoring system for evaluating hand function on the motor assessment scale for stroke. Stroke Research and Treatment, 2014(2), 1–10. Sinnema, C., Ludlow, L., & Robinson, V. (2016). Educational leadership effectiveness: A Rasch analysis. Journal of Educational Administration, 54(3), 305–339. Sousa, C. S., Turrini, R. N. T., & Poveda, V. D. B. (2015). Educational intervention in patients undergoing orthognathic surgery: Pilot study. Journal of Nursing Education and Practice, 5(5), 126–134. Stelmack, J., Szlyk, J. P., Stelmack, T., Babcock-Parziale, J., Demers-Turco, P., Williams, T., et al. (2004). Use of Rasch person-item map in exploratory data analysis: A clinical perspective. Journal of Rehabilitation Research & Development, 41(2), 233–241. Stonkus, M. A., & Royal, K. D. (2015). Further validation of the inventory of mental toughness factors in sport (IMTF-S). International Journal of Psychological Studies, 7(3), 35–45. Teoh, B. C., Alrasheedy, A. A., Hassali, M. A., Tew, M. M., & Samsudin, M. A. (2015). Perceptions of doctors and pharmacists towards medication error reporting and prevention in Kedah, Malaysia: A Rasch model analysis. Advances in Pharmacoepidemiology and Drug Safety, 4(5). https://doi.org/10.4172/2167-1052.1000192 Wells, N., Bronheim, S., Zyzanski, S., & Hoover, C. (2015). Psychometric evaluation of a consumer- developed family-centered care assessment tool. Maternal and Child Health Journal, 19(9), 1899–1909. Wooliscroft, B., Ganglmair-Wooliscroft, A., & Noone, A. (2013). The hierarchy of ethical consumption behavior: The case of New Zealand. Journal of Macromarketing, 34(1), 57–72. Wright, B. D., & Stone, M. H. (1979). Best Test Design. Chicago, IL: Mesa Press. Zain, A. N. M., Samsudin, M. A., Rohandi, R., & Jusoh, A. (2010). Using the Rasch model to measure students’ attitudes toward science in “low performing” secondary schools in Malaysia. International Education Studies, 3(2), 56–63.

Additional Readings Boone, W. J. (2008). Teaching students about rasch maps. Rasch Measurement Transactions, 22(2), 1163–1164. Luntz, M. (2010). Using the very useful Wright Map. Measurement Research Associates Test Insights, January 2010. Document retrieved April 11, 2018 from https://rasch.org/mra/ mra-01-10. Ohio graduation tests interpretive guide family reports. Understanding your students test scores. Spring 2014. Document. retrieved April 11, 2018 from https://education.ohio.gov/getattachment/Topics/Testing/Ohio-Graduation-Test-OGT/Ohio-Graduation-Test-AssessmentResources/OGT_Sp14_FamilyGuide.pdf.aspx

Additional Wright Map Examples Aryadoust, V., & Shahsavar, Z. (2016). Validity of the Persian blog attitude questionnaire: An evidence-based approach. Journal of Modern Applied Statistical Methods, 15(1), 417–451. Bansilal, S. (2015). A Rasch analysis of a grade 12 test written by mathematics teachers. South African Journal of Science, 111(5–6), 1–9. Chen, S., Zhu, X., & Kang, M. (2017). Development and validation of an energy-balance knowledge test for fourth-and fifth-grade students. Journal of Sports Sciences, 35(10), 1004–1011.

References

253

Daud, N. S. M., Daud, N. M., & Abu Kassim, N. L. (2005). Second language writing anxiety: Cause or effect? Malaysian Journal of ELT Research, 1(1), 1–19. Herrmann-Abell, C., & Deboer, G. (2017). Investigating a learning progression for energy ideas from upper elementary through high school. Journal of Research in Science Teaching, 55(1), 68–93. Holmefur, M. M., & Krumlinde-Sundholm, L. (2016). Psychometric properties of a revised version of the Assisting Hand Assessment (Kids-AHA 5.0). Developmental Medicine & Child Neurology, 58(6), 618–624. Kozusznik, M. W., Peiró, J. M., Lloret, S., & Rodriguez, I. (2015). Hierarchy of eustress and distress: Rasch calibration of the Valencia eustress-distress appraisal scale. Central European Journal of Management, 2(1–2), 67–79. Ludlow, L. H., Enterline, S. E., & Cochran-Smith, M. (2008). Learning to teach for social justice- beliefs scale: An application of Rasch measurement principles. Measurement and Evaluation in Counseling and Development, 40(4), 194–214. Moral, F. J., Rebollo, F. J., Paniagua, M., & Murillo, M. (2014). Using an objective and probabilistic model to evaluate the impact of different factors in the dehesa agroforestry ecosystem. Ecological Indicators, 46, 253–259. Ning, H. K. (2017). A psychometric evaluation of the achievement goal questionnaire – Revised in Singapore secondary students. Journal of Psychoeducational Assessment, 35(4), 424–436. Oliveira, Í. M., Taveira, M. D. C., Cadime, I., & Porfeli, E. J. (2016). Psychometric properties of a career exploratory outcome expectations measure. Journal of Career Assessment, 24(2), 380–396. Pesudovs, K., Garamendi, E., Keeves, J. P., & Elliott, D. (2003). The activities of daily vision scale for cataract surgery outcomes: Re-evaluating validity with Rasch analysis. Investigative Ophthalmology & Visual Science, 44(7), 2892–2899. Wang, Y. C., Cook, K. F., Deutscher, D., Werneke, M. W., Hayes, D., & Mioduski, J. E. (2015). The development and psychometric properties of the patient self-report neck functional status questionnaire (NFSQ). Journal of Orthopaedic & Sports Physical Therapy, 45(9), 683–692.

Chapter 17

Rasch and Forms of Validity Evidence

Charlie and Chizuko: Two Colleagues Conversing Charlie: You know Chizuko, when I read a Rasch article, in almost any field…there seems to be some sort of discussion of validity. The thing is… Chizuko: Yes…go on… Charlie: Well, there seem to be many, many types of validity to be considered…some articles only discuss a few steps that were used to evaluate the forms of validity evidence of the an instrument, while other authors discuss a wide range of steps…so what is best to do? Chizuko: Mmmmmm, this is how you might start thinking of things…there are many types of validity evidence…and there are researchers who think and write of validity evidence day and night. My thought is that some central types of validity evidence are discussed quite a bit in Rasch articles, and those are the ones I’m thinking I might help you learn about. Charlie: Then let’s get going!

Tie Ins to RAHS Book 1 As readers will see in this chapter, there are a number of validity considerations that were presented throughout different chapters of both RAHS Book 1, and this book. For example, in chapter 13 of RAHS Book 1 we used the term construct validity. In this chapter of book 2, we introduce the term fit validity (Baghaei, 2008). Although this term is new to our two books, we have devoted an entire chapter to explain the starting steps of Fit in Chapter 8 of RAHS Book 1. In this new chapter, we summarize considerations of validity that are found throughout RAHS Book 1 in many different spots, and we introduce additional validity considerations which we have seen utilized by authors using Rasch. We realize that there have been many many discussions in the literature concerning “validity” evidence both by researchers using Rasch and those not using Rasch. The purpose of this chapter is to provide an © Springer Nature Switzerland AG 2020 W. J. Boone, J. R. Staver, Advances in Rasch Analyses in the Human Sciences, https://doi.org/10.1007/978-3-030-43420-5_17

255

256

17 Rasch and Forms of Validity Evidence

introduction to a range of validity evidence issues that might be considered through Rasch. By no means are we attempting to summarize all the writing that has been presented regarding validity evidence, rather, we provide added details as to some of the ways in which Rasch has been considered with regard to aspects of validity evidence. Our goal is that not only do readers of RAHS Book 1 finish that book with an appreciation that their Wright Maps should provide some evidence of construct validity (your item difficulty should match that from theory), but we would like readers of this book more often explicitly address added validity issues in their written Rasch articles. Often we read Rasch articles and there might be one type of validity discussed (construct validity). We believe that future Rasch articles could be strengthened through an added consideration of additional validity topics.

Introduction Often when we sit down with colleagues who are not Rasch experts, and those colleagues are using an instrument, be it a survey or a test, they ask us “How do I evaluate the validity of our instrument using Rasch?” Responding to this question, we begin by helping our colleagues learn that many types of validity evidence exist, and some types of validity evidence can be explored via Rasch analysis. Our next step begins with a discussion of the validity evidence that we have seen most often in Rasch papers as researchers work to outline how they support the inferences they wish to make with their measures. One problem with the publishing process is the strict word or pages limits imposed, although limiting the content of an article to the key results makes it more manageable to read. Our hope is that researchers should carefully consider the different types of validity evidence we present here. This should help researchers chose which validity discussions to include in their Rasch papers and written works. Of course, many journals now provide authors with the opportunity to provide electronic appendices, which may allow researchers to include wider discussions of validity. We begin by using six categories of validity (content validity, construct validity, predictive validity, concurrent validity, statistical validity, and fit validity) discussed by Baghaei (2008) in his excellent Rasch Measurement Transactions article. Then we present some added components of validity that researchers might consider as part of their Rasch analysis. We readily acknowledge there has been, and we think will continue to be, discussions regarding the topic of validity. The purpose of this chapter is not to touch upon those debates and attempt to provide some sort of summary of all that could be considered with regard to the topic of validity. Rather, we wish to point out to our readers a number of validity issues that can be discussed in the most basic of Rasch articles. Often we see a lack of discussion regarding the topic of validity in articles which have developed instruments or utilized existing instruments. We believe that some added time spent considering validity issues can help improve Rasch analyses. Below we present our consideration of a number of aspects of validity. These are the categories presented by Baghaei (2008).

Construct Validity Evidence

257

Content Validity Evidence When an instrument is created using Rasch thinking and Rasch analysis of data, a consideration of the variable being measured is required. Throughout RAHS Book 1, and in much of this book, we emphasized the importance of considering one variable and one trait when one is trying to measure something. A thoughtful way of considering one variable includes identifying content for the test items or survey items that taps into that construct. For instance, if the goal is to create a math test for addition, a researcher should be able to write out potential math addition test items that would measure the trait of mathematics addition. When discussing the content validity evidence of a test or survey, a researcher must be able to argue for why the items of a test or a survey provide content validity evidence. We have found that documents such as standards from professional organizations and models proposed by researchers are excellent documents to support working toward considering the content validity evidence of an instrument. To help readers consider the topic of content validity evidence, we suggest reading one of many examples of a Rasch presentation addressing content validity (Hudgens & Marquis, 2012). The bottom line, as researchers craft their articles or presentations make sure to address the content validity of your items. Usually developers of a new instrument will explain how they crafted and vetted items. But often they will not explicitly explain how taking such careful steps in item development supports the content validity of their instrument.

Construct Validity Evidence When we think of construct validity evidence, we must consider the following question: Do the items of our test or survey follow a predicted ordering and spacing on our Wright Map based on our understanding of the construct? Needless to say, to make such an assessment of construct validity evidence, a researcher needs a Wright Map that presents the items as a function of difficulty order. Also equally important, a researcher needs to have some sort of information that provides the theory (what items should be hypothesized to be where on the Wright Map). Where might this theory come from? Just as it is most helpful to have documents and writings to support the argument for the content validity evidence of items, it is equally important to have such documents that provide suggestions, outlines, and models on the location of items of a test or a survey as one moves up the Wright Map from less to more. For example Piaget’s theory of cognitive development could be used to predict the location of different test items (where a test item is a task presented to the student and either they correctly solve the task or they do not). We recommend readers consider reading the following Rasch article that does consider the topic of construct validity evidence: de Jong et al. (2012). Construct validity of two pain behavior observation measurement instruments for young children with burns by Rasch analysis. Pain, 153(11), 2260–2266.

258

17 Rasch and Forms of Validity Evidence

Formative Assessment Checkpoint #1 Question: Which Rasch technique is useful for evaluating the construct validity evidence of an instrument? Answer: A Wright Map is a useful technique to assess construct validity evidence. More specifically, one can evaluate if the ordering and spacing of items on the Wright Map matches what would be predicted from theory.

Predictive Validity Evidence In his article considering validity, Baghaei asks “does the person ability hierarchy make sense”? (2008 p. 1145). Another way to consider this question is to ask: does the location of the person measures on the Wright Map make sense? More specifically, are students who we would predict are high achievers on a test indeed the high achievers? Likewise, are those students we would predict to be low achievers on a test the lower achieving students? With respect to surveys, it is helpful to return to the STEBI instrument (Enochs & Riggs, 1990) that we have used frequently in RAHS Book 1 and this book. Recall that the STEBI provides a measure of Self- Efficacy. Thus, if a researcher is evaluating the predictive validity evidence of the STEBI the researcher would ask: are those students with high measures (high self- confidence) those which one would predict to have high self-confidence? Moreover, are those students one would predict would have low self-confidence indeed those who have the lowest measures? In the case of preservice teachers, do those preservice teachers with more classroom experience have a higher measure than those who have less classroom experience? When evaluating the predictive validity evidence of an instrument, of course a researcher must not only look at individual students, but also look at groups of students. For instance, if a test was administered to ten different schools, are the schools that a researcher would predict be high performing actually high performing? Moreover, are the schools that are low performing indeed low performing? With respect to the STEBI, are groups of students at a confidence level one would predict? Often a researcher will have a gut feeling as to what school they think should be highly performing and which ones they think will be lower performing. Also, one could of course conduct a statistical test to see if the high performing schools were performing at a higher level than other schools. And if there were any differences, did those match predictions made based upon other evidence. When we review past Rasch articles we often do not see an assessment of predictive validity evidence. Of course, using raw scores one might claim that one could evaluate predictive validity, but reviewing the raw score totals of respondents. Needless to say there are a number of problems. First raw scores are not measures. The other weakness is that with Rasch we can better understand what each respondent can or can not do as the result of items being expressed on the same scale as persons. This means it is possible to look not only at the pure respondent hierarchy and ask if the ordering makes sense, but it is also possible to look at the person ordering and what each respondent would be predicted to have answered on the

Statistical Validity Evidence

259

survey or test. This, we feel, provides an added detail by which the predictive validity of an instrument could be evaluated.

Concurrent Validity Evidence Regarding concurrent validity evidence, Baghaei writes “do the person ability measures correlate well with other test instruments probing the same latent variable?” (Baghaei, 2008, p. 1145). As was the case with predictive validity, evaluating this type of validity evidence makes use of Rasch person measures. More specifically, do the Rasch measures of respondents correlate with measures from other instruments measuring the same or similar trait? For example, if data are collected for a mathematics test of young students, do the Rasch measures of the students from the test correlate moderately with the measures the same students receive when evaluated with other mathematics tests? If we consider the STEBI instrument (Enochs & Riggs, 1990), does the Self Efficacy measure of respondents correlate with other measures that also provide an assessment of Self Efficacy? We have found many research studies often do not include an evaluation of concurrent validity evidence, perhaps because many studies do not include an alternative instrument (or measure) that would allow a comparison of respondents’ measures from two instruments of interest. Another problem is that even if other instruments have been used to evaluate a respondent, it is important that both instruments have been evaluated with Rasch techniques. Just as it would not make sense to conduct statistical analysis of raw test data or raw survey data, it also does not make sense to evaluate concurrent validity evidence using instruments that have been scored using only raw scores. Also, it is important to point out that researchers using Rasch should consider evaluating divergent validity evidence. In an investigation a researcher might compute a Rasch person measure for a sample of students for a specific variable such as self- efficacy. When the researcher compares the sample of student measures for self- efficacy to another variable that they predict should be uncorrelated, the researcher should observe no correlation between the two variables. The bottom line for researchers is to consider evaluating concurrent validity. But please take heed of our concern, that it makes little sense to conduct an analysis of concurrent validity, but not to make use of Rasch measures for the other tests and surveys you are utilizing to evaluate concurrent validity.

Statistical Validity Evidence An added type of validity considered by Baghaei (2008) is what he terms statistical validity. More specifically, he asks: “Does the instrument distinguish between high and low abilities with sufficient statistical certainty?” (p. 1145). In our work, we approach this issue in the same manner. Imagine for a group of students we have

260

17 Rasch and Forms of Validity Evidence

post-data to help us classify students ahead of time into groups that we classify as high achieving, middle achieving, and low achieving. If an instrument has statistical validity evidence, then the Rasch measures from the instrument for the low achieving students (identified through use of post data) should be statistically different than the Rasch measures from the instrument for the high achieving students (also identified through the use of post data). To understand why an instrument might not have statistical validity evidence (meaning not being able to differentiate between high and low achievers), it is helpful to consider our Wright Map that we can create and interpret for tests and surveys. When we see in a Wright Map that a limited number of test items exist along the range of person abilities on a test, we can begin to appreciate why an instrument might not exhibit statistical validity evidence. If an instrument does not have items that help differentiate the respondents, then it makes sense that it is more difficult to determine the differences between high and low achievers. A tip for readers: the use of a person separation index is one way to help evaluate the number of groups differentiated in a Rasch analysis. In the Winsteps help tab (Linacre, 2018) provides this useful guidance; Person separation is used to classify people. Low person separation (