Constructing Measures: An Item Response Modeling Approach [2 ed.] 1032261668, 9781032261669

Constructing Measures introduces a way to understand the advantages and disadvantages of measurement instruments. It exp

200 23 13MB

English Pages 394 [395] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Constructing Measures: An Item Response Modeling Approach [2 ed.]
 1032261668, 9781032261669

Table of contents :
Cover
Endorsements
Half Title
Title
Dedication
Copyright
Contents
List of figures
List of tables
Preface
Aims of the Book
Audiences for the Book
Structure of the Book
Learning Tools
Using the Book to Teach a Course
Help for Old Friends
Acknowledgements
Part I A Constructive Approach to Measurement
1 The BEAR Assessment System: Overview of the “Four Building Blocks” Approach
1.1 What Is “Measurement”?
1.1.1 Construct Modeling
1.2 The BEAR Assessment System
1.3 The Construct Map
1.3.1 Example 1: The MoV Construct in the Data Modeling Assessments
1.4 The Items Design
1.4.1 Example 1: MoV Items
1.4.2 The Relationship between the Construct and the Responses
1.5 The Outcome Space
1.5.1 Example 1: The MoV Outcome Space
1.6 The Wright Map
1.6.1 Example 1: The MoV Wright Map
1.6.2 Return to the Discussion of Causation and Inference
1.7 Reporting the Results to the Measurer and Other Users
1.8 Using the Four Building Blocks to Develop an Instrument
1.9 Resources
1.10 Exercises and Activities
Part II The Four Building Blocks
2 Construct Maps
2.1 The Construct Map
2.2 Examples of Construct Maps
2.2.1 Example 1: The Models of Variability (MoV) Construct in the Data Modeling Curriculum
2.2.2 Example 2: A Social and Emotional Learning Example (RIS: The Researcher Identity Scale)
2.2.3 Example 3: An Attitude Example (GEB: General Ecological Behavior)
2.2.4 Example 4: A 21st Century Skills Example (LPS Argumentation)
2.2.5 Example 5: The Six Constructs in the Data Modeling Curriculum
2.2.6 Example 6: A Process Measurement Example—Collaborative Problem-Solving (CPS)
2.2.7 Example 7: A Health Assessment Example (PF-10: Physical Functioning 10)
2.2.8 Example 8: An Interview Example (CUE: Conceptual Underpinnings of Evolution)
2.2.9 Example 9: An Observational Instrument—Early Childhood (DRDP)
2.2.10 Example 10: The Issues Evidence and You (IEY) Science Assessment
2.3 Using Construct Mapping to Help Develop an Instrument
2.4 Examples of Other Construct Structures
2.5 Resources
2.6 Exercises and Activities
3 The Items Design
3.1 The Idea of an Item
3.2 The Facets of the Items Design
3.2.1 The Construct Facet
3.2.2 The Secondary Design Facets
3.3 Different Types of Item Responses
3.3.1 Participant Observation
3.3.2 Specifying (Just) the Topics
3.3.3 Constructed Response Items
3.3.4 Selected Response Items
3.3.5 Steps in Item Development
3.4 Building-in Fairness through Design
3.4.1 What Do We Mean by Fairness Here?
3.4.2 Universal Design
3.5 Resources
3.6 Exercises and Activities
4 The Outcome Space
4.1 The Qualities of an Outcome Space
4.1.1 Well-defined Categories
4.1.2 Research-based Categories
4.1.3 Context-specific Categories
4.1.4 Finite and Exhaustive Categories
4.1.5 Ordered Categories
4.2 Scoring the Outcome Space (the Scoring Guide)
4.3 General Approaches to Constructing an Outcome Space
4.3.1 Phenomenography
4.3.2 The SOLO Taxonomy
4.3.3 Guttman Items
4.4 A Unique Feature of Human Measurement: Listening to the Respondents
4.5 When Humans Become a Part of the Item: The Rater
4.6 Resources
4.7 Exercises and Activities
5 The Wright Map
5.1 Combining Two Approaches to Measurement
5.2 The Wright Map
5.2.1 The Rasch Model
5.2.2 Visualizing the Rasch Model Parameters: The Wright Map
5.2.3 Modeling the Response Vector
5.2.4 Linking the Construct Map and the Wright Map
5.3 The PF-10 Example (Example 7)
5.4 Reporting Measurements
5.4.1 Interpretation and Errors
5.4.2 The PF-10 Example (Example 7), Continued
5.5 Resources
5.6 Exercises and Activities
Textbox 5.1 Making sense of logits
Part III Quality Control Methods
6 Evaluating and Extending the Statistical Model
6.1 More Than Two Score Categories: Polytomous Data
6.1.1 The PF-10 Example (Example 7), Continued
6.2 Evaluating Fit
6.2.1 Item Fit
6.2.2 Respondent Fit
6.3 Resources
6.4 Exercises and Activities
Textbox 6.1 The Partial Credit Model
Textbox 6.2 Calculating the Thurstonian Thresholds
7 Trustworthiness, Precision, and Reliability
7.1 Trustworthiness in Measurement
7.2 Measurement Error: Precision
7.3 Summaries of Measurement Error
7.3.1 Internal Consistency Coefficients
7.3.2 Test–Retest Coefficients
7.3.3 Alternate Forms Coefficients
7.3.4 Other Reliability Coefficients and Indexes
7.4 Inter-rater Consistency
7.5 Resources
7.6 Exercises and Activities
8 Trustworthiness, Validity, and Fairness
8.1 Trustworthiness, Continued
8.1.1 Crafting a Full Validity Argument
8.2 Evidence Based on Instrument Content
8.2.1 Instrument Content Evidence for Example 2, the Researcher Identity Scale-G
8.3 Evidence Based on Response Processes
8.3.1 Response Process Evidence Related to Example 8—The DRDP
8.4 Evidence Based on Internal Structure
8.4.1 Evidence of Internal Structure at the Instrument Level: Dimensionality
8.4.2 Dimensionality Evidence for Example 2: The Researcher Identity Scale-G
8.4.3 Evidence of Internal Structure at the Instrument Level: The Wright Map
8.4.4 Wright Map Evidence from Example 2: The Researcher Identity Scale-G
8.4.5 Evidence of Internal Structure at the Item Level
8.4.6 Item-level Evidence of Internal Structure for the PF-10 Instrument
8.5 Evidence Regarding Relations to Other Variables
8.5.1 “Other Variables” Evidence from Two Examples
8.6 Evidence Based on the Consequences of Using the Instrument
8.7 Evidence Related to Fairness
8.7.1 Differential Item Functioning (DIF)
8.7.2 DIF Evidence for the RIS-G
8.8 Resources
8.9 Exercises and Activities
Part IV A Beginning Rather than a Conclusion
9 Building on the Building Blocks
9.1 Choosing the Statistical Model
9.1.1 Interpretation of Thurstone’s Requirement in Terms of the Construct Map
9.2 Comparing Overall Model Fit
9.3 Beyond the Lone Construct Map: Multidimensionality
9.4 Resources
9.5 Exercises and Activities
Textbox 9.1 Showing that Equation 9.5 Holds for the Rasch Model
Textbox 9.2 Statistical Formulation of the Multidimensional Partial Credit Model
10 Beyond the Building Blocks
10.1 Beyond the Construct Map: Learning Progressions
10.2 Beyond the Items Design and the Outcome Space: Process Measurement
10.3 Beyond the Statistical Model: Considering a More Complex Scientific Model
10.4 Other Measurement Frameworks: Principled Assessment Designs
10.4.1 Example: Evidence-centered Design
10.4.2 Going “Outside the Triangle”
10.5 A Beginning Rather Than a Conclusion
10.5.1 Further Reading about the History of Measurement in the Social Sciences
10.5.2 Further Reading about Alternative Approaches
10.5.3 Further Reading about the Philosophy of Measurement
10.6 Exercises and Activities
Appendix A The Examples Archive
Appendix B Computerized Design, Development, Delivery, Scoring and Reporting—BASS
Appendix C The BEAR Assessment System (BAS): Papers about its Uses and Applications
Appendix D Models of Variation Materials
Appendix E The General Ecological Behavior Items
Appendix F The Item Panel
Appendix G Matching Likert and Guttman Items in the RIS Example
Appendix H Sample Script for a Think-aloud Investigation
Appendix I The Item Pilot Investigation
Appendix J Results from the PF-10 Analyses
References
Index

Citation preview

“I think it would be hard to overstate the importance of Mark Wilson’s Constructing Measures for researchers and practitioners engaged in the construction and validation of measures of human properties. This volume provides usable, concrete guidance for constructing instruments, including but not limited to educational tests, survey-based measures, and psychological assessments, and is particularly remarkable for its comprehensive treatment of the entire (iterative) process of instrument design, including construct definition, item writing and vetting, and quality control via thoughtfully chosen psychometric models (in particular, the Rasch model and its extensions). Further, it is written in an accessible style and would be a great entry point for non-specialists, but also provides sufficient rigor for those who wish to more deeply understand both the mathematical and conceptual foundations of measurement.” —Andrew Maul, Associate Professor of Education, University of California, Santa Barbara “Professor Wilson is one of the world’s outstanding leaders in measurement. I have used the first edition of his book in all of my graduate courses on measurement. His book takes the complex process of constructing measures, and breaks it into four building blocks. These building blocks can be used by anyone who seeks to create useful and defensible measures in the human sciences. The new edition promises to introduce a new generation of students and researchers to the essential aspects of constructing measures.” —George Engelhard, Professor of Educational Measurement and Policy, The University of Georgia “This volume is an excellent and important update to the original Constructing Measures. Broadly applicable to educational measurement and assessment, it should be in every university library collection. Faculty and students will find this volume helpful for many courses.” —Kathleen Scalise, Professor at the University of Oregon (Education Studies and School Psychology) “Twenty years ago, the first edition of this book opened for me the door to the magic world of measurement. In this second edition, Mark unpacks complex and abstract measurement concepts into easy-to-follow building blocks, grounded in real-world examples. This book is an ideal choice for instructors who are about to teach an introductory course in measurement and for students eager to foray into the measurement world.”  —Lydia Liu, Principal Research Director, Education Testing Service

CONSTRUCTING MEASURES

Constructing Measures introduces a way to understand the advantages and disadvantages of measurement instruments. It explains the ways to use such instruments, and how to apply these methods to develop new instruments or adapt old ones, based on item response modeling and construct references. Now in its second edition, this book focuses on the steps taken while constructing an instrument, and breaks down the “building blocks” that make up an instrument—the construct map, the design plan for the items, the outcome space, and the statistical measurement model. The material covers a variety of item formats, including multiple-choice, open-ended, and performance items, projects, portfolios, Likert and Guttman items, behavioral observations, and interview protocols. Each chapter includes an overview of the key concepts, related resources for further investigation, and exercises and activities. A variety of examples from the behavioral and social sciences and education—including achievement and performance testing, attitude measures, health measures, and general sociological scales—demonstrate the application of the material. New to this edition are additional example contexts including a cognitive/achievement example and an attitude example, and a behavioral example and new concentrations on specific measurement issues and practices such as standard-setting, computer-delivery and reporting, and going beyond the Likert response format. Constructing Measures is an invaluable text for undergraduate and graduate courses on item, test, or instrument development; measurement; item response theory; or Rasch analysis taught in a variety of departments, including education, statistics, and psychology. The book also appeals to practitioners who develop instruments,

including industrial/organizational, educational, and school psychologists; health outcomes researchers; program evaluators; and sociological measurers. Mark Wilson is a Distinguished Professor in the Berkeley School of Education at the University of California, Berkeley, who specializes in measurement and statistics. His research focuses on the establishment of a framework for measurement practice informed by the philosophy of measurement, on statistical models that are aligned with scientific models of the construct, and on instruments to measure new constructs.

CONSTRUCTING MEASURES An Item Response Modeling Approach Second Edition

Mark Wilson

To Penelope Jayne Thomas: This book has been brewing all your life; Now it is time to drink.

Designed cover image: © Mark Wilson First published 2023 by Routledge 605 Third Avenue, New York, NY 10158 and by Routledge 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2023 Mark Wilson The right of Mark Wilson to be identified as author of this work has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. ISBN: 978-1-032-26166-9 (hbk) ISBN: 978-1-032-26168-3 (pbk) ISBN: 978-1-003-28692-9 (ebk) DOI: 10.4324/9781003286929 Typeset in Bembo by Apex CoVantage, LLC

CONTENTS

List of figures xvi List of tables xx Prefacexxii   Aims of the Book  xxii   Audiences for the Book  xxiii   Structure of the Book  xxiii   Learning Tools  xxiv   Using the Book to Teach a Course  xxv   Help for Old Friends  xxvi Acknowledgementsxxviii PART I

A Constructive Approach to Measurement

1

1 The BEAR Assessment System: Overview of the “Four Building Blocks” Approach

3

1.1 What Is “Measurement”?  3 1.1.1 Construct Modeling  6 1.2 The BEAR Assessment System  7 1.3 The Construct Map  8 1.3.1 Example 1: The MoV Construct in the Data Modeling Assessments  11

x Contents

1.4 The Items Design  15 1.4.1 Example 1: MoV Items  16 1.4.2 The Relationship between the Construct and the Responses  18 1.5 The Outcome Space  19 1.5.1 Example 1: The MoV Outcome Space  21 1.6 The Wright Map  25 1.6.1 Example 1: The MoV Wright Map  26 1.6.2 Return to the Discussion of Causation and Inference  31 1.7 Reporting the Results to the Measurer and Other Users 32 1.8 Using the Four Building Blocks to Develop an Instrument 34 1.9 Resources 37 1.10 Exercises and Activities  37 PART II

The Four Building Blocks

39

2 Construct Maps

41

2.1 2.2

The Construct Map  41 Examples of Construct Maps  45 2.2.1 Example 1: The Models of Variability (MoV) Construct in the Data Modeling Curriculum 46 2.2.2 Example 2: A Social and Emotional Learning Example (RIS: The Researcher Identity Scale) 47 2.2.3 Example 3: An Attitude Example (GEB: General Ecological Behavior)  48 2.2.4 Example 4: A 21st Century Skills Example (LPS Argumentation)  51 2.2.5 Example 5: The Six Constructs in the Data Modeling Curriculum  53 2.2.6 Example 6: A Process Measurement Example— Collaborative Problem-Solving (CPS)  56 2.2.7 Example 7: A Health Assessment Example (PF-10: Physical Functioning 10)  58

Contents  xi

2.3 2.4 2.5 2.6

2.2.8 Example 8: An Interview Example (CUE: Conceptual Underpinnings of Evolution)  60 2.2.9 Example 9: An Observational Instrument— Early Childhood (DRDP)  62 2.2.10 Example 10: The Issues Evidence and You (IEY) Science Assessment  64 Using Construct Mapping to Help Develop an Instrument 65 Examples of Other Construct Structures  67 Resources 69 Exercises and Activities  69

3 The Items Design

71

3.1 The Idea of an Item  71 3.2 The Facets of the Items Design  74 3.2.1 The Construct Facet  75 3.2.2 The Secondary Design Facets  78 3.3 Different Types of Item Responses  81 3.3.1 Participant Observation  83 3.3.2 Specifying (Just) the Topics  83 3.3.3 Constructed Response Items  84 3.3.4 Selected Response Items  85 3.3.5 Steps in Item Development  88 3.4 Building-in Fairness through Design  90 3.4.1 What Do We Mean by Fairness Here?  90 3.4.2 Universal Design  93 3.5 Resources 94 3.6 Exercises and Activities  94 4 The Outcome Space 4.1 The Qualities of an Outcome Space  96 4.1.1 Well-defined Categories  98 4.1.2 Research-based Categories  99 4.1.3 Context-specific Categories  100 4.1.4 Finite and Exhaustive Categories  101 4.1.5 Ordered Categories  102 4.2 Scoring the Outcome Space (the Scoring Guide)  103 4.3 General Approaches to Constructing an Outcome Space 104

96

xii Contents

4.4 4.5 4.6 4.7

4.3.1 Phenomenography 104 4.3.2 The SOLO Taxonomy  108 4.3.3 Guttman Items  112 A Unique Feature of Human Measurement: Listening to the Respondents  118 When Humans Become a Part of the Item: The Rater  122 Resources 125 Exercises and Activities  125

5 The Wright Map

127

5.1 Combining Two Approaches to Measurement  127 5.2 The Wright Map  133 5.2.1 The Rasch Model  134 5.2.2 Visualizing the Rasch Model Parameters: The Wright Map  139 5.2.3 Modeling the Response Vector  145 5.2.4 Linking the Construct Map and the Wright Map 147 5.3 The PF-10 Example (Example 7)  149 5.4 Reporting Measurements  153 5.4.1 Interpretation and Errors  154 5.4.2 The PF-10 Example (Example 7), Continued 156 5.5 Resources 158 5.6 Exercises and Activities  159 Textbox 5.1 Making sense of logits  140 PART III

Quality Control Methods 6 Evaluating and Extending the Statistical Model 6.1 More Than Two Score Categories: Polytomous Data  165 6.1.1 The PF-10 Example (Example 7), Continued 174 6.2 Evaluating Fit  178 6.2.1 Item Fit  179 6.2.2 Respondent Fit  184

163 165

Contents  xiii

6.3 Resources 190 6.4 Exercises and Activities  191 Textbox 6.1 The Partial Credit Model  171 Textbox 6.2 Calculating the Thurstonian Thresholds 174 7 Trustworthiness, Precision, and Reliability

193

7.1 Trustworthiness in Measurement  193 7.2 Measurement Error: Precision  195 7.3 Summaries of Measurement Error  203 7.3.1 Internal Consistency Coefficients  204 7.3.2 Test–Retest Coefficients  206 7.3.3 Alternate Forms Coefficients  207 7.3.4 Other Reliability Coefficients and Indexes  207 7.4 Inter-rater Consistency  209 7.5 Resources 212 7.6 Exercises and Activities  212 8 Trustworthiness, Validity, and Fairness 8.1 Trustworthiness, Continued  215 8.1.1 Crafting a Full Validity Argument  216 8.2 Evidence Based on Instrument Content  217 8.2.1 Instrument Content Evidence for Example 2, the Researcher Identity Scale-G  219 8.3 Evidence Based on Response Processes  222 8.3.1 Response Process Evidence Related to Example 8—The DRDP  223 8.4 Evidence Based on Internal Structure  224 8.4.1 Evidence of Internal Structure at the Instrument Level: Dimensionality  225 8.4.2 Dimensionality Evidence for Example 2: The Researcher Identity Scale-G  227 8.4.3 Evidence of Internal Structure at the Instrument Level: The Wright Map  229 8.4.4 Wright Map Evidence from Example 2: The Researcher Identity Scale-G  230 8.4.5 Evidence of Internal Structure at the Item Level 230

215

xiv Contents

8.5 8.6 8.7 8.8 8.9

8.4.6 Item-level Evidence of Internal Structure for the PF-10 Instrument  232 Evidence Regarding Relations to Other Variables  233 8.5.1 “Other Variables” Evidence from Two Examples 235 Evidence Based on the Consequences of Using the Instrument 237 Evidence Related to Fairness  238 8.7.1 Differential Item Functioning (DIF)  239 8.7.2 DIF Evidence for the RIS-G  240 Resources 242 Exercises and Activities  242

PART IV

A Beginning Rather than a Conclusion 9 Building on the Building Blocks

245 247

9.1 Choosing the Statistical Model  247 9.1.1 Interpretation of Thurstone’s Requirement in Terms of the Construct Map  252 9.2 Comparing Overall Model Fit  257 9.3 Beyond the Lone Construct Map: Multidimensionality 259 9.4 Resources 267 9.5 Exercises and Activities  269 Textbox 9.1 Showing that Equation 9.5 Holds for the Rasch Model  250 Textbox 9.2 Statistical Formulation of the Multidimensional Partial Credit Model 260 10 Beyond the Building Blocks 10.1 Beyond the Construct Map: Learning Progressions  271 10.2 Beyond the Items Design and the Outcome Space: Process Measurement  275 10.3 Beyond the Statistical Model: Considering a More Complex Scientific Model  282 10.4 Other Measurement Frameworks: Principled Assessment Designs 291

271

Contents  xv

10.4.1 Example: Evidence-centered Design  292 10.4.2 Going “Outside the Triangle”  294 10.5 A Beginning Rather Than a Conclusion  298 10.5.1 Further Reading about the History of Measurement in the Social Sciences  298 10.5.2 Further Reading about Alternative Approaches 299 10.5.3 Further Reading about the Philosophy of Measurement 300 10.6 Exercises and Activities  301 Appendix A The Examples Archive Appendix B Computerized Design, Development, Delivery, Scoring and Reporting—BASS Appendix C The BEAR Assessment System (BAS): Papers about its Uses and Applications Appendix D Models of Variation Materials Appendix E The General Ecological Behavior Items Appendix F The Item Panel Appendix G Matching Likert and Guttman Items in the RIS Example Appendix H Sample Script for a Think-aloud Investigation Appendix I The Item Pilot Investigation Appendix J Results from the PF-10 Analyses

302 304 306 316 319 321 324 328 331 333

References336 Index353

FIGURES

1.1 The National Research Council (NRC) Assessment Triangle 6 1.2 The four building blocks in the BEAR Assessment System (BAS) 7 1.3 Illustration of a generic construct map, incorporating qualitative person-side waypoints 10 1.4 The Models of Variation (MoV) construct map. Note that the distances between waypoints on this (hypothetical) map are arbitrary 12 1.5 Illustration of a spinner 13 1.6 The Piano Width task 17 1.7 A picture of an initial idea of the relationship between construct and item responses 18 1.8 A picture of the Construct Modeling idea of the relationship between degree of construct possessed and item responses 19 1.9 A segment of the MoV Outcome Space (from Appendix D) 22 1.10 Sketch of a Wright map 27 1.11 The Wright Map for MoV29 1.12 The Revised MoV construct map 30 1.13 The “four building blocks” showing the directions of causality and inference 32 1.14 A Group Proficiency Report for a set of students on the MoV construct33 1.15 The instrument development cycle through the four building blocks 35 2.1 The Construct Map, the first building block in the BEAR Assessment System (BAS) 42 2.2 A generic construct map in construct “X” 43

Figures  xvii

2.3 A sketch of the construct map for the MoV construct of the Data Modeling instrument 46 2.4 The construct map for the RIS 47 2.5 The GEB items arrayed into four consecutive sets 49 2.6 Sketch of a construct map in general ecological behavior 50 2.7 The Argumentation construct map 52 2.8 Three constructs represented as three strands spanning a theory of learning 54 2.9 Exploring: the first strand in the CPS Process Framework 57 2.10 The Complete CPS Process Framework 58 2.11 A sketch of the construct map for the Physical Functioning subscale (PF-10) of the SF-36 health survey 59 2.12 The final CUE learning progression 61 2.13 A view of the DRDP measure Identity of Self in Relation to Others (SED1) 63 2.14 A sketch of the construct map for the Using Evidence construct66 2.15 Illustration of a simple ordered partition 68 3.1 The Items Design, the second building block in the BEAR Assessment System (BAS) 74 3.2 Detail of the Argumentation construct map 77 3.3 The prompt for the “Ice-to-Water-Vapor” task 77 3.4 The two sets of MoV items: CR and SR 78 3.5 A subject-by-process blueprint for an immunology test 80 3.6 Levels of pre-specification for different item formats 82 3.7 Exploring student responses to the Rain Forest/Desert Task in the CUE interview 85 3.8 Coding elements developed by the CUE project 86 3.9 An example of a multiple-choice test item that would be a candidate for ordered multiple-choice scoring 86 3.10 An example item from the RIS-G (Guttman response format items) 87 4.1 The four building blocks in the BEAR Assessment System (BAS) 97 4.2 An open-ended question in physics 105 4.3 A phenomenographic outcome space 107 4.4 The SOLO Taxonomy 108 4.5 A SOLO task in the area of History 109 4.6 SOLO scoring guide for the history task 110 4.7 A sketch of the construct map for the Using Evidence construct of the IEY curriculum 112 4.8 The SOLO-B Taxonomy 113 4.9 Sketch of a “Guttman scale” 114

xviii Figures

4.10 Guttman’s example items (from Guttman, 1944, p.145) 115 4.11 Some Likert-style items initially developed for the Researcher Identity Scale (RIS) 116 4.12 An example item from the RIS-G (Guttman response style item) 117 4.13 An early version of a LPS Ecology item 120 4.14 Example notes from a think-aloud session 121 5.1 Thurstone’s graph of student success on specific items from a test versus chronological age 131 5.2 The four building blocks in the BEAR Assessment System (BAS) 133 5.3 Representation of three possible relationships between respondent location and the location of an item 135 5.4 Relationship between respondent location (θ) and probability of a response of “1” for an item with difficulty 1.0 137 5.5 Figure 5.4 reoriented so that respondent location is on the horizontal axis 138 5.6 Item response functions for three items 141 5.7 A generic Wright map  142 5.8 The Wright Map for the dichotomized PF-10 instrument 150 5.9 Standard errors of measurement for the PF-10 155 5.10 A Group Report for the PF-10 157 5.11 The Individual Scores report for Respondent 2659 158 6.1 Item response function for Xi = 0 for an item with difficulty 1.0 logits 169 6.2 The cumulative category response functions for a polytomous item 170 6.3 The Wright map for the trichotomous PF-10 instrument 175 6.4 Cumulative category characteristic curves, and matching fit results for VigAct 176 6.5 Cumulative category characteristic curves, and matching fit results for WalkBlks 182 6.6 Kidmap for Respondent 63 on the PF-10 187 6.7 Kidmap for Respondent 1030 on the PF-10 188 6.8 Kidmap for Respondent 1308 on the PF-10 189 7.1 A BASS report for a respondent with a score of 13 on the PF-10(poly)197 7.2 The standard errors of measurement for the PF-10(poly)— with no missing data (each dot represents a score) 199 7.3 The standard error of measurement for the PF-10(poly)— including missing data (each dot represents a unique response vector) 200 7.4 The information amounts for the PF-10(poly), with no missing data (each dot represents a score) 201 8.1 A sample item from the RIS-G instrument showing the match to waypoints 222

Figures  xix

8.2 The qualitative “inner loop” of the BEAR Assessment System (BAS): shown using dashed lines 224 8.3 Wright map for the RIS-G instrument 231 8.4 Relationships (linear and quadratic) between the PF-10 scale and age 236 9.1 Item response functions: all IRFs have the same shape 253 9.2 Item response functions: IRFs have different shapes 254 9.3 Conceptual sketch of three construct maps forming a learning progression260 9.4 The Data Modeling constructs represented via a multidimensional Wright map 262 9.5 The Data Modeling constructs represented via a multidimensional Wright map with DDA adjustment 265 9.6 Item threshold estimates for Cha6 and Cos4 268 10.1 A prerequisite relationship among two waypoints in different constructs273 10.2 Prerequisite links between two Data Modeling constructs 274 10.3 The structured constructs model (SCM) representation of the Data Modeling constructs 275 10.4 Two views from the Olive Oil Task: Player A, top panel; Player B, bottom panel 277 10.5 Mutual game states for the Olive Oil Task, and a successful pathway 278 10.6 A representation of log-stream data for a time sequence in the Olive Oil Task 279 10.7 Wright map for the CPS indicators 282 10.8 Siegler’s Rule Hierarchy 285 10.9 Siegler’s balance beam tasks 286 10.10 Sketch of the Saltus model 288 10.11 The CIA Triangle (top) and Black’s “vicious triangle” (bottom) 295 10.12 The Revised CIA Triangle 296 BM1 The BASS software modules 305 BM2 The complete MoV Outcome Space 316 BM3 Mapping Likert-style items into Guttman-style items 325 BM4 Sample script for a think-aloud investigation 328

TABLES



1.1 2.1 5.1 6.1 6.2 7.1

7.2 8.1

8.2 8.3 8.4 9.1 9.2

9.3 9.4 10.1 10.2 10.3 10.4

Scoring guide for the “Piano Width” Item 1(b) 24 Items in the PF-10 59 Items in the PF-10, indicating waypoints 151 Item fit results 181 Three response patterns for the PF-10 scale 185 Outline of various measurement error summaries (aka reliability coefficients) 209 Layout of data for checking rater consistency 212 Comparisons between the unidimensional and multidimensional models 228 Correlations among the subdimensions of the RIS-G (disattenuated) 228 Item statistics for the PF-10 example (for selected items) 234 DIF results for RIS-G 241 Model comparisons for the PF-10 data 257 Relative model fit between unidimensional and sevendimensional models for the Data Modeling data 263 Correlations of the seven dimensions hypothesized for Data Modeling264 Summary of results for Data Modeling prerequisites 267 States of play in the Olive Oil Task 278 Descriptions of the “events” recorded in the Olive Oil log-stream 280 Siegler’s predicted success rates for children at different Rule Hierarchy levels for the different types of balance beam tasks 287 Comparison of deviances for Rasch and Saltus models for three data sets 290

Tables  xxi

10.5 Estimates of probability of group membership and locations for selected respondents A.1 Descriptions and discussions of the Examples in the text T5.1 Logit differences and probabilities for the Rasch model D.1 The MoV item threshold estimates J.1 PF-10 Dichotomous data: item difficulty estimates J.2 PF-10 Dichotomous data: respondent location estimates and standard errors of measurement J.3 PF-10 Polytomous data: item parameter estimates and fit results J.4 Respondent estimates and sem(θˆ  ) for the PF-10 (polytomous)

291 303 140 318 333 333 334 334

PREFACE

It is often said that the best way to learn something is to do it. Because the purpose of this book is to introduce the principles and practices of sound measurement, it is organized in the same way as an actual instrument development would be ordered, from an initial idea about the property that is to be measured to gathering evidence that the instrument can successfully be used to provide measurement on that property. In general, the properties to be measured will be properties of human beings, attributes such as abilities and attitudes, and also propensities toward behaviors, preferences, and other human attributes. Aims of the Book

After reading this book, the reader should be in a position to recognize how a measurement instrument can be seen as the embodiment of an argument, specifically the argument that the instrument can indeed measure the property that it is intended to measure. As well, the reader will also be able to (a) responsibly use such instruments, (b) critique the use of such instruments, and (c) apply the methods described in the book to develop new instruments and/or adapt old ones. The aim is that by understanding the process of measurement, the reader will indeed have learned the basics of measurement and will have had an opportunity to see how they integrate into a whole argument (see, in particular, Chapter 8). This book attempts to convey to readers the conceptual measurement ideas that are behind the technical realizations of measurement models and does include technical descriptions of the models involved but does not attempt to open the “black box” of parameter estimation and so on. For example, although Chapters 5

Preface  xxiii

and 6 do introduce the reader to the conceptual basis for a statistical model, the Rasch model, and do give a motivation for using that particular formulation, the book does not attempt to go into details about the formal statistical basis for the model. This seems a natural pedagogic order—find out what the scientific ideas are, then learn about the technical way to express them and work with them. This book is designed to be used as either (a) the core reference for a first course in measurement or (b) the reference for the practical and conceptual aspects of a course that uses another reference for the other, more technical (i.e., statistical), aspects of the course. In approach (a), the course would normally be followed by a second course that would concentrate on the mathematical and technical expression of the ideas introduced here.1 But, in approach (b) it may be that some instructors will prefer to teach both the concepts and the technical expression at the same time—for such, the best option is to read a more traditional technical introduction in parallel with reading this book. Audiences for the Book

In order to be well-prepared to read this book, the reader should have (a) an interest in understanding the conceptual basis for measurement and (b) a certain minimal background in quantitative methods, including knowledge of basic descriptive statistics, and a familiarity with standard errors, confidence intervals, correlation, t-tests, elementary regression topics, and a readiness to learn about how to use a software application for quantitative analysis. This would include first- and/or second-year graduate students, but also undergraduates with sufficient maturity of interests and experience. Structure of the Book

This book is organized to follow the steps of a particular approach to conceiving of and constructing an instrument to measure a property. So that the reader can see where they are headed, the account in the book starts off in the first chapter with a summary of all the constructive steps (“building blocks”) involved, using a single context taken from educational achievement testing. Then this brief account is expanded upon in Chapters 2 through 6, developing the reader’s understanding of the building blocks approach. Chapter 2 describes the “construct map”—that is, the idea that the measurer has of the property that is being measured. The construct is the conceptual heart of this approach—along with its visual metaphor, the construct map. Chapter 3 describes the design plan for the “items,” the ways that prompt the respondent to give useful information about the property, such as questions, tasks, statements, and performances. Chapter 4 describes the “outcome space,” the way that these responses are categorized and then scored as indicators of the property. Chapters 5 and 6 describe the statistical model that

xxiv Preface

is used to organize the item scores into measures. This statistical model is used to associate numbers with the construct map—to “calibrate” the map. This initial set of chapters (1–6) constitutes the constructive part of measuring. The next two chapters describe the quality control part of the process. This starts already in Chapter 6, which describes how to check that the item scores are operating consistently in the way that the statistical model assumes they do. Chapters 7 and 8 describe how one can investigate the trustworthiness of the resulting measurements. Chapter 7 describes how to check that the instrument has demonstrated sufficient consistency to be useful—termed, generally, the “reliability” evidence. Chapter 8 describes how to check whether the instrument does indeed measure what it is intended to measure—termed the “validity” evidence. Both of these latter two chapters capitalize on the calibrated construct map as a way to organize the arguments for the instrument. The final chapters, Chapters 9 and 10, are quite different—they discuss ways to extend the “building blocks” approach and how to go beyond it, respectively, and thus (it is hoped) instigate the beginning of the reader’s future as a measurer, rather than functioning as a conclusion. Learning Tools

Each chapter includes several features intended to help the reader follow the arguments presented therein. Each chapter begins with an overview of the chapter and a list of the “key concepts”—these are typically single words or phrases that refer to the main ideas that are introduced and used in the chapter. After the main body of the chapter, there is a section called “Resources” that the reader can consult to investigate these arguments and topics further. There is also a set of “Exercises and Activities” for the reader at the end of the chapters. These serve a dual purpose: (a) prompting the reader to try out some of the strategies used in the chapter for themselves and to extend some of the discussions beyond where they go in the text, and (b) encouraging the reader to carry out some of the steps needed to apply the content of the chapter to developing an instrument. Several chapters also include textboxes—these give useful details that readers can follow up on but which might otherwise delay readers’ progress through the chapter. There are also ten appendices at the end of the book. These contain a variety of information that will be useful to the reader who is looking for more detail at specific points in the narrative. Some describe aspects of the instrument development process in more detail than is provided in the text, and some others record the details of the results of computer analyses of data, parts of which are used in the text. There are several other resources that are designed to support the reader; these are available at a website noted in Appendices A and B. First, there is the “Examples Archive” (Appendix A). In the text, ten Examples are used in various places to provide concrete contexts for the concepts being discussed. To supplement these accounts, the Examples Archive on the website includes information

Preface  xxv

on each of these Examples, including background information, descriptions of the construct, the instrument, data sources, analyses, and selected analysis results. There they are recorded in considerable detail so as to allow the reader to work through the instrument development and investigation steps to completion. In particular, this is very useful to illustrate the ways that the approach can vary under differing circumstances. Second, all the computations are carried out with a particular program—the BEAR Assessment System Software (BASS)—see Appendix B. Access to this is included on the website. Using the Book to Teach a Course

This book is the core reference for the first of a series of courses of instruction about measurement taught by the author in the Berkeley School of Education at the University of California, Berkeley, since 1986. It evolved into its current form over several years, with the format changing in response to successes and failures and student comments. Thus, the book has a quite natural relationship to a course that is based on it. The chapters form a sequence that can be used as a core for (a) a 14- to 15-week semester-length course where the students create their own instruments, or (b) for two 8-week quarter-length courses where the students design and develop an instrument in the first course, take advantage of the break to gather some relevant data using the instrument, and in the second quarter learn to analyze the data and develop a report on the reliability and validity evidence they have gathered. In my own teaching of the course, I structure it as a large-scale problem-based learning exercise for the students, where the problem they face is to develop, try out, and investigate an instrument. The students choose their own substantive aim for this exercise, deciding on a specific property they wish to measure and making their way through the building blocks and investigations. This makes it more than just a “concepts and discussion” course—it becomes an entry point into one of the core areas of the professional world of measurement. But, at the same time, this does not make it into a mere “instrument development” course—the purpose of having the students take the practical steps to create an instrument is to help them integrate the many ideas and practices of measurement into a coherent whole. This process of conceptual integration is more important than the successful development of a specific instrument—indeed a flawed instrument development can be more efficacious in this way than a successful instrument development. (Students sometimes think that when their plans go awry, as they often do, I am just trying to make them feel good by saying this, but it really is true.) The commitment needed to follow an instrument through the development process from Chapter 2 to Chapter 8 is really quite considerable. In order to do it individually, most students need to have a genuine and specific interest in the construction of a successful instrument to carry them through these steps in good

xxvi Preface

spirit. This is not too hard to achieve with the many students in a master’s or doctoral program who need to develop or adapt an instrument for their thesis or dissertation. If, however, a student is too early in their program, where they have, say, not yet decided on a dissertation or thesis topic, then it can be somewhat artificial for them to engage in the sustained effort that is required. For such students, it is more practicable to treat the instrument design as a group project (or perhaps several group projects), where many of the practical steps are streamlined by the planning and organization of the instructor. Students benefit greatly from the interactions that they can have when they hear about one another’s examples as they progress through the class. If the instructor follows the advice in the “Exercises and Activities” sections of the chapters, then each major stage of instrument development is shared with other members of the class. This is particularly important for the “item panel” (Chapter 3) and reporting the results of the estimation of the parameters of the statistical model (Chapter 5) steps. The nature of the array of Examples and the range of measurement procedures included reflect the range of types of instruments that students typically bring to the course. For example, students bring polytomous instruments (such as surveys and attitude scales) more often than dichotomous ones (such as multiple-choice tests)—that is why, for example, there is little fuss made over the distinction between dichotomous and polytomous items, often a considerable stumbling block in measurement courses. Many students bring achievement or cognitive testing as their topics, but this is usually only a plurality rather than a majority—students also commonly bring attitude and behavioral topics to the class, as well as a variety of more exotic topics such as measurement of organizations and even non-human subjects. Although the book is written as a description of these steps and can be read independently without any co-construction of an instrument, it is best read concurrently with the actual construction of an instrument. Reading through the chapters should be instructive, but developing an instrument at the same time will make the concepts concrete and will give the reader the opportunity to explore both the basics and the complexities of the concepts of measurement. Help for Old Friends

This book is, as I am sure you have noticed, a second edition, and there have been quite a few changes since the first. The main change is that there are more example contexts offered to illustrate the points being made (i.e., via the Examples). This should need no explanation—it is partly as a result of comments and suggestions I have received concerning the first edition. There has been a notable change in the terminology used in the text. As the reader will find, the book uses a special word for the distinct points along a

Preface  xxvii

construct that are used to help define the construct and to interpret the results: these are called waypoints in this edition and are defined and discussed at length in Chapters 1 and 2. In the previous edition, the term used for this was “level”; the reason for this change is that the word “level” has many meanings and uses, and using it with a specialist definition had caused both conceptual and linguistic problems for readers and learners. A second notable change is that in the second edition I have refrained from terming the statistical model used in estimating the fourth building block as the “measurement model,” which was the usage in the first edition. Although this is the typical usage in psychometrics and social science measurement more generally, I have come to see this as a serious misnomer. Indeed, the statistical model is just exactly that—a model that embodies the instrument developer’s ideas about how the item scores can be related to latent variables. But that is not all of “measurement,” and hence it is misleading to say it is the measurement model—as that implies that all the rest of the components of measurement are not important (see Wilson, 2011, for a discussion of this error in (my) usage). Especially given what I have learned in the last ten years about measurement philosophy (see, e.g., Mari et al., 2023), I am convinced that for psychometrics to be a useful science it must get over its “statistics envy” and embrace a larger vision of itself. In fact, the humble BAS itself deserves the label “measurement model” far more than any statistical model. As noted earlier, the computing support has moved online and to a new platform BASS (Wilson et al., 2019) (see Appendix B). This is associated with other changes—the new program does much more than the old one (the old one was merely a data analysis and results output program). The new program supports the whole building blocks approach from developing the construct map to summarizing reliability and validity evidence and reporting to respondents. The new program also comes with different levels of user interface, from a passive observer role, examining the development steps and results outputs for the Examples, through being an active user on small scale, to a professional testing and assessment platform (see the website noted in Appendix B). Note 1  A book based on this second course is currently in preparation—please look out for it.

ACKNOWLEDGEMENTS

The BAS four building blocks used in this volume have been developed from joint work with Geoff Masters and Ray Adams (Masters, Adams, and Wilson, 1990; Masters and Wilson, 1997). This work was inspired by the foundational contributions of Benjamin Wright of the University of Chicago. There are also important parallels with the “evidentiary reasoning” approach to assessment described in Mislevy, Steinberg and Almond (2003) and in Mislevy, Wilson, Ercikan, and Chudowsky (2003)—see section 10.4. I would like to acknowledge the intellectual contributions made by these authors to my thinking and hence to this work. The students of Measurement in Education and the Social Sciences I (EDUC 274A, initially 207A) in the Berkeley School of Education at the University of California, Berkeley, have, through their hard work and valuable insights, been instrumental in making this work possible. In particular, I would like to thank my valued colleagues who have co-taught the course with me, Karen Draney and Perman Gochyyev, as well as the many members of my research group “Models of Assessment,” who have over the years provided close and critical readings of the first edition: Derek Briggs, Nathaniel Brown, Brent Duckor, John Gargani, Laura Goe, Cathleen Kennedy, Jeff Kunz, Ou Lydia Liu, Quiang Liu, Insu Paek, Deborah Peres, Mariella Ruiz, Juan Sanchez, Kathy Scalise, Cheryl Schwab, Laik Teh Woon, Mike Timms, Marie Wiberg, and Yiyu Xie. Many colleagues have contributed their thoughts and experiences to the first edition volume. I  cannot list them all but must acknowledge the important contributions of the following: Ray Adams, Alicia Alonzo, Paul De Boeck, Karen Draney, George Engelhard Jr., William Fisher, Tom Gumpel, PJ Hallam,

Acknowledgements  xxix

Machteld Hoskens, Florian Kaiser, Geoff Masters, Bob Mislevy, Stephen Moore, Pamela Moss, Ed Wolfe, Benjamin Wright, and Margaret Wu. The many revisions in the second edition have required support from multiple reviewers and critical friends—I thank them for their dedication to persist through multiple versions. They include Haider Ali Bhatti, Nathaniel Brown, Aubrey Condor, Karen Draney, Brent Duckor, Himilcon Inciarte, Florian Kaiser, Julian Levine, James Mason, Andy Maul, Rebecca McNeil, Smriti Mehta, Linda Morell, Veronica Santelices, Kathy Scalise, Bob Schwartz, David Stevens, Josh Sussman, Sean Tan, Mike Timms, Yukie Toyama, Diana Wilmot, Xingyao (Doria) Xiao, Mingfeng Xue, and Shih-Ying Yang. I would particularly like to thank those who read through full versions of the entire manuscript. The team that worked on the BASS software has also made important contributions: Anna (Anita) Maria Albornoz Reitze, David Torres Irribarra, Karen Draney, and Perman Gochyyev. Again, thank you to you all. Of course, any errors or omissions are my own responsibility. Mark Wilson, Berkeley, California, March 2023

PART I

A Constructive Approach to Measurement

1 THE BEAR ASSESSMENT SYSTEM Overview of the “Four Building Blocks” Approach

There is nothing as practical as a good theory. —Kurt Lewin (1943)

1.1  What Is “Measurement”?

Measurement is widely practiced in many domains, such as science, manufacturing, trade, medicine, health, psychology, education, and management. It is the aim of this book to focus particularly on measurement in domains where the intent is that human attributes (or properties) are to be measured—attributes such as their achievements, their attitudes, or their behaviors. Typically, these are measured using instruments such as psychological scales, achievement tests, questionnaires, and behavioral checklists. The reason for gathering the measurements may range from making a decision about just a single person, to making decisions about social groups (such as schools and businesses), including groups involved in a research study (such as a psychological experiment, or a design experiment); sometimes the context does not require an explicit decision-making context, but instead the purpose is to monitor and track certain aspects of the measurements such as changes over time. The general approach to measurement adopted here is one that is especially pertinent to the social domains, but to physical and biological sciences as well. A general definition is that measurement is an empirical and informational process, designed on purpose, whose input is an empirical property of an object and that produces information in the form of values of that property. (Mari et al., 2023, p. 25) DOI: 10.4324/9781003286929-2

4  The BEAR Assessment System

In this definition, the term “property” is used for the real-world human characteristic that we wish to measure, generally labeled as an attribute, or a latent trait, but also more specifically (depending on context) as an ability, an attitude, a behavior, etc. Thus, measuring is a designed process, not merely a matter of finding something adventitiously, and the outcome is a value (sometimes a number, sometimes an output category), a piece of information about the person. Important qualities that should pertain to measurements are objectivity and intersubjectivity. Objectivity is the extent to which the information conveyed by the measurement concerns the property under measurement and nothing else. Intersubjectivity is enhanced to the extent that that information is interpretable in the same way by different measurers in different places and times. Going beyond this very basic definition, measurement is also characterized by an assessment of the quality of the information conveyed by the output (Mari et  al., 2023, pp. 58–63): “every measurement is tainted by imperfectly known errors, so that the significance which one can give to the measurement must take account of this uncertainty” (ISO, 1994: Foreword). The evaluation of random and systematic variations in measurements is traditionally ascertained through summaries of the typical variations in measurements over a range of circumstances (i.e., random errors) and investigation of biases in the measurement (i.e., systematic errors), the two together helping to establish the trustworthiness of the measurement. In the social sciences, these two aspects have been referred to, broadly, as relating to the reliability of the measurements and the validity of the measurements, respectively, and these will be examined in greater detail in Chapters 7 and 8. Other terms are used in the physical sciences such as precision and trueness, respectively, combining together to determine accuracy of the measurement (Mari et al., 2023, pp. 52–58).2 The approach adopted here is predicated on the idea that there is a single underlying attribute that an instrument is designed to measure. Many surveys, tests, and questionnaires are designed to measure multiple attributes. Here it will be assumed that, at least in the first instance, we can consider those characteristics one at a time, so that the full survey or test is seen as being composed of several instruments each measuring a single attribute (although the instruments may overlap in terms of the items). This intention is established by the person who designs and develops the instrument (the instrument developer) and is then adopted by others who also use the instrument (the measurers). Correspondingly, the person who is the object of measurement will be called the respondent throughout this book—as they are most often responding to something that the measurer has asked them to do—although that term will be made more specific in particular contexts, such as “student” where the context is within education, “subject” where it is a psychological study, “patient” when it involves medical practice, etc. Note that, although the central focus of the applications in this book is the measurement of human

The BEAR Assessment System  5

TEXTBOX 1.1  SOME USEFUL TERMINOLOGY In this volume, the word instrument is defined as a technique of relating something we observe in the real world (sometimes called “manifest” or “observed”) to an attribute that we are measuring that exists only as a part of a theory (sometimes called “latent” or “unobserved”). This is somewhat broader than the typical usage, which focuses on the most concrete manifestation of the instrument—the items or questions. This broader definition has been chosen to expose the less obvious aspects of measurement. Examples of the kinds of instruments that can be subsumed under the “construct mapping” framework are shown in this and the next several chapters. Very generally, it will be assumed that there is a respondent who is the object of measurement—sometimes the label will be changed depending on an application context, for example, a subject in a psychological context, a student or examinee in education, a patient in a health context. Also, very generally, it will be assumed that there is a measurer who seeks to measure something about the respondent; and when the instrument is being developed, this may be made more specific by referring to the instrument developer. While reading the text the reader should mainly see yourself as the measurer and/or the instrument developer, but it is always useful to be able to assume the role of the respondent as well.

attributes, this is not a limitation of the general procedures described—they can be applied to properties of any complex object—and this will be commented upon when it is pertinent. The measurements that result from applying the instrument can be seen as the result of a scientific argument (Kane, 2006) embodied in the instrument and its design and usage. The decision may be the very most basic measurement decision that a respondent has a certain value on the attribute in question (as in the Basic Evaluation Equation—Mari et  al., 2023, p. 130), or it may be part of a larger context where a practical or a scientific decision needs to be made. The building blocks of the BAS that are described subsequently can thus be seen as a series of steps that can be used as the basis for this argument. First, the argument is constructive; that is, it proceeds by constructing the instrument following a design logic based on the aforementioned definition of measurement (this occupies the contents of Chapters 2–5). Then the argument is reflective, proceeding by gathering data on the instrument’s functioning in an empirical situation, and interpreting the resulting information on whether the instrument did indeed

6  The BEAR Assessment System

function as planned in terms of validity and reliability (this occupies the contents of Chapters 6–8). Thus, in this book, the concept that is being explored is more like a verb, “measuring,” than a noun, “measurement.” In general, the approach here can be seen as being an embodiment of Mislevy’s sociocognitive approach to human measurement (Mislevy, 2018) and also as an example of Principled Assessment Design (see Chapter 10 and also Ferrara et al., 2016; Nichols et al., 2016; Wilson & Tan, 2023). There is no claim being made here that the procedures described subsequently are the only way to make measurements—there are other approaches that one can adopt. The aim is not to survey all such ways to measure, but to lay out one particular approach that the author has found successful over the last three and a half decades of teaching measurement to students at the University of California, Berkeley, and consulting with people who want to develop instruments in a wide variety of areas. 1.1.1  Construct Modeling What is the central aim of the BAS?

The general approach to measurements that is described in this book, which I call construct modeling, is based on a constructive way of understanding the process of measurement. It is closely related to the approach taken by the (US) National Research Council (NRC) in a NRC Committee report on the status of educational assessment at the turn of the century (NRC, 2001). The Committee laid out what has become a broadly accepted formulation of what should be the foundations of measurement in that field (and more broadly, in social sciences). According to the Committee (see the “NRC Assessment Triangle” in Figure 1.1): First, every assessment is grounded in a conception or theory about how people learn, what people know, and how knowledge and understanding progress

Cognition

Interpretation FIGURE 1.1 

Observation

The National Research Council (NRC) Assessment Triangle.

The BEAR Assessment System  7

over time. Second, each assessment embodies certain assumptions about which kinds of observations, or tasks, are most likely to elicit demonstrations of important knowledge and skills from students. Third, every assessment is premised on certain assumptions about how best to interpret the evidence from the observations in order to make meaningful inferences about what students know and can do. (p. 16) In Figure 1.1, the three foundations are labeled as Cognition, Observation, and Interpretation. The foundations, then, are seen as constituting a guide for “the process of collecting evidence to support the types of inferences one wants to draw . . . referred to as reasoning from evidence” [emphasis in original] (Mislevy, 1996, p.  38). Thus, construct modeling can also be seen as an example of evidencecentered design for measurement (Mislevy et al., 2003). 1.2  The BEAR Assessment System What are the parts of the BAS, and how do they relate to measuring?

The BEAR Assessment System (BAS; Wilson & Sloane, 2000) is an application of construct modeling. It uses four “building blocks” to address the challenges embodied in the NRC Triangle: (a) construct map, (b) items design, (c) outcome space, and (d) the Wright map. These building blocks are shown in Figure 1.2 in the form of a cyclical sequence that occurs during assessment development, a cycle which may also iterate during that development. Each of these four building blocks is an application of (parts of) the three foundations from the NRC

Items Design

Construct Map

Wright Map FIGURE 1.2 

Outcome Space

The four building blocks in the BEAR Assessment System (BAS).

8  The BEAR Assessment System

Triangle. Hence, the foundations are also seen as being principles for assessment development. The match, in sequence, is as follows: (a) The construct map is the embodiment of the principle of Cognition. (b) The items design is the practical plan for carrying out Observation. (c) The outcome space and the Wright map jointly enable Interpretation. This correspondence is explained subsequently in the respective sections of Chapter 1, using a single example (specifically, Example 1) to help make the points concrete, and then each of the next four chapters is devoted to one of the building blocks, in turn, giving further examples to elucidate the range of application. In this chapter, the four building blocks will be illustrated with a recent example from educational assessment—an assessment system built for a middle school statistics curriculum that leans heavily on the application of learning sciences ideas in the “STEM” (science/technology/engineering/mathematics) domain, the Data Modeling curriculum (Lehrer et al., 2014). The Data Modeling project carried out jointly by researchers at Vanderbilt University and the University of California, Berkeley, was funded by the US National Science Foundation (NSF) to create a series of curriculum units based on real-world contexts that would be familiar and interesting to students. The goal was to make data modeling and statistical reasoning accessible to a larger and more diverse pool of students along with improving preparation of students who traditionally do not do well in STEM subjects in middle school. The curriculum and instructional practices utilize a learning progression to help promote learning. This learning progression describes transitions in reasoning about data and statistics when middle school students are inducted into practices of visualizing, measuring, and modeling the variability inherent in contextualized processes. In the Data Modeling curriculum, teaching and learning are closely coordinated with assessment. 1.3  The Construct Map How will the attribute be described?

The most obviously prominent feature of the measurement process is the instrument—the test, the survey, the interview, etc. But, when it comes to developing the measurement process itself, the development of the instrument is actually somewhat down that track. Pragmatically, the first inkling is typically embodied in the purpose for which an instrument is needed and the context in which it is going to be used (i.e., involving some sort of decision). This need to make a decision is often what precipitates the idea that there is an attribute of a person that needs to be measured. Thus, even though it might not be the first step in measurement development, the definition of the attribute must eventually take center place.

The BEAR Assessment System  9

Consistent with current usage, the attribute to be measured will be called the construct (see Messick (1989) for an exhaustive analysis). A construct could be a part of a theoretical model of a person’s cognition, such as their understanding of a certain set of concepts, or their attitude toward something, or it could be some other psychological construct such as “Need for Achievement” or a personality trait such as Extraversion. It could be from the domain of educational achievement, or it could be a health-related construct such as “Quality of Life,” or a sociological construct such as “rurality” or migrants’ degree of assimilation. It could relate to a group rather than an individual person, such as a work group or a sports team, or an institution such as a workplace. It can also take as its object something that is not human, or composed of humans, such as a forest’s ability to spread in a new environment, a volcano’s proclivity to erupt, or the weathering of paint samples. There is a multitude of potential constructs—the important thing here is to have one that provides motivation for developing an instrument, a context in which the instrument might be used, and, ideally, a theoretical structure for the construct. The idea of a construct map is a more precise concept than “construct.” First, we assume that the construct we wish to measure has a particularly simple form—it extends from one end of the construct to another, from high to low, or small to large, or positive to negative, or strong to weak. The second assumption is that there are consecutive distinguishable qualitative points between the extremes. Quite often the construct will be conceptualized as describing successive points in a process of change, and the construct map can then be thought of as being analogous to a qualitative “roadmap” of change along the construct (see, e.g., Black et  al., 2011). In recognition of this analogy, these qualitatively different locations along the construct will be called “waypoints”—and these will, in what follows, be very important and useful in interpretation. Each waypoint has a qualitative description in its own right, but, in addition, it derives meaningfulness by reference to the waypoints below it and above it. Third, we assume that the respondents can (in theory) be at any location in between those waypoints—that is, the underlying construct is dense in a conceptual sense. There have been historically preceding concepts that have been formative in developing the idea of a construct map. Each of them features some aspects of the idea described in the preceding paragraph, but none are quite the same. A very common example is the set of traditional school grades (A to F, etc.). Probably the most prominent example in the assessment literature is Bloom’s Taxonomy, which focuses on behavioral objectives (in education) as central planning tools for educational curricula, and which features hierarchies of levels of objectives for the cognitive (Bloom, et  al., 1956) and affective (Krathwohl et  al., 1964) domains—for example, the levels of the cognitive domain have the following labels: Remember, Understand, Apply, Analyze, Evaluate, and Create (from the latest revision, Anderson et al. (2001)). Historically, this has been broadly used

10  The BEAR Assessment System

in educational circles around the world, and its status as a predecessor must be acknowledged. However, there are important distinctions between Bloom’s Taxonomy and the concept of a construct map: (a) Bloom’s Taxonomy is a list of objectives, designed to help plan a sequence of instruction—there is no theoretical necessity to see them as defining a single construct. (b) The levels of the Taxonomy, being behavioral objectives, are not necessarily good targets for designing assessments (although many educational researchers have used them in that way). (c) The Taxonomy is seen as being universal, ranging across almost any cognitive or affective variable, which contrasts with the required specificity of the construct in the construct map. (d) There is no posited relationship between the underlying construct and the Taxonomy’s equivalent of the waypoints (i.e., “Bloom’s levels”). Other important precedents include the stage theories of Jean Piaget (regarding cognitive development; Flavell (1963), Inhelder & Piaget (1958)), and the learning hierarchies of Robert Gagné (regarding conceptual learning; Gagné (1968)). Each of these has similarities and distinctions from the concept of a construct map (for discussion of each, see Wilson (1989)). In Figure  1.3, there are four illustrative waypoints (the small circles)—these will be defined within the theoretical context (i.e., semantically), and are ordered

Highest qualitatively distinguished point (4) Intermediate point (3)

Intermediate point (2)

Lowest qualitatively distinguished point (1)

FIGURE 1.3 Illustration

of a generic construct map, incorporating qualitative personside waypoints.

The BEAR Assessment System  11

within that same theory. The line running between the waypoints represents the (dense) possible locations where the respondents are located. The waypoints, although substantively ordered, as noted, are not necessarily equally spaced apart on the metric of the construct—and so, that is also exemplified in Figure 1.3. At this point in the introduction, there is no metric that has been developed—no scale, so to speak, so no expectations on this can usefully be entertained at this point. To reiterate, at this initial stage of instrument development, the construct map is still an idea, a latent rather than a manifest conception. For those familiar with psychometric models in the social sciences, one can think of a construct map as a special sort of unidimensional latent variable (i.e., the underlying latent variable for a monotone unidimensional latent variable model3 (Junker & Ellis, 1997)). It is special in the sense that the underlying latent variable has, so far, no specific scale, but it does have particular (conceptual) locations on it (the waypoints) which must be derived from the substantive content of the construct. 1.3.1 Example 1: The MoV Construct in the Data Modeling Assessments

Both the Data Modeling curriculum (introduced at the end of Section 1.2), and its assessment system are built on a set of six constructs or strands of a learning progression (the complete set of all six are described in Section 9.3). These have been designed to describe how students typically develop as they experience the Data Modeling curriculum. Conceptual change along these constructs was encouraged by instructional practices that served to engage students in versions of core professional practices of statisticians and data scientists which had been adapted to a school context and to be appropriate for young students. In the classrooms, students developed and iterated how to visualize, measure, and model variability. In the account here, the primary focus is on the assessment aspects of the learning progression, mainly concentrating on the development and deployment of the assessments. Further information on the instructional aspects of the learning progression can be found in Lehrer et al. (2020). In this chapter, we will focus on one of the topics that the Data Modeling researchers developed: “Models of Variation” (abbreviated as MoV). As students’ conceptions of variation develop, the MoV construct describes their progression in creating and evaluating mathematical models of random and non-random variation. These models initially focus on the identification of sources of variability, then advance to the incorporation of chance devices to represent the (mathematical) mechanism of those sources of variability, and, at the most sophisticated point, involve the judgment of how well the model works (i.e., model fit) by examining how repeated model simulations relate to an empirical sample. In the Data Modeling curriculum, a student’s ideas about models and modeling are scaffolded by classroom practices that encourage students (usually

12  The BEAR Assessment System

FIGURE 1.4 The

Models of Variation (MoV) construct map. Note that the distances between waypoints on this (hypothetical) map are arbitrary.

working together in small groups) to develop and critique models of data generation processes. A construct map consists of an ordered list of typical waypoints that students reach as they progress through a series of ways of knowing: in the case of MoV, this represents how the students in the Data Modeling curriculum typically learn how to devise and revise models of variability. The book’s initial example of a construct map is the MoV construct map and this is shown in Figure 1.4. Details about the MoV example have been previously published in Wilson & Lehrer (2021). At the first point-of-interest, shown at the bottom of Figure 1.4 and labeled “MoV1,” students associate variability with particular sources of variability, which the curriculum encourages by asking students to reflect on random processes characterized by signal and noise. For example, when considering variability of measures of the same object’s width, students may consider variability as arising from errors of measurement, and “fumbles” made by some measurers because they were not sufficiently precise. To be judged as being at this initial point, students should say something about one or more sources of variability but not go so far as to implicate chance origins to variability. Note that, in actually recording such observations, it is also common that some students will make statements that

The BEAR Assessment System  13

do not reach this initial point on MoV, and hence these might be labeled as being at MoV0 (by implication, below MoV1), but this story is a little more complicated, as will be noted in Section 1.5.1. When students get to the next waypoint, Waypoint MoV2, they informally begin to order the relative contributions of different sources to variability, using wording such as “a lot” or “a little.” Thus, students are referring (implicitly though not explicitly) to mechanisms that they believe can cause these distinctions, and they typically predict or explain the effects on variability. The move from MoV2 up to MoV3 is an important transition in students’ reasoning. At MoV3, students begin to explicitly think about chance as contributing to variability. In the Data Modeling curriculum, students initially experience and investigate simple devices and phenomena where it is agreed (by all in the classroom) that the behavior of the device is “random.” An example of one such device is the type of spinner illustrated in Figure 1.5. The students have initial experiences with several different kinds of spinners (different numbers of sectors, different shapes, etc.). Then they are engaged in a design exercise where students are provided with a blank (“mystery”) spinner and are asked to draw a line dividing the spinner into two sectors which would produce different proportions of the outcomes of the spinner, say 70% and 30%. This is then repeated for different proportions and different numbers of categories, so that students understand how the geometry of the spinners embody both the randomness and the structural features of chance. The conceptual consequences of such investigations are primarily captured in another construct (i.e., the Chance construct, see Section 9.3), but in terms of the modeling, students also are led to understand that chance can be a source of variability, even in modeling. At MoV4, the challenge to students is to transition from thinking about single (random) sources of variability (such as the spinner in Figure  1.5) to

FIGURE 1.5 

Illustration of a spinner.

14  The BEAR Assessment System

conceptualizing variability in a process as emerging from the combination of multiple sources of variation, some of which might be random and some not. For example, a distribution of repeated measurements of an attribute of an object might be thought of as a combination of a fixed amount of a respondent’s attribute (which they might think of as the “true amount”) and one or more components of chance error in the measurement process. Within the Data Modeling curriculum, one such exercise involves teams of students measuring the “wingspan” of their teacher (i.e., the width of the teacher’s reach when their arms are spread out). Here the random sources might be the following: (a) The “gaps” that occur when students move their rulers across the teacher’s back (b) The “overlaps” that occur when the endpoint of one iteration of the ruler overlaps with the starting point of the next iteration (c) The “droop” that occurs when the teacher becomes tired and their outstretched arms droop The students not only consider these effects, but, in this curriculum, they also go on to model them using assemblies of spinners, initially in their usual physical format (i.e., using spinners), and advancing to a virtual representation on a computer (which is intended to make the simulations easier). When they move on to MoV5, students arrive at an evaluative point, where they consider variability when judging the success of the models they have devised previously. They run multiple simulations and observe that one run of a model’s simulated outcomes may fit an empirical sample well (e.g., similar median and inter-quartile range values, similar “shapes,” etc.) but the next simulated sample might not. In this way, students are prompted by the teacher to imagine running simulations repeatedly, and thus can come to appreciate the role of multiple runs of model simulations as a tool that can be used to ascertain the success of their model. This is a very rich and challenging set of concepts for students to explore, and, eventually, grasp. In fact, even when assessing the abilities of students in this domain at the college entry level, it has been found that the accurate understanding of sampling statistics can be rather poorly mastered (see, e.g., Arneson et al., 2019). The Data Modeling MoV construct map is an example of a relatively complete construct map. When a construct map is first postulated, it may often be much more nebulous. The construct map is refined through several processes as the instrument is being developed. These processes include (a) explaining the construct to others with the help of the construct map, (b) creating

The BEAR Assessment System  15

items that you believe will lead respondents to give responses that inform the waypoints of the construct map, (c) trying out those items with a sample of respondents, and (d) analyzing the resulting data to check if the results are consistent with your intentions, as expressed through the construct map. These steps are illustrated in the three building blocks discussed in the next three sections of this chapter. 1.4  The Items Design What are the critical characteristics of the items?

The next step in instrument development involves thinking of ways in which the theoretical construct embodied in the construct map could be manifested via a real-world situation. At first, this may be not much more than a hunch: a context where the construct is involved or plays a determining role. Later, this hunch will become more crystallized, and settle into certain patterns. The time-ordered relationship between the items and the construct is not necessarily one way as it has just been described in the previous section. Oftentimes, the items will be thought of first, and the construct will be elucidated only later—this is simply an example of how complex a creative act such as instrument construction can be. The important thing is that the construct and the items should be distinguished, and that eventually, the items are seen as prompting realizations of the construct within the respondent. For example, the Data Modeling items often began as everyday classroom experiences and events that teachers have found to have a special significance in the learning of variability concepts. Typically, there will be more than one real-world manifestation used in the instrument. These parts of the instrument are generically called “items,” and the format in which they are presented to the respondent will be called the items design, which can take many forms. The most common ones are the multiple-choice format used in achievement testing and the Likert-type format from surveys and attitude scales (e.g., with responses ranging from “strongly agree” to “strongly disagree”). Both are examples of the “selected response” item type, where the respondent is given only a limited range of possible responses and is forced to choose among them. There are many variants of this, ranging from questions on questionnaires to consumer rankings of products. In contrast, in other types of items, the respondent may produce a “constructed response” within a certain mode, such as an essay, an interview, a performance (e.g., a competitive dive, a piano recital, or a scientific experiment)—usually these will be judged using a scoring guide or “rubric” which functions in way that is similar to the way a construct map functions.

16  The BEAR Assessment System

In  all these examples so far, the respondent is aware that they are being observed, but there are also situations where the respondent is unaware of the observation. A person might be involved in a game, for example, where an observer (human or automated) might record a certain suite of behaviors without the gamer being aware of the situation. Of course, in addition, the items may be varied in their content and mode: interview questions will typically range over many aspects of a topic; questions in a cognitive performance task may be presented depending on the responses to earlier items; items in a survey may use different sets of options—and some may be selected and some constructed in a variety of response formats. 1.4.1  Example 1: MoV Items

In the case of the Data Modeling assessments, the items are deployed in number of ways: (a) as part of summative pretest and posttest, (b) as part of meso-level assessment following units of instruction (Wilson, 2021), and (c) as parts of micro-level assessment in the form of prompts for “assessment conversations” where a teacher discusses the item and student suggested responses with groups of students. To illustrate the way that items are developed to link to the construct map, consider the Piano Width task, shown in Figure 1.6. This task capitalizes on the Data Modeling student’s experiences with ruler iteration errors (i.e., the “gaps and laps” noted in the previous section) in learning about wingspan measurement as a process that generates variability. Here the focus is on Item 1 of the Piano Width task: the first part—1(a)—is intended mainly to prompt the student take one of two positions. There are, of course, several explicit differences in the results shown in the two displays, but Item 1(a) focuses the students’ attention on whether the results show an effect due to the students’ measurement technique (i.e., short ruler versus long ruler). Some students may note that the mode is approximately the same for both displays and consider that sufficient to say “No.” Thus, this question, although primarily designed to set up the part 1(b) question, is also addressing a very low range on the MoV construct—these students are not able to specify the source of the variation, as they are not perceiving the spread as being relevant. Following this setup in Item 1(a), the second part, Item 1(b), does most of the heavy lifting for MoV, exploring the students’ understanding of how measurement technique could indeed affect variation, and targeting MoV2. This question does not range up beyond that, as no models of chance or chance devices are involved in the question.

The BEAR Assessment System  17

FIGURE 1.6 The

Piano Width task.

18  The BEAR Assessment System

1.4.2  The Relationship between the Construct and the Responses

The initial situation between the first two building blocks can be depicted as in Figure 1.7. Here both the construct and the items are only vaguely known, and there is some intuitive relationship between them (as indicated by the curved dotted line). Causality is often unclear at this point, perhaps the construct “causes” the responses that are made to the items, or perhaps the items existed first in the measurement developer’s plans and hence could be said to “cause” the construct to be developed by the measurement developer. It is important to see this as an important and natural step in instrument development—a step that often occurs at the beginning of instrument development—and may recur many times as the instrument is tested and revised. Unfortunately, in some instrument development efforts, the conceptual approach does not go beyond the state depicted in Figure 1.7, even when there are sophisticated statistical methods used in the data analysis (which, in many cases, do indeed assume a causal order). This unfortunate abbreviation of the instrument development process, which is mainly associated with an operationalist view of measurement (Mari et al., 2023), will typically result in several shortcomings: (a) Arbitrariness in choice of items and item formats, (b) No clear way to relate empirical results to instrument improvement, (c) An inability to use empirical findings to improve the conceptualization of the construct. To avoid these issues, the measurer needs to build a structure that links the construct closely to the items—one that brings the inferences as close as possible to the observations. One way to do that is to see causality as going from the construct to the items—the measurer assumes that the respondent “has” some amount of the

Construct

FIGURE 1.7  A  picture

Responses to items

of an initial idea of the relationship between construct and item responses.

The BEAR Assessment System  19

Causality Responses to items

Construct Inference FIGURE 1.8 A picture

of the Construct Modeling idea of the relationship between degree of construct possessed and item responses.

construct, and that amount of the construct is conceived of as a cause of the responses to the items in the instrument that the measurer observes. That is the situation shown in Figure 1.8—the causal arrow points from left to right. However, this causal agent is latent—the measurer cannot observe the construct directly. Instead, the measurer observes the responses to the items, and must then infer the underlying construct from those observations. That is, in Figure 1.8, the direction of the inference made by the measurer is from right to left. It is this two-way relationship between the construct and the responses that is responsible for much of the confusion and misunderstanding about measurement, especially in the social sciences—ideas about causality get confounded with ideas about inference, and this makes for much confused thinking (see Mari et al., 2023). The remaining two building blocks embody two different steps in that inference. Note that the idea of causality here is an assumption, and the analysis does not prove that causality is in the direction shown, it merely assumes it goes that way. In fact, the actual mechanism, like the construct, is unobserved or latent. It may be a much more complex relationship than the simple one shown in Figure 1.8. Until more extensive research might reveal the nature of that complex relationship, the measurer may be forced to act as though the relationship is the simple one depicted. 1.5  The Outcome Space How can responses be categorized so the categories are maximally useful?

The first step in the inference process illustrated in Figure 1.8 is to decide which aspects of the response will be used as the basis for the inference, and how those aspects will be categorized and scored. The result of all these decisions will be called the Outcome Space in this book. Examples of familiar outcome spaces include the following:

20  The BEAR Assessment System

(a) The categorization of question responses into “true” and “false” on a test (with subsequent scoring as, say, “1” and “0”), (b) The recording of Likert-style responses (Strongly Agree to Strongly Disagree) on an attitude survey, and their subsequent scoring depending on the valence of the items compared to the underlying construct. Less common outcome spaces would be the following: (c) The question and prompt protocols in a standardized open-ended interview (Patton, 1980, pp. 202–205) and the subsequent categorization of the responses, (d) The translation of a performance into ordered categories using a scoring guide (sometimes called a “rubric”). Sometimes the categories themselves are the final product of the outcome space, and sometimes the categories are scored so that the scores can (a) serve as convenient labels for the outcomes categories and (b) be manipulated in various ways. To emphasize this distinction, the second type of outcome space may be called a “scored” outcome space, whereas the first might be thought of as a “named” outcome space, for example, nominal with an ordered tendency noted, at least for now. The resulting scores play an important role in the construct mapping approach. They are the embodiment of the “direction” of the construct map (e.g., positive scores go “upward” in the construct map). The distinction between the outcome space and the items design, as described in the previous section, is not something that people are commonly aware of, and this is mainly due to the special status of what are probably the two most common item formats—the Likert-style item common in attitude scales and questionnaires and the multiple-choice item common in achievement testing. In both item formats, the items design and the outcome space have been collapsed—there is no need for the measurer to categorize the responses as that is done by the respondents themselves. And in most cases, the scores to be applied to these categories are also fixed beforehand. However, these common formats should be seen as “special cases”—the more generic situation is where the respondent constructs their own responses, most commonly in a written (e.g., an essay) or verbal (e.g., a speech or an interview) format, but it could also be in the form of a performance (e.g., a dive) or a produced object (e.g., a statue). In this constructed response type of outcome space, the responses are selected into certain categories by a rater (sometimes called a “reader” or a “judge”). The rater might also be a piece of software that is part of an automated scoring system, as featured in the latest educational technology using machine-learning algorithms. That the constructed response form is more basic becomes clear when one sees that the development of the options for the

The BEAR Assessment System  21

selected responses will in most cases include an initial development iteration that uses the free-response format (we return to this point in Section 3.3). That the Likert-style response format, in many cases, does not require such an initial step may be seen as an advantage for the developers, but see the discussion about Example 2 in Section 4.3.3 for a discussion of that. In developing an outcome space for a construct map, several complications can arise. One is that when the designed waypoints are confronted with the complications of actual responses to items, sometimes there is found to be a useful level of subcategories within (at least some) waypoints. These can be useful in several ways: (i) they give more detail about typical responses, and hence help the raters make category decisions, (ii) they can be related to follow-on actions that might result from the measurements (e.g., in the case of achievement tests, point teachers toward specific instructional strategies), and, in some situations (iii) they may give hints that there may be a finer grain of waypoints that could form the basis for measurement if there were more items/responses, etc. In case (iii), this may be denoted in the category labels using letters of the alphabet (“MoV2A,” etc.), though other denotations may be more suitable when implications of ordering are not warranted. (This is exemplified in the example in the next section.) A variation on this is that sometimes there are categorizations that are considered “somewhat higher” than a certain waypoint, or “somewhat below” in which case “+” and “−” are simply added to the label for the waypoint: Mov2+ or MoV2−, for instance. In describing an outcome space, it is very helpful to the reader to show examples of typical responses at each waypoint—these are called exemplars in this book. These are useful for both selected and constructed response type items. Mostly the exemplars will consist of very clear examples of responses for each waypoint, in order to help with interpretation of the waypoint. But when the outcome space is to be used by raters, it is also helpful to include adjudicated difficult cases, sometimes called “fence-sitters.” 1.5.1  Example 1: The MoV Outcome Space

The outcome space for the Data Modeling Models of Variability construct is represented in Appendix D (in Figure A.1)—the whole outcome space is not shown here due to its length. Glancing at the Appendix, one can see that the Data Modeling outcome space is conceptualized as being divided into five areas corresponding to the five waypoints in the construct map shown in Figure 1.4, running from the most sophisticated at the top to the least sophisticated at the bottom. Here, the focus is on just one waypoint, as shown in Figure 1.9. The columns in this figure represent the following (reading from left to right): (a) The label for the waypoint (e.g., “MoV2”) (b) A description of the waypoint

22  The BEAR Assessment System



M 0

V 2

Informally describe the contribution of one or more sources of variability to variability observed in the system.

MoV2 B





MoV2 A

FIGURE 1.9 

Describe how a process or a change in process affects variability.

Informally estimate the magnitude of variation due to one or more sources.





"When we used the ruler, there were more mistakes (more gaps and laps), but when we switched to the tape measure, there were fewer mistakes. So, the measurements with the ruler are more spread out than the measurements with the tape measure." "If we all carefully measure the height of the flagpole, we will all be pretty close to its real height; so our measurements will look like a hill." "When we first plant and measure each plant's height, the height can't be less than 0. So the plants' heights on day 3 will be skewed to the left." "The amount of water makes a lot of difference in how tall the plants grow but where we put them in the pot doesn't matter as much." "The error due to gaps and laps is more than the error due to misreading the ruler."

A segment of the MoV Outcome Space (from Appendix D).

(c) Labels for possible intermediate waypoints (d) A description of each of those (e) Exemplars of students’ responses to items for each of the intermediate points In Figure 1.9, we can note that MoV2 has been divided into two intermediate points: MoV2A and MoV2B. These distinguish between (a) an apprehension of the relative contributions of the sources of variability (MoV2A) and (b) demonstration of an informal understanding of the process that affects that variability (i.e., MoV2B), respectively. These are seen as being ordered in their sophistication, though the relative ordering will also depend on the contexts in which these are displayed (i.e., we would expect this ordering within a certain context, but across two different contexts MoV2A may be harder to achieve than MoV2B). The outcome space for a construct is intended to be generic for all items in the full item set associated with that construct. When it comes to a specific item, there is usually too much detail which precludes putting it all into documents like the one in Appendix D. Hence, to make clear how the expected responses to a specific item relate to the outcome space and to aid raters in judging the responses, a second type of document is needed, a scoring guide, which is focused on a specific item, or, when items are of a generic type, on an item representing the set. For example, the scoring guide for the Piano Width Item 1(b) (which was shown in Figure 1.6) is shown in Table 1.1. As noted earlier, the most sophisticated

The BEAR Assessment System  23

responses we usually get to the Piano Width task are at MoV2, and typically fall into one of two MoV2 categories after “Yes” is selected for Item 1(a). MOV2B: The student describes how a process or change in the process affects the variability. That is, the student compares the variability shown by the two displays. The student mentions specific data points or characteristics of the displays. For example, one student wrote: “The Meter stick gives a more precise measurement because more students measured 80-84 with the meter stick than with the ruler.” MOV2A: The student informally estimates the magnitude of variation due to one or more sources. That is, the student mentions sources of variability in the ruler or meter stick. For example, one student wrote: “The small ruler gives you more opportunities to mess up.” Note that this is an illustration of how the construct map waypoints may be manifested into multiple intermediate waypoints, and, as in this case, there may be some ordering among responses of the same category (i.e., MoV2B is seen as a more complete answer than MoV2A). Less sophisticated responses are also found: MoV1: The student attributes variability to specific sources or causes. That is, the student chooses “Yes” and attributes the differences in variability to the measuring tools without referring to information from the displays. For example, one student wrote: “The meterstick works better because it is longer.” Of course, students also give unclear or irrelevant responses, and commonly this category might be labeled as “MoV0,” with the “0” indicating that it is a waypoint that is below the lowest one that has been defined so far in the development process (i.e., below MoV1, as shown in Figure 1.4). But the Data Modeling developers were very attentive to the responses they observed in their initial rounds of data collection, and they went a step further: they postulated two lower waypoints. For decidedly irrelevant responses such as “Yes, because pianos are heavy,” they labeled them as “No Link(i)” and abbreviated this as “NL(i).” The Data Modeling researchers labeled this as “No Link” because this response does not provide strong evidence that the student’s response links to one of the postulated waypoints MoV1 to MoV5. However, in the initial stages of instruction in this topic, students also gave responses that were not clearly at MoV1 but were judged to be better than completely irrelevant (i.e., somewhat better than NL(i)). Typically, these responses contained relevant terms and ideas but were not accurate enough to warrant labelling as MoV1. For example, one student wrote: “No, Because it equals the same.” This type of response was labeled “No Link(ii)” (abbreviated as NL(ii)), and was placed lower than MoV1 but above NL(i) on the construct map—see Table 1.1). Moreover, it was found that at the beginning of the instruction on this

24  The BEAR Assessment System TABLE 1.1  Scoring guide for the “Piano Width” Item 1(b)

Label

Description of Response

MoV2B Describe how a process or change in the process affects the variability.

Students chooses “yes” while mentioning the data explicitly when discussing the variability.

Examples • “Yes, using the meter stick they had more on 80–84, they had 12. Using the small ruler they had less on 80-84, they had 8. They had less than the others.” • “More kids got the same answer when they used the meter stick because over half of the circles are 80–84. That means the meter stick is better.” • “The meter stick doesn’t have to be flipped so it’s easier to get a more accurate measurement. You will have a tighter clump near where the true measure is.” • “Yes, there are more circles in the middle and not as many on the end when you use the meter stick.” • “On the meter stick they got a lot of 80-84 measurements, but the ruler didn’t”

MoV1A Attribute variability • Because when you move the ruler it can change to specific sources or the length, but with the meter stick you just lay it causes. down. The graph shows that. • Yes, it did affect it because it is less accurate if you measure it in centimeters. • Yes, because with the small ruler you will be Student chooses “yes” morelikely to make a mistake than with the big and attributes the ruler. differences in variability • The meterstick works better because it is longer. to the measuring tools • Because you skip less space if you use a longer tool. without referring to • It coul information from the displays. NL(ii)

• “No” The student does • “No, Because it equals the same” not recognize that • “Yes. They should use cm.” the measurements are affected by the different tools. Student chooses ”no” and might state that the tools would give the same measure, or student chooses “yes” but gives no clear reason why.

NL(i)

Unclear or irrelevant response.

M

Missing Response

• “The two groups used different tools.” • “Yes, because pianos are heavy.”

The BEAR Assessment System  25

topic, students were struggling to give responses to even the very easiest questions, and so, to help with these “pretest” situations, the NL(ii) response category was maintained for this purpose. This phenomenon, where the observed responses from the respondents give new insights into the categories of the outcome space, is quite common in the first few iterations of the BAS cycle. Here it has resulted in the addition of two waypoints below the initially identified set. We will see a second sort of modification that arises in the next section, based on the results from the use of the Wright map. In part, this Example has been chosen to show how complications like this can arise, and how they can be incorporated into the results without losing the essential ideas. 1.6  The Wright Map How can measurement data be analyzed to help evaluate the construct map?

Once the initial versions of the outcome space and the individual item scoring guides have been established, the next step is to study the instrument’s empirical behavior by administering it to an appropriate sample of respondents. This results in a data set composed of the codes or scores for each person in the sample. The second step in the inference is to relate these scores back to the construct. This is done through the fourth building block, which we will term the Wright map. Effectively, the Wright map is composed of two parts, each of which is very important, but we refer to them together as the “Wright map” because the representation that we call the Wright map brings these two parts together—indeed, they have to work together in order to achieve the aims of the measurement. The first part is the application of a statistical model (sometimes called a “psychometric model” because this approach is often used in psychometrics). This statistical model is used to transform the item-based codes based on the waypoints (scored into integers 0, 1, 2, etc.) and estimate respondent locations on a metric that enables comparison of the results between different respondents and different occasions. Simply put, the Wright map enables the translation of the scored item responses into a metric that can be related back to the construct map. This enables the examination of the item locations in that metric to empirically examine how well they match the waypoints. Thus, when the two parts of the Wright map work together successfully, they enable interpretation of the measurements in terms of the construct map’s waypoints. In this book, we will estimate item and respondent parameters using a particular type of statistical model called the “Rasch” model. This model is named after Georg Rasch, who discovered that it has interesting properties. It is defined in Section  5.2.1 and one of its most important properties is discussed in Section 9.1. It is also called a one-parameter logistic (1PL) model: the “one parameter” referred to here is the single item parameter, for the item difficulty. There

26  The BEAR Assessment System

is also one parameter for the respondent’s ability. Both of these will be explained in Section 5.2 and the sections following that. A related model, the partial credit model (PCM), will also be used to carry out the estimation mentioned in the previous paragraph. These models are suitable for situations where the construct (a) is reasonably well thought of as a single construct map (as noted earlier), and (b) has categorical observations. Other situations are also common and these will be discussed in later chapters. In this chapter, we will not examine the statistical equations used in the estimation (these are central to Chapters 5 and 6) but will instead focus on the main products in terms of the Wright map. The interpretation of the results from the estimation is aided by several types of graphical summaries. The graphical summaries we will use here have been primarily generated using a particular computer application, BASS (see Appendix B). Other software can be used for the estimation step, and several of them also generate many of the same graphs (e.g., ConQuest (Adams et al., 2020) and TAM (Robitzsh et al., 2017)). The most important of these graphical summaries for our purposes in this chapter is the “Wright Map” (you can see examples of this in Figures 1.10 and 1.11). This graph capitalizes on the most important feature of a successful analysis using the Rasch model: the estimated locations of the respondents on the construct underlying the construct map, which can be matched to the estimated locations of the categories of item responses. This allows us to relate our hypotheses about the items that have been designed to link to specific construct map waypoints through the response categories. This feature is crucial for both the measurement theory and measurement practice in a given context: (a) in terms of theory, it provides a way to empirically examine the structure inherent in the construct map, and adds this as a powerful element in studying the validity of use of an instrument; and (b) in terms of practice, it allows the measurers to “go beyond the numbers” in reporting measurement results to practitioners and consumers, and equip them to use the construct map as an important interpretative device. Beyond the Wright map, the analysis of the data will include steps that focus on each of the items, as well as the set of items, including item analysis, individual item fit testing, and overall fit testing, as well as analyses of validity and reliability evidence. These will be discussed in detail and at length in Chapters 6–8. For now, in this chapter, the focus will be on the Wright map. 1.6.1  Example 1: The MoV Wright Map

Results from the analysis of data that were collected using the Data Modeling MoV items were reported in Wilson and Lehrer (2021). The authors used a sample of 1002 middle school students from multiple school districts to calibrate these items. In a series of analyses carried out before the one on which the following results are based, they investigated rater effects for the constructed response items, and no statistically significant rater effects were found, so these are not included in the analysis.

The BEAR Assessment System  27

But also see Sections 4.5 and 7.4 for further comment on rater effects. Specifically, the authors fitted a partial credit statistical model, a one-dimensional Rasch-family item response model (Masters, 1982), to the responses related to the MoV construct. The statistical model results are reported in terms of item parameters called “Thurstone thresholds” that correspond to the differences between successive waypoints on the construct map. For each item, they used the threshold values to describe the empirical characteristics of the item (Adams et al., 2020; Wilson, 2005). The way that the item is displayed on a Wright map is as follows: (a) If an item has k score categories, then there are k − 1 thresholds on the Wright map, one for each transition between the categories. (b) Each item threshold gives the ability location that a student must obtain to have a 50% chance of success at the associated scoring category or above, compared to the categories below. (The units of this scale are in logits which are the logarithm of the odds—see Textbox 5.1 for ways to interpret logit units.) For example, suppose a fictitious “Item A” has three possible score categories (0, 1, and 2): in this case, there will be two thresholds. This is illustrated on the right-hand side of Figure  1.10. Suppose that the first threshold has a value of

Respondents

Item-Thresholds 1-

0- Item A,Threshold 1/2

X

FIGURE 1.10 

-1- Item A,Threshold 0/1

Sketch of a Wright map.

28  The BEAR Assessment System

−1.0 logits: this means that a student at that same location, −1.0 logits (shown as the “X” on the left-hand side in Figure 1.10), has an equal chance of scoring in category 0 compared to the categories above (categories 1 and 2). If their ability is lower than the threshold value (−1.0 logits), then they have a higher probability of scoring in category 0; if their ability is higher than −1.0, then they have a higher probability of scoring in either category 1 or 2 (than 0). These thresholds are, by definition, ordered: in the given example, the second threshold value must be greater than −1.0—as shown, it is at 0.0 logits. Items may have more than two thresholds or just one threshold (e.g., dichotomous items such as traditional multiple-choice items). The locations of the MoV item thresholds are graphically summarized in the Wright map in Figure 1.11, simultaneously showing estimates for both the students and items on the same (logit) scale. Moving across the columns from left to right on Figure 1.11, one can see the following. (a) The logit scale. (b) A histogram (on its side) of the respondents’ estimated locations, including the number of respondents represented by each bar of the histogram. (c) The location of students at each raw score (labeled Total Score). (d) A  set of columns, one for each waypoint on the construct map, with the labels for each waypoint printed at the bottom (such as “MoV1”) showing the thresholds4 for each item. The interpretation for the threshold locations shown here is explained in detail in Section 5.3. It is sufficient for now for the reader to note how the thresholds increase “up” the logit scale as the waypoints move from low to high. (e) A column indicating the bands for each of the waypoints in the construct map. Note that the bands for MoV2 and MoV3 have been combined in the Wright map—see the second paragraph in the following for a discussion of this. (f) The logit scale (again, for convenience). Note also that the legend for the item labels is shown at the bottom: so, for example, Item 9 is named as “Model2” (i.e., the second question in the Model task) in the legend. What is needed now is to relate these item thresholds back to the waypoints in the MoV construct map (i.e., as illustrated in Figure 1.4). A standard-setting procedure called construct mapping (Draney & Wilson, 2010–2011) was used to develop empirical boundaries between the sets of thresholds for each waypoint and thus create interpretative “bands” on the Wright map. The bands are indicated by the horizontal lines across Figure 1.11. The bands indicate that the thresholds fall quite consistently into ordered sets, with a few exceptions, specifically the following thresholds: NL(ii) for Soil, MoV1 for Piano4, and MoV4 for Model3.

9.4

MoV5

4

4 8.3

MoV4 22

2

Logits

3.3

6.3

5.3 8.2 7.2 7.1

5.2 1.2

4.1 1.1 5.1 8.1

9.1

2.1

6.1

3.1

3.2

2.2

6.2 4.2

MoV2&3

9.2 2.3

MoV1

1.3

0

NL(ii)

-2

NL(i)

NL(ii)

1: Piano4

2: Building4

The Wright Map for MoV.

3: Building2

MoV1

4: Rock

5: Piano2

MoV2&3

6: Model3

7: Soil

MoV4

8: Model1

9: Model2

MoV5

The BEAR Assessment System  29

-2

FIGURE 1.11 

Logits

0

2

9.3

20 19 18 17 16 15 14 13 12 11 10

30  The BEAR Assessment System

In the initial representations of this Wright map, it was found that the thresholds for the waypoints MoV2 and MoV3 were thoroughly mixed together in terms of their locations on the logit scale. A large amount of time was spent on exploring this, both quantitatively (reanalyzing the data) and qualitatively (examining item contents and talking to curriculum developers and teachers about the apparent anomaly). The conclusion was that for these two waypoints, although there is certainly a necessary hierarchy to their lower ends (i.e., there is little hope for a student to successfully use a chance-based device such as a spinner to represent a source of variability [MoV3] if they cannot informally describe such a source [MoV2]), these two waypoints can and do overlap quite a bit in the classroom context, and hence it makes sense that they overlap on the logit scale. Students are still improving on MoV2 when they are initially starting on MoV3, and they continue to improve on both at about the same time. Hence, at least formally, while it was decided to uphold the distinction between Mov2 and MoV3 in terms of content, it also seemed best to (a) maintain the conceptual distinction between MoV2 and MoV3, but to label the segment of the scale (i.e., the relevant band) as “MoV2&3.” Thus, this is an example of how results expressed in the Wright map can modify the construct map (the MoV construct map ended up with a combined waypoint and was modified as in Figure 1.12), and also the outcome space (in this case, the consequence of the combination is that the two waypoints MoV2 and MoV3 will now be scored as 2). In addition, although the details are not shown here, the items and scoring guides for the three “off-band” thresholds

MoV5: Account for variability among different runs of model simulations to judge adequacy of model. MoV4: Develop emergent models of variability.

MoV3: Use a chance device to represent a source of variability or the total variability of the system MoV2: Informally describe the contribution of one or more sources of variability to variability observed in the system.

MoV1: Identify sources of variability.

FIGURE 1.12 

The Revised MoV construct map.

The BEAR Assessment System  31

noted earlier were examined and considered for revision: this is an example of how the Wright map results can also affect the items design. Of course, it was also the case that the MoV outcome space was modified directly from the observed results by adding the NL(i) and NL(ii) response categories. As noted earlier, the conceptualization of a specific construct map starts off as an idea mainly focused on the content of the underlying construct and relates to any extant literature and other content-related materials concerning that construct. But eventually, after iterating through the cycle of the four building blocks, it will incorporate both practical knowledge of how items are created and designed, and the empirical information relating to the empirical behavior of items (discussed in detail in Chapters 3 and 4) as well as to the behavior of the set of items as a whole (as represented in the results, especially the Wright map). The interpretive bands on the Wright map in Figure 1.11 can thus be used as a means of labeling estimated student locations with respect to the construct map, in particular for the waypoints NL(i) to MoV4. For example, a student estimated to be at 1.0 logits can be seen to be at approximately the middle of the MoV2&3 band. Thus, they could be interpreted as being at the point of most actively learning (specifically, succeeding at the relevant points approximately 50% of the time) within the construct map waypoints MoV2 and MoV3. That is, they are able to informally describe the contribution of one or more sources of variability to the observed variability in the system, while at the same time being able to develop a chance device (such as a spinner) to represent that relationship. The same student would be expected to succeed more consistently (approximately 75%) at MoV1 (i.e., being able to identify sources of variability), and succeed much less often (approximately 25%) at MoV4 (i.e., develop an emergent model of variability). Calculation of these probabilities depends on the distance between the item and student locations in the logit metric, which is explained in Section 5.2—see especially Textbox 5.1. 1.6.2  Return to the Discussion of Causation and Inference What are the directions of causation in the BAS?

Looking back to the discussion about the relationship between causation and inference, previously illustrated in Figure 1.8, we can now elaborate that diagram by plugging in the four building blocks into their appropriate places (see Figure 1.13). In this figure, the arrow of causality goes directly from the construct to the item responses—it does not go through the outcome space or the Wright map because (presumably) the construct would have “caused” the responses whether or not the measurer had constructed a scoring guide and measurement model or not. This sometimes puzzles people, but indeed it amply displays the distinction between the latent causal link and the manifest inferential link. The initial vague

32  The BEAR Assessment System

Construct Map

Causality

Items Design

Outcome Space

Wright Map

Inference FIGURE 1.13  The

“four building blocks” showing the directions of causality and inference.

link (as in Figure 1.8) has been replaced in Figure 1.13 by a causal link, and, in particular, the (undifferentiated) inference link in Figure 1.8 has been populated by two important practical tools that we use in measuring the construct: the outcome space and the Wright map. 1.7  Reporting the Results to the Measurer and Other Users How can the results from the theoretical empirical development be used to help measurers interpret the measurements?

There are numerous ways to report the measurements using the approach described in this chapter, and many are available within the report generation options in the BASS application. In this introductory chapter, only one will be featured, but more will be shown in later chapters (especially Chapters 6–8). A report called the “Group Proficiency Report” was generated in BASS for a classroom set of students from the MoV data set and is shown in Figure 1.14. In this graph, the MoV scale runs horizontally from left to right (i.e., lowest locations on the left). The bands representing the waypoints are shown as vertical bars in different shades of gray (blue in the online version), with the labels (NL(i) to MoV5) in the top row and the logit estimates of the boundaries between them are indicated in the second row (and also at the bottom). Below that, each individual student is represented in a row, with a black dot showing their estimated location, and an indication of variation around that given by the “wings” on either side of

NL(i)

MoV1

MoV2&3

MoV4

MoV5

-1.30

-0.20

0.50

1.69

4.10

-1.30

-0.20

0.50

1.69

4.10

A Group Proficiency Report for a set of students on the MoV construct.

The BEAR Assessment System  33

FIGURE 1.14 

NL(ii)

34  The BEAR Assessment System

each dot. Looking at the dots, one can see that, for this class, the students range broadly across the three waypoints NL(ii), MoV1, and MoV2&3. There are also two outliers #13549 and #13555 which are below and above these three core waypoints (students’ names are avoided for privacy). Thus, a strategy might be envisaged where the teacher would plan activities suitable for students at these three sophistication levels in understanding variation, and also plan to individually talk to the two outliers to see what is best for each of them. The exact meaning and interpretation of the wings will be left for Section 5.4 to explore in detail, but here we note that there is a group of students (#13553 to #13530) who are straddling MoV1 and MoV2 & 3, and this will need to be considered also. These results are also available in tabular format that can be viewed directly in BASS and/or exported into a grading application. Other reports available in BASS (see Appendix B) are (a) the Individual Proficiency Report (analogous to the Group 1, except generated for an individual student—see Section 5.4), (b) a report of each student’s raw (unscored) responses to each item, (c) a report of the score for each student’s responses to each item, and (d) a report on how consistently each student responded, given their estimated location, which may help teachers understand individual responses to the items (see Section 6.2.2). 1.8  Using the Four Building Blocks to Develop an Instrument How can the four building blocks function as a guide for instrument construction?

The account so far, although illustrated using the ADM example, has been quite abstract. The reader should not be alarmed by this, as the next four chapters are devoted, in turn, to each of the four building blocks, and will provide more detail and many examples of each, across a broad range of contexts and subject matters. The purpose of this introductory chapter has been simply to orient the reader to what is to come. Another purpose of this chapter is to get the reader thinking and learning about the practical process of instrument development. If the reader does indeed want to learn to develop instruments, then it should be obvious that he or she should be prepared to read through this section and carry out the exercises and class projects that are described in the chapters that follow. However, even if practical experience about how to develop instruments is not the aim of the reader, then this section, and later sections like it, should still be studied carefully, and the exercises carried out fully. The reason for this is that learning about measurement without developing an instrument leaves the reader in a very incomplete state of knowledge—it is a bit like trying to learn about bike riding, soufflé cooking,

The BEAR Assessment System  35

or juggling, by reading about it in a book, or watching a video, without actually trying it out. We all can appreciate that an essential part of the knowing in these situations is the doing, and the same is true of measurement: it is sufficiently complicated and is inherently based on the requirement to balance multiple competing optimality considerations, so that genuine appreciation for how any principles or knowledge about it operate can only be mastered through practice. The exercises at the end of each chapter are intended to be a path toward this knowledgein-practice—it can be demanding to actually carry out some of the exercises and will certainly take more time than just reading the book, but carrying out these exercises will hopefully bring a sense of satisfaction in its own right and enrich the reader’s appreciation of the complexity of measurement. The four building blocks provide not only a path for inference about a construct but can also be used as a guide to the construction of an instrument to measure that construct. The next four chapters are organized according to a development cycle based on the four building blocks—see Figure 1.15. It will start with defining the idea of the construct as embodied in the construct map (Chapter 2), then move on to developing tasks and contexts that engage the construct, the items design (Chapter 3). These items generate responses that are then categorized and scored—that is the outcome space (Chapter 4). The statistical model is applied to analyzing the scored responses (Chapter 5), and these results can then be used via the Wright map to reflect back on the success with which one has measured the construct—bringing one back to the construct map (Chapter 2). In essence this sequence of building blocks is actually a cycle, a cycle that may need to be repeated several times. Chapters 6–8 help with this appraisal process by focusing on gathering evidence about how the instrument works: on model-fit, reliability evidence, and validity evidence, respectively.

Construct Map

Estimates and Interpretations FIGURE 1.15 

Items

Item Scores

The instrument development cycle through the four building blocks.

36  The BEAR Assessment System

As the measurer starts down the path of developing the instrument, they will need to gather some resources in order to get started. Even before they start developing the construct map, the topic of Chapter 2, they should initialize two sorts of resources that will provide continuing support throughout the whole exercise: a literature review and a set of informants. Literature Review: Every new instrument (or, equally, the redevelopment or adaptation of an old instrument) must start with an idea—the kernel of the instrument, the “what” of the “what does it measure?” and the “how” of “how will the measurements be used?” When this is first being considered, it makes a great deal of sense to look broadly to establish a dense background of knowledge about the content and the uses of the instrument. As with any new development, one important step is to investigate (a) the theories behind the construct and (b) what has been done in the past to measure this particular content—that is, what have been the characteristics of the instrumentation that were used. The materials in this latter category may not be available in the usual places that researcher/ developers look for the relevant literature. Often, published documents provide only very few details about how previous instruments have been developed, especially any steps that did not work out (this is the measurement equivalent of the “file-drawer problem” in meta-analysis). It may require contacting the authors of the previous instruments to uncover materials such as these: the measurer is encouraged to try that out, as it has been my experience that many, if not most, instrument developers will welcome such contacts. Thus, a literature review is necessary, and should be reasonably close to completion before going too far with other steps (say, alongside Chapter 2, but before commencing the activities discussed in Chapter 3). However, a literature review will necessarily be limited to the insights of those who had previously worked in this area, so other steps will also have to be taken. Informants: Right at the beginning, the measurer needs to recruit a small set of informants to help with instrument design. This should include (a) some potential respondents, where appropriate, who should be chosen to span the usual range of respondents. Other members would include (b) professionals, teachers/ academics, and researchers in the relevant areas, as well as (c) people knowledgeable about measurement in general and/or measurement in the specific area of interest, and (d) people who are knowledgeable and reflective about the area of interest and/or measurement in that area, such as administrators and policymakers. This group (which may change somewhat in nature over the course of the instrument development) will, at this point, be helpful to the measurer by discussing their experiences in the relevant area, by criticizing and expanding on the measurer’s initial ideas, by serving as “guinea pigs” in responding to older instruments in the area, and by also responding to initial items and item designs. The information from the informants should overlap that from the literature review but may also contradict it in parts.

The BEAR Assessment System  37

1.9 Resources

History of the Four Building Blocks/BEAR Assessment System Approach: For a very influential perspective on the idea of a construct, see the seminal article by Messick (1989) referenced earlier. The conceptual link between the construct and the Wright map was made explicit in two books by Benjamin Wright, which are also seminal for the approach taken in this book: Wright and Stone (1979) and Wright and Masters (1981). The origin of the term “Wright map” is discussed in Wilson (2017). The idea of the construct map was introduced in Wilson and Sloane (2000). Similar Ideas: The idea of the evidence centered design approach to assessment is quite parallel—and an early account is given in Mislevy et al. (2003)—an account that is integrative is given in Mislevy et  al. (2003). A  closely related approach is termed “Developmental Assessment” by Geoff Masters and his colleagues at the Australian Council for Educational Research—examples are given in DEETYA (1997) and Masters and Forster (1996). This is also the basis of the historical approach taken by the OECD’s PISA project (OECD, 1999), where the equivalent of the construct map is referred to as the “described variable.” The BAS can be seen as falling into the category of principled assessment design (PAD), and this general approach, as well as several other examples has been summarized in Ferrara et al. (2016) and Wilson and Tan (2022). Aspects and Applications of the Four Building Blocks/BEAR Assessment System Approach: The BEAR Assessment System (Wilson & Sloane, 2000), which is based on the four building blocks, has been used in other contexts besides the ADM assessment example given earlier, which was originally published in Lehrer et al. (2014). Other publications about the ADM context are Schwartz et al. (2017) and Wilson and Lehrer (2021). There are many publications about aspects of the BAS giving examples of construct maps and the BAS across both achievement and attitude domains. A list of them is given in Appendix C. 1.10  Exercises and Activities

1. Explain what your instrument will be used for, and why existing instruments will not suffice. 2. Read about the theoretical background to your construct. Write a summary of the relevant theory (keep it relatively brief, no more than five pages). 3. Investigate previous efforts to develop and use instruments with a similar purpose and ones with related but different purposes. In many areas there are compendia of such efforts—for example, in the areas of psychological and educational testing, there are series like the Mental Measurements Yearbook (Carlson et al., 2021)—similar publications exist in many other areas. Write a summary of the alternatives that are found, outlining the main points, perhaps in a table (again, keep it brief, no more than five pages).

38  The BEAR Assessment System

4. Brainstorm possible informants for your instrument construction. Contact several potential informants and discuss your plans with them—secure the agreement of some of them to help you out as you make progress. 5. Try to think through the steps outlined earlier in the context of developing your instrument, and write down notes about your plans, including a draft timetable. Try to predict problems that you might encounter as you carry out these steps. 6. Share your plans and progress with others who are engaged in similar efforts (or who have already done so)—discuss what you and they are succeeding on, and what problems have arisen. 7. Read through Appendix B about the BAS Software (BASS). Make sure you can access the BASS website. Look around on the website and explore the resources and materials there. 8. Log into BASS, and explore the Examples included. Choose one you are interested in and look through the screens under the Construct, Items, and Scoring Guide tabs, and explore the results under the Analysis and Reports tabs. 9. If you haven’t already chosen the MoV Example, repeat #8 for that one, and compare what you find with the tables, figures, and results reported in this chapter. Notes 1 “BEAR” stands for the Berkeley Evaluation and Assessment Research center located in the Berkeley School of Education at the University of California, Berkeley. 2 A more up-to-date terminology used in the physical sciences involves measurement uncertainty, which includes multiple aspects (Mari et al., 2023). But we will use the more traditional social science labels in this book to avoid confusion (see Chapters 7 and 8). 3 These include unidimensional item response models (Rasch models, 2PL and 3PL IRT models) as well as unidimensional factor analysis models). Note that no items have as yet been introduced, so concepts like monotonicity and independence are not yet relevant to the conceptualization at this point. 4 The notation for the item thresholds is as follows. The location of the threshold for each item is represented by a pair of symbols, “i.k,” where “i” indicates the item number and “k” specifies the item score, so that, for example, “9.2” is the second threshold location for Item 9—that is, the threshold between the scores 0 and 1 compared to scores 2–4 (the maximum for item 4).

PART II

The Four Building Blocks

2 CONSTRUCT MAPS

But, as Bacon has well pointed out, truth is more likely to come out of error, if this is clear and definite, than out of confusion, and my experience teaches me that it is better to hold a well-understood and intelligible opinion, even if it should turn out to be wrong, than to be content with a muddle-headed mixture of conflicting views, sometimes miscalled impartiality, and often no better than no opinion at all. —Sir William Maddock Bayliss (1915)

2.1  The Construct Map What are the essential aspects of construct maps that I need to know about?

The construct map is the first building block in the Bear Assessment System (BAS). It has already been introduced briefly in Chapter 1, and its relationship to the other building blocks was also illustrated—see Figure 2.1. In this chapter, it is the main focus. The idea of a construct that will be described in this chapter is particularly suitable for a visual representation called a construct map. The following are its most important features: (a) There is a coherent and substantive definition for the content of the construct. (b) There is an idea that the construct is composed of an underlying continuum—this can be manifested in two ways: in terms of the respondents and/ or in terms of item responses.

DOI: 10.4324/9781003286929-4

42  Construct Maps

Items Design

Construct Map

Wright Map

Outcome Space

FIGURE 2.1 The

Construct Map, the first building block in the BEAR Assessment System (BAS).

A  generic construct map is shown in Figure  2.2—the generic variable being measured is labeled “X” for this figure. The depiction shown here will be used throughout this book, so a few lines will be used to describe its parts before moving on to examine some concrete examples. The arrow running up and down the middle of the map indicates the continuum of the construct, running from “low” to “high.” The left-hand side will indicate qualitatively distinct groups of respondents, each occupying a waypoint, and ranging from those with high “X” to those with low “X.” A respondent construct map would include only the left side. The right-hand side will indicate qualitative differences in item responses, each occupying a waypoint, and ranging from responses that indicate high “X” to those that indicate low “X.” An item response construct map would include only the right side. A full construct map will have both sides represented. The two different aspects of the construct, the respondents and their responses, lead to two different sorts of construct maps: (a) A respondent construct map, where the respondents are ordered from more to less (on the construct)—and qualitatively may be grouped into an ordered succession of waypoints (b) An item response construct map, where the item responses are ordered from more to less (on the construct)—and qualitatively may also be grouped into an ordered succession of waypoints Also: (c) A full construct map, which consists of both respondent and item locations (which will most often be shortened to just “construct map”)

Construct Maps  43

Direction of increasing “X”.

Respondents …

Responses to Items …

… with very high “X”

… indicate very high “X”

… with high “X”

… indicate high “X”.

. . .

. . . … with mid-range “X”

… indicate mid-range “X”

. . .

. . .

… with low “X”

… indicate low “X”

… with very low “X”

… indicate very low “X”

Direction of decreasing “X” FIGURE 2.2 

A generic construct map in construct “X.”

44  Construct Maps

Of course, words like “construct” and “map” have many other usages in other contexts, but they will be reserved in this book for just the purposes described earlier. Note that this figure depicts an idea rather than being a technical representation. Indeed, later this idea will be related to a specific technical representation, but for now, just concentrate on the idea. Certain features of the construct map concept are worth pointing out. As before, there are waypoints marked off by the circles on the line running up and down the middle. But now we will consider the line itself—this is crucially important, as this is where the individual respondents and individual items are located. In theory, respondents and items can be located anywhere on the line—some will be on top of waypoints, others will scatter around the waypoints. 1. In general, there is no a priori limit on the density of the potential locations on the construct continuum2 that could be occupied by a particular respondent. This corresponds to the idea that no matter where a respondent is on the continuum, there could be another respondent arbitrarily close just above and/or just below that respondent. An example could be where a respondent was responding to a survey one week, and then responding to it again one week later but was distracted during the second week by a loud leaf-blower outside the room, leading to a small fluctuation in their response. Of course, one might expect that there will be limitations of accuracy in identifying that location, caused by limitations of data, but that is another matter (see Section 5.4). 2. Similarly, there is no a priori limit on the density of the potential locations on the construct continuum that could be occupied by a particular item. For example, this corresponds to the idea that no matter where an item is on the continuum, there could be another (probably similar) item arbitrarily close just above and/ or just below that item. An example could be where a single word was replaced within an item with a synonym that had approximately the same reading difficulty. And, of course, issues of estimation and error will be involved here too. 3. The labels of individual items on the construct map are actually labels of item responses. It is important to keep in mind that the locations of the labels are not the locations of items per se but are really the locations of certain types of responses to the items. Thus, effectively the items’ locations will be estimated via the respondents’ reactions to them. 4. In any specific context, just a few waypoints will be identified along the construct map, and these can be expressed in terms of typical respondents or typical item responses for a certain waypoint, or both. Thus, the waypoints can be seen as relating to groups of respondents and classes of item responses. Examples of constructs that can be mapped in this way are common. In attitude surveys, for example, there is always something that the respondent is agreeing to or liking or some other action denoting an ordering; in educational achievement testing, there is most often an underlying idea of increasing correctness, of sophistication or excellence; in marketing, there are always some products that are more

Construct Maps  45

attractive or satisfying than others; in political science, there are some candidates who are more likely to be voted for than others; in health outcomes research, there are better health outcomes and worse health outcomes. In almost any domain, there are important contexts where the type of construct that can be mapped (as defined earlier) needs to be measured. The next section of this chapter contains numerous examples of construct maps to help convey the range of possibilities. As should be clear from the earlier description, a construct can be most readily expressed as a construct map where the intended construct has a single underlying continuum—and this implies that, for the intended use of the instrument, the measurer wants to spread the respondents from high to low, or left to right, etc., in some context where that distribution is substantively interpretable.3 This makes good sense as a basis for instrument construction (see also Section 9.1). There are several ways in which the idea of a construct map can exist in the more complex reality of usage—a construct is always an ideal, we use it because it suits our theoretical approach and/or our practical aims. If the theoretical approach is inconsistent with the idea of mapping a construct onto an ordered continuum, then it is hardly sensible to want to use a construct map as the fundamental structural approach. There can be several situations that are not expressly suited for construct map interpretation, though some can be adapted to do so. Examples are given of such situations in Section 2.4. 2.2  Examples of Construct Maps What are some interesting and instructive examples of construct maps?

The idea of a construct map is very natural in the context of educational testing, such as Example 1 (MoV in Data Modeling) in the previous chapter. It is also just as amenable to use in many other domains. For example, in attitude measurement, one often finds that the underlying idea is one of increasing or decreasing amounts of a human attitudinal attribute, and that attribute (i.e., the construct) might be satisfaction, liking, agreement, etc. In the following sections, multiple examples of construct maps are described, across a wide range of topics in the social sciences and professional practice, as well as beyond. These examples also illustrate some different ways that construct maps can be conceptualized, represented, and developed. Just as Example 1 will be used in several places in the following chapters, several of these Examples will also be returned to help illustrate applications of the BAS. The Examples begin with several that involve single construct maps (Examples 1–4), and then there are two that involve multiple construct maps (Examples 5 and 6). The Examples that follow illustrate different aspects of how constructs maps appear in different situations: one where the construct map was derived after the items were developed (Example 7); one in the context of interviewing respondents (Example 8); one in the context of observations (Example 9); and one in the context of an educational curriculum (Example 10).

46  Construct Maps

2.2.1 Example 1: The Models of Variability (MoV) Construct in the Data Modeling Curriculum What is an example of a construct map in the area of mathematics?

The MoV construct map has already been discussed in Chapter 1. A representation of the respondent construct map for MoV is shown in Figure 2.3. Thus, it needs no further explanation here. However, to see its place in a larger learning progression, see Section 2.2.5. Direction of increasing sophistication in understanding modeling of variability Responses to items

Respondents who ...

... account for variability among different runs of model simulations to judge adequacy of model (MoV5) ... develop emergent models of variability (MoV4) ... informally describe the contribution of one or more sources of variability to variability observed in the system (MoV2)

... use a chance device to represent a source of variability or the total variability of the system (MoV3)

... identify sources of variability (MoV1)

Direction of decreasing sophistication in understanding modeling of variability of the construct map for the MoV construct of the Data Modeling instrument.

FIGURE 2.3 A sketch

Construct Maps  47

2.2.2 Example 2: A Social and Emotional Learning Example (RIS: The Researcher Identity Scale) What is an example of a construct map in the area of socio-emotional learning?

The San Francisco Health Investigators (SFHI) Project (Koo et al., 2021; Wilson et al., 2022) developed the Researcher Identity Scale (RIS), with application focused on students at the high school level. The developers considered researcher identity as “one unified idea made up of four strands: Agency, Community, Fit and Aspiration, and Self ” (Koo et al., p. 5).4 The construct map for this construct is shown in Figure 2.4. The hypotheses represented by the construct map can be thought of as having a lowest extreme waypoint (0), which is at the bottom, below the other waypoints—here the student is not aware of what research entails and has no consideration for their possible role(s) in such research. Going up to the next waypoint (1), the student is considered a novice to the idea of research. At waypoint 2, the student is in the process of exploring different aspects of research. Beyond that, at waypoint 3, the student starts to be comfortable with their identity as a researcher. At the highest extreme (4), the student self-identifies as a researcher and integrates this into their larger identity. Note that Figure 2.4 is formatted in a somewhat different way than the earlier figures showing construct maps. In particular, it is shown in a “table” format that does not so clearly make reference to the idea of an underlying continuum of possible locations. This format is convenient on the printed page, but can lead to confusion, where the reader thinks that the construct map is no more than a sequence of categories.5 A second variant is indicated on the right-hand column, Waypoint 4 Integration of Identity

Respondent Description Student identifies as a researcher and integrates this into their larger self 3 Comfortable Student begins to feel with Identity comfortable with their identity as a researcher 2 Role Exploration 1 Curious Identity

0 Absent

FIGURE 2.4 

Student explores the different aspects of research Student is a newcomer to the concept of research

Student is unaware of what research entails and has not considered their own role in research

Example Responses Agency –I can do research that benefits people Fit and Aspiration – I plan to get a research related degree in college Agency –I can discuss research ideas with my peers Self –I am beginning to consider myself a researcher Community –I am making a contribution to a research group Fit and Aspiration – Iwould like to do research Community –I am a member of a research community Self –I can do research tasks with help from others Self – I think research is boring

The construct map for the RIS.

48  Construct Maps

where items rather than item responses are shown. Again, this can lead to confusion, where the waypoints are seen as being associated with items rather than item responses. Nevertheless, the figure does indicate that, at least in broad conception, this was conceptualized as a full construct map by the SFHI project. More information about measurement using the RIS construct is given in Chapters 4 and 8. 2.2.3 Example 3: An Attitude Example (GEB: General Ecological Behavior) What is an example of a construct map that was developed by first developing the items and then the waypoints?

The General Ecological Behavior (GEB) scale (Kaiser, 1998; Kaiser & Wilson, 2004) is an instrument meant to measure environmental attitude (see, e.g., Kaiser et  al., 2010). It is based on self-reports of past environmentally protective behavior. The full set of items in the GEB is shown in Appendix D. In this case, the items were developed before the waypoints—the GEB is based on the Campbell paradigm for developing and interpreting attitudinal measurements (Kaiser & Wilson, 2019), whereby people disclose their attitude levels—their valuations of the attitude object (e.g., environmental protection) or their commitment to the attitude-implied behavioral goal (e.g., protecting the environment)—in the extent to which they engage in verbal and nonverbal behaviors that involve increasing levels of behavioral costs. (Kaiser & Lange, 2021, emphases added) All behaviors involve costs, and so that is true for behaviors aimed at environmental protection. Thus, the extent of a person’s environmental attitude can be inferred from the environmentally protective activities they engage in (e.g., when people publicly boycott companies with a poor ecological record, buy products in refillable packages, wash dirty clothes without prewashing and/or ride a bicycle, walk or take public transportation to work or school) or engaged in the past. These activities can be regarded as the behavioral means necessary to pursue a specific goal (e.g., protecting the environment). Thus, the more committed the people are to protecting the environment, the more they engage in protecting the environment (Kaiser, 2021). These behavioral costs can be minor and/or common in the community in which the person lives, or they can be large and/or much less common in their community: [P]eople generally favor more convenient and socially accepted over more demanding, socially prohibited, or otherwise costly environmentally protective behaviors. . . . Engagement in a specific behavior involves costs in terms of time, money, effort, courage, inconvenience, etcetera. Such costs have a chance of being endured by people only when these people’s levels of environmental attitude at least match the costs. . . . Consequently, people who engage in a

Construct Maps  49

comparatively demanding environmentally protective behavior (e.g., became members in environmental organizations) reveal higher levels of environmental attitude than people who fail to engage in that behavior. (Kaiser & Lange, 2021) The GEB is an instrument that has been developed without the a priori creation of a construct map. However, consistent with the Campbell paradigm, one can seek to generate an item ordering in terms of waypoints that is quite consistent with the concept of a construct map. In fact, a construct map can be hypothesized in an a posteriori way in this context. Following a review of the item set, the ordering of the GEB items shown in Figure 2.5 was postulated,6 where only the positively oriented items have been used to make the interpretation easier. These subsets can then be shown as in the construct map in Figure 2.6.7 The options for some items (where appropriate) were just “Yes” or “No”; the options for most items were “Never” “Seldom,” “Occasionally,” “Often,” and “Always.”8 Environmentally committed 1. I buy domestically grown wooden furniture. 2. I contribute financially to environmental organizations. 3. I drive on freeways at speeds under 100 kph (= 62.5 mph). 4. I am a member of an environmental organization. 5. I boycott companies with an unecological background. Environmentally active 6. I am a vegetarian. 7. I am a member of a carpool. 8. I buy meat and produce with eco-labels. 9. I buy products in refillable packages. 10. I own solar panels. 13. I have a contract for renewable energy with my energy provider. 15. I buy beverages and other liquids in returnable bottles. 16. I have pointed out unecological behavior to someone. 18. I talk with friends about environmental pollution, climate change, and/or energy consumption. 19. I read about environmental issues. 20. I refrain from owning a car. 22. I own a fuel-efficient automobile (less than 6 L per 100 km). 23. I have looked into the pros and cons of having a private source of solar power. Engages in moderately pro-environmental activities that involve extra efforts 29. I collect and recycle used paper. 30. I drive in such a way as to keep my fuel consumption as low as possible. 31. I own an energy-efficient dishwasher (efficiency class A+ or better). 32. I buy seasonal produce. 35. In nearby areas (around 30 km; around 20 miles), I use public transportation or ride a bike. 36. In winter, I turn down the heat when I leave my apartment/house for more than 4 h. 37. I wash dirty clothes without prewashing. 39. I bring empty bottles to a recycling bin. Engages in common and easy pro-environmental activities 45. I shower (rather than take a bath). 47. I wait until I have a full load before doing my laundry. 48. After a picnic, I leave the place as clean as it was originally. 49. I ride a bicycle, walk, or take public transportation to work or school. 50. I reuse my shopping bags.

FIGURE 2.5 

The GEB items arrayed into four consecutive sets.

50  Construct Maps

FIGURE 2.6 

Sketch of a construct map in general ecological behavior.

According to Florian Kaiser: [W]e can recognize behaviors that are “distinct (and explicit) expressions of commitment to environmental protection” (A), behaviors that represent “active (but only indirectly connected with) commitment to environmental

Construct Maps  51

protection” (B), behaviors that stand for “weak commitment to environmental protection” (D)—what everybody does in Germany to actively protect the environment, and behaviors that we regard as the absolute essentials “(if even these two would be avoided) there would be an absolute lack of commitment to actively protect the environment” (E). (C) is the messy middle ground of behaviors. (F.G. Kaiser, personal communication) Note that, in this construct map, no waypoint is included for waypoint C (the “messy middle”), as, in this formulation, it has not been identified as a specific point, but rather as a location somewhere between waypoints B and D. The development steps for the GEB were similar to the design process used in developing the Program for International Student Assessment (PISA) tests. There it is referred to as developing “described scales” (OECD, 2015, p. 265 and ff). This multistep process may be briefly summarized as follows: (a) A committee of curriculum experts in the relevant subject matter develop a set of items that they judge to be relevant to the curriculum area to be tested. (b) A survey is conducted to try out the items, so that they can be arrayed from empirically easiest to most difficult. (c) Another committee of experts examine the ordered set of items, splitting them into interpretable classes of the items that order the content, and then label/describe those classes. As noted earlier, the interesting difference here is in the order of development, where in the described scale, the items are developed and calibrated first, and the construct is “described” later; whereas in the construct mapping procedure, the construct (map) is established first, and the items and the outcome space are developed based on that. Of course, in either case there are overlaps: for example, (a) the items developed in the described variable approach could be seen as being generated in view of an implicit construct definition (latent in the educational curricula in the PISA case); and (b) the items developed in the construct map case may, via BAS iterations, have an influence on the final version of the construct map. 2.2.4 Example 4: A 21st Century Skills Example (LPS Argumentation) What is an example of a construct map in the area of 21st Century Skills?

Traditional achievement testing in education has been augmented in the last 20 years with a range of new types of achievement constructs called “habits of mind” and “21st Century Skills” (e.g., Griffin et al., 2012). These have proven amenable to description using construct maps, especially following the “learning

52  Construct Maps

progression” concept. The Learning Progressions in Science (LPS) project built on a view of scientific argumentation as a complex competency of reasoning utilized in situations that require scientific content knowledge to construct and/or critique proposed links between claims and evidence (Morell et al., 2017; Osborne et al., 2016). The learning progression (hypothesized as a complex construct map progression) draws on Toulmin’s (1958) model for the structure of practical or informal arguments. This model starts with a claim, which is a “conclusion whose merits we are seeking to establish” (p. 90) and which must be then supported by relevant data or evidence. The relation between the evidence and the claim is provided by a warrant (i.e., reasoning) that forms the substance of the justification for the claim. These warrants can in turn be dependent on (implicit or explicit) assumptions that are referred to as backing. In addition, claims may also be circumscribed by the use of qualifiers that define the limits of validity for the claim. Figure 2.7 shows the hypothesized learning progression for argumentation. In it, there are three broad waypoints of argumentation differentiated by intrinsic cognitive load, where each waypoint is conceived as being distinguished by having more connections between the claims and the various pieces of evidence. The initial waypoints (shown at the bottom of the figure) are labelled as “Notions” (indicated by “0”) to signify that assessment items relating to this waypoint do not require explicit connections between claim and evidence. At this waypoint, the connections (i.e., the warrants according to Toulmin) are not specifically required, and thus success is possible by demonstrating identification/critique of an isolated claim, warrant, or evidence without making a logical connection between them. Put differently, requiring zero degrees of coordination. Going higher on the construct map (i.e., higher in Figure 2.7) to the next set of waypoints (indicated by “1”), one will find items that require the construction of relationships between claims and evidence (i.e., warrants). These require only one degree of coordination—that is, in responding, a student would need Waypoint 3. Counter-critiques and Comparative Arguments

2. Complete Arguments

1. Claims and Evidence

0. Notions FIGURE 2.7 

Respondent Description Constructing a counter-claim with justification Providing a two-sided comparative argument Constructing a one-sided comparative argument Providing a counter-critique Providing an alternative counter-argument Constructing a complete argument Identifying a warrant Constructing a warrant Identifying evidence Providing evidence Identifying a claim Constructing a claim Notions

The Argumentation construct map.

Construct Maps  53

to make one explicit logical connection between claim and evidence by way of a warrant. Thus, here a student must not only be able to identify a claim or a piece of evidence, but also must know how to construct or critique a relationship between claims and evidence. The highest waypoints of the construct map require two or more degrees of coordination. These items at the highest waypoint (“3”) involve students in explicating or comparing two or more warrants. More detail is available in Osborne et al. (2016) and concrete examples of how various progress levels are operationalized with assessment items are included in the Supplementary Materials for that paper. As before, the presentation format of the construct map has been adapted for this project: (a) the usual orientation (from bottom to top) has been reversed, and (b) the waypoints are formatted as a table. The question of whether this is a respondent construct map or an item construct map is left somewhat ambiguous by the project—the labels of the sublevels are explicitly expressed in terms of students’ cognitions, but the textual descriptions included (which are summarized earlier) are given in terms of item responses. More information about measurement using the LPS Argumentation construct is given in Chapter 3. 2.2.5 Example 5: The Six Constructs in the Data Modeling Curriculum What is an example of a construct map in a topic that has multiple strands (dimensions)?

In the example used in Chapter 1, the construct map produced by the Data Modeling project for Models of Variability (MoV) construct (see Figure 2.3) was used to illustrate the central ideas and some of the ways that a construct map can be used in developing an instrument. In this section, the account of the work of the Data Modeling project is broadened to include the rest of the six constructs that it deployed and deepened by describing a hypothesized learning progression (see Chapter 10) based on the six Data Modeling constructs (plus the seventh, which is based on the substantive topic in which the data modeling takes place). An illustration of multiple construct maps is given in Figure 2.8. In this representation, only three constructs are illustrated, but the idea is readily generalized. The three construct maps are shown as mapping out (in analogy to latitude and longitude), the somewhat less specific ideas about students’ thinking (as illustrated by the succession of “clouds”) that might be envisaged by the researcher/developer (illustrated by the figure in the lower left-hand corner. The Data Modeling learning progression was developed in a series of classroom design studies, first conducted by the designers of the progression (e.g., Lehrer, 2017; Lehrer et al., 2007; Lehrer & Kim, 2009; Lehrer et al., 2011) and

54  Construct Maps

FIGURE 2.8 Three constructs represented as three strands spanning a theory of learning.

subsequently elaborated by teachers who had not participated in the initial iterations of the design (e.g., Tapee et al., 2019). Six constructs were generated that delineate the desired conceptual changes in data modeling thinking and practices, and which could be supported by students’ participation in instructional activities (Lehrer et al., 2014). These constructs were developed as part of the design studies just mentioned—these studies together gave evidence for common patterns of conceptual growth as students learned about data modeling in particular substantive contexts, ranging from repeated measurements in classroom contexts, to manufacturing production (e.g., different methods for making packages of toothpicks), to organismic growth (e.g., measures of plant growth). Conceptual pivots to promote change were structured and instantiated into a curriculum, most especially inducting students into statistical practices of visualizing. The curriculum includes rationales for particular tasks, tools, and activity structures, guides for conducting mathematically productive classroom conversations, and a series of formative assessments that teachers could deploy to support learning.

Construct Maps  55

The six constructs are described in the next few paragraphs. First, recall that Models of Variability (MoV) was already described in Chapter 1, so that will not be repeated here. Visualizing Data: Two of the six constructs have waypoints in students’ progress toward ways of thinking that typically are seen as students begin to learn the practices of visualizing data. The first is Data Display (DaD) which describes conceptions of data that inform how students construct and interpret representations of data. These conceptions range along a dimension that starts with students’ interpreting data through the lens of individual cases to students’ viewing data as statistical distributions. DaD is closely associated with a second construct, MetaRepresentational Competence (MRC), which incorporates the waypoints of understanding as students learn to design and adjust data representations to illustrate how the data support the claims that they want to make. A crucial step here is where students begin to consider trade-offs among potential representations with respect to particular claims. Conceptions of Statistics: A third construct, Conceptions of Statistics (CoS), describes how students’ understandings of statistics change when they have multiple opportunities to invent and critique indicators of characteristics of a distribution, such as its center and spread. Initially, students begin by thinking of statistics as the result of computations, not as indicators of features of a distribution. The invention and critique of measures of distribution is viewed as another conceptual pivot. The upper anchor of this construct entails recognition of statistics as subject to sample-to-sample variation. Conceptions of Chance: Chance (Cha) describes the progress in students’ understanding about how probability operates to produce distributions of outcomes. Students’ beginning forms of understandings are intuitive and rely on conceptions of agency (e.g., “lucky numbers”). Later they begin to understand the idea of a trial, which means that students must abandon the idea of personal influence on selected outcomes, and which leads the way to a profound perspectival sea change that now frames chance as being associated with a long-term process, a necessity for a frequentist view of probability (Thompson et al., 2007). Further experiences eventually lead to a new kind of distribution, that of a sampling distribution of sample statistics, reflecting the upper anchor of the CoS construct. The CoS and Cha constructs are related in that the upper anchor of CoS relies on conceptions of sample-to-sample variation attributed to chance. Informal Inference: The sixth and final construct, Informal Inference (InI), describes important changes in students’ reasoning about inference. The term “informal” is intended to convey that the students are not expected to reach conceptions of probability density and related statistical ideas that would be typical of professionals. Rather, students are engaged in generalizing or predictions beyond the specific data at hand. The beginner waypoints describe types of inferences that are based on personal beliefs and experiences (in many cases here, data do not

56  Construct Maps

play a role, other than perhaps as confirmation of what the student believes). At the upper anchor, students can conceptualize a “sample” as an empirical sample that is just one instance of a possibly infinite collection of samples generated by a long-term, repeated process (Saldanha & Thompson, 2014). Informal inference is then based upon this understanding of sample, which is a keystone of professional practice of inference (Garfield et al., 2015). More information about measurement using the Data Modeling constructs is given in Chapter 9. 2.2.6 Example 6: A Process Measurement Example—Collaborative Problem-Solving (CPS) What is an example of a construct map in the area of process measurement?

A new collaborative problem-solving (CPS) process framework (Awwal et  al., 2021) has been developed as part of the ATC21S project (Griffin et  al., 2012; Griffin & Care, 2015; Care et al., 2017). This framework proposed a new view of CPS as a unified construct composed of an interaction of levels of sophistication between collaboration and problem-solving. The waypoints in the first strand of a multistrand framework are shown in Figure 2.9. In this construct, the collaborative problem-solvers will begin their problem-solving efforts by Exploring both the social space and the problem space inherent in the materials that they are given. The lowest waypoint is Focus, where they independently engage with and assess their own resources. At waypoint B, they share with one another what they have found and thus Contribute to the group understanding of the available resources. With this accomplished, at the third waypoint, they each can take advantage of the other’s resources and thus Benefit from them. This is then complemented by questioning about one another’s resources, so that they come to depend on one another. Finally, at the highest waypoint (metacognitive), they jointly examine and discuss the available resources. The full CPS process is seen as being composed of five strands (i.e., the columns in Figure 2.10). Thus, when a group is collaborating proficiently, according to the framework, the collaborative problem-solvers will begin by Exploring (first column, as already described) both the social space and the problem space. They will then define (second column) the problem with respect to their joint resources, thus constructing a shared understanding of the problem. Together, they will then come up with a plan (third column) and implement the plan (fourth column) followed by evaluating their progress and reflecting and monitoring their outcomes, considering alternative hypotheses (fifth column). Within that, the collaborators will carry out these five processes with differing levels of sophistication and success (i.e., the rows of the five columns). In practice, these processes would be iterative—with collaborators returning to previous process steps and correcting errors and omissions.

Construct Maps  57

A. Focus B. Contribute (Independent)

C. Benefit

D. Depend E. Metacognitive

1. Exploring

FIGURE 2.9 

Examines shared resources

Asks others questions; Asks others about their resources

Take and uses others resources; Responds to others questions

Give own resources / information to others; Describes own resources to others

Engage with own resources

Exploring: the first strand in the CPS Process Framework.

Measurement of a process, such as is described in the CPS Process Framework, demands a form of observation that is focused on the processes of the collaboration. In order to facilitate this, the ATC21S project team developed a set of interactive computerized game-like tasks that afforded group members a range of virtual activities designed to help them reach a successful conclusion. Logfiles of each collaborator’s moves in the games were recorded and used to develop a process-based coding and scoring of their moves (Awwal et al., 2021). More information about the CPS example, including examples of items, the outcome space, and a Wright map, is given in Section 10.2.

1. Exploring

2. Defining

3. Planning

E. Metacognitive

Examines shared resources

Agrees on definition of problem

D. Depend

Asks others questions; Asks others about their resources

B. Contribute

C. Benefit

Takes and uses other resources; Responds to others questions Give own resources / information to others; Describes own resources to others

A. Focus (Independent)

Proficiency levels (Collaboration)

58  Construct Maps

Engage with own resources

FIGURE 2.10 

4. Implementing

5. Evaluating and reflecting

Develops plan together; Allocate roles

Follows sequential action steps of plan

Evaluates plan; Agrees on what they have finished; Returns to planning if solution has not been reached; Implements alternative approaches together

Making adjustments to what is relevant and what is not

Agrees to plan

Asks for feedback/contribution from others; Takes turns to identify outcomes of trialing

Identifies others’ resources that are useful

Discusses the plan

Integrates other contributions into own actions

Suggests they both finish; Integrates others’ evaluations; Asks others for feedback on task outcome

Tells others relevant parts of the problem

Suggests plan

Directs others; Reporting to others

Tells others they have finished; Tells others results of evaluation; Tells others task outcome

Identifies parts of the problem

Trialing resources

Identifying own outcome of plan

Decides you’ve finished; Evaluates own choices; Reviews task before completing

Negotiates finishing; Develops a common judgment of the outcome; Discusses alternative approaches

The Complete CPS Process Framework.

2.2.7 Example 7: A Health Assessment Example (PF-10: Physical Functioning 10) What is an example of a construct map in the topic area of health sciences?

A health sciences example of a self-report behavioral construct that can be mapped in this way is the Physical Functioning subscale (PF-10; Raczek, et al., 1998) of the SF-36 health survey (Ware & Grandek, 1998). The SF-36 instrument is used to assess generic health status, and the PF-10 subscale assesses the physical functioning aspect of that. The items of the PF-10 consist of descriptions of various types of physical activities to which the respondent may respond that they are “limited a lot,” “a little,” or “not at all.” The ten items in this instrument are given in Table 2.1. An initial construct map for the PF-10, developed using an informal version of the “described scale” procedure discussed earlier (and based on empirical item difficulties from earlier studies—Raczek et al., 1998), is shown in Figure 2.11. In this case, the succession of increasing ease of physical functioning was indicated by the order of the item responses. This sequence ranges from

Construct Maps  59 TABLE 2.1  Items in the PF-10

Item number

Item label

Item

 1  2  3  4  5  6

Bath WalkOne OneStair Lift WalkBlks ModAct

 7  8  9 10

Bend WalkMile SevStair VigAct

Bathing or dressing yourself Walking one block Climbing one flight of stairs Lifting or carrying groceries Walking several blocks Moderate activities, such as moving a table, pushing a vacuum cleaner, bowling, or playing golf Bending, kneeling, or stooping Walking more than a mile Climbing several flights of stairs Vigorous activities, such as running, lifting heavy objects, participating in strenuous sports

Respondents

Direction of increasing ease of physical functioning Responses to Items

“Not limited at all” to Vigorous Activities

“Not limited at all” to Moderate Activities

“Not limited at all” to Easy Activities

Direction of decreasing ease of physical functioning FIGURE 2.11  A  sketch

of the construct map for the Physical Functioning subscale (PF-10) of the SF-36 health survey.

60  Construct Maps

very strenuous activities, such as those represented by the label “Vigorous Activities,” down to activities that take little physical effort for most people. Note that the order shown here is based on the relative difficulty of self-reporting that the respondents’ activities are “not limited at all.” Note that, in this case, the top waypoint (“Vigorous activities”) is associated with just one summarizing item, whereas the other two are associated with several each—this make sense, as the focus of the survey is on people with some health concerns, who would be expected to not be very involved in those vigorous activities. More information about measurement using the PF-10 construct is given in Chapters 5–8. 2.2.8 Example 8: An Interview Example (CUE: Conceptual Underpinnings of Evolution) What is an example of a construct map in an area where interviews are the main source of information?

Using a cognitive interview technique, the Conceptual Underpinnings of Evolution (CUE) project investigated elementary students’ abilities to reason in biology, specifically focusing on microevolution for second- and third-grade students (i.e., 7–9 years old) (Metz et al., 2019). The curriculum incorporated modules about animals and plants and their natural ecology and diversity. Students were involved in activities that focused on the organisms’ habitats, traits and needs, and their structures and functions, including strategically designed thought experiments about environmental stresses. The project developed a learning progression focused on this grade-span, but also included deeper understandings that come into play at subsequent grade levels. The progression aimed to build students’ understandings of the following important concepts: variation between-species and within-species, structure/function, limiting factors, survival advantage, and change over time. The project assessed students’ placement along the learning progression using a structured interview protocol. This was developed from the literature using prior classroom-based research in these domains, which was also the source of the instructional modules that were developed for this project. Each question in the interview was related to certain waypoints within the learning progression, and quotations from students’ responses were systematically documented that exemplified the learning progression levels. The learning progression went through several iterations, and based on both qualitative and quantitative data, a final version was developed (shown in Figure  2.12; Cardace, et  al., 2021). This learning progression consists of two construct maps: one for “Fit” and one for “Process.” The Fit strand focuses on

Construct Maps  61

FIGURE 2.12 

The final CUE learning progression.

Source: From Metz et al. (2019)

students’ explanation about how organisms fit into their environment (labeled as “F” on the left-hand side in the figure). Here, students’ explanations advance from exclusively considering organisms’ needs (waypoint F2), to observing the affordances of their physical characteristics, to initial explanations about structure– function–environment relationships (F3 and F4), and, ultimately, they incorporate natural selection as an explanation for the organism’s good fit in their environment (F4+P7). The Process strand (labeled as “P” on the right-hand side in the

62  Construct Maps

figure) focuses on the mechanism by which organisms become well-fitted to their environments. For this, it is crucial to identify variability in traits (P3 and P4), and also the survival value that may be attached to those traits (P5A and P5B). As they proceed along the progression, students’ explanations incorporate the shifting of distributions between generations (P6) and ultimately include natural selection as the specific mechanism for the attainment of good fit (F4+P7). Both strands are described in more detail in publications about this project (Cardace et al., 2021; Metz et al., 2019). 2.2.9 Example 9: An Observational Instrument—Early Childhood (DRDP) What is an example of a construct map in a topic area where structured observations are the main source of information?

The California Department of Education (CDE) has developed the Desired Results Developmental Profile9 (DRDP10), an observation-based formative child assessment system used in early care and education programs throughout California. The DRDP is designed to be used across the full developmental continuum for children from birth through kindergarten, and the observations are organized as “measures” across several important domains of child development. For example, at the kindergarten level, there are 55 measures across 11 domains or subdomains. The representation of each measure is designed as a combination of the construct map and the guidelines for the item itself: for example, the preschool measure “Identity of Self in Relation to Others” (SED1) is illustrated in Figure 2.13. This is a view of the scoring guide that early childhood teachers see when they log in to rate their students in this measure. The lowest waypoint is at the left-hand side at the top: Child . . . “Responds in basic ways to others.” The highest in this view is on the right at the top: Child . . . “Compares own preferences or feelings to those of others.” Each also includes several exemplars of what a teacher might see from children in their early childhood classroom at that waypoint. These brief examples are supplemented by a video exemplar library structured by waypoints as well as training materials and online and face-to-face teacher professional development workshops. Some commentators have criticized the use of observations as the basis for consequential measurement in educational settings. The project has published reports supporting its usage—and the case has been much debated. One evaluator has noted that educators should prefer to invest in training teachers to be better observers and more reliable assessors than to spend those resources training and paying for outside assessors to administer on-demand tasks to young children in unfamiliar contexts that will

Developmental Domain: SED — Social and Emotional Development

SED 1: Identity of Self in Relation to Others

Child shows increasing awareness of self as distinct from and also related to others

Mark the latest developmental level the child has mastered:

Responding

Exploring

Building

Integrating

Later Earlier ○ ○ Uses senses to explore Recognizes self and familiar people self and others

Later ○ Communicates own name and names of familiar people (e.g., “dada,” “mama,” “grandma,” or sibling’s name)

Earlier ○ Expresses simple ideas about self and connection to others

Middle ○ Describes self or others based on physical characteristics

Later ○ Describes own preferences or feelings; and Describes the feelings or desires of family members, friends, or other familiar people

t Attends to a familiar

t Examines own hand or

t Orients toward a familiar

t Communicates, “Me

t Acts out roles from own

t Communicates, using

t Communicates to an adult, t Selects a pink scarf for a

adult during feeding.

foot by looking at it or mouthing it.

adult when own name is spoken or signed.

Earlier ○ Responds in basic ways to others

Possible Examples

t Quiets when hears a familiar adult.

t Grasps an adult’s finger when palm of child’s hand is touched.

t Touches others’ hair

t Points to picture of self

t Plays with sound by

t Smiles when a familiar

when it is within reach.

repeating grunts and squeals.

on the wall.

adult enters the room.

llamo Luis,” [“My name is Luis,” in Spanish].

t Communicates names of immediate family members in a photo.

t Looks to new baby sister and communicates her name.

family in pretend play.

t Communicates, “I’m

making cookies—just like Grandma!” while rolling play dough.

communication board, “His hair is red!”

t Identifies own height,

as indicated on a growth chart posted on the wall.

t Draws picture of a house t Narrates details while and communicates, “This is my house.”

drawing a picture of a friend.

t Draws a picture of own

family, representing traits such as heights and hair colors.

“I was mad when it rained because we couldn’t go outside.”

Earlier ○ Compares own preferences or feelings to those of others

friend whose favorite color is pink, then selects a blue scarf for self.

t Communicates that a

t Communicates to a

t Says, “Ayokong hawakan

t Communicates, “ 我 喜

friend is happy because he is going to have a birthday party. ang susô. Na tatakot ako,” [“I don’t want to touch the snail. It scares me,” in Tagalog].

peer that they both like peanut butter and jelly sandwiches.

歡 游 泳, 但 是 我姐 姐 不 喜 歡,” [“I love to swim, but my sister doesn’t,” in Chinese].

○ Unable to rate this measure due to extended absence

SED 1

Identity of Self in Relation to Others

FIGURE 2.13 A view

SED 1

of the DRDP measure Identity of Self in Relation to Others (SED1). (https:// www.desiredresults.us/sites/default/files/docs/resources/vid_examples/SED1.jpg; Copyright 2013–2019 California Department of Education-All rights reserved)

Construct Maps  63

○ Child is emerging to the next developmental level

64  Construct Maps

provide data with the added measurement error inherent in assessing young children from diverse backgrounds. (Atkins-Burnett, 2007) In comparison to other construct maps, this one clearly is represented in a different way—this orientation is what the teachers found most useful. Given that this instrument is used for assessment of every student in California in publicly funded preschools, this represents the largest use of construct maps so far. More information about measurement using the DRDP constructs are given in Chapters 3 and 8. Research reports are available at https://www.desiredresults. us/research-summaries-drdp-2015-domain. 2.2.10 Example 10: The Issues Evidence and You (IEY) Science Assessment What is an example of a construct map for a science assessment in the topic area of “Evidence and Trade-offs”?

This example is an assessment system built for a middle school science curriculum, Issues, Evidence and You (SEPUP, 1995). The Science Education for Public Understanding Project (SEPUP) at the Lawrence Hall of Science was awarded a grant from the National Science Foundation in 1993 to create year-long issuesoriented science courses for the middle school and junior high grades. In issuesoriented science, students learn science content and procedures, but they are also required to recognize scientific evidence and weigh it against other community concerns with the goal of making informed choices about relevant contemporary issues or problems. The goal of this approach is the development of an understanding of the science and problem-solving approaches related to social issues without promoting an advocacy position. The course developers were interested in trying new approaches to assessment in the IEY course materials for at least two reasons. First, they wanted to reinforce the problem-solving and decision-making aspects of the course—to teachers and to students. Traditional “fact-based” chapter tests would not reinforce these aspects and, if included as the only form of assessment, could direct the primary focus of instruction away from the course objectives the developers thought were most important. Second, the developers knew that in order to market their end product, they would need to address questions about students’ achievement in this new course, and traditional assessment techniques were not likely to demonstrate students’ performance in the key objectives (problem-solving and decision-making). Both the IEY curriculum and its assessment system are built (which, like the Data Modeling example, uses the BEAR Assessment System (Wilson & Sloane, 2000) as its foundation on four constructs. The Understanding Concepts construct

Construct Maps  65

is the IEY version of the traditional “science content.” The Designing and Conducting Investigations construct is the IEY version of the traditional “science process.” The Evidence and Trade-offs construct was a relatively new one in science education at the time the curriculum was developed and is composed of the skills and knowledge that would allow one to evaluate, debate, and discuss a scientific report such as an environmental impact statement and make realworld decisions using that information. The Communicating Scientific Information construct is composed of the communication skills that would be necessary as part of that discussion and debate process. The four constructs are seen as four dimensions on which students will make progress during the curriculum and are the target of every instructional activity and assessment in the curriculum. The dimensions are positively related, because they all relate to “science” but are educationally distinct. The Evidence and Trade-offs (ET) construct was split into two parts (called “elements”) to help relate it to the curriculum. An initial idea of the Using Evidence element of the ET construct was built up by considering how a student might increase in sophistication as they progressed through the curriculum. A sketch of the construct map for this case is shown in Figure 2.14: on the right side of the continuum is a description of how the students are responding to the ET items. 2.3  Using Construct Mapping to Help Develop an Instrument How can the concept of construct mapping help in the development of an instrument?

The central idea in using the construct mapping concept at the initial stage of instrument development is for the measurer to focus on the essential features of what it is that is to be measured—in what way does an individual show more of it, and less of it? It may be expressed as from “higher to lower,” “agree to disagree,” “weaker to stronger,” or “more often to less often”—the particular wording will depend on the context. But the important idea is that there is an order for the waypoints inherent in the construct—and underlying that there is a continuum running from more to less—that is what allows it to be thought of as a construct map. A tactic that can help is the following: (a) Think first of the extremes of that continuum (say “novice” and “expert,” or in the context of an attitude toward something, “loathes” to “loves”). (b) Make the extremes concrete through descriptions. (c) Develop some intermediate waypoints between the two extremes. It will be helpful also to start thinking of typical responses that respondents at each level would give to first drafts of items (more of this in the next chapter).

66  Construct Maps

Direction of increasing sophistication in using evidence

Students

Responses to Items Response accomplishes lower level AND goes beyond in some significant way, such as questioning or justifying the source, validity, and/or quantity of evidence.

Response provides major objective reasons AND supports each with relevant and accurate evidence. Response provides some objective reasons AND some supporting evidence, BUT at least one reason is missing and/or part of the evidence is incomplete. Response provides only subjective reasons (opinions) for choice and/or uses inaccurate or irrelevant evidence from the activity. No response; illegible response; response offers no reasons AND no evidence to support choice made Direction of decreasing sophistication in using evidence FIGURE 2.14 A sketch

of the construct map for the Using Evidence construct.

Before this can be done successfully however, the measurer will often have to engage in a process of “variables clarification” where the construct to be measured is distinguished from other, closely related, constructs. Reasonably often, the measurer will find that there were several constructs lurking under the original idea—the four building blocks method can still be applied by attempting

Construct Maps  67

to measure them one at a time. One informal tactic for disentangling possibly different constructs is to consider respondents who are “high” on one construct and “low” on the other. If such respondents are common and are interestingly different in a theoretical sense, then it is likely that the constructs are interestingly different. However, similar to the adage “correlation is not causation,” the opposite does not hold generally. That is, respondents may tend to be similar on two different constructs, but that does not necessarily imply that the constructs are indeed the same. The ten Examples discussed earlier, which are based on published cases, tend to discuss only the final resolution of this process. However, the CUE example explicitly addresses one successful effort at variable clarification, and this especially so in the publication Cardace et al. (2021). Unfortunately, editors in the social science literature tends to seek to eliminate such discussions as being superfluous to “scientific advancement” in the discipline. This is, of course, nonsense—in fact it is essential that such discussions and investigations be made public, so that social science disciplines can advance in their measurement practices. There are more examples of this in the Examples Archive (Appendix A). In creating a construct map, the measurer must be clear about whether the construct is defined in terms of who is to be measured, the respondents, or what responses they might give, the item responses. Eventually, both will be needed, but often it makes sense in a specific context to start with one rather than the other. For instance, on the one hand, when there is a developmental theory of how individuals increase on the construct, or a theory of how people array themselves between the extremes of an attitude, then probably the respondent side will be first developed. On the other hand, if the construct is mainly defined by a set of items and the responses to those items, then it will probably be easier to start by ordering the item responses. Ultimately, the process will be iterative with each side informing the other. 2.4  Examples of Other Construct Structures

There are other theoretical situations that involve structures that are not exactly the same as construct maps, though some can be adapted to make the construct map approach useful. One important issue is that one needs to distinguish constructs that are amenable to the use of construct mapping and constructs that are not. Clearly, any construct that is measured using a single score or a code signifying an order for each person will be a candidate for construct mapping. Other possibilities are described in the following paragraphs. Latent Classes: One major type of construct that is not straightforwardly seen as a candidate for construct mapping is one where there is no underlying continuum, where, for example, there is assumed to be just a set of discrete unordered

68  Construct Maps

categories. This is seen in areas such as cognitive psychology where one might assume that there are only a few strategies available for solving a particular problem. Latent class analysis (e.g., Vermunt, 2010) is an approach that posits just such a construct and should be used when the measurer is seriously wanting to use that as the basis for reporting. An example of this could be a case where a student was engaging in problem-solving regarding three-dimensional shapes, and it had been established that there were two typical approaches: say, one based on a geometrical (Euclidean) interpretation and one based on a topological interpretation (i.e., by using qualitative features of the shapes). Then, the aim of a measurement might be to classify the students into the two different classes who used those problemsolving strategies. Here the concepts of “more” and “less” do not apply. Ordered Partitions: However, this situation might be subsumed into a context that is more like a construct map. For example, suppose that the problem-solving described in the previous paragraph was set in a more complex assessment situation where there was a waypoint at a less sophisticated level (e.g., the student did not even understand the problem), and also a waypoint that was more sophisticated (e.g., where the student solved the problem, but also adapted it to a new situation). Here the construct can be seen as an ordered partition of the set of categories, as is implied in Figure 2.15. In this situation, the partial order itself can be used to simplify the problem so that the two latent classes are co-located on the construct map. In this case, there will be a loss of information, but this simplified construct may prove useful, and the extra complications can be added back in later (Wilson & Adams, 1993). The MoV example introduced in the previous chapter is somewhat

FIGURE 2.15 

Illustration of a simple ordered partition.

Construct Maps  69

similar. This can be illustrated as a respondent construct map—see Figure 2.3. This figure makes clear that the waypoints MoV2 and MoV3 are different aspects of students’ thinking (i.e., they are given on the left-hand side of the figure), but also that they are seen as being co-located on the construct map. Multiple Strands (Also Called Multiple Dimensions): But there are also constructs that are more complex than what is involved in a single construct map, yet contain construct maps as a component. Probably the most common one would be a construct with multiple “strands”—for example, the six Data Modeling strands mentioned in Example 1 in Chapter 1. In this sort of situation where a simple construct map would be inadequate to fully match the underlying constructs, a reasonable approach would be to use the construct map approach consecutively for each of the strands. This may well involve a multidimensional statistical model—see Section 9.3. Multiple Strands but also a Composite: An interesting added complexity in this multiple strand case is where the measurements are to be made at two grain sizes—within each of the multiple strands and across all the strands—this multilevel measurement context demands a different type of statistical model than what will be used in this book (although it is a topic in the follow-on volume—see Wilson and Gochyyev (2020)). For other examples of more complex structures, see Section 2.5. 2.5 Resources

Many examples of construct maps are given in the references cited in Appendix C. These have been organized thematically, so it may be helpful for the reader to search for similar topics to the one that they are interested in. However, relatively few of them incorporate both the respondent and item response sides of the continuum, so the reader may need to look beyond what they find in the literature. Several of situations that are more complex than the straightforward examples of construct maps shown in the examples in this chapter are described and discussed in Chapters 9 and 10. If the reader is investigating such a situation, then they should look through these chapters for similar situations. (And, of course, Section 2.4 will also be a guide here.) 2.6  Exercises and Activities

(Following on from the exercises and activities in Chapter 1) 1. Lay out the different constructs involved in the area you have chosen to work in. Clarify the relationships among them and choose one to start concentrating on. 2. For your chosen construct, write down a brief (one to two sentences) definition of the construct. If necessary, write similar definitions of related constructs to help distinguish among them.

70  Construct Maps

3. Describe the different waypoints of the construct—as noted earlier, start with the extremes and then develop qualitatively distinguishable waypoints in between those extremes. Distinguish between waypoints for the respondents, and in potential item responses. Write down the successive waypoints in terms of both aspects, if possible, at this point in the development of the construct map. 4. Open up the BASS application and name your construct map, enter your definition, and add useful background information and links (be sure to save your work). 5. Take your description of the construct (and any other clarifying material) to a selected subset of your informants and ask them to critique it. 6. Try to think through the steps outlined above in the context of developing your instrument and write down notes about your plans for how the responses may be mapped to the waypoints. 7. Share your plans and progress with others. Discuss what you and they are succeeding in, and what problems have arisen. Notes 1 The interested reader can also find examples of construct maps within each of the Worked Cases in the Examples Archive (Appendix 1) on the website associated with this book. 2 Note that the word “continuum” is not being used here in the mathematical sense (i.e., as the non-denumerable set of real numbers). So far there are no numbers attached to the construct or its construct map (which would be premature). Instead, the term continuum is being used here to signify the idea that the construct is ordered and dense in the way defined in this paragraph. Later, in Chapter 5, this will be made more concrete in that we will find that observations on the continuum can, at least in theory, be thought of as being as dense as the rational numbers, in the sense that these observations will be indicated by ratios of numbers of observations. 3 Some might express this as the construct being “unidimensional,” but that form of expression will be reserved in this book for the situation where the construct has been associated with a quantitative representation, which is not yet the case here (but see Chapter 5). 4 For more information on the latent variable, its components, and its construct map, see Bathia et al. (2020). 5 Note that this type of formatting is also associated with the use of the term “level” which has also been found to be associated with a similar confusion. 6 For a German context. 7 Note that the relative distances between the waypoints in Figure 2.5 has not been set by a technical method, though they are roughly equivalent to the differences that are found using the methods described in Chapter 5. 8 These were collapsed into “No” (Never” “Seldom” “Occasionally”) or “Yes” (“Often” and “Always”) (Kaiser & Wilson, 2000). 9 Developed in concert with early childhood assessment experts from WestEd and the Berkeley Evaluation and Assessment Research (BEAR) Center at the University of California, Berkeley. 10 Information about DRDP can be found at https://www.desiredresults.us/; and a recent technical reports are available at https://www.desiredresults.us/research.

3 THE ITEMS DESIGN

Ut sementum feceris ita metes. —[“As you sow so will you reap,” Latin proverb]

3.1  The Idea of an Item What is the main purpose of an “item”?

Often the first inkling of an item comes in the form of an idea about a way to reveal a particular characteristic (construct) of a respondent. The inkling can be quite informal: a remark in a conversation, the way a student describes what they understand about something, a question that prompts an argument, a particularly pleasing piece of art, a newspaper article, a patient’s or client’s symptoms. The specific way in which the measurer prompts an informative response from a respondent is crucial to the value of the resulting measurements. In fact, in many, if not most, cases, the construct itself will not be clearly defined until a relatively large set of items has been created and tried out with respondents. Each new situation brings about the possibility of developing new and different sorts of items or of adapting old ones. Across many fields and across many topic areas within those fields, a rich variety of types of items have been developed to deal with many different constructs and situations. We have already seen two very different formats. In Chapter  1 (in Example 1), a constructed response type of item was closely examined: the Data Modeling MoV “Piano width” item. Many examples of responses to that item were included in Table  1.1. The LPS science argumentation assessment (Chapter  2, Example 4) and the CUE interview (Chapter  2, Example 8) were additional DOI: 10.4324/9781003286929-5

72  The Items Design

examples of the constructed response type. In Chapter 2, we branched out and examined selected response items as well: the PF-10 health survey (Chapter  2, Example 7) asked the question “Does your health now limit you in these activities?” with respect to a range of physical activities, but restricted the responses to a selected response among: “Yes, limited a lot,” “Yes, limited a little,” and “No, not limited at all.” The General Ecological Behavior (GEB) Scale (Chapter 2, Example 3) was another example of the selected response type, as it is an example of the familiar Likert-style item. These are examples of two ends of a range of item formats that stretch from very open-format constructed responses to very closedformat selected responses. In the following section, we present a range of item types that span across these two and beyond (see Figure 3.5). Many other types of items exist (e.g., see Brookheart & Nitko, 2018, for a large assortment from educational assessment), and the measurer should be aware of both the specific types of items that have previously been used in the specific area in which an instrument is being developed and the item types that have been used in other related contexts. Probably the most common type of item in the experience of most people is the general constructed response item format that is commonly used in school classrooms and many other settings around the world. The format can be expressed orally, in writing (by hand or typing) or in other forms, such as concrete products, active performances, or actions taken in a digital environment. The length of the response can vary from a single number or word to lengthy essays, complex proofs, interviews, extended performances, or multi-part products. The item can be one that is produced extemporaneously by the measurer (say, the teacher, or other professional) or it can be the result of an extensive developmental process. This format is also used outside of educational settings, including workplaces, social settings, and in the everyday conversational interchanges that we all experience. Typical subforms are the essay, the brief demonstration, the work product, and the short-answer format. In counterpoint, probably the most common type of item in published instruments is the selected response format. Some may think that this is the most common item format, but that is because they are discounting the numerous everyday situations involving constructed response item formats. The multiple-choice item is familiar to almost every educated person and has had an important role in the educational trajectories of many. The selected response (particularly the Likerttype response) format is also very common in attitude scales, and surveys and questionnaires used in many situations, such as in health, applied psychology, and public policy settings, in business settings such as employee and consumer ratings, and in governmental settings. The responses are most commonly “Strongly Disagree” to “Strongly Agree,” but many other response options are also found (as in the PF-10 and GEB examples). It is somewhat paradoxical that the most

The Items Design  73

commonly experienced format is not the most commonly published format (at least for educational achievement items). As will become clear to the reader as they advance through the next several chapters, the view developed here is that the constructed response format is the more basic format, and the selected response format can be seen as an adapted version of it. There is another form of item that many respondents are less well-aware of: non-prompted items. These are items where the respondent is not directly or explicitly informed about the specifics of the items, and hence they are not actually reacting to prompts per se. For example, shoppers in a supermarket may be observed and their actions recorded in some way, or users of an online software application may have their computer interactions recorded (keystrokes, of course, but perhaps also their eye-gaze information). In these sorts of situations, the measurer should always inform the respondents that they are being observed and data recorded. But beyond that, the issue noted here is that, even though the respondents may know about that in general, they will usually not be aware of the actual “items” that are involved: that is the specific observational protocols being used. An example of this is where children in a playgroup situation are being observed, and their behaviors are being coded; the DRDP observations are an instance of this (Chapter 2, Example 9). A second type of case is where students are involved in playing a computer game in a small group; the CPS process measurement (Chapter 2, Example 6) is an instance of this. Note that there are degrees of prompting that may occur here. Teachers may strategically place certain play objects into a child’s environment in order to observe certain types of play that require such artifacts. The element “staging” involved will vary from context to context. The relationship of the item to its construct is its most fundamental relationship.1 Typically, the item is but one of many (often one from an infinite set) that could be used to measure the construct. Paul Ramsden and his colleagues, writing about the assessment of achievement in high school physics, noted: Educators are interested in how well students understand speed, distance and time, not in what they know about runners or powerboats or people walking along corridors.2 Paradoxically, however, there is no other way of describing and testing understanding than through such specific examples. (Ramsden et al., 1993, p. 312; footnote added) Similarly, consider the health measurement (PF-10) example already described. Here the specific questions that are used are clearly neither necessary for defining the construct nor are they sufficient to encompass all the possible meanings of the concept of physical functioning. Thus, the task of the measurer is to choose a finite set of items that does indeed represent the construct in some reasonable

74  The Items Design

way. As Ramsden hints, this is not the straightforward task one might think it to be, on initial consideration. Sometimes a measurer will feel the temptation to seek the “one true task,” the “authentic item,” the single observation that will supply the mother lode of evidence about the construct. Unfortunately, this misunderstanding, common among beginning measurers, is founded on a failure to fully consider the need to establish sufficient levels of validity and reliability for the instrument. Where one wishes to represent a wide range of contexts in an instrument, it is better to have more items rather than less—this is because (a) the instrument can then sample more of the content of a construct and more of the situations where a student’s location on the construct might be displayed (see Chapter 8 for more on this), and (b) because it can then generate more bits of information about how a respondent stands with respect to the construct, which will yield greater precision (see Chapter 7 for more on this). Both requirements need to be satisfied within the time and cost limitations imposed on the measuring context. 3.2  The Facets of the Items Design What are the essential components of an item?

The items design is the second building block in the Bear Assessment System (BAS) and its relationship to the other building blocks is shown in Figure 3.1. The Items Design is the main focus of this chapter. The relationship of this building block to the next, the Outcome Space, is very deep. In fact, they can be seen as the two parts of the definition of an item, as noted in the first paragraph of this chapter; this point will be returned to in the next chapter.

Items Design

Construct Map

Wright Map FIGURE 3.1 The

Outcome Space

Items Design, the second building block in the BEAR Assessment System (BAS).

The Items Design  75

One way to understand the items design is to see it as a description of the theoretical population of items called the item universe (Guttman, 1944), along with a procedure for sampling the specific items to be included in the instrument. As such, the instrument is the result of a series of decisions that the measurer has made regarding how to represent the construct or, equivalently, how to stratify the “space” of items and then sample from those strata. Some of those decisions will be principled ones relating to the fundamental definition of the construct, and the research background of the construct. Some will be practical, relating to the constraints of administration and usage. Some will be rather arbitrary, being made to keep the item generation task within reasonable limits. We can think of items as being like gemstones. They can be viewed from a variety of perspectives and/or they can be seen as having different characteristics. In this book, we will refer to the different characteristics of items as facets. Each facet of an item tells us something important and contributes to its overall quality. For example, item type is one facet and could be any of the following: (i) selected response items, (ii) constructed response items, and (iii) items not readily classifiable into either. Another facet might be the readability of the text in the item—this would be a facet that associated each item with a readability value (e.g., Flesch’s Reading Ease formula—Flesch, 1948) but also it may need to include a subset for items that had no readability value (e.g., perhaps for a spoken item). The facets interact to make the item easier or more difficult and should be considered carefully and used intentionally. For example, an item meant to evaluate understanding will be impacted significantly by the reading level of the item. The item might appear to test high-level understanding, but only appears that way due to complex language making the item difficulty a function of the text and not the concept being measured. Without a plan to design and hence control such features within the item set, the measurement designer will be subject to unknown effects from the items themselves. For our purposes here, one can distinguish between two types of facets of the universe of items that are useful in describing the item pool: (a) the construct facet; (b) the rest of the facets. 3.2.1  The Construct Facet What is the most important facet?

The construct facet is the facet that is used to provide criterion-referenced interpretations along the range of the construct, from high to low; it is essential and common to all items. This facet provides interpretational levels within the construct, and hence it is called the construct facet. For example, the construct facet in the Data Modeling MoV construct is provided in Figure 1.12. Thus, the construct facet is essentially the content of the construct map—where an instrument is

76  The Items Design

developed using a construct map, the construct facet has already been established by that process. However, beyond the specification of the construct map itself, the important issue of how the categories of the item responses will be related to the construct map still needs to be explicated. Each item can be designed to generate responses that span a certain number of the waypoints on the construct map. Two is the minimum (otherwise the item would not be useful), but beyond that, any number is possible up to the maximum number of waypoints in the construct. With selected response items such as multiple-choice items, this range is limited by the options that are offered. For example, the observational DRDP item shown in Figure 2.13 has been designed to generate a teacher’s characterization of a child’s behavior at eight levels: from “Responds in basic ways to others” to “Compares own preferences or feelings to those of others.” Thus, this item is polytomous (i.e., it generates responses for greater than two waypoints). But in practice many selected response items are dichotomous (i.e., generate responses for just two waypoints), such as typical multiple-choice items in educational testing, where only one option is correct, and the others are incorrect. In attitude scales, this distinction is also common: for some instruments using Likert-style options, one might only ask for “Agree” versus “Disagree,” but for others a polytomous choice is offered, such as “Strongly Agree,” “Agree,” “Disagree,” and “Strongly Disagree.” Although this choice can seem innocuous at the item design stage, especially for selected response items using traditional option sets, it is in fact quite important and will need special attention when we get to the fourth building block in Chapter 5. An example of how items match to waypoints was given in Chapter 1 for the Data Modeling example item “Piano Width” where categories of responses were matched to four waypoints in the MoV construct map (see Table 1.1). A second example is from the LPS Argumentation Assessment (Chapter  2, Example 4). The construct map for this construct was given in Figure 2.9, and the relevant detail is shown in Figure 3.2. The prompt for a bundle of LPS items is shown in Figure 3.3: the “Ice-to-Water-Vapor” task. Three of the items developed to accompany this prompt (see items 4C, 5C, and 6C in the left-hand panel of Figure 3.3). These are matched to the 0b, 0d, and 1b waypoints shown in Figure 3.2, respectively, where they identify a claim, some evidence, and a warrant. Each item in an instrument not only plays a specific role in relation to at least two waypoints, but the complete set of items should correspond meaningfully to the full range of interest within the construct. If some waypoints are of relatively greater interest (high end versus low end), then the items should be distributed in a way that matches that preference. The organization of the item set into an instrument is usually summarized in a representation called a blueprint, or sometimes an instrument “specification.” In the case of the construct facet, this will take on a particularly simple format,

The Items Design  77

Waypoint

Constructing Arguments

1b 1a

Constructing a warrant

0d 0c

Identifying evidence Providing evidence

0b 0a 0 FIGURE 3.2 

FIGURE 3.3 

Critiquing Arguments Identifying a warrant

Identifying a claim Constructing a claim

Respondent Description Student identifies the warrant provided by another person. Student constructs an explicit warrant that links their claim to evidence. Student identifies another person’s evidence. Student supports a claim with a piece of evidence. Student identifies another person’s claim. Student states a relevant claim. No evidence of facility with argumentation.

Detail of the Argumentation construct map.

The prompt for the “Ice-to-Water-Vapor” task.

by noting which items relate to each waypoint. Of course, as other facets (as in Section 3.2.2) are added, this will become more complex, eventually forming a multi-way matrix. The most important part of the items design is in how the responses relate to the waypoints. For every item, all the categories of the item’s responses must relate to specific waypoints on the construct map. Faulty design here will make everything else harder to get right. Not being sure of this at the beginning is a normal state of affairs, but the development steps are designed to help the measurement designer to decide about this—including redesigning construct maps and/or items when the construct facets do not match up well with the categories of item responses.3 The fine-tuning of the item will be an iterative

78  The Items Design

FIGURE 3.4 

The two sets of MoV items: CR and SR.

process involving testing the item, making adjustments, and getting feedback from respondents. The final version will show clearer alignment with the construct and related waypoints. 3.2.2  The Secondary Design Facets What else do we need to specify about an item?

Having specified the construct component, a major task remains—to decide on all the other characteristics that the set of items will need to have. Here the term “secondary facets” is used, as these facets are used to describe an important design aspect of the item set beyond the primary one of the relations of the items to the construct map. These are the facets that are used to establish classes of items to populate the instrument—they are an essential part of the basis for item generation and classification. Many types of secondary facets are possible, but some are more common than others. For example, in the health assessment (PF-10) example, the items are all self-report measures—this represents a decision to use self-report as the only condition for the Response-type facet for this instrument (and hence to find items that people in the target population can easily respond to), and not to use other possibilities, such as giving the respondents actual physical tasks to carry out. It is worth noting that, even if the measurement designer does not specifically design such a set of facets, there still will be facets represented among the items in the item set, such as item difficulty, to name just one.

The Items Design  79

Looking back to the “Ice-to-Water Vapor” items shown in Figures 3.3 and 3.4, one can see an illustration of two conditions for the item format facet. As noted earlier, the left-hand panel contains three constructed response items, and the righthand panel contains three selected response items directly related to those constructed response items. The selected response items are examples of the “select and shift” style of items common in technology-enhanced items (TEIs). In each case, the respondent must select a sentence from Anna’s statement and shift it into the correct bin. As noted earlier, the distinction between selected response and constructed response is a fundamental one in items design. In a broader sense, the distinction between the Response-type facet conditions (SR, CR) was not being used just to sample across types of student response—it was part of the data collection design, where the (almost identical) SR and CR items were part of a study of the empirical differences between written and selected responses modes for students. In addition, the Response-type facet had another function within the LPS project—the selected response items were intended to be used in summative testing of students (at the end of the teaching unit, the semester, etc.), whereas the constructed response questions were intended for formative use within active teaching (i.e., as a context/prompt for classroom discussions, for “in-class quizzes” when the teacher wanted to see what the students were thinking, etc.). This is an example of the complexity of the concept of an “item facet”: often it is seen as a frame for item sampling, but it can also be a planning tool for deployment of different versions of an instrument for different purposes. In the next section, we will also see this facet used as an item development strategy. Content Facets: The most common type of design facet that one finds for items are content facets. These refer to aspects of the subject matter of the instrument. In one sense, the construct map facet is an example of such. More common aspects of the content would be the curriculum content categories in an academic test, or the job skills in a work placement inventory: all depend heavily on the specific topics and contexts of the instrument’s usage. One common way to express such content facets in educational applications is through a subject-by-process matrix. An example is given in Figure 3.5, where the rows identify patient conditions, while the columns are important immunology curriculum concepts. The checkmarks show which pairings of patient condition and immunology concept are represented by an item in the test. In the design of this instrument, the developers decided to include just one item in the instrument for each such relevant pairing. This provides a formula for the relative coverage of each topic in the instrument—that is, for this instrument, each concept is covered by as many immunology concepts it is associated with in the figure. Equivalently, each immunology concept is covered by as many patient conditions as it is associated with in the figure. Other facets may be used in the columns or rows of such a matrix, such as the waypoints in a construct map, the cognitive levels from Bloom’s Taxonomy, etc. (see other examples in Section 4.3).

80  The Items Design

x-linked agammaglobulinemia adenosine deaminase deficiency DiGeorge syndrome acquired immune deficiency syndrome Chediak-Higashi syndrome leukocyte adhesion deficiency acute cellular rejection













3



2





1



1 1

✓ ✓

# Questions

1



poison ivy

hereditary angioedema complement c8 deficiency systemic lupus erythematosus

2



Goodpasture’s syndrome post-infectious glomerulonephritis

asplenia

# Questions

Systemic disorders affecting immune function

Transplantation

2

asthma

FIGURE 3.5 

Hypersensitivity reactions

Cell mediated immunity

Humeral immunity

Patient Condition

The innate immune system

Primary Framework:

Structure of the immune system

Secondary Framework: Immunology Topic



2



1



1









2





2

7

5



3

3

2

4

1



1

3

25

A subject-by-process blueprint for an immunology test.

Source: From Raymond and Grande (2019)

The plan for arraying items across the facets can seem to be quite a minor decision and relatively harmless in its implications. But it is an important design consideration, influencing many aspects of the instrument’s performance. Other Types of Facets: There are many other potential facets for the items design, and typically, design decisions will be made by the measurer to include some

The Items Design  81

and not others. Sometimes these decisions will be made on the basis of practical constraints on the instrument usage. For example, such considerations were partly responsible for the design of the PF-10 (Chapter 2, Example 7) when early on in the SF-36 development process, it was deemed too time-consuming to have patients carry out actual physical functioning tasks. Sometimes such decisions are made on the basis of historical precedents (also partly responsible for the PF-10 design—it is based on research on an earlier, larger, instrument). And sometimes on a practical basis because the realized item pool must have a finite set of facets, while the potential pool has an infinite set of facets. Also note that, although these are couched as decisions about the instrument, they are not entirely neutral to the idea of the construct itself. While the underlying PF-10 construct might be thought of as encompassing many different manifestations of physical functioning, the decision to use only a self-report facet restricts the actual interpretation of the instrument (a) away from items that could look beyond self-report, such as performance tasks, and (b) to items that are easy to self-report. Recall that the purpose of the item is to help determine the person’s location along the construct of interest. Any list of facets used as a blueprint for a specific instrument design plan must necessarily have some somewhat arbitrary limitations in the degree of detail in the specifications—for example, none of the specifications mentioned so far include “they must be in English,” yet this is indeed a feature of all of the mentioned instruments. One of the most important ideas behind the items design is to decrease this incidence of unspoken specifications by explicitly adopting a description of the item pool quite early in instrument development. This initial items design will likely be modified during the instrument development process, but that does not diminish the importance of having an items design from very early in the instrument development. The generation of at least a tentative items design should be one of the first steps (if not the first step) in item generation. Items constructed before a tentative items design is developed should primarily be seen as part of the process of developing the items design itself. Generally speaking, it is going to be easier to develop items from your design than to go backward and figure out a design based on a set of items. The items design can (and probably will) be revised, but having one in the first place makes the resulting item set much more likely to be coherent. 3.3  Different Types of Item Responses How open-ended versus closed form should an item be?

One important way that different item formats can be characterized is by their different amounts of pre-specification—that is, by the degree to which the results from the use of the instrument are developed before the instrument is administered to a

82  The Items Design

respondent. For example, open-ended essays are less prespecified than true/false items. Of course, when more is pre-specified before, less has to be done after the response has been made. Contrariwise, when there is little pre-specified (i.e., little is fixed before the response is made), then more has to occur afterward in order to match the response categories to the construct map. This idea will be used as the basis for the matrix-like classification of item types shown in Figure 3.5, which summarizes the whole story. This typology is not only a way to classify the response style of items that exist in research and practice. Its real strength lies in its nature as a guiding principle to the item development process. I  would argue that every instrument should go through a set of developmental stages that will approximate the columns in Figure 3.6 through to the desired level. Instrument development efforts that seek to skip some of these stages run the risk of having to make more or less arbitrary decisions about item design at some point in the development process. For example, deciding to create a Likert-type attitude scale without first investigating the responses that people would choose to make to open-ended prompts will leave the instrument with no defense against the criticism that the prechosen Likert-style response format has distorted the measurement. The same sort of criticism holds for traditional multiple choice achievement test items.

Description of item components

Specific items

Intent to measure construct

General

Specific

No score guide

Participant Observations

X

After

After

After

After

After

Topics Guide (a) General

Before

X

After

After

After

After

Topics Guide (b) Specific

Before

Before

X

After

After

After

Open-ended

Before

Before

Before

X

After

After

Scoring Guide

Before

Before

Before

Before

X

After

Fixed Response

Before

Before

Before

Before

Before

X

Item Format

Score guide

Responses

Note: “X” marks the column which encompasses the main activity at each level of Item Format. FIGURE 3.6 

Levels of pre-specification for different item formats.

The Items Design  83

3.3.1  Participant Observation What is the least structured form of item?

The item format with the lowest possible level of pre-specification would be one where the measurer had not yet formulated any of the item format issues discussed earlier, or even, perhaps the nature of the construct itself, the ultimate aim of the instrument. What remains is the intent to observe. This type of very diffuse instrumentation is exemplified by the participant observation technique (e.g., Emerson et al., 2007; Ball, 1985; Spradley, 2016) common in qualitative studies. This is closely related to the phenomenological interview technique or “informal conversational interview,” as described by Patton (1980): [T]he researcher has no presuppositions about what of importance may be learned by talking to people. . . . The phenomenological interviewer wants to maintain maximum flexibility to be able to pursue information in whatever direction appears to be appropriate, depending on the information that emerges from observing a particular setting or from talking to one or more individuals in that setting. (pp. 198–199) Not only is it the case that the measurer (i.e., in this case usually called the “participant observer”) may not know a priori the full purpose of the observation, but also “the persons being talked with may not even realize they are being interviewed”4 (Patton, 1980, p. 198). There are also participant observations where the timing and nature of the observation has been planned in advance—for example, observing what a respondent does at a particular point in a process. The degree of pre-specification of the participant observation item format is located in the first row of the matrix in Figure 3.6 where “X” marks the column which encompasses the main activity at each level of Item Format. This matrix emphasizes the progressive increase in pre-specification as one moves from this initial participant observation level to constructed response formats. Some may balk at considering a technique like participant observation as an example of an “instrument” and including it in a book on measurement. But this and the technique described in the next paragraph are included here because (a) many of the techniques described in these chapters are applicable to the results of such observations, (b) these techniques can be very useful during an instrument development (more on this at the end of this section), and (c) the techniques mark a useful starting point in thinking about the level of pre-specification of types of item formats. 3.3.2  Specifying (Just) the Topics What if you know what topics you want to ask about, but no more than that?

The next level of pre-specification occurs when the aims of the instrument are indeed pre-established—in the terms introduced earlier, one can call this the topics

84  The Items Design

guide format (i.e., in the second and third rows of the matrix). Patton (1980), in the context of interviewing, labels this the “interview guide” approach—the guide consists of a set of issues that are to be explored with each respondent before interviewing begins. The issues in the outline need not be taken in any particular order and the actual wording of questions to elicit responses about those issues is not determined in advance. The interview guide simply serves as a basic checklist during the interview to make sure that there is common information that should be obtained from each person interviewed. (p. 198) For Topics Guide, two levels of specificity can be distinguished. At the more general level of specificity (second row of the matrix), the definition of the construct and the topics, are only specified to a summary level—this is called the general topics guide approach. Presumably, the full specification of these will occur after observations have been made. At greater degree of specificity (third row of the matrix), the intended complete set including the construct definition and the full set of topics is available before administration—hence, this is the specific topics guide approach. The distinction between these two levels is a matter of degree— one could have a very vague summary and alternatively, there could be a more detailed summary that was nevertheless incomplete. 3.3.3  Constructed Response Items What is the most common form of item?

The next level of pre-specification is the constructed response level (rows 4 and 5 of the matrix). This includes the very common open-ended test and interview instruments such as those mentioned at the beginning of this chapter—used by teachers and in informal settings the world over. Here the items are determined before the administration of the instrument and are administered under standard conditions, including (usually) a predetermined order. In the context of interviewing, Patton (1980) has labeled this the “standardized open-ended interview.” As for the previous level of item format, there are two discernible levels within this category. At the first level (fourth row of the matrix), only the prompt is predetermined, and the response categories are yet to be developed. Most tests that teachers make themselves and use in their classrooms are at this level. At the second level (fifth row of the matrix), the categories that the responses will be divided into are predetermined—call this the scoring guide level. Examples of the constructed response item format (also known as open-ended items) have been given in Chapter 1 (Figure 1.6) and in this chapter (Figure 3.4, left panel, and Figure 3.7).

The Items Design  85

The interviewer explores the student’s thinking about whether the plants in one environment could live in the other environment and why or why not, using the photos as illustration and referent information. The interviewer asks: Do you think these plants—that live in the rainforest—could survive in the desert too? If S answers no: Why not? Any other reasons? If the student answers yes, then the interviewer asks: Why? Any other reasons? Are there any places in the world where these plants couldn’t live? Is there any other reason they couldn’t live there? Following this line of questioning, the interviewer probes the students’ thinking about the reverse situation from desert to rainforest using the same question structure. The last part of the item asked the student about how plants survive in the desert, essentially to explain the fit between organism and this relatively extreme environment. I’ve got another question for you about these desert plants. How come they can survive here in the desert where there is so little water? FIGURE 3.7 Exploring

student responses to the Rain Forest/Desert Task in the CUE

interview.

An interesting case of open-ended items are the structured interview questions used in the CUE assessments. Recall that these are investigating students’ knowledge about natural selection. In this interview, students were presented with a task illustrated with a visual stimulus (see Figure 3.7). Each task related to one of the strands of the Construct Map (given in Figure 2.12). For example, the “Rainforest/Desert Task,” which relates to the Fit construct begins with the interviewer showing the student two large photographs: one is of the deserts and one is of rain forests (see Figure 3.7). Children’s responses to this task reflected a broad range of sophistication, reflecting each level within the fit dimension of the construct map (Metz et  al., 2019). The relevant codes, developed after many hours of coding and recoding effort by the project for this type of task, are shown in Figure 3.8—these were augmented by extensive coding notes, and all interviews were (at least) doublecoded to check for consistency. 3.3.4  Selected Response Items What is the most structured form of item?

The final level of specificity is the standardized fixed-response format typified by the commonly occurring multiple-choice and Likert-style items (sixth and last row of the matrix). Here the respondent chooses rather than generates a response to

86  The Items Design

Live Where They Belong Meeting Needs Limiting factor X Differences in Organisms’ Structures Survival/ risk value of trait (for the individual) Those individuals with advantageous trait more likely to have offspring Change in relative frequency (as in will be more of XXX) Inheritance Change in relative frequency from original generation to offspring generation Accumulation of changes over many generations Result of natural selection as organisms adapted to where they live Other: Answer to the question from student perspective but falls outside natural selection concept No Response/I Don’t Know Uncodeable: 1. Silence, I don’t know, and Umm 2. Incomprehensible 3. Inaudible FIGURE 3.8 

Coding elements developed by the CUE project.

Q. What is the capital city of Belgium? A. Amsterdam B. Brussels C. Ghent D. Lille FIGURE 3.9 An

example of a multiple-choice test item that would be a candidate for ordered multiple-choice scoring.

the item. As mentioned earlier, this is probably the most widely used item form in published instruments. Multiple-choice items are ubiquitous in educational assessment, and all should be familiar with them already, but, just in case, a very modest example is shown in Figure 3.9. Under some circumstances, it can be interesting, even enlightening, to consider alternative ways of scoring outcome categories. For example, in the case of multiple-choice items, there are sometimes distractors that are found to be chosen by “better” examinees than some other distractors (in the sense that the examinees obtained higher scores on the instrument as a whole, or on some other relevant indicator). When this difference is large enough, and when there is a way to interpret those differences with respect to the construct definition, then it may make sense to try scoring these distractors to reflect partial success. For example, consider the multiple-choice test item in Figure 3.9: A standard scoring scheme would be: A, C, or D = 0; B = 1. Among these distractors, it would seem reasonable to think that it would be possible to assign a response C to a higher score than A or D, because Ghent is also in Belgium, and the other two cities are not. Thus, an alternative hypothetical scoring scheme would be A or D = 0; C = 1; B = 2. A similar analysis could be applied to any other outcome space where the score levels are themselves meaningful. This can be informed by the analysis described in Section 8.4.5.

The Items Design  87

This possibility can be built into multiple choice items right at the design stage when a construct map has been used to design the instrument. By developing options that each relates to different waypoints and ensuring that there are more than two waypoints involved (say, three or four), then the options can be ordered with respect to the construct map, resulting in an ordered multiple-choice (OMC) item (Briggs et al., 2006). This development technique offers a way to improve the interpretability of outcomes from traditional-looking multiple-choice instruments. The GEB and PF-10 instruments described in previous chapters are examples (Examples 3 and 6, respectively) of items with Likert-style response options. This item format is equally as common in social science settings (and beyond) as multiple-choice are in educational testing, so I will not give further examples. However, there is a related response format that has been found to give better results, the Guttman-style item (Wilson et al., 2022). In this format, the options are ordered according to the underlying construct, and when the instrument is being developed according to a construct map, the most convenient way to do this is to design each option to match successive waypoints. This is exactly what was done in the Researcher Identity Scale (RIS) example (Example 2—see the construct map in Figure 2.4), where an initial set of Likert-style items was used to create a set of Guttman-style items. An example is shown in Figure 3.10. Initially, each of these options was the stem of a Likert-style option (with choices from Strongly Agree to Strongly Disagree)—changing the response options in this way focuses the respondent on the construct under measurement, using contextspecific words to describe the relevant waypoints and away from merely reporting their agreement/disagreement level, which is both vague and very variable across respondents. The aforementioned description of this sequence of formats has focused on what has to happen before the item is administered. But, in fact, the sequence of columns in the matrix (Figure 3.6) also lays out what has to happen after the item is administered if, indeed, the responses are to become the input to a measurement. Essentially, in each row, when one looks to the right of the “X” (the point at which that item format is administered), the steps remaining are indicated by the remaining columns. Thus, texts that started as responses in a participant observation context (row one of the matrix) would still have to go through the

G4. Which statement best describes you? (a) I don’t consider myself a part of a research community. (b) I am beginning to feel like a part of a research community. (c) I am a small part of a research community. (d) I am a part of a research community. (e) I am an important part of a research community. FIGURE 3.10 

An example item from the RIS-G (Guttman response format items).

88  The Items Design

multiple stages of qualitative analysis, including the derivation of general and specific categories for the contents of the texts (i.e., columns 2 and 3 of the matrix), then the development of coding and scoring guides for the texts (columns 3 and 4 of the matrix), and the actual coding of them (column 5). This same logic applies to each of the remaining rows, until row 6, where there is no further need for recoding, etc.—at this point, each respondent has coded its own response.5 3.3.5  Steps in Item Development How do we move from more initial ideas to more standardized approaches?

As the motivation to create a new instrument is almost certainly that the measurement designer wants to go beyond what was done in the past, it is important that the measurer bring new sources of information to the development, beyond what will be learnt from a literature review. One important source of information can be found through exactly the sort of participant observation approach that has been described in the previous section. The measurer should find situations where people who would be typical respondents to the planned instrument could be observed and interviewed in the informal mode of participant observation. That might include informal conversational interviews, and collections of products, recordings of performances, etc. Information from these processes is used to develop a richer and deeper background for the “theory of the construct” that the measurer needs to establish the construct (i.e., the waypoints of the construct map), and the contextual practices that are necessary to develop the secondary facets of the instrument. The set of Informants described in Section 1.9 would be of help in this process, some as participants and some as observers. At this level of specificity, and at each of the levels noted subsequently, item development should include a thoughtful consideration of the range of participants/ respondents and how their differences may alter how items are understood. This should include consideration of respondents who are likely to be from different waypoints of the construct, all the standard demographic groups (gender, ethnicracial identity, class, etc.) as well as specific groupings relevant to the construct under measurement, such as, in the case of, say, the scientific argumentation construct discussed earlier, students with different amounts of experience in science, and different levels of interest in science. Following the initial idea-building and background-filling work of the literature review and the participant observations, the measurer should try an initial stab at the items design topics guide. This is difficult to do in a vacuum of context, so, at the same time, it is necessary to also develop some initial drafts of items. This is even true if the plan is to leave the instrument at the topics guide level, as it is essential to try out the guides in practice (i.e., that means actually doing some interviews, etc.). The development of the construct through the idea of

The Items Design  89

construct map has already been discussed in Chapter 2. The development of the other components will require insights from the participant observation to know what to focus on, and how to express the questions appropriately—some similar information may be gleaned from the literature review, although usually such developmental information is not reported in refereed journals.6 The decision of whether to stop developing the topics guide at a summary level, or whether to go on to the finer-grained specific topics guide will depend on a number of issues, such as the amount of training that the measurer will devote to the administrators of the instrument, and the amount of time and effort that can be devoted to the analysis. But, if the aim is for the finer level, then inevitably the coarser level will be a step along the way. Going on to a constructed response format will require either the generation of a set of items or the development of a method for automatically generating them in a standardized way. The latter is rather rare and quite specialized, so it will not be addressed here (but see Williamson et al., 2006). Item development is a skill that is partly science, partly engineering, and partly art. At this point, the “items” are essentially the prompts for the respondents to respond to. The science lies in the construction and/or discovery of theoretically sound and useful constructs, the engineering lies in the development of sound specifications of the item facets, and the art lies in making it work in context. Every context is unique, and understanding the way that items relate to the construct in terms of the design facets (i.e., the construct map [waypoints] and secondary facets) is crucial to successful development of the set of prompts. If the aim is to develop fixed response items, then a further step is needed. This step is discussed in the next chapter. When items are organized into instruments, there are also issues of instrument format to consider. An important dimension of instrument design is the uniformity of the formats within the instrument. An instrument can consist entirely of a single item format, such as is typical in many standardized achievement tests where all are usually multiple-choice items, and in many surveys, where Likerttype items are mostly used (though sometimes with different response categories for different sections of the survey). However, more complex mixtures of formats are also used. For example, the portfolio is an instrument format common in the expressive and performance arts, and also in some professional areas. This will consist of a sample of work that is relevant to the purpose of the portfolio, and so may consist of responses to items of many sorts and may even be structured in a variety of ways more or less freely by the respondent according to the rules that are laid down. Tests may also be composed of mixed types—multiple choice items as well as essays, say, or performance-tasks of various sorts. Surveys and questionnaires may also be composed of different formats, true–false items, Likert-style items, and short-answer items. Interviews may consist of open-ended questions as well as forced choice sections. Care must be taken in devising complex designs

90  The Items Design

such as these—concern must be taken with respect to the time allowed for the different sections, the extent of coverage of each, and the ways in which the items of different formats are related to the construct map(s). 3.4  Building-in Fairness through Design How fair can we make our items?

The question of whether the items in an instrument are fair to respondents should be addressed at the design stage. According to the Meriam-Webster Dictionary, fair means: “marked by impartiality and honesty: free from self-interest, prejudice, or favoritism.” At this point in instrument development, in the process of developing items, it is a formative question. However, this issue will be revisited as a part of the evaluation of the fairness evidence for an instrument in Section 8.7 (focusing on DIF). The essential question is whether there are important influences (besides the underlying construct that the instrument is intended to measure) on a respondent’s reactions to specific items in an instrument, or indeed the whole set of items. For example, consider the many well-established instances where this issue arises in educational testing, as noted by a U.S. National Research Council (NRC), report: [I]n a written science assessment [or for any other subject matter] with openended responses, is writing a target skill or an ancillary skill? Is the assessment designed to make inferences about science knowledge, about written expression of science knowledge, or about written expression of science knowledge in English? The answers to these questions can assist with decisions about accommodations, such as whether to provide a scribe to write answers or to provide a translator to translate answers into English. If mathematics is required to complete the assessment tasks, is mathematics computation a target skill or an ancillary skill? Is the desired inference about knowing the correct equation to use or about performing the calculations? (Here, the answers can guide decisions about use of a calculator.) (NRC, 2006) 3.4.1  What Do We Mean by Fairness Here? How is fairness defined in measurement terms?

These are questions that can be asked for each respondent, but usually fairness questions are couched in terms of how different groups of respondents will tend to respond to the items. These groups may differ from construct to construct and from situation to situation but, in a given context, will generally correspond to

The Items Design  91

typical demographic groups. Staying within the educational testing context, these groups could be defined so as to include, say, respondents of different genders, from different ethnic and racial groups, with a different language statuses (such as respondents whose native language is not the language of the instrument), respondents with learning challenges (such as cognitive disabilities, learning disabilities, or hearing impairments, etc.), or respondents with non-standard citizen status (such as recent migrants, or visa-holders, etc.). In addressing this issue in terms of the construct map and the development of items that are sensitive to the construct, typically, the investigation of such effects on demographic groups would be structured by assuming that there is (a) a reference group (i.e., the group with which the other groups will be compared), and (b) one or more focal groups (i.e., groups for whom the fairness is in question). Then, the measurer will need to consider the following three successively more complicated types of difference in the way that the construct map and the items work between the focal and the reference groups. These possibilities are as follows: (i) In terms of the measurements themselves, the respondents in the focal group are essentially on the same construct map as the respondents in the reference group, but they tend (on average) to be lower or higher than those from the reference group (this is termed “differential impact”). (ii) The respondents in the focal group are essentially on the same construct map, but some items (or perhaps a class of items) in the instrument behave differently for respondents in the focal group compared to their reference group peers who have a matching level of on the construct map. This phenomenon has the generic label “differential item functioning” or DIF7 (see Gamerman et al., 2016, Kamata & Vaughn, 2004; and Section 8.7.1). (iii) The respondents in the focal group are on a different construct from those in the reference group.8 Case (i) is the least complicated of the three—we might say that there is measurement-fairness here. But it is important to distinguish between the fairness of the measurement process itself (i.e., the instrument fairly maps from the student responses to the eventual outcome) and real-world fairness (i.e., there is no unfairness in the ways that the individual respondents have gotten to this point of being measured). Examples of real-world unfairness might be where some individuals have had more resources devoted to them in the forms of better nutrition, more education, or medical care. It might be naive to assume that, when no DIF is found that there is no unfairness that has occurred, especially if there is a history of concerns over fairness in the specific contexts where the instrument is typically used. To establish that case (i) does indeed hold, the remaining two possibilities need to be examined, and eliminated as important effects.

92  The Items Design

With regard to case (ii), measurement researchers have investigated differential item functioning with respect to many different focal groups and constructs. For example, effects have been found with respect to construct-irrelevant language factors in items that may prevent English learners (ELs) from demonstrating what they know and can do, most notably in large-scale summative assessments in mathematics (e.g., Daro et  al., 2019; Mahoney, 2008), but consistent findings have been rare. In a second example, writing about the same situation, Lee & Randall (2011) have reviewed studies of DIF in mathematics assessments between ELs and students who are not ELs, focusing on how the complexity of the language used in the items (i.e., “language load”) does influence DIF. They found that there were no consistent effects across the eight studies they located. They did find in their own additional study that a large proportion of the items from the mathematics content area “Data Analysis, Statistics, and Probability” showed DIF against ELs, but that the items in this area did not have much evidence of language complexity, and they concluded by speculating that the DIF effects were more likely due to differential educational exposure to these topics in their classrooms than to item format effects. Case (iii) is the most complex, but relatively little research exists investigating this possibility. Looking again at the case of ELs learning mathematics, this is at least partly due to the paucity of clearly established models of how ELs typically learn and develop mathematical competence. But emerging research suggests that often ELs do not follow the typical learning progressions—this should not be too surprising as our understanding of these learning progressions are based primarily on research and models for non-ELs (Sato et al., 2012). One would expect that, given individual differences in educational history, sociocultural background, and literacy and fluency in their native or home language, ELs would interact with academic content differently compared to non-ELs (e.g., Solano-Flores & Trumbull, 2003). Yet a further layer of complexity is the concern that ELs are not a uniform group—there are substantial differences among ELs in terms of their cognitive, cultural, and linguistic resources as well as their prior educational backgrounds that they bring to assessment tasks (Abedi, 2004). On the other hand, some evidence suggests that learning progressions can be appropriate tools for examining learning and development in students from different cultural and linguistic backgrounds. In the area of early language and literacy, the literature agrees that despite opportunity gaps that lead to average differences between groups, ELs have similar learning trajectories as their monolingual English-speaking peers in areas of vocabulary development, grammar, phonemics, and writing (Hammer et  al., 2014). An empirical analysis, incorporating a learning progressions approach, found that groups of young children in public early care and education programs who spoke different languages at home had markedly similar trajectories of early language and literacy development (Sussman et al., in press).

The Items Design  93

The implication of these complexities for measurement is that measurers need to take a proactive role during the design of items to investigate how the items interact with the subgroups in the intended population of respondents. In particular, for ELs, measurers should investigate national and cultural variations in how ELs access, engage, and respond to assessment tasks which generic learning progressions may miss (Mislevy & Duran, 2014). Measurers should also consider ELs’ educational trajectories, focusing on their experiences in content domains as well as in language domains. While the examples have centered on ELs, the related implication for measurement is that research specifically aimed at developing alternative construct maps for ELs and other focal groups is needed. 3.4.2  Universal Design When is the best time to be fair?

A guiding principle known as universal design, initially developed in design professions such as architecture, can be adapted into the measurement context. The idea is that products, say buildings—or measurements—should be designed so that a maximal number of people can use them without the need for further modification. In other words, they should be designed to eliminate unnecessary obstacles to access, and unnecessary limitations on people’s success in using the products. In the case of measurement practice, this can have several sorts of manifestations. Generally, the measurement developer needs to find or develop items that are fair (e.g., no DIF), and/or that can be modified/accommodated9 to compensate for DIF effects for affected groups. This leads to the question of what constitutes a fair modification/accommodation. For example, if a test is not of a speeded variety (i.e., the time limitation is not a part of the construct), then offering extra time to those who need it because of specific known difficulties would not provide those respondents with an unfair advantage over the rest of the respondents.10 To this end, Thompson et al. (2002) have proposed seven categories of recommendations for examination of items and instruments to enhance their universal design features: (a) (b) (c) (d) (e) (f) (g)

Studying an inclusive sample of the respondent population Developing precisely defined constructs Designing accessible, non-biased items Designing items that are amenable to focal group accommodations Writing simple, clear, and intuitive instructions about the procedures Designing for maximum readability and comprehensibility Designing for maximum legibility

94  The Items Design

Examples of each of these are given in Johnstone et al. (2006). In particular, given that the reader is reading this book, the advice to develop “precisely defined constructs” should be already at the top of the measurer’s list. However, it is important, in light of the discussion made earlier, that the measurer be open to the possibility that the constructs may be structured differently between reference and focal groups and investigate it thoroughly. Apart from engaging in a focused literature search for evidence of this, the measurer should also see that the cognitive lab methods described in Section 4.4 will be useful in giving clues, assuming, of course, that the focal groups have been included among the sampled respondents, which corresponds to the preceding recommendation (a). A useful framework for universal design in a measurement context has been given by Robert Dolan and his colleagues (Dolan et al., 2013). Nathaniel Brown has prepared a summary of such issues (Brown, 2020) that is helpful in exploring the multifaceted nature of the design issues. 3.5 Resources

Chapters  1–3 have provided the general necessary resources to create an items design and generate items using the creativity, insight, and hard work of the measurement developer. In terms of specific resources, there is far too wide a range of potential types of constructs, areas of application, and item formats to even attempt to list particular sources here. Nevertheless, the Exercises and Activities at the end of Chapters 1 and 2 should have directed the reader toward a better understanding of practices within their area of interest, including the background, current and past practices, and the relevant range of item designs. Within the area of educational achievement testing, there are several very useful resources for types of items and methods to develop them: Brookheart  & Nitko (2018), Haladyna (1996, 1999), Osterlind (1998), and Roid & Haladyna (1982). Many if not most other areas will have similar resources, and your informants should be able to put you onto them. I recommend you consult with professors, colleagues, and experts in your area of interest for more specific direction and support. Such informants, especially those who have experience in instrument development, can help explain specific issues that may arise, provide insight during the item development process, and help critique your overall process. 3.6  Exercises and Activities

(Following on from the exercises and activities in Chapters 1 and 2) 1. Generate lots of types of items and several examples of each type. Be prolific. 2. Write down your initial items design based on the preceding activities (including the activities listed for Chapters 1 and 2)—devote most of your attention

The Items Design  95

to the construct facet, but listen to your informants about what should be the most important secondary facets. 3. Give these draft items a thorough professional review at an “item panel” meeting—where key informants and others you recruit to constructively critique the items generated so far (see Appendix F). 4. Write down your interim Items Design based on the preceding activities, including predicted numbers of item you (still) will need to generate. 5. Following the initial round of item generation and item paneling described in (1)–(4), second or third rounds may be needed, and may also involve a return to reconsider the construct definition or the definition of the other facets of the items design. 6. Enter your surviving items into the BASS Item Bank for the construct maps you are working on. 7. Think through the steps outlined above in the context of developing your instrument and write down notes about your plans. 8. Share your plans and progress with your informants and others—discuss what you are succeeding in, and what problems have arisen. Notes 1 Although an item may also be related to several different constructs, especially if the outcome space (see Chapter 4) is different for each. 2 That is, these are typical situations involved in high-school physics items. 3 Note that the possibility of unclassifiable and missing response categories is being set aside here. 4 Of course, not informing the respondents that they are part of a research study may well be a tactic that needs an institutional review board (IRB) review. 5 Of course, the measurer may want to carry out further analysis and recoding after that. 6 But indeed, most refereed journal do not insist on such information. Lamentably, often they recommend deletion of such information to save on page length. 7 When this occurs for the aggregate of the whole test, this is termed “differential test functioning.” 8 This would be termed as a lack of factorial invariance across these groups. 9 Broadly, a modification to an item is a change in the item itself, while an accommodation is a change in the circumstances under which the item is administered, such as, say, the provision of a glossary. 10 Further information on this topic is available from the National Center on Educational Outcomes (NCEO) (http://education.umn.edu/NCEO).

4 THE OUTCOME SPACE

You see, but you do not observe. —Sir Arthur Conan Doyle (1892)

4.1  The Qualities of an Outcome Space What is an outcome space?

The outcome space is the third building block in the Bear Assessment System (BAS). It has already been briefly introduced in Chapter 1, and its relationship to the other building blocks was illustrated there too—see Figure 4.1. In this chapter, it is the main focus. The term “outcome space” was introduced by Ference Marton (1981) for a set of outcome categories developed from a detailed (“phenomenographic”) analysis of students’ responses to standardized open-ended items such as the LPS item discussed in the previous chapter.1 In much of his writing, Marton describes the development of a set of outcome categories as a process of “discovering” the qualitatively different ways in which students respond to a task. In this book, the term outcome space is adopted and applied in a broader sense to any set of qualitatively described categories for recording and/or judging how respondents have responded to items. Several examples of outcome spaces have already been shown in earlier examples. The LPS Argumentation construct map in Figure 2.7 (Example 4) summarizes how to categorize the responses to the LPS items attached to the Argumentation construct—this is a fairly typical outcome space for an openended item. The outcome spaces for fixed-response items look different—they

DOI: 10.4324/9781003286929-6

The Outcome Space  97

FIGURE 4.1 

The four building blocks in the BEAR Assessment System (BAS).

are simply the fixed responses themselves. For example, the outcome space for an evaluation item in the PF-10 survey (Example 7) is as follows: “limited a lot” “a little” or “not at all.” Although these two types of outcome space look quite different, it is important to see that they are connected in a deep way—in both cases, the response categories are designed to map back to the waypoints of the construct map. Thus, if two sets of items, some of which were constructed response and some selected response, related to the same construct map, then, despite how different they looked, they all would have the common feature that their responses could be mapped to the waypoints of that construct map. As noted earlier, this connection leads to a good way to develop a fixed set of responses for selected response items: first, construct the open-ended outcome space, and second, use some of the sample responses in the categories as a way to generate representative fixed choices for selection. Of course, many considerations must be borne in mind while making those choices. Inherent in the idea of categorization is an understanding that the categories that define the outcome space are qualitatively distinct. All measures are based, at some point, on qualitative distinctions. Even fixed-response formats such as multiple-choice test items and Likert-style survey questions rely upon a qualitative understanding of what constitutes different levels of response (more or less correct, or more or less agreeable, as the case may be). Rasch (1977, p. 68) pointed out that this principle goes far beyond measurement in the social sciences: “That

98  The Outcome Space

science should require observations to be measurable quantities is a mistake of course; even in physics, observations may be qualitative—as in the last analysis they always are.” The remainder of this section contains a description of the important qualities of a sound and useful outcome space. These qualities include well-defined, finite and exhaustive, ordered, context-specific, and research-based, as detailed subsequently. The outcome space for a construct map will apply to all items that are designed to give evidence about the construct. For specific items, there are often specific details about the item responses that will pertain only to that or similar items. This also needs to be specified: this more specific item-focused information is referred to in this book as a scoring guide. 4.1.1  Well-defined Categories How should the categories in an outcome space be defined?

The categories that make up the outcome space must be well-defined. For our purposes, this will need to include not only (a) a general definition of what is being measured by that item (i.e., in the approach described in this book, a description of the construct map), but also (b) relevant background material, (c) examples of items, item responses, and their categorization, as well as (d) a training procedure for judging constructed response items. The LPS Example displays all except the last of these characteristics: Figure 2.9 summarizes the Argumentation construct map including descriptions of different levels of response; Figures 3.3 and 3.4 show an example item; and the paper cited in the description in Chapter 2 (Osborne et al., 2016) gives a background discussion to the construct map, including references to the relevant literature. Construct Mapping: What is not shown in the LPS materials is a training program to achieve high inter-rater agreement in the types of responses that fall into different categories, which will in turn support the usefulness for the results. To achieve high levels of agreement, it is necessary to go beyond written materials; some sort of training is usually required. One such method that is consistent with the BAS approach is called “construct mapping” (Draney  & Wilson, 2010; Draney  & Wilson, 2011). In the context of education, this method has been found to be particularly helpful for teachers, who can bring their professional experiences to help in the judgment process, but who also have found the process to enhance their professional development. In this technique, teachers choose examples of item responses from their own students or others, and then circulate the responses beforehand to other members of the moderation group. All the members of the group categorize the responses using the scoring guides and other material available to them. They then come together to “moderate”

The Outcome Space  99

those categorizations at a consensus-building meeting. The aim of the meeting is for the group to compare their categorizations, discuss them until they come to a consensus about the scores, and to discuss the instructional implications of knowing what categories the students have been categorized into. This process may be repeated several times with different sets of responses to achieve higher levels of initial agreement, and to track teachers’ improvement over time. In line with the iterative nature of design, the outcome space may be modified from the original by this process. One way to check that the outcome space contains sufficiently interpretable detail is to have different teams of judges use the materials to categorize a set of responses. The agreement between the two sets of judgments provides an index of how successful the definition of the outcome space has been (although, of course, standards of success may vary). Marton (1986) gives a useful distinction between developing an outcome space and using one. In comparing the work of the measurer to that of a botanist classifying species of plants, he notes: [W]hile there is no reason to expect that two persons working independently will construct the same taxonomy, the important question is whether a category can be found or recognized by others once it has been described. It must be possible to reach a high degree of agreement concerning the presence or absence of categories if other researchers are to be able to use them. (Marton, 1986, p. 35) 4.1.2  Research-based Categories What backing is needed for the categories in an outcome space?

The construction of an outcome space should be part of the process of developing an item and, hence should be informed by research aimed at (a) establishing the construct to be measured and (b) identifying and understanding the variety of responses respondents give to that task. In the domain of measuring achievement, a National Research Council committee concluded: A model of cognition and learning should serve as the cornerstone of the assessment design process. This model should be based on the best available understanding of how students represent knowledge and develop competence in the domain. . . . This model may be fine-grained and very elaborate or more coarsely grained, depending on the purpose of the assessment, but it should always be based on empirical studies of learners in a domain. Ideally, the model will also provide a developmental perspective, showing typical ways in which learners progress toward competence. (NRC, 2001, pp. 2–5)

100  The Outcome Space

Thus, in the achievement context, a research-based model of cognition and learning should be the foundation for the definition of the construct and hence also for the design of the outcome space and the development of items. In other areas, similar advice pertains—in psychological scales, health questionnaires, and in marketing surveys—there should be a research-based construct to tie all of the development efforts together. There is a range of formality and depth that one can expect of the research behind such “research-based” outcome spaces. For example, the LPS Argumentation construct is based on a close reading of the relevant literature (Osborne et al., 2016), as are the Data Modeling constructs (Lehrer et al., 2014). The research basis for the PF-10 is documented in Ware and Grandek (1998), although the construct is not explicitly established there. For each of the rest of the Examples, there is a basis in the relevant research literature for the construct map, although (of course) some literatures have more depth than others. 4.1.3  Context-specific Categories How generic should be the categories in an outcome space?

In the measurement of a construct, the outcome space must always be specific to that construct and the contexts in which it is to be used. Sometimes it is possible to confuse the context-specific nature of an outcome space with the generality of the scores that are derived from that. For example, a multiple choice item will have distractors that are only meaningful (and scoreable) in the context of that item, but the usual scores of the item (“correct”/“incorrect” or “1”/“0”) are interpretable more broadly as indicating “correctness.” This can lead to a certain problem in developing achievement items, which I call the “correctness fallacy”—that is, the view (perhaps an unconscious view) that the categorization of the responses to items is simply according to whether the student supplied a “correct” answer to it. The problem with this approach is that the “correctness” of a response may not fully comprehend the complexity of what is asked for in the relevant construct map. For example, in the “Ice-to-Water-Vapor” Task in Figure 3.3, note how a student could be asked which student, Anna or Evan, is correct. The response to this could indeed be judged as correct or not, but nevertheless, the judgment would have little information regarding Argumentation—what is needed to pry out of this context is to proceed to the next part of the task, as exemplified in Figure 3.4, where the prompts are used to disassemble the “correctness” into aspects relevant to the Argumentation construct map. Even when categories are labeled in the same way from context to context, their use inevitably requires a reinterpretation in each new context. The set of categories for the LPS tasks, for example, was developed from an analysis of students’ answers to the set of tasks used in the pilot and subsequent years of the assessment development project. The general scoring guide used for the LPS

The Outcome Space  101

Argumentation construct needs to be supplemented by an item scoring guide, including a specific set of exemplars for each specific task (as shown in Table 1.1 for the Data Modeling MoV Piano Width item). 4.1.4  Finite and Exhaustive Categories How many categories are needed in an outcome space?

The responses that the measurer obtains to an open-ended item will generally be a sample from a very large population of possible responses. Consider a single essay prompt—something like the classic “What did you do over the summer vacation?” Suppose that there is a restriction to the length of the essay of, say, five pages. Think of how many possible different essays could be written in response to that prompt. It is indeed a very large number (although, because there is only a finite number of words in English, there is [for a finite-length essay] a finite upper limit that could be estimated). Multiply this by the number of different possible prompts (again, very large, but finite) and then again by all the different possible sorts of administrative conditions (it can be hard to say what the numerical limit is here, perhaps infinite), and you end up with an even bigger number. The role of the outcome space is to bring order and sense to this extremely large and potentially unruly bunch of possible responses. One prime characteristic is that the outcome space should consist of only a finite number of categories. For example, the LPS scoring guide categorizes all Argumentation item responses into 13 categories, as shown in Figure 2.7. The PF-10 outcome space is just three categories: “Yes, limited a lot,” “Yes, limited a little,” and “No, not limited at all.” The outcome space, to be fully useful, must also be exhaustive: there must be a category for every possible response. Note that some potential responses may not be delimited in the construct map. First, under broad circumstances, there may be responses that indicate the following: (a) That there was no opportunity for a particular respondent to respond (e.g., this can occur due to the measurer’s data collection design, where the items are, say, distributed across a number of forms, and certain items are included on one of the forms). (b) That the respondent was prevented from completing all of the items by matters not related to the construct or the purpose of the measurement (such as an internet interruption). For such circumstances, the categorization of the responses to a construct map level would be misleading, and so a categorization into “missing” or an equivalent would be best. The implications for this “missing” category need to be borne in mind for the analysis of the resulting data.

102  The Outcome Space

Second, there will often be responses that do not conform with the expected range. In the constructed response items, one can get responses like “tests stink” or “I vote for Mickey Mouse,” etc. Although such responses should not be ignored, as they sometimes contain information that can be interpreted in a larger context and may even be quite important in that larger context, they will usually not inform the measurer about the respondent’s location on a specific construct. In fixed-response item formats like the PF-10 scale, the finiteness and exhaustiveness of the response categorization is seemingly forced by the format, but one can still find instances where the respondent has endorsed, say, two of the options for a single item. In situations like these, the choice of a “missing” category may seem automatic, but there are circumstances where that may be misleading. In educational achievement testing, for example, it may be more consistent with an underlying construct (i.e., because such responses do not reflect “achievement”) to categorize them at the lowest waypoint, as was indicated for the Argumentation construct map in Figure 2.7. Whatever policy is developed, it has to be sensitive to both the underlying construct and the circumstances of the measurement. 4.1.5  Ordered Categories Must the categories in an outcome space be related to one another?

For an outcome space to be useful in defining a construct that is to be shaped into a construct map, the categories must be capable of being ordered in some way. Some categories must represent lower levels on the construct, while some must represent higher ones. In traditional fixed-response item formats like the multiple-choice test item and the true-false survey question, the responses are ordered into just two levels—in the case of true–false questions, (obviously) into “true” and “false”; in the case of multiple-choice items, into the correct category, for choosing the correct option, and into the false category for choosing one of the false options. In Likert-type survey questions, the order is implicit in the nature of the choices: the options “Strongly Agree,” “Agree,” “Disagree,” and “Strongly Disagree” give an ordering for the responses, etc. A scoring guide for an open-ended item needs to do the same thing—the scores shown in Table 1.1 for the MoV item give four ordered categories (scored 0–3, respectively). This ordering needs to be supported by both the theory behind the construct and empirical evidence—the theory behind the outcome space should be the same as that behind the construct itself. Empirical evidence can be used to support the ordering of an outcome space—and is an essential part of both pilot and field investigations of an instrument (see Section 8.4 for examples of this). The ordering of the categories does not need to be complete. An ordered partition (i.e., where several categories may have the same rank in the ordering such as is shown for the scoring guide for Argumentation in Figure 3.2) can still be used to provide useful information (Wilson & Adams, 1995).

The Outcome Space  103

4.2  Scoring the Outcome Space (the Scoring Guide) How can the observations in the categories of an outcome space be scored?

Most often, the set of categories that come directly out of an outcome space is not yet sufficient as a basis for measurement. One more step is needed—the provision of a scoring guide. The scoring guide organizes the ordered categories as waypoints along the construct map: the categories must be related back to the responses side of the generating construct map. This can be seen simply as providing numerical values for the ordered levels of the outcome space (i.e., scoring of the item response categories), but the deeper meaning of this pertains to the relationship back to the construct map from Chapter 1. In many cases, this process is seen as integral to the definition of the categories, and that is indeed a good thing, as it means that the categorization and scoring work in concert with one another. Nevertheless, it is important to be able to distinguish the two processes, at least in theory, because (a) the measurer must be able to justify each step in the process of developing the instrument, and (b) sometimes the possibility of having different scoring schemes is useful in understanding and exploring the construct. In many circumstances, especially where the measurer is using an established item format, the question of what scoring procedure to use has been established by long-term practice and is regularly not considered as an issue to be examined. For example, with multiple-choice test items, it is standard procedure to score the correct distractor as 1 and the incorrect ones with 0. Thus, when the correct distractor is indeed an example of a particular waypoint (and the incorrect ones are all associated with waypoints below it), then the 1 and 0 scoring makes sense. Of course, the instrument developer needs to be sure that there is no ambiguity in the mapping of the distractors to the waypoints. Likert-style response questions in surveys and questionnaires are usually scored according to the number of response categories allowed—if there are four categories like “Strongly Agree,” “Agree,” “Disagree,” and “Strongly Disagree,” then these are scored as 0, 1, 2, and 3, respectively (or, sometimes, 1, 2, 3, and 4). When a Likert-style response set is sued with an attitude or behavioral construct map, there can be difficulties in interpreting how “Disagree” and “Agree” are mapped onto the waypoints. See the discussion about Guttman-style items in Section 4.3.3. With option sets that have a negative valence with the construct, the scoring will generally be reversed to be 3, 2, 1, and 0, respectively. With open-ended items, the outcome categories must be ordered into qualitatively distinct, ordinal categories, such as was done in the LPS example. Just as for Likert-style items, it makes sense to think of each of these ordinal levels as being scored by successive integers, just as they are in Figure 3.4 (left panel), where the successive ordered categories are scored thus:

104  The Outcome Space

A full critique, or comparison of two arguments = 3, One complete warrant, or a counterargument = 2, One claim or piece of evidence = 1, No evidence = 0 No opportunity to respond = missing. This can be augmented where there are finer gradations available—one way to represent this is by using “+” and “−” for (a) responses palpably above a waypoint, but not yet up to the next waypoint, and (b) responses palpably below a waypoint, but not yet down to the next waypoint, respectively. Note that these may be waypoints in the making. Another way is to increase the number of scores to incorporate the extra categories. The category of “no opportunity” is scored as “missing” above. Under some circumstances, say, where the student was not administered an achievement item because it was deemed too difficult on an a priori basis, then it would make sense to score this “missing” consistently with that logic as a “0.” However, if the student was not administered the item for reasons that related to some reason unrelated to that student’s measure on the construct, say, that they were ill that day, then it would make sense to maintain the “missing” and interpret it as indicating “missing data.” 4.3  General Approaches to Constructing an Outcome Space Are there general strategies that can be used to develop categories in an outcome space?

The construction of an outcome space will depend heavily on the specific context, both theoretical and practical, in which the measurer is developing the instrument. It should begin with the definition of the construct, and then proceed to the definition of the descriptive components of the items design and will also require the initial development of some example items. Two general schema are described subsequently that have been developed for this purpose, focusing on the cognitive domain: (a) phenomenography (Marton, 1981), which was mentioned earlier, and the SOLO Taxonomy (Biggs & Collis, 1982). At the end of this section, a third method, applicable also to non-cognitive contexts, and derived from the work of Louis Guttman, is described. 4.3.1 Phenomenography How can qualitative categories be useful in developing an outcome space?

Phenomenography provides a method of constructing an outcome space for a cognitive task based on a detailed analysis of a set of student responses (Bowden & Green,

The Outcome Space  105

2005). Phenomenographic analysis has its origins in the work of Ference Marton (1981) who describes it as “a research method for mapping the qualitatively different ways in which people experience, conceptualize, perceive, and understand various aspects of, and phenomena in, the world around them” (Marton, 1986, p. 31). Phenomenographic analysis usually involves the presentation of an open-ended task, question, or problem designed to elicit information about an individual’s understanding of a particular phenomenon. Most commonly, tasks are attempted in relatively unstructured interviews during which students are encouraged to explain their approach to the task or conception of the problem. Researchers have applied phenomenographic analysis to such topics as physics education (Ornek, 2008); teacher conceptions of success (Carbone et al., 2007); blended learning (Bliuc et  al., 2012); teaching (Gao et  al., 2002); nursing research (Sjöström  & Dahlgren, 2002); and speed, distance, and time (Ramsden et al., 1993). A significant finding of these studies is that students’ responses typically reflect a limited number of qualitatively different ways of thinking about a phenomenon, concept, or principle (Marton, 1988). An analysis of responses to the question in Figure 4.2, for example, revealed just a few different ways of thinking about the relationship between light and seeing. The main result of phenomenographic analysis is a set of categories describing the qualitatively different kinds of responses students give, forming the outcome space, which Dahlgren (1984) describes an outcome space as a “kind of analytic map”: It is an empirical concept which is not the product of logical or deductive analysis, but instead results from intensive examination of empirical data. Equally important, the outcome space is content-specific: the set of descriptive

FIGURE 4.2 

An open-ended question in physics.

Source: From Marton (1983)

106  The Outcome Space

categories arrived at has not been determined a priori, but depends on the specific content of the [item]. (p. 26) The data analyzed in studies of this kind are often, but not always, transcripts of interviews. In the analysis of students’ responses, an attempt is made to identify the key features of each student’s response to the assigned task. The procedure can be quite complex, involving up to seven steps (Sjöström & Dahlgren, 2002). A search is made for statements that are particularly revealing of a student’s way of thinking about the phenomenon under discussion. These revealing statements, with details of the contexts in which they were made, are excerpted from the transcripts and assembled into a pool of quotes for the next step in the analysis. The focus of the analysis then shifts to the pool of quotes. Students’ statements are read and assembled into groups. Borderline statements are examined in an attempt to clarify differences between the emerging groups. Of particular importance in this process is the study of contrasts. Bringing the quotes together develops the meaning of the category, and at the same time the evolving meaning of the category determines which quotes should be included and which should not. This means, of course, a tedious, time-consuming iterative procedure with repeated changes in the quotes brought together and in the exact meaning of each group of quotes. (Marton, 1988, p. 198) The result of the analysis is a grouping of quotes reflecting different kinds of understandings. These groupings become the outcome categories, which are then described and illustrated using sampled student quotes. Outcome categories are “usually presented in terms of some hierarchy: There is a best conception, and sometimes the other conceptions can be ordered along an evaluative dimension” (Marton, 1988, p. 195). For Ramsden et al. (1993), it is the construction of hierarchically ordered, increasingly complex levels of understanding, and the attempt to describe the logical relations among these levels that most clearly distinguishes phenomenography from other qualitative research methods. The link to the idea of a construct map should be clear. Consider now the outcome space in Figure 4.3 based on an investigation of students’ understandings of the relationship between light and seeing (see the item shown in Figure 4.2). The concept of light as a physical entity that spreads in space and has an existence independent of its source and effects is an important notion in physics and is essential to understanding the relationship between light and seeing. Andersson and Kärrqvist (1981) found that very few ninth-grade students in Swedish comprehensive schools understood these basic properties of light. They observe that authors of science textbooks take for granted an understanding of light and move rapidly to topics such as lenses and systems of lenses that rely on students’ understanding of these

The Outcome Space  107

(e) The object reflects light and when the light reaches the eyes we see the object. (d) There are beams going back and forth between the eyes and the object. The eyes send out beams which hit the object, return and tell the eyes about it. (c) There are beams coming out from the eyes. When they hit the object we see (cf. Euclid's concept of “beam of sight”). (b) There is a picture going from the object to the eyes. When it reaches the eyes, we see (cf. the concept of “eidola” of the atomists in ancient Greece). (a) The link between eyes and object is “taken for granted.” It is not problematic: "you can simply see." The necessity of light may be pointed out and an explanation of what happens within the system of sight may be given. FIGURE 4.3 

A phenomenographic outcome space.

foundational ideas about light. And teachers similarly assume an understanding of the fundamental properties of light: “Teachers probably do not systematically teach this fundamental understanding, which is so much a part of a teacher’s way of thinking that they neither think about how fundamental it is, nor recognize that it can be problematic for students” (Andersson & Kärrqvist, 1981, p. 82). To investigate students’ understandings of light and sight more closely, 558 students from the last four grades of the Swedish comprehensive school were given the question in Figure 4.2 and follow-up interviews were conducted with 21 of these students (Marton, 1983). On the basis of students’ written and verbal explanations, five different ways of thinking about light and sight were identified. These are summarized in the five categories in Figure 4.3. Reading from the bottom of Figure 4.3 up, it can be seen that some students give responses to this task that demonstrate no understanding of the passage of light between the object and the eye: according to these students, we simply “see” (a). Other students describe the passage of “pictures” from objects to the eye (b); the passage of “beams” from the eye to the object with the eyes directing and focusing these beams in much the same way as a flashlight directs a beam (c); the passage of beams to the object and their reflection back to the eye (d); and the reflection of light from objects to the eye (e). Each of these responses suggests a qualitatively different understanding. The highest level of understanding is reflected in category (e); the lowest in category (a). Marton (1983) does not say whether he considers the five categories to constitute a hierarchy of five levels of understanding. His main purpose is to illustrate

108  The Outcome Space

the process of constructing a set of outcome categories. Certainly, categories (b), (c), and (d) reflect qualitatively different responses at one or more intermediate levels of understanding between categories (a) and (e). Note that in this sample, no student in the sixth grade, and only 11% of students in the ninth grade gave responses judged as being in category (e). 4.3.2  The SOLO Taxonomy What is a general strategy for incorporating cognitive structures into an outcome space?

The SOLO (Structure Of the Learning Outcome) Taxonomy is a general theoretical assessment development framework that may be used to construct an outcome space for a task related to cognition. The taxonomy, which is shown in Figure 4.4, was originally developed by John Biggs and Kevin Collis (1982) to provide a frame of reference for judging and classifying students’ responses from elementary to higher education (Biggs, 2011). The SOLO Taxonomy is based on Biggs and Collis’ initial observation that attempts to allocate students to Piagetian stages and to then use these allocations to predict students’ responses to tasks invariably results in unexpected observations (i.e., “inconsistent” performances of individuals from task to task). The solution for Biggs and Collis is to shift the focus from a hierarchy of very broad developmental stages to a hierarchy of observable outcome categories within a narrow range regarding a specific topic—in our terms, a construct: “The difficulty, from a practical point of view, can be resolved simply by shifting the label from the student to his response to a particular task” (1982, p. 22). Thus, the SOLO levels “describe a particular performance at a particular time and are not meant as labels to tag students” (1982, p. 23).

An extended abstract response is one that not only includes all relevant pieces of information but extends the response to integrate relevant pieces of information not in the stimulus. A relational response integrates all relevant pieces of information from the stimulus. A multistructural response is one that responds to several relevant pieces of information from the stimulus. A unistructural response is one that responds to only one relevant piece of information from the stimulus. A pre-structural response is one that consists only of irrelevant information.

FIGURE 4.4 

The SOLO Taxonomy.

The Outcome Space  109

The SOLO Taxonomy has been applied in the context of many instructional and measurement areas in education, including topics such as science curricula (Brabrand & Dahl, 2009), inquiry-based learning (Damopolii et al., 2020), high school chemistry (Claesgens et al., 2009), mathematical functions (Wilmot et al., 2011), middle school number sense and algebra (Junpeng et al., 2020), and middle school science (Wilson & Sloane, 2000). The Stonehenge example detailed in Figures  4.5 and 4.6 illustrates the construction of an outcome space by defining categories to match the levels of the SOLO framework. In this example, five categories corresponding to the five levels of the SOLO Taxonomy—prestructural, unistructural, multistructural, relational, and extended abstract—have been developed for a task requiring students

FIGURE 4.5 

A SOLO task in the area of History.

Source: Photograph of Stonehenge downloaded from: https://publicdomainpictures.net/pictures/100000/velka/stonehenge-england-1410116251sfi.jpg

110  The Outcome Space

4 Extended Abstract e.g., Stonehenge is one of the many monuments from the past about which there are a number of theories.

It may have been a fort but the evidence suggests it was more likely to have been a temple. Archeologists think that there were three different periods in its construction, so it seems unlikely to have been a fort. The circular design and the blue stones from Wales make it seem reasonable that Stonehenge was built as a place of worship. It has been suggested that it was for the worship of the sun god because at a certain time of the year the sun shines along a path to the altar stone. There is a theory that its construction has astrological significance or that the outside ring of pits was used to record time. There are many explanations about Stonehenge but nobody really knows.

This response reveals the student's ability to hold the result unclosed while he considers evidence from both points of view. The student has introduced information from outside the data and the structure of his response reveals his ability to reason deductively. 3 Relational e.g., think it would be a temple because it has a round formation with an altar at the top end. I think

it was used for worship of the sun god. There was no roof on it so that the sun shines right into the temple. There is a lot of hard work and labor in it for a god and the fact that they brought the blue stone from Wales. Anyway, it Is unlikely they build a fort in the middle ofa plain.

This is a more thoughtful response than the ones below; it incorporates most of the data, considers the alternatives, and interrelates the facts. 2 Multistructural e.g., It might have been a fort because would it looks like it would stand up to it. They used to

build castles out of stones in those days. It looks like you could defend it too. It is more likely that Stonehenge was a temple because it looks like a kind of design all in circles and they have gone to a lot of trouble.

e.g.,

These students have chosen an answer to the question (i.e., they have required a closed result) by considering a few features that stand out for them in the data, and have treated those features as independent and unrelated. They have not weighed the pros and cons of each alternative and come to balanced conclusion on the probabilities. 1 Unistructural e.g., It looks more like a temple because they are all in circles. e.g., It could have been a fort because some of those big stones have been pushed over. These students have focused on one aspect of the data and have used it to support their answer to the question. 0 Prestructural e.g., A temple because people live in it. e.g., It can be a fort or a temple because those big stones have fallen over. The first response shows a lack of understanding of the material presented and of the implication of the question. The student is vaguely aware of "temple," "people," and "living," and he uses these disconnected data from the story, picture, and questions to form his response. In the second response, the pupil has focused on an irrelevant aspect of the picture.

FIGURE 4.6 

SOLO scoring guide for the history task.

Source: From Biggs and Collis (1982, pp. 47–49)

to interpret historical data about Stonehenge (Biggs & Collis, 1982, pp. 47–49). The History task in Figure 4.5 was constructed to assess students’ abilities to develop plausible interpretations from incomplete data. Students aged between 7.5 and 15 years of age were given the passage in Figure 4.5 and asked to give in writing their thoughts about whether Stonehenge might have been a fort rather than

The Outcome Space  111

a temple. The detailed SOLO scoring guide for this item is shown in Figure 4.6 (note the inclusion of exemplars here). This example raises the interesting question of how useful theoretical frameworks of this kind might be in general. Certainly, Biggs and Collis have demonstrated the possibility of applying the SOLO Taxonomy to a wide variety of tasks and learning areas and other researchers have observed SOLO-like structures in empirical data. Dahlgren (1984, 29–30), however, believes that the great strength of the SOLO taxonomy—its generality of application—is also its weakness. Differences in outcome which are bound up with the specific content of a particular task may remain unaccounted for. In some of our analyses, qualitative differences in outcome similar to those represented in the SOLO taxonomy can be observed, and yet differences dependent on the specific content are repeatedly found. Nevertheless, the SOLO Taxonomy has been used in many assessment contexts as a way to get started. An example of such an adaptation was made for the Using Evidence construct map for the Issues Evidence and You curriculum (Chapter 2, Example 10; Wilson & Sloane, 2000) shown in Figure 4.7, which began with a SOLO hierarchy as its outcome space, but eventually morphed to the structure shown. For example, in Figure 4.7: Waypoint I is clearly a pre-structural response, but Waypoint II is a special unistructural response consisting only of subjective reasons and/or inaccurate or irrelevant evidence; Waypoint III is similar to a multistructural response, but is characterized by incompleteness; Waypoint IV is a traditional relational response, and is the standard schoolbook “correct answer” while Waypoint V adds some of the usual “extras” of extended abstract. Similar adaptations were made for all of the IEY constructs, which were adapted from the SOLO structure based on the evidence from student responses to the items. This may be the greatest strength of the SOLO Taxonomy—its usefulness as a starting place for the analysis of responses. In subsequent work using the SOLO Taxonomy, several other useful levels have been developed. A problem in applying the Taxonomy was found—the multistructural level tends to be quite a bit larger than the other levels—effectively, there are lots of ways to be partially correct. In order to improve the diagnostic uses of the levels, several intermediate levels within the multistructural one have been developed by the Berkeley Evaluation and Assessment Research (BEAR) Center, and hence the new generic outcome space is called the SOLO-B Taxonomy. Figure 4.8 gives the revised taxonomy.

112  The Outcome Space

FIGURE 4.7 A sketch

of the construct map for the Using Evidence construct of the IEY curriculum.

4.3.3  Guttman Items What is a useful alternative to Likert-style items?

The two general approaches described earlier relate most effectively to the cognitive domain—there are also general approaches in the attitudinal and behavioral domains. The most common general approach to the creation of outcome spaces in areas such as attitude and behavior surveys has been the Likert style of item. The most generic form of this is the provision of a stimulus statement (sometimes called a “stem”), and a set of standard options that the respondent must choose from.

The Outcome Space  113

FIGURE 4.8 

The SOLO-B Taxonomy.

Possibly the most common set of options is “Strongly Agree,” “Agree,” “Disagree,” and “Strongly Disagree,” sometimes with a middle “neutral” option. The set of options may be adapted to match the context: for example, the PF-10 Health Outcomes survey uses this approach (see Section 2.2.7). Although this is a very popular approach, largely I suspect, because it is relatively easy to come up with many items when all that is needed is a new stem for each one, there is certain dissatisfaction with the way that the response options relate to the construct. After laying out the potential limitations of this approach, an alternative is proposed. The problem here is that there is very little to guide a respondent in judging what is the difference between, say, “Strongly Disagree” and “Agree.” Indeed, individual respondents may well have radically different ideas about these distinctions (Carifio & Perla, 2007). This problem is greatly aggravated when the options offered are not even words, but numerals or letters, such as “1,” “2,” “3,” “4,” and “5”—in this sort of array, the respondents do not even get a hint as to what it is that they are supposed to be making distinctions between! The Likert response format has been criticized frequently over the almost 100  years since Likert wrote his foundational paper (Likert, 1932–1933), as one might expect for anything that is so widely used. Among those criticisms are (a) some respondents have a tendency to respond on only one response side or the other (i.e., the positive side or the negative side), (b) some have a tendency to not choose extremes or choose mainly extremes, (c) that some respondents confuse an “equally balanced” response (e.g., between Agree and Disagree) with a “don’t know/does not apply” response (DeVellis, 2017), and (d) that, under some circumstances, it has been found to be better to collapse the alternatives into just two categories (Kaiser & Wilson, 2000). A particularly disturbing criticism is found in a

114  The Outcome Space

paper by Andrew Maul (2013), where he throws into question even the assumption that the stems (i.e., the questions or statements) are needed for Likert-style items?! For psychometricians, probably the most common criticism is that the use of integers for recording the responses (or, even, as noted earlier, as the options themselves) gives the impression that the options are “placed’ at equal intervals (e.g., Carafio & Perla, 2007; Jamieson, 2005; Ubersax, 2006). This then gives the measurer a false confidence that there is (at least) interval-level measurement status for the resulting data, and hence one can proceed with confidence to employ statistical procedures that assume this (e.g., linear regression, factor analysis). There is also a literature on the robustness of such statistical analyses against this violation of assumptions, dating back even to Likert (1932–1933) himself, but with others making similar points over the years (cf. for example, Glass et al., 1972; Labovitz, 1967; Traylor, 1983). An alternative is to build into each set of options meaningful statements that give the respondent some context in which to make the desired distinctions. The aim here is to try and make the relationship between each item and the overall scale interpretable. This approach was formalized by Guttman (1944), who created his scalogram technique (also known as Guttman scaling): If a person endorses a more extreme statement, he should endorse all less extreme statements if the statements are to be considered a [Guttman] scale. . . . We shall call a set of items of common content a scale if a person with a higher rank than another person is just as high or higher on every item than the other person. (Guttman, 1950, p. 62) To illustrate this idea, suppose there are two dichotomous attitude items that form a Guttman scale, as described by Guttman. If Item B is more extreme than Item A, which, in our terms, would mean that Item B was higher on the construct map than Item A (see Figure 4.9), then the only possible Likert-style responses in the format (Item A, Item B) would be the following: (a) (Disagree, Disagree) (b) (Agree, Disagree) (c) (Agree, Agree) That is, a respondent could agree or disagree with both, or could agree with the less extreme Item A and disagree with the more extreme Item B or agree with both. But the response (d) (Disagree, Agree)

FIGURE 4.9 

Sketch of a “Guttman scale.”

The Outcome Space  115

FIGURE 4.10 

Guttman’s example items.

Source: From Guttman (1944, p. 145)

would be disallowed. Consider now Figure 4.9, which is sketch of the (very minimal) meaning that one might have of a Guttman scale—note how it is represented as a continuum consistent with the term “scale,” but that otherwise it is very minimal (which matches the “minimal meaning”). It can be interpreted thus: if a respondent is below Item A, then they will disagree with both the items; if they are in between A and B, then they will agree with A but not B; and, if they are beyond B, they will agree with both. But there is no point where they would agree with B but not A. One can see that Guttman’s ideas are quite consistent with the ideas behind the construct map, at least as far as the sketch goes. Four items developed by Guttman using this approach are shown in Figure 4.10. These items were used in a study of American soldiers returning from the Second World War (Guttman, 1944)—the items were designed to be part of a survey to gauge the intentions of returned soldiers to study (rather than work) after being discharged from their military service. The logic of the questions is as follows: if I turn down a good job and go back to school regardless of help, then I will certainly make the same decision for a poor job or no job. Some of the questions have more than two categories, which make them somewhat more complicated to interpret as Guttman items, but nevertheless, they can still be thought of in the same way. Note how for these items, (a) the questions are ordered according to the construct (i.e., their intentions to study); and

116  The Outcome Space

(b) the options to be selected within each item are   (i)  (ii) (iii) (iv)

clear and distinct choices for the respondents, related in content to the construct and also to the question, also ordered in terms of the construct (i.e., their intentions to study), and not necessarily the same from question to question.

It is these features that we will concentrate on here. In the next paragraphs, this idea, dubbed as “Guttman-style” items by Wilson (in press), will be explored in terms of our Example 2 (The Researcher Identity Scale RIS). As noted in Chapter  2, the RIS Project developed the Researcher Identity Scale (RIS) construct map in Figure 2.4 (Example 2). Following the typical attitude survey development steps, the developers made items following the Likert response format approach—see Figure 4.11 for some examples. Altogether, they developed 45 Likert response items, with 6 response categories for each item, as shown in Figure 4.11 (i.e., Strongly Disagree, Disagree, Slightly Disagree, Slightly Agree, Agree, and Strongly Agree). The SFHI researchers then decided to transform the Likert-style items to Guttman-style items. The trick of doing this is that the stems of the Likert-style items (i.e., the first column in Figure 4.11) become the options in the Guttmanstyle items. This means that each Guttman-style item may correspond to several Likert-style items (Wilson et al., 2022). To do this, the stems of the Likert-style items (i.e., the first column in Figure 4.11) were grouped together based on the researcher’s judgment of the similarity of their content and their match to the RIS construct map waypoints to create Guttman-style sets of ordered response options (see Figure 4.12). For not every case were there Likert stems that matched the complete RIS set of waypoints, so the researchers had to create some new options to fill the gaps. The resulting Guttmanstyle response options were placed in order based on the following: (a) The theoretical levels of the construct map that they were intended to map to (b) Empirical evidence of how students responded to the items in earlier rounds of testing

I am a member of a research community

O

O

O

O

O

O

I am a part of a group of researchers.

O

O

O

O

O

O

I am an important part of a group of researchers

O

O

O

O

O

O

FIGURE 4.11 Some

Likert-style items initially developed for the Researcher Identity Scale (RIS).

The Outcome Space  117

G4. Which statement best describes you? (a) I don’t consider myself a part of a research community. (b) I am beginning to feel like a part of a research community. (c) I am a small part of a research community. (d) I am a part of a research community. (e) I am an important part of a research community. FIGURE 4.12 

An example item from the RIS-G (Guttman response style item).

To see an illustration of this, compare the Likert-style items shown in Figure 4.11 with options (c), (d), and (e) for the Guttman response format item in Figure 4.12. This example shows, indeed, the matched set of these three Likert response format items with one Guttman response format item. As there were not any matching items for the two lower levels among the Likert items, two more options were developed for the Guttman item—options (a) and (b) in Figure 4.12. The project developed 12 Guttman response format items and collected validity evidence for the use of the instrument (Morell et al., 2021) based on 21 of the Likert-style items. Eleven of the 12 have at least one Likert-style option that the Guttman options were designed to match to, with 21 matching levels in all, out of a total possible 60 across all 12 Guttman response format items (so, approximately two-thirds of the Guttman options were new). Details of the matching of the 21 Likert-style items with the 12 Guttman-style items are given in Appendix G. Comparing the Likert-style item set to the Guttman-style item set, the SFHI researchers found the following: (a) The Guttman-style set gave more interpretable results (see details of this in Section 8.2.1). (b) The reliabilities were approximately the same, even though there were fewer Guttman-style items (45 versus 12), resulting in an equivalency of approximately 3.75 Likert-style items to each Guttman-style item (Wilson et  al., 2022). (c) Respondents tended to be slower in responding to each Guttman-style item, though the equivalence in (b) balances that out (Wilson et al., 2022). Of course, one does not have to develop a set of Likert-style items first in order to get to the Guttman-style items—that was the case in the SFHI project and was chosen to share here as it is a useful account of how to change the numerous Likert-style instruments into Guttman-style instruments. However, in fact the account shows the relationship quite clearly—the Gutman-style items can be thought of as sets of Likert-style items, where the vague Likert-style options have been swapped out for the more concrete and interpretable Guttman-style stems.

118  The Outcome Space

4.4 A Unique Feature of Human Measurement: Listening to the Respondents What can respondents tell us about the items?

A crucial step in the process of developing an instrument, and one unique to the measurement of human beings, is that the measurer can ask the respondents what they are thinking about when responding to the items. In Chapter 8, evaluative use of this sort of information is presented as a major tool in gathering evidence for the validity of the instrument and its use. In this section, formative use of this sort of information is seen as a tool for improving the instrument, and in particular, the items of the instrument. There are two major types of investigations of response processes: the “think-aloud” and the “exit interview.” Other types of investigation may involve reaction time studies, eye movement studies, and various treatment studies, where, for example, the respondents are given certain sorts of information before they are asked to respond. In the think-aloud style of investigation, also called “cognitive labs” (Willis, 1999), students are asked to talk aloud about what they are thinking and feeling while they are actually responding to the item. What the respondent says is recorded, often what they do is being videotaped, and other characteristics may be recorded, such as having their eye movements tracked. A professional should be at hand to prompt such self-reports and also to ask clarifying questions if necessary. Typically, respondents need a certain amount of familiarization to know what it is that the researcher is interested in and to feel comfortable with the procedure. The results can provide insights ranging from the very plain—such as “the respondent was not thinking about the desired topic when responding”—to the very detailed, including evidence about particular cognitive and metacognitive strategies that they are employing. Of course, the scientific value of self-reported information should be questioned (e.g., Brener et al., 2003)—that is, there is no guarantee that the respondent can observe their own thoughts. However, in this circumstance, the evidence from the think-aloud should be considered more like forensic evidence in a criminal investigation—not directly indicating what has really happened (mentally) but giving the measurer clues that can be linked to reasonable hypotheses about what is going on in the hidden mental world of the respondent. A sample think-aloud protocol is provided in Appendix H which was developed as a part of the LPS project (Example 4; Morell et  al., 2017). Note the preamble to the whole process, designed to set the respondent at ease, and provide them with disclosure of the aims of the exercise.2 Then the interviewer demonstrates what they want the respondent to do using a similar context (i.e., describing how the student would get to their school office), for the respondent to become familiar with the procedures. Other familiarization procedures are also

The Outcome Space  119

commonly used, such as conducting a “dry run” with an item similar to those in the actual study set. The exercise then proceeds through each item to be investigated. In this case, the cognitive lab was being recorded on a verbal recording device, but it may also be video-taped, or recorded via a video-conferencing system (although care must be taken with these to make sure that the respondent is not identified in the stored data). One additional good idea is to have a silent observer present during the entire activity to assist in the interpretation and evaluation of the results with the interviewer. Four types of information are usually produced from these procedures: (a) Process records that show what a student does as they solve the item (i.e., verbal and/or video recordings) (b) Products by the respondent (i.e., both the students’ responses to the item and students’ jottings, etc., while responding to the item) (c) Introspective reports from the respondent (i.e., respondents’ comments as they attempt to respond) (d) Retrospective information (i.e., respondents’ comments after they have completed the item). Regarding the fourth source of information, the prompts for this were not included in the protocol, but the researchers noted that the interviewers asked questions based on events that arose during the think-aloud protocol. For example, we may have asked a process question (“How did you solve that?”) when the student did not adequately verbalize. Or we may have asked a design question (“Was there anything that confused you?”) when a student spent several minutes on a sub-section of an item. (Johnstone, et al., 2006, p. 7) The data from these data sources are then coded to record the relevant aspects of the information, and to allow accumulation and comparison across respondents. Each development project will need to create its own version of a thinkaloud coding sheet according to the information they wish to glean from the think-aloud process. Some of the relevant pieces of information that will likely be valuable to record are as follows: (i) How accessible the questions were for the respondents. (ii) Whether the items appeared to be biased for (or against) certain subgroups (this, of course, depends on their being representatives of these groups among the sampled respondents, as noted earlier).

120  The Outcome Space

(iii) Whether the respondents found the items to be simple and clear, and whether the item had intuitive instructions and procedures. (iv) Whether the text of the item was readable and comprehensible. (v) Whether the presentation of the items was legible. (vi) Whether anything was left out. Of course, there will be other issues that arise in each different situation, but this is a useful starting point for thinking about this issue. An example of an early version of an item from the LPS project is shown in Figure 4.13 (Morell et al., 2021). Figure 4.14 shows information gathered during a think-aloud session based on this item. The item relates to a different construct map used in the LPS project than the one shown in Chapter  2, for Ecology3 (Dozier et  al., 2021). Some appreciation of the typical range of responses one gets from a think-aloud exercise can be gained by reading through the various comments recorded in Figure 4.14—they range from helpful corrections for poor wordings, to insights into how students are misunderstanding the ideas involved, to insights into how students are misunderstanding the representation of the ideas in the item figure, to misalignment between item vocabulary and student reading level. In the revisions to this item, the project: (a) revised the preamble to read “The picture below shows the changes in an ecosystem over time after a big fire. Scientists call this ecological succession” and

FIGURE 4.13 

An early version of a LPS Ecology item.

The Outcome Space  121

Extract from Interviewer Notes for “Succession” 102, 106: Repetitive statement: “The picture below shows the process of ecological succession after a big fire, which occurs after an ecosystem experiences a disturbance (like a big fire).” 201: Student seemed to focus only on the most obvious change (bare forest to smaller trees to taller trees), without examining the figure more closely for other “smaller” changes. Question design may not facilitate close/detailed examination of the figure. Maybe we can ask something like “Several patterns of change can be observed in the above figure. Describe some of these patterns.” 202: The word “hardwood” might be unfamiliar to students. Student also said that it was confusing why the vegetation would keep changing from one type of plant to another, especially how pine would turn into hardwood (i.e., student does not understand what a succession is). He thought that the pine trees might have died out (because of a drought) and then hardwood trees grew over the land. But student knew that the grids were showing the same location. 202: Student did not seem to notice the smaller patterns/changes besides the change in plant growth (bare to smaller trees to bigger trees), despite multiple prompting. 110: Student thinks the plants over the years are the same plants just growing bigger (less bushy to more bushy). These words were confusing in the Succession task 102, 105 Ecological succession 104 Succession 106 Disturbance, ecological (but the student said they weren’t that hard) FIGURE 4.14 

Example notes from a think-aloud session.

(b) modified the representation to make clearer that the panels in the figure were views (snapshots) of the same place over time (i.e., by adding an element that is common to all of them). The exit interview is similar in aim to the think-aloud but is timed to occur after the respondent has completed their item responses. It may be conducted after each item, or after the instrument is completed, depending on whether the measurer judges that the delay will or will not interfere with the respondent’s memory. The types of information gained will be similar to those from the think-aloud, though generally it will not be so detailed. This is not always a disadvantage, as sometimes it is the respondent’s reflections which are desired. Thus, it may be the case that a data collection strategy that involves both think-alouds and exit-interviews will be the best. Information from respondents can be used at several points along the instrument development process as detailed in this book. Reflections on what the respondents say can lead to wholesale changes in the idea of the construct, revisions of the construct facet, the secondary facets, specific items and item types, and changes in the outcome space and/or scoring guides (this last to be described

122  The Outcome Space

in the next chapter). It is difficult to overemphasize the importance of including procedures for tapping into the insights of the respondents in the instrument development process. A  counterexample is useful here—in cognitive testing of babies and young infants, the measurer cannot gain insights in this way, and that has required the development of a whole range of specialized techniques to make up for such a lack. There are ethical considerations that should be considered when the objects of your measurement are human beings. The measurer is obliged to make sure that the items do not offend or elicit personal information that might be detrimental to the respondent, ask them to carry out unlawful or harmful procedures, or unduly distress them. The steps described in the previous section will prove very informative of such matters, and the measurer should heed any information supplied by the respondents supply and make subsequent revisions of the items based on these insights. But simply noting such comments is not sufficient. There should be prompts that are explicitly aimed at addressing these ethical issues, as the respondents may think that such comments are “not wanted” in the thinkaloud and exit interview processes. For example, to investigate whether items are offensive to potential respondents, it is useful to assemble a group of people who are seen as representing a broad range of potential respondents (these groups have various titles, such as a “community review panel”). The specific community demographic categories that should be represented in the group will vary depending on the instrument and its audience, but likely demographic variables would be age, gender, ethnicracial identity, socioeconomic status, sexual orientation, language status, etc., as well as specific groups relevant to the construct under measurement. This group is then asked to examine each item individually and the entire set of items as a whole to bring to your attention any concerns regarding issues mentioned in the previous paragraphs or any other reasons that they feel are important. Of course, it is up to the measurer to decide what to do with such recommendations, but they should have a justification for not following any such suggestions. 4.5  When Humans Become a Part of the Item: The Rater How can raters be a part of measurement?

In the majority of the different item formats described in Section 3.3.3, there will be a requirement for the responses to be judged, or rated, into categories relating to the waypoints in the construct map. Even for the selected response format, it was recommended that the development process include a constructed response initial phase. It may be possible in some cases to use machine learning to either assist or replace the human element in large-scale measurement situations, but most situations still need an initial phase that will require human rating to gather

The Outcome Space  123

a training sample for the machine learning to work. Thus, an essential element of many instrument development efforts will require the use of human raters of the responses to the items, and this requirement needs to be considered at the design stage. The preceding chapter, on the Items Design, describes important aspects about how to create items, and these aspects will also have to be considered when designing items that will require ratings of the responses—for example, the prompts in the items will need to be designed to generate enough response materials for the raters to make their judgments. In designing items, it is important to be aware that open-ended items are not without their drawbacks. They can be expensive and time-consuming to take, code and score, and inevitably introduce a certain amount of subjectivity into the scoring process. This subjectivity is inherent in the need for the raters to make judgments about the open-ended responses. Guidelines for the judgments cannot encompass all possible contingencies and therefore the rater must make judgment with a certain degree of consistency. But counter to this flaw, it is the judgment that offers the possibility of a broader and deeper interpretation of the responses and hence the measurements. Failures to judge the responses in appropriate categories may be due to several factors related to the rater, such as fatigue, failure to understand and correctly apply the guidelines, distractions due to matters such as poor expression by the respondent, and distractions due to recent judgments about preceding responses. A traditional classification of the types of problematic patterns that raters tend to exhibit is described in the next three paragraphs (e.g., Saal et al., 1980). Rater severity or leniency is a consistent tendency by the rater to judge a response into a category that is lower or higher, respectively, than is appropriate. Detection of this pattern is relatively straightforward when the construct has been designed as a construct map (as opposed to more traditional item-development approaches) as the successive qualitative categories implicit in the waypoint definitions give useful reference points for an observer (and even the self-observer). A halo effect may reveal itself in three different ways—they all involve how the rating of one response can affect the rating of another. The first may happen in the circumstance that a single response is judged to be judged against several subscales. The problem is that the judgment of one of the subscales may influence the judgment of another. A typical case is where the rater makes an overall determination across the whole set of subscales rather than attending to each of the subscales separately. The second type of halo effect arises when the rater forms an impression based on the person’s previous responses rather than scoring each response on its own merit. The third type of halo effect occurs between respondents—the response from an earlier respondent may influence the judging of a response from a later respondent. Restriction in range is a problematic pattern where the rater tends to judge the responses into only a subset of the full range of scorable categories. There

124  The Outcome Space

are several different forms of this: (a) central tendency is where the rater tends to avoid extreme categories (i.e., the judgments tend to be toward the middle of the range); (b) extreme tendency is the opposite, where the rater tends to avoid middle categories (i.e., the judgments tend to be toward the extremes of the range); and, of course, (c) severity or leniency could be seen as a tendency to restrict the range of the categories to low or high end of the range, respectively. The causes of these problematic patterns may differ as well. A rater may adopt a rating strategy that looks like central tendency because they adopt a “least harm” tactic—staying in the middle of the range reduces the possibility of grossly misscoring any respondent (which means that a discrepancy index used to check up on raters will not be very sensitive). However, a restriction of range, for example, to the low end, may be due to a failure on the part of the rater to see distinctions between the different levels. There are several design strategies that can be taken to avoid problematic patterns of judgment. One typical strategy is to provide extensive training so that the rater more fully understands the intentions of the measurer. Carrying out the ratings in the context of a construct map provides an excellent basis for such training—the waypoints make the definition of the construct itself much clearer, and, in turn, the exemplars4 make the interpretation of the waypoints much clearer also. This training should include (a) opportunities for raters to score sample responses with established ratings, and (b) scrutiny of the results of the ratings, as well as (c) repetitions of the training at appropriate intervals. There are many models of delivery of such training, but one useful approach is to have three foci: (a) provide general background—for example, in an online module; (b) small groups to work through judgment scenarios; and (c) individual coaching to address questions and areas of weakness. A second strategy is to have double or triple readings of a sample of responses. Differences between category judgments from different raters can then be considered in several ways to improve the ratings. Differences can be investigated by (a) discussion and mutual agreement among the set of raters, (b) comparing the individual ratings with the frequency pattern of rating categories across the set of raters, or (c) comparing each rating to that of the most expert of the raters. Of course, with this strategy, the lighter the sampling of responses, the less reliable this method will be in detecting raters with problematic patterns. A third strategy uses auxiliary information from another source of information to check for consistency with the rater’s judgment about a response. A rater would be considered severe if they tended to give scores that were lower than would be expected from other sources of information. If the rater tended to give higher scores, then that would be considered leniency. In Wilson and Case (2000), for example, where the instrument consisted of achievement items that were both selected response and constructed response, student’s responses to the set of selected response items were used to provide auxiliary information about a

The Outcome Space  125

student’s location on the construct in comparison to the rater’s judgments for the constructed response items. In a second example (Shin et al., 2019), where all of the items were constructed response, a machine-learning algorithm was used to provide auxiliary information about the ratings of responses. For an instrument that requires raters, the measurement developer needs to be aware of the considerations mentioned earlier, but there are many others besides that are dependent on the nature of the construct being measured and the contexts for those measurements. It is beyond the scope of this book to discuss all of the many such complexities, and the interested reader should look for support from the relevant literature. For example, a useful and principled description of these complexities in the context of educational performance assessments (e.g., written essays and teacher portfolios) is given in Engelhard & Wind (2017). 4.6 Resources

The development of an outcome space is a complex and demanding exercise. The scoring of outcome spaces is an interesting topic by itself—for studies of the effects of applying different scores to an outcome space, see Wright and Masters (1981) and Wilson (1992). Probably the largest single collection of accounts of outcome space examples is contained in the volume on phenomenography by Marton et  al. (1984), but also see the later collection by Bowden and Green (2005). The seminal reference on the SOLO Taxonomy is Biggs and Collis (1982); extensive information on using the taxonomy in educational settings is given in Biggs and Moore (1993) and Biggs (2011). The Guttman-style item is a relatively new concept, although based on an old idea—see Wilson et al. (2022) for the only complete account so far, although it is presaged in the first edition of this volume (Wilson, 2005). 4.7  Exercises and Activities

(Following on from the exercises and activities in Chapters 1–3) 1. For some of your items, carry out a phenomenographic study, as described in Section 4.3.1 to develop an outcome space. 2. After developing your outcome space, write it up as a scoring guide (e.g., Table 1.1) for your items, and incorporate this information into your construct map. 3. Log into BASS and enter the information about waypoints, exemplars, etc. 4. Carry out an Item Pilot Investigation as described in Appendix I. The analyses for the data resulting from this investigation will be described in Chapters 6–8. 5. Make sure that the data from your Pilot Investigation is entered into BASS. This will be automatic if you used BASS to collect the data. If you used

126  The Outcome Space

another way to collect the data (e.g., pencil and paper or another type of assessment deployment software), then use the “Upload” options to load it into BASS. 6. Try to think through the steps outlined above in the context of developing your instrument and write down notes about your plans and accomplishments. 7. Share your plans and progress with others. Discuss what you and they are succeeding in, and what problems have arisen. Notes 1 Note that the Data Modeling outcome space in Figure 1.9 and Appendix D could also have been used here. 2  The scope of this, of course, must be agreed to by your institutional IRB. 3 The details of that construct map are not needed to appreciate the observations about this example of think-aloud notes, but in case the reader is interested, one can look up that information via the Examples Archive in Appendix A. 4  That is, examples of typical responses at each waypoint.

5 THE WRIGHT MAP

[Warren] Buffett found it “extraordinary” that academics studied such things. They studied what was measurable, rather than what was meaningful. —Roger Lowenstein (2008)

5.1  Combining Two Approaches to Measurement How can the score-focused and the item-focused approaches be reconciled?

If you ask a person who is a non-professional in measurement: “What is the relation between what we want to measure and the item responses?” their answers will usually be spoken from one of two different viewpoints. One viewpoint focuses on the items. For example, in the context of the PF-10, they might say “If a respondent says that their vigorous activities are ‘limited a lot,’ then that means they have less physical functioning,” or “If someone can’t walk one block then they are clearly in poor health.”

DOI: 10.4324/9781003286929-7

128  The Wright Map

A second point of view will consider ways of summarizing the respondents’ responses. For example, they might say “If someone answers ‘limited a lot’ to most of the questions, then they have poor physical capabilities,” or “A person who scores high on the survey is in good physical health.” Usually, in this latter case, the idea of a “score” is the same as what people became accustomed to when they were students in school, where the individual item scores are added to give a total (which might then be presented as a percentage instead, in which case, the total score is divided by the maximum score to give the percentage). These two types of answers are indicative of two different approaches to measurement frequently expressed by novice measurers. Specifically, the first approach focuses on the items, and their relationship to the construct. I call this the itemfocused approach. The second approach focuses on the respondents’ scores, and their relationship to the construct. I call this the score-focused approach. The two different points of view have different histories1—a very brief sketch of each is given subsequently. The Item-focused Approach: Parts of the history of the item-focused approach have already been described in the preceding chapters. By this point, it should be clear that the item-focused approach has been the driving force behind the first three building blocks—the construct map, the items design, and the outcome space. The item-focused approach was made formal by Guttman (1944, 1950) in his Guttman Scaling technique as was described in Section 4.3.3. However, the story does not end there. While Guttman’s logic leads to a straightforward relationship between the two sides of the construct map, as shown in Section 4.3.3, the use of Guttman scales has been found to be severely compromised by the problem that in practice there are almost always large numbers of response patterns in the data that do not conform to the strict Guttman requirements. For example, here is what Irene Kofsky had to say, drawing on extensive experience with using the Guttman scale approach in the area of child development psychology: [T]he scalogram model may not be the most accurate picture of development, since it is based on the assumption that an individual can be placed on a continuum at a point that discriminates the exact [emphasis added] skills he has mastered from those he has never been able to perform. . . . A better way of describing individual growth sequences might employ probability statements

The Wright Map  129

about the likelihood of mastering one task once another has been or is in the process of being mastered. (Kofsky, 1966, pp. 202–203) Thus, to successfully integrate the two aspects of the construct map, the issue of response patterns that are not strictly in the Guttman format must be addressed, and Kofsky gives us a hint—what about using probability as a “smoothing agent”? The Score-focused Approach: The intuitive foundation of the score-focused approach is what might be called intuitive test theory (Braun & Mislevy, 2004, p. 6) where there is an understanding that there needs to be some sort of an aggregation of information across the items, but the means of aggregation is either left vague or assumed on the basis of historical precedent to be summation of item scores. This simple score theory is more like a folk theory, but nevertheless exerts a powerful influence on intuitive interpretations. The simple score-focused approach has been formalized as classical test theory (CTT—also known as true score theory), and intuitive test theory has been assumed into it. The statistical aspects of this approach have their foundation in the previous historical statistical ideas from astronomical data analysis and were laid out by Francis Edgeworth (1888, 1892) and Charles Spearman (1904, 1907) in a series of papers. They borrowed a perspective from the fledgling statistical science of the time and posited that an observed score on the instrument, X, was composed of the sum of a “true score” T, and an “error” E: X = T + E.(5.1) One way to think of T is as part of a thought experiment where it would be a long-term average score that the respondent would get over many re-takings of the instrument assuming the respondent could be “brainwashed” to forget all the preceding ones.2 The “error” is not seen as something inherently wrong (as implied by the term itself!), but simply as what is left over after taking out the true score. Thus, the error is what is not represented by T—in this approach it is “noise.” Formally, it is assumed that (a) the error is normally distributed with a mean of zero, and (b) different errors are independent of one another, as well as of the true score, T (e.g., Finch & French, 2019, pp. 29–42; Nunally & Bernstein, 1994, pp. 215–247). This was found to explain a phenomenon that had been observed over many empirical studies: some sets of items seemed to give more consistent results than other sets of items, specifically, that larger sets of items (e.g., longer tests) tend to give greater consistency (as embodied in the Spearman–Brown Prophecy Formula; Brown (1910), Spearman (1910)). The explanation that Spearman found for the phenomenon has to do with what is called the reliability coefficient, which is essentially the relation between two forms of the instrument constructed to be

130  The Wright Map

equivalent (this is further discussed in Chapter 7). The introduction of an error term, E, also allows for a quantification of inconsistency in the observed scores.3 However, these advantages of the instrument-focused approach come at a price: The items themselves are absent from the model. Look at Equation 5.1: There are no items present in this equation! If we think of CTT as a model for the scoring, then it is a model that ignores the fundamental architecture of the instrument as being composed of items. Moreover, the use of the raw score as the framing for the true score, T, means that every time we make a change in the item set used (adding or deleting an item for instance), there is a new scale for the true score! This can be worked around technically by equating, but it imposes a heavy burden including gathering large data sets for each new item set. One might ask, with this obvious limitation, why is the use of the CTT approach so common? There are two main answers to this question. The first is that in the century following the pioneering work described in the previous paragraphs, measurement experts have been developing numerous “work-arounds” that essentially add extra assumptions and data manipulations into the CTT framework. These allow, among other extensions, ways to add item-wise perspectives, such as was initially proposed by Spearman and Brown. The shortcoming of this approach is that just about every such extension makes its own set of extra assumptions, so that real applications where several extensions will likely be employed end up with a mess of assumptions that are very seldom considered in the application. The second, and more problematic answer, is that we in the social sciences have accustomed ourselves to using a methodological approach that does not make explicit the connections between the content of the instruments (as usually expressed in a traditional instrument blueprint) and the statistical model used to analyze the resulting data. The connection is lost when the item responses are summed up into the raw scores—there is no ready way to track them back empirically. This is aided and abetted by the Spearman–Brown formula itself, which tells us that we can get higher reliabilities by adding more items from the same general topic—hence the measurement developer using the CTT approach can usually attain a “high enough” reliability just by adding more generic items without having to worry about the items’ detailed relationships with the underlying construct. In contrast, the item-focused approach has been the driving force behind the first three building blocks. Hence, if we adhered to the framing provided by the classical test theory approach, the efforts that have been expended on the first three building blocks might be in vain. In summary, each of the approaches can be seen to have its virtues: (a) Guttman scaling (the item-focused approach) focuses attention on the meaningfulness of the results from the instrument by focusing on the meaningful way that the items might match to the construct map—that is, its validity

The Wright Map  131

(b) Classical test theory (the score-foscused approach) models the statistical nature of the scores and focuses attention on the consistency of the results from the instrument—that is, what we will define subsequently as its reliability. There has been a long history of attempts to reconcile these two approaches. One notable early approach is that of Louis Thurstone (1925). Thurstone clearly saw the need to have a measurement model that combined the virtues of both, and sketched out an early solution, which is illustrated in Figure 5.1. In this figure, the curves show the cumulative percentage of students who got each item correct in a test, plotted against the chronological ages of the students (in years). To see this, select any item represented by one of the curves. Notice how the percentage of students getting it correct increases with age. Now select a chronological age. Notice how the percentage of students answering correctly differs by item. Easier items at that age (e.g., #6 and #11) will have a higher percentage and more difficult ones (e.g., #60 and #65) will have a lower percent age. The ordering of the curves (i.e., the curves for the different items) in Figure 5.1 is essentially the ordering that Guttman was looking for, but with one exception—Thurstone was using chronological age as a stand-in for the respondent’s location on the construct map (i.e., as stand-in for the construct itself). Note though that the fact that they are curves rather than vertical lines (which would be related to the sort of abrupt transitions that Guttman envisioned) corresponds to a probabilistic way of thinking about the relationship between the score and success on this construct, and this can be taken as a response to Kofsky’s suggestion about

FIGURE 5.1 Thurstone’s

graph of student success on specific items from a test versus chronological age.

Source: From Thurstone (1925, p. 444)

132  The Wright Map

using a probability-based approach. Unfortunately, this early reconciliation remained an isolated inspired moment for many years. Thurstone also went beyond this to outline a further requirement for a measurement model: A measuring instrument must not be seriously affected in its measuring function by the object of measurement. To the extent that its measuring function is so affected, the validity of the instrument is impaired or limited. If a yardstick measured differently because of the fact that it was a rug, a picture, or a piece of paper that was being measured, then to that extent the trustworthiness of that yardstick as a measuring device would be impaired. Within the range of objects for which the measuring instrument is intended, its function must be independent of the object of measurement. (Thurstone, 1928, p. 547) This too was an important contribution—demanding that the scale must function similarly regardless of the sample being measured. This observation was generalized by Georg Rasch (1961), who added a similar requirement for the items: The comparison between two stimuli [items] should be independent of which particular individuals [respondents] were instrumental for the comparison. . . . Symmetrically, a comparison between two individuals should be independent of which particular stimuli within the class considered were instrumental for the comparison. (Rasch, 1961, pp. 331–332) He referred to these as requirements for specific objectivity and made that the fundamental principle of his approach to measurement. The Bear Assessment System (BAS) approach adopted in this book is intended as a reconciliation of these two basic historical tendencies. The statistical formulation for the Rasch model is founded on the work of Georg Rasch (1960–1980) who was the first to point out the important qualities of the model that bears his name, the Rasch (statistical) model (which is described in the next section). The practical usefulness of this model for measuring was extended by Benjamin Wright (1968, 1977) and Gerhard Fischer (see Fischer & Molenaar (1995) for a thorough summary of Fischer’s contributions).4 The focus for this book is on developing an introductory understanding of the purpose and mechanics of a measurement model. With that goal in mind, construct modeling has been chosen as a good starting point, and the Rasch statistical model has been chosen due to how well it supports the implementation of the construct map idea. Note that it is not intended that the measurer will learn all that is needed to appreciate the wide-ranging debates concerning the respective models and estimation procedures by merely reading this book. This book is an

The Wright Map  133

only an introduction and the responsible measurer will need to go further (see suggestions in Chapters 9 and 10). 5.2  The Wright Map What is the conceptual tool that can help with the reconciliation?

The Wright map is the fourth building block in the Bear Assessment System (BAS). It has already been briefly introduced in Chapter 1, and its relationship to the other building blocks was illustrated there too—see Figure 5.2. In this chapter, it is the main focus. In metrological terms, calibration is defined as follows: An empirical and informational process performed on a measuring instrument for establishing a functional relation between the local scale of the instrument and a public scale. (Mari et al., 2023, p. 299) In the construct-mapping context that is developed here, the “local scale of the instrument” is the vector of item scores for the respondents, and the “public scale” is the publicly interpretable version of the logit scale, which was illustrated in the MoV example in Figure 1.11. This process occurs in two steps: (a) the item scores over a sample of respondents are used to estimate the respondent and item parameters on a scale using a statistical model, and then (b) the correspondence between the locations of the items on that scale and the waypoints of the construct map are used to establish public references for the scale (Mari et al., 2023).

FIGURE 5.2 

The four building blocks in the BEAR Assessment System (BAS).

134  The Wright Map

The representational device that is used to articulate these two aspects of the process is the Wright map. The following sections will (a) introduce the statistical model that is used to provide the underlying scale for the process, (b) explain the representation of the scale via the Wright map, and (c) elucidate how these can be connected back to the construct map, thus informing the scale with public references. As mentioned in the previous section, the Rasch model will be used as the statistical model for this purpose. 5.2.1  The Rasch Model What is the first step to the Wright map?

The first step of the Wright map building block is to gather a sample of responses to the items that comprise the instrument. At the initial iteration through the BEAR Assessment System (BAS), the instrument should be seen as tentative, with the possibilities for modifying items, deleting items, and adding items still to be explored. Nevertheless, the data that is gathered is seen as the best available for the current purposes, so it will be subjected to a statistical analysis to try and establish the basis for a measurement scale. The Rasch model relates specifically to dichotomous items, that is, items scored into just two categories. In attitude instruments, this might be “Agree” versus “Disagree,” in achievement testing, this might be “Right” versus “Wrong,” or, more broadly, in surveys, it might be “Yes” versus “No”—of course, there are many other dichotomous labels for the categories. In this chapter, without loss of generality, we will assume that they are scored into “1” and “0,” and that the scores for item stems that are negatively related to the construct map have been reverse-coded. There are many situations where dichotomous coding is not suitable, and the Rasch model can be extended to take advantage of that. This will be explored in the next chapter. In the Rasch model, the form of the relationship is that the probability of the item response for item i, Xi, is modeled as a function of the respondent location θ (Greek “theta”) and the item location δi: (Greek “delta”), where the location is conceptualized as being along the common scale of ability (or, attitude, etc., for respondents) and difficulty (for items). Thus, the concern about the limitations of total raw scores mentioned earlier is avoided. In achievement and ability applications, the respondent location will usually be termed the “respondent ability” and the item location will be termed the item “difficulty.” In attitude applications, these terms are not appropriate, so terms such as “attitude towards . . . (something)” or “propensity to endorse . . .” (for respondent location) and “item scale value” or “difficulty to agree with . . .” (for item response location) are sometimes used. In order to be neutral to areas of application, the terms used here in

The Wright Map  135

this section are “respondent location” and “item location”—this is also helpful in reminding the reader that these parameters will have certain graphical interpretations in terms of the construct map, and, eventually, the Wright map. To make this more specific, suppose that the item has been scored dichotomously as “1” or “0” (“Right”/“Wrong,” “Agree”/“Disagree,” etc.). That is, Xi = 1 or 0. The logic of the Rasch model is that the respondent has a certain “amount” of the construct, indicated by θ, and that an item also has a certain “amount” of the construct, which is indicated by δi. However, the way the amounts interplay is in opposite directions—hence the difference between the respondent and the item, θ − δi, is what counts: one can imagine that the respondent’s amount of θ must be compared to the items amount of d in order to find the probability of a “1” response (as opposed to a “0” response). We can consider three situations (see Figure 5.3). In panel (a) when the amounts (θ and δi) are equal (e.g., at the same point on the Wright map in Figure 5.3), responses of “0” and “1” have the same probability— hence the probability a response of “1” is 0.50. For instance, the respondent is equally likely to agree or disagree to the item for an attitude question; or, for an achievement question, they are equally as likely to get it right as to get it wrong. (a)

(b)

(c)

θ θ

δi

δi

δi θ

In (a) the item location (delta) is the same as the respondent location (theta), meaning the respondent’s ability is equal to the item’s difficulty. FIGURE 5.3 Representation

In (b) the item location (delta) is lower than the respondent location (theta), meaning the respondent’s ability is greater than the item’s difficulty.

In (c) the item location (delta) is higher than the respondent location (theta), meaning the respondent’s ability is less than the item’s difficulty.

of three possible relationships between respondent location and the location of an item.

136  The Wright Map

In panel (b) when the respondent has more of the construct than the item has (i.e., θ > δi), the probability of a “1” is greater than 0.50. Here the respondent is more likely to agree (for an attitude question) or get it right (for an achievement question). In panel (c) when the item has more of the construct than the respondent has (i.e., θ < δi), then the probability of a “1” is less than 0.50. Here the respondent is more likely to disagree (for an attitude question) or get it wrong (for an achievement question). To reiterate, in the context of achievement testing, for these three examples, we would say that the “ability” of the respondent is (a) equal to or (b) greater than or (c) less than the “difficulty” of the item. In the context of attitude measurement, we would say that (a) the respondent and the statement are equally positive, (b) the respondent is more positive than the item, and (c) the respondent is more negative than the item. Similar expressions would be appropriate in other contexts. Note that these three situations depicted in Figure 5.3,

(a) θ = δi,   (b) θ > δi,

and   (c) θ < δi,

correspond to the relationships

(a) θ − δi, = 0,   (b) θ − δi, > 0, and   (c) θ − δi, < 0,

respectively. This allows one to think of the relationship between the respondent and item locations as points on a line, where the difference between them is what matters. It is just one step beyond that to interpret that the probability of a particular response is a function of the distance between the respondent and item locations. In the specific case of the Rasch model, the probability of response Xi = 1 for a respondent with location θ and an item with difficulty δi is

Probability (Xi = 1 | θ, δi) = f (θ − δi),(5.2)

where f is a function that will be defined in the next few paragraphs, and we have included θ and δi on the left-hand side to emphasize that the probability depends on both (and the notation “|” is used to make the point that we are assuming that the probability will be dependent on the values of θ and δi). Graphically, we can picture the relationship between location and probability as in Figure 5.4: the respondent locations, θ, are plotted on the vertical axis, and the probability of the response “1” is given on the horizontal axis. To make it concrete, it is assumed in Figure 5.4 that the item location for Item i is δi, = 1.0. Thus, at θ = 1.0, the respondent and item locations are the same, and the probability is

The Wright Map  137

FIGURE 5.4  Relationship

between respondent location (θ) and probability of a response of “1” for an item with difficulty 1.0.

138  The Wright Map

0.50 (check it in the figure yourself). As the respondent location moves above 1.0, that is, for θ > 1.0, the probability increases above 0.50; as the respondent location moves below 1.0, that is, for θ < 1.0, the probability decreases below 0.50. At the extremes, the relationship gets closer and closer to the limits of probability: As the respondent location moves way above 1.0, that is, for θ >> 1.0, the probability increases to approach 1.0; and as the respondent location moves way below 1.0, that is, for θ