Advanced Methods in Automatic Item Generation: Theoretical Foundations and Practical Applications 9780367902933, 9780367458324, 9781003025634

Advanced Methods in Automatic Item Generation is an up-to-date survey of the growing research on automatic item generati

417 135 16MB

English Pages 233 [246] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Advanced Methods in Automatic Item Generation: Theoretical Foundations and Practical Applications
 9780367902933, 9780367458324, 9781003025634

Table of contents :
Cover
Half Title
Title Page
Copyright Page
Table of Contents
Preface
A Word of Thanks
Chapter 1: Introduction: The Changing Context of Educational Testing
The Problem of Scaling Item Development
Automatic Item Generation: An Augmented Intelligence Approach to Item Development
Benefits of Using AIG for Item Development
Purpose of This Book
References
Section 1:
Basic Concepts
Required for
Generating
Constructed- and
Selected-Response
Items
Chapter 2: Cognitive Model Development: Cognitive Models and Item Generation
Benefits of Using Cognitive Models For AIG
Developing Cognitive Models for AIG
A Word of Caution When Creating Cognitive Models
Two Types of Cognitive Models for AIG
Logical Structures Cognitive Model
Key Features Cognitive Model
References
Chapter 3: Item Model Development: Template-Based AIG Using Item Modelling
Layers in Item Models
Item Generation With 1-Layer Models
n-Layer Item Models
Item Generation With n-Layer Models
Two Important Insights Involving Cognitive and Item Modelling in AIG
Non-template AIG: A Review of the State of the Art
Is It Preferable to Use a Template for AIG?
Note
References
Chapter 4: Item Generation: Approaches for Generating Test Items
The Importance of Constraint Coding
Logical Constraint Coding Using Bitmasking
Demonstration of Item Assembly Using the Logical Constraints Approach
Logical Structures Cognitive Model
Key Features Cognitive Model
Item Assembly Using the Logical Constraints Approach
References
Chapter 5: Distractor Generation: The Importance of the Selected-Response Item in Educational Testing
The Contribution of Distractors in the Selected-Response Item Format
Traditional Approach for Writing Distractors
Methods for Distractor Generation
Distractor Generation With Rationales
Distractor Pool Method With Random Selection
Systematic Distractor Generation
Note
References
Chapter 6: Putting It All Together to Generate Test Items: Overview
Mathematics Example Using the Logical Structures Model
Cognitive Model Development
Item Model Development
Item Generation Using Constraint Coding
Systematic Distractor Generation
A Sample of Generated Math Items
Medical Example Using Key Features
Cognitive Model Development
Item Model Development
Item Generation Using Constraint Coding
Systematic Distractor Generation
A Sample of Generated Medical Items
Chapter 7: Methods for Validating Generated Items: A Focus on Model-Level Outcomes
Substantive Methods for Evaluating AIG Models
Cognitive and Item Model Review Using a Validation Table
Distractor Model Review Using a Validation Table
Substantive Model Review Using a Rating Scale
Substantive Methods for Evaluating AIG Items
AIG versus Traditional Item Review: Item Quality
AIG versus Traditional Item Review: Predictive Accuracy
Statistical Methods for Evaluating AIG Items
Statistical Analyses of the Correct Option
Statistical Analyses of the Incorrect Options
Cosine Similarity Index (CSI)
The Key to Validating Generated Items
References
Section 2:
Advanced Topics in
AIG
Chapter 8: Content Coding: Challenges Inherent to Managing Generated Items in a Bank
Managing Generated Items With Metadata
Content Coding for Item Generation
Assembling Content Codes in Item Generation
Content Coding Examples
Logical Structures Mathematics Model
Key Features Medical Model
References
Chapter 9: Generating Alternative Item Types Using Auxiliary Information: Expanding the Expression of Generated Items
Generating Items With Symbols
Generating Items With Images
Generating Items With Shapes
Challenges With Generating Items Using Auxiliary Information
References
Chapter 10: Rationale Generation: Creating Rationales as Part of the Generation Process
Methods for Generating Rationales
Correct Option
Correct Option With Rationale
Correct Option With Distractor Rationale
A Cautionary Note on Generating Solutions and Rationales
Benefits and Drawbacks of Rationale Generation
References
Chapter 11: Multilingual Item Generation: Beyond Monolingual Item Development
Challenges With Writing Items in Different Languages
Methods for Generating Multilingual Test Items
Language-Dependent Item Modelling
Successive-Language Item Modelling
Simultaneous-Language Item Modelling
Example of Multilingual Item Generation
Validation of Generated Multilingual Test Items
References
Chapter 12: Conclusions and Future Directions
Is AIG an Art or Science?
Is It “Automatic” or “Automated” Item Generation?
How Do We Define the Word “Item” in AIG?
How Do You Generate Items?
What Is an Item Model?
How Do You Ensure That the Generated Items Are Diverse?
How Should Generated Items Be Scored?
How Do You Organize Large Numbers of Generated Items?
What Does the Future Hold for Item Development?
References
Author Index
Subject Index

Citation preview

Advanced Methods in Automatic Item Generation Advanced Methods in Automatic Item Generation is an up-to-date survey of the growing research on automatic item generation (AIG) in today’s technology-enhanced educational measurement sector. As test administration procedures increasingly integrate digital media and Internet use, assessment stakeholders—from graduate students to scholars to industry professionals—have numerous opportunities to study and create different types of tests and test items. This comprehensive analysis offers thorough coverage of the theoretical foundations and concepts that define AIG, as well as the practical considerations required to produce and apply large numbers of useful test items. Mark J. Gierl is Professor of Educational Psychology at the University of Alberta, Canada. He holds the Tier 1 Canada Research Chair in Educational Measurement. Hollis Lai is Associate Professor of Dentistry in the Faculty of Medicine and Dentistry at the University of Alberta, Canada. Vasily Tanygin is a full-stack software developer who has over a decade of experience creating AIG and educational assessment technologies. He graduated with a specialist degree in software systems development from Mari State Technical University, Russia.

Advanced Methods in Automatic Item Generation Mark J. Gierl, Hollis Lai, and Vasily Tanygin

First published 2021 by Routledge 52 Vanderbilt Avenue, New York, NY 10017 and by Routledge Park Square, Milton Park, Abingdon, Oxon, OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2021 Taylor & Francis The right of Mark J. Gierl, Hollis Lai, and Vasily Tanygin to be identified as authors of this work has been asserted by them in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Names: Gierl, Mark J., author. | Lai, Hollis, author. | Tanygin, Vasily, author. Title: Advanced methods in automatic item generation / Mark J. Gierl, Hollis Lai, and Vasily Tanygin. Identifiers: LCCN 2020050227 (print) | LCCN 2020050228 (ebook) | ISBN 9780367902933 (hardback) | ISBN 9780367458324 (paperback) | ISBN 9781003025634 (ebook) Subjects: LCSH: Educational tests and measurements--Mathematical models. | Educational Psychology. Classification: LCC LB1131 .G46 2021 (print) | LCC LB1131 (ebook) | DDC 370.15--dc23 LC record available at https://lccn.loc.gov/2020050227 LC ebook record available at https://lccn.loc.gov/2020050228 ISBN: 978-0-367-90293-3 (hbk) ISBN: 978-0-367-45832-4 (pbk) ISBN: 978-1-003-02563-4 (ebk) Typeset in Optima by SPi Global, India

Contents

Preface vii A Word of Thanks x 1 Introduction: The Changing Context of Educational Testing

1

Section 1 Basic Concepts Required for Generating Constructed- and Selected-Response Items

21

2 Cognitive Model Development: Cognitive Models and Item Generation

23

3 Item Model Development: Template-Based AIG Using Item Modelling

42

4 Item Generation: Approaches for Generating Test Items

66

5 Distractor Generation: The Importance of the Selected-Response Item in Educational Testing

81

6 Putting It All Together to Generate Test Items: Overview

101

7 Methods for Validating Generated Items: A Focus on Model-Level Outcomes

120

v

Contents

Section 2 Advanced Topics in AIG

145

8 Content Coding: Challenges Inherent to Managing Generated Items in a Bank

147

9

Generating Alternative Item Types Using Auxiliary Information: Expanding the Expression of Generated Items

159

10 Rationale Generation: Creating Rationales as Part of the Generation Process

175

11 Multilingual Item Generation: Beyond Monolingual Item Development

186

12 Conclusions and Future Directions

206

Author Index Subject Index

225 230

vi

Preface

Our interest in automatic item generation (AIG) began in 2004. Like many educational testing researchers during that time, at the University of Alberta, we were interested in computer-based testing (CBT), and we wanted to contribute to the transition from paper to computer. Twenty years ago, creating a computer-based test was a challenging but feasible task. We had a strong theoretical foundation with item response theory; we had a clear understanding of the CBT architecture; we had the Internet. These resources allowed us to build and deploy functional CBT systems. The only missing link that was clearly needed, but unavailable, was an abundant supply of test items. Items were scarce—we felt lucky when our bank contained 200 items. But researchers and practitioners were optimistic that strategies could be put into place that would allow us to scale the traditional subject matter expert item development process in order to produce larger numbers of new items for our CBT systems. Fast forward to 2007. Our CBT research program evolved, and we were now focused on cognitive diagnostic assessment (CDA) within a CBT framework. A CDA is a computer-based test designed to measure the knowledge and skills required to solve items in a particular content area for the purpose of providing examinees with detailed feedback. One of the defining characteristics of a CDA is that the knowledge and skills are specified at a fine-grain size to magnify the cognitive processes that underlie test performance. Hence, CDAs required even larger numbers of items than most CBTs. This large supply of items was non-existent, but, again, there was plenty of optimism that the traditional item development approach could be used to produce the items we needed. It was not vii

Preface

uncommon to see published articles in the educational testing literature where one paragraph at the beginning of the manuscript contained a statement such as “to begin, assume you have a bank containing large numbers of items”, followed by 20 pages describing a sophisticated algorithm or a test design that could be used to administer these items. We were confident that modern testing methods could be powered using traditional item development practices. Looking back on much of the research that was conducted in the first decade of the 2000s, it is clear that item development was always a task for tomorrow or the task undertaken by a different research group. A tremendous amount of time and effort was spent on developing methods, procedures, and applications that could be used to administer test items in a CBT or CDA while little effort was spent on creating methods or refining procedures to produce the items that were needed to feed these sophisticated new methods with content. But by 2009, the realities of item development began to set in, and we concluded that the content so desperately needed for our sophisticated new educational testing methods did not exist and to produce this content using the traditional standards for practice would require more time, effort, and money than we could ever afford. Hence, our research program evolved yet again to address this challenge. Technology-enhanced item development became our new research area, and creating methods to generate items became our new research focus. Luckily, two important events occurred as we began our AIG research program that provided us with purpose and direction. First, we created an exemplary team of researchers. Hollis Lai became a PhD student at the University of Alberta in 2008. His background in psychology and computer science served as the catalyst for his interest in AIG. Hollis was an outstanding PhD student. He won three major scholarships and completed three psychometric internships during his doctoral studies. He also conducted a great deal of research on AIG and completed his dissertation on this topic in 2013. Vasily Tanygin joined our research team in 2012 as a developer and eventually served as the primary architect for our AIG software and applications. Vasily was also an excellent student who received his specialist degree with honors in software development from Mari State Technical University in 2008. In Russia, he served as a software developer working on diverse projects using web, desktop, and mobile platforms. In Canada, Vasily worked as our AIG developer in addition viii

Preface

to working as the software developer and system administrator in the Learning Assessment Centre at the University of Alberta. Second, Dr. Tom Haladyna and I co-edited the 2013 book Automatic Item Generation: Theory and Practice, which provided a comprehensive and up-to-date survey of the theory and practice of AIG. We documented the significant progress that has been made in both the theory and application since the publication of the only book on this topic—Item Generation for Test Development, edited by Sidney Irvine and Patrick Kyllonen in 2002. More importantly, the authors in our edited book highlighted areas where knowledge was lacking and where significant problems remain unaddressed. Our AIG research program focused on these problems. The current book documents our solutions to many of these problems.

ix

A Word of Thanks

Our research program on AIG is funded by the Social Sciences and Humanities Research Council of Canada (SSHRC). The majority of the funds from SSHRC are used to hire students who work as research assistants during their graduate studies and allow us to present our research results. We thank SSHRC for their generous support. The outcome from an AIG project, if successful, is large numbers of test items. In the beginning, the items we generated were simply deleted. But as our research program matured and the quality of our generated items improved, testing organizations became more interested in working with us and using our generated items in their operational testing programs. This led to a period where we formed partnerships with different testing organizations. Since 2010, we have worked closely with government agencies, testing companies, and academic publishers to implement item generation principles produced from our research into their test development practices. Our collaborators included Donna Matovinovic, David Carmody, Marten Roorda, Jim Hogan, Teresa Hall, Richard Patz, Seth Stone, Renata Taylor-Majeau, Andrew Spielman, Ellen Byrne, David Waldschmidt, Marita MacMahon-Ball, Veronica Vele, Andrew Wiley, Aaron Douglas, Kristen Huff, Keith Boughton, Wim van der Linden, Glenn Milewski, Kim Brunnert, Barbara Schreiner, Judy Siefert, Ian Bowmer, Krista Breithaupt, Andre De Champlain, Claire Touchie, Debra Pugh, Andre-Phillip Boulais, Alexa Fotheringham, Tanya Rivard, Kimberly Swygert, Jean D’Angelo, Mike Jodoin, Brian Clauser, Craig Sherburne, Joanna Preston, Tom Haladyna, and Richard Luecht. Each of these collaborations also allowed us to work with and learn from many different subject matter experts. By x

A Word of Thanks

our count, we have worked with more than 60 subject matter experts on AIG projects in the last 13 years. We thank all our collaborators and the subject matter experts for their insights and contributions. Our research is conducted by faculty and students who are affiliated with the Centre for Research in Applied Measurement and Evaluation—which is part of the Measurement, Evaluation, and Data Science Program—in the Department of Educational Psychology at the University of Alberta. Many faculty and former students have participated in and contributed to our AIG research, including Okan Bulut, Geoffrey Bostick, Kevin Eva, Andrea Gotzmann, Jiawen Zhou, Syed Fahad Latifi, Xinxin Zhang, Gregor Damnik, Cecilia Alves, Karen Fung, Qi Guo, Simon Turner, Curtis Budden, Stephanie Varga, Kaja Matovinovic, Andrew Turnbull, Gregory Chao, Tara Leslie, Gautam Puhan, Adele Tan, Bihua Xiang, Larry Beauchamp, Fern Snart, and Jennifer Tupper. Our two current PhD students—Jinnie Shin and Tahereh Firoozi—continue to work on AIG problems, and they provided invaluable feedback on an earlier version of this book. Jinnie Shin and Jungok Hwang conducted the translation for the Korean multilingual example presented in Chapter 11, which, in our opinion, turned out superbly. To our colleagues and students, we thank you for your outstanding work. Finally, we would like to thank our families for their support as we worked days, evenings, weekends, and, sometimes, holidays solving AIG problems. Mark Gierl would like to thank Jeannie, Markus, and Elizabeth. Hollis Lai would like to thank Xian. Vasily Tanygin would like to thank Anna and Polina. We appreciate you, and we could not have completed this work without you.

xi

1

Introduction The Changing Context of Educational Testing

The field of educational testing is in the midst of dramatic changes. These changes can be characterized, first and foremost, by how exams are administered. Test administration marks a significant and noteworthy paradigm shift. Because the printing, scoring, and reporting of paper-based tests require tremendous time, effort, and expense, it is neither feasible nor desirable to administer tests in this format. Moreover, as the demand for more frequent testing continues to escalate, the cost of administering paper-based tests will also continue to increase. The obvious solution for cutting some of the administration, scoring, and reporting costs is to migrate to a computer-based testing (CBT) system (Drasgow & Olson-Buchanan, 1999; Mills, Potenza, Fremer, & Ward, 2002; Parshall, Spray, Kalohn, & Davey, 2002; Ras & Joosten-Ten Brinke, 2015; Susanti, Tokunaga, & Nishikawa, 2020; Ziles, West, Herman, & Bretl, 2019). CBT offers important economic benefits for test delivery because it eliminates the need for paper-based production, distribution, and scoring. In addition, CBT can be used to support teaching and promote learning. For instance, computers permit testing on demand, thereby allowing students to take the exam on a more frequent and flexible schedule. CBTs are created in one central electronic location, but they can be deployed to students locally, nationally, or internationally. Items on CBTs can be scored immediately, thereby providing students with instant feedback while, at the same time, reducing the time teachers would normally spend on marking tests (Bartram & Hambleton, 2006; Drasgow, 2016; van der Linden & Glas, 2010; Wainer, Dorans, Eignor, Flaugher, Green, Mislevy, Steinberg,

1

Introduction

& Thissen, 2000). Because of these important benefits, the wholesale transition from paper-to CBT is now underway. Adopting CBT will have a cascading effect that changes other aspects of educational testing, such as why we test and how many students we test. As the importance of technology in society continues to increase, countries require a skilled workforce that can make new products, provide new services, and create new industries. The ability to create these products, services, and industries will be determined, in part, by the effectiveness of our educational programs. Students must acquire the knowledge and skills required to think, reason, solve complex problems, communicate, and collaborate in a world that is increasingly shaped by knowledge services, information, and communication technologies (e.g., Ananiadou & Claro, 2009; Auld & Morris, 2019; OECD, 2018; Binkley Erstad, Herman, Raizen, Ripley, Miller-Ricci, & Rumble, 2012; Chu, Reynolds, Notari, & Lee, 2017; Darling-Hammond, 2014; Griffin & Care, 2015). Educational testing has an important role to play in helping students acquire these foundational skills and competencies. The 1990s marked a noteworthy shift, during which the objectives of testing were broadened to still include the historically important focus on summative outcomes, but a new focus on why we test was also added to include procedures that yield explicit evidence to help teachers monitor their instruction and to help students improve how and what they learn. That is, researchers and practitioners began to focus on formative assessment (see, for example, Black & Wiliam, 1998; Sadler, 1989). Formative assessment is a process used during instruction to produce feedback required to adjust teaching and improve learning so that students can better achieve the intended outcomes of instruction. Feedback has maximum value when it yields specific information in a timely manner that can direct instructional decisions intended to help each student acquire different types of knowledge and skills more effectively. Outcomes from empirical research consistently demonstrate that formative feedback can produce noteworthy student achievement gains (Bennett, 2011; Black & Wiliam, 1998, 2010; Hattie & Timperley, 2007; Kluger & DeNisi, 1996; Shute, 2008). As a result, our educational tests, once developed exclusively for the purposes of accountability and outcomes-based summative testing, are now expected to also provide teachers and students with timely, detailed feedback to support teaching and learning 2

Introduction

(Drasgow, Luecht, & Bennett, 2006; Ferrara, Lai, Reilly, & Nichols, 2017; Nicol & Macfarlane-Dick, 2006; Nichols, Kobrin, Lai, & Koepfler, 2017; Pellegrino & Quellmalz, 2010). With enhanced delivery systems and a broader mandate for why we evaluate students, educational testing now appeals to a global audience, and, therefore, it also affects how many students are tested (Grégoire & Hambleton, 2009; Hambleton, Merenda, & Spielberger, 2005; International Test Commission Guidelines for Translation and Adapting Tests, 2017). As a case in point, the world’s most popular and visible educational achievement test—the Programme for International Student Assessment (PISA) developed, administered, and analyzed by the Organisation for Economic Cooperation and Development (OECD)—is computerized. The OECD (2019a, p. 1) asserted, Computers and computer technology are part of our everyday lives and it is appropriate and inevitable that PISA has progressed to a computer-based delivery mode. Over the past decades, digital technologies have fundamentally transformed the ways we read and manage information. Digital technologies are also transforming teaching and learning, and how schools assess students. OECD member countries initiated PISA in 1997 as a way to measure the knowledge, skills, and competencies of 15-year-olds in the core content areas of mathematics, reading, and science. To cover a broad range of content, a sophisticated test design was used in which examinees wrote different combinations of items. The outcome of this design is a basic knowledge and skill profile for a typical 15-year-old within each country. To accommodate the linguistic diversity among member countries, exams were created, validated, and administered in 47 languages. The results from these tests are intended to allow educators and policy makers to compare the performance of students from around the world and to guide future educational policies and practices. While the first five cycles were paper based, 54 of the 72 (81%) participating countries in PISA 2015 took the first computer-based version. The number of countries which opted for CBT increased to 89% (70 of the 79 participating countries) for PISA 2018 in keeping with the OECD view that CBT has become part of the educational experience for most students (OECD,  2019b). In short, 3

Introduction

testing is now a global enterprise that includes large numbers of students immersed in different educational systems located around the world.

The Problem of Scaling Item Development As educational testing undergoes a noteworthy period of transition, CBT is replacing paper-based testing, thereby creating the foundation for the wide-spread use of technology-based systems. Computer delivery systems are being used to implement test designs that permit educators to collect information that support both formative and summative inferences as students acquire 21st-century skills. These developments are unfolding on a global stage, which means that educational testing is being used to expand our assessment practices to accommodate students who speak different languages as they become educated in diverse cultures, geographic regions, and economic systems. But these transitions are also accompanied by formidable new challenges, particularly in the area of item development. Educators must have access to large numbers of diverse, multilingual, high-quality test items to implement CBT given that the items are used to produce tests that serve multiple purposes and cater to large numbers of students who speak many different languages. Hence, thousands or, possibly, millions of new items are needed to develop the banks necessary for CBT so that testing can be conducted in these different testing conditions and across these diverse educational environments. A bank is a repository of test items, which includes both the individual items and data about their characteristics. These banks must be developed initially from scratch and then replenished constantly to ensure that examinees receive a continuous supply of new items during each test administration. Test items, as they are currently created, are time-consuming and expensive to develop because each individual item is written by a subject matter expert (SME; also called test developer, content specialist, or item writer). Hence, item development can easily be identified as one of the most important problems that must be solved before we can migrate to a modern testing system capable of different purposes, like formative and summative assessment, and suitable for a large and diverse population composed of students from different cultural and linguistic groups. 4

Introduction

As of today, these large, content-specific, multilingual item banks are not available. Moreover, the means by which large numbers of new items can be quickly developed to satisfy these complex banking requirements is unclear (Karthikeyan, O’Connor, & Hu, 2019). The traditional approach used to create the content for these banks relies on a method in which the SME creates items individually. Under the best condition, traditional item development is an iterative process where highly trained groups of SMEs use their experiences and expertise to produce new items. Then, after these new items are created, they are edited, reviewed, and revised by another group of highly trained SMEs until they meet the appropriate standard of quality (Lane, Raymond, Haladyna, & Downing, 2016). Under what is likely the more common condition—particularly in classroom assessment at both the K–12 and post-secondary education levels—traditional item development is a solitary process where one SME with limited training uses her or his experiences to produce new test items, and these items, in turn, are administered to examinees with little, if any, additional review or refinement. In both conditions, the SME bears significant responsibility for identifying, organizing, and evaluating the content required for this complex and creative process. Item development is also a subjective practice because an item is an expression of the SME’s understanding of knowledge and skill within a specific content area. This expression is distinctive for each SME and, as a result, each item is unique. For this reason, traditional item development has often been described as an “art” because it relies on the knowledge, experience, and insight of the SME to produce unique test items (Schmeiser & Welch, 2006). But the traditional approach to item development has two noteworthy limitations. First, item development is time-consuming and expensive because it relies on the item as the unit of analysis (Drasgow et al., 2006). Each item in the process is unique, and, therefore, each item must be individually written and, under the best condition, edited, reviewed, and revised. Many different components of item quality can be identified. Item quality can focus on content. For example, is the content in the item appropriate for measuring specific outcomes on the test? Item quality can focus on logic. For example, is the logic in the item appropriate for measuring the knowledge and skills required by examinees to solve problems in a specific domain? Item quality can also focus on presentation. For example, is the item presented as a task that is grammatically and linguistically accurate? Because each element in an item is unique, each component of 5

Introduction

item quality must be reviewed and, if necessary, revised. This view of an item where every element is unique, both within and across items, was highlighted by Drasgow et al. (2006, p. 473) when they stated, The demand for large numbers of items is challenging to satisfy because the traditional approach to test development uses the item as the fundamental unit of currency. That is, each item is individually hand-crafted—written, reviewed, revised, edited, entered into a computer, and calibrated—as if no other like it had ever been created before. In high-stakes testing situations, writing and reviewing are conducted by highly trained SMEs using a comprehensive development and evaluation process. As a result, the traditional approach to item development is expensive. Rudner (2010) estimated that the cost of developing one operational item for a high-stakes test using the traditional approach ranged from US$1,500 to $2,500. Second, the traditional approach to item development is challenging to scale efficiently and economically. The scalability of the traditional approach is linked, again, to the item as the unit of analysis. When one item is required, one item is written by the SME because each item is unique. When 100 items are required, 100 items must be written by the SMEs. Hence, large numbers of SMEs who can write unique items are needed to scale the process. Using a traditional approach can result in an increase in item production when large numbers of SMEs are available. But item development is a time-consuming process due to the human effort needed to create large numbers of new items. As a result, it is challenging to meet the content demands of modern testing systems using the traditional approach because it is not easily scaled.

Automatic Item Generation: An Augmented Intelligence Approach to Item Development Researchers and practitioners require an efficient and cost-effective method for item development. The solution will not be found using the traditional approach because of its two inherent limitations. Consequently, 6

Introduction

an alternative is needed. This alternative is required to support our modern test delivery and design initiatives, which will be used to evaluate large numbers of students who are educated in different educational systems and who speak different languages. One approach that can help address the growing need to produce large numbers of new test items in an efficient and economical manner is with the use of automatic item generation (AIG; Gierl & Haladyna, 2013; Irvine & Kyllonen, 2002). AIG is the process of using models to generate items using computer technology. It can be considered a form of augmented intelligence (Zheng et al., 2017). Augmented intelligence is an area within artificial intelligence that deals with how computer systems emulate and extend human cognitive abilities, thereby helping to improve human task performance. The interaction between a computer system and a human is required for the computer system to produce an output or solution. Augmented intelligence combines the strength of modern computing using computational analysis and data storage with the human capacity for judgment to solve complex unstructured problems. Augmented intelligence can, therefore, be characterized as any process or system that improves the human capacity for solving complex problems by relying on a partnership between a machine and a human (Pan, 2016). AIG can be distinguished from traditional item development in two important ways. The first distinction relates to the definition of what constitutes an item. Our experience working with SMEs and other testing specialists has demonstrated that many different definitions and conceptualizations surround the word “item”. If we turn to the educational testing literature, it is surprising to discover that the term “item” is rarely, if ever, defined. When a definition is offered, it tends to be a “black box”, meaning that an input and output are presented with no description of the internal mechanism for transforming the input to the output. For example, Ebel (1951, p. 185), in his chapter titled “Writing the Test Item” in the first edition of the famous handbook Educational Measurement, offered an early description where an item was merely referred to as “a scoring unit”. Osterlind (2010), also noting the infrequency with which the term “item” was defined in the literature, offered this definition: A test item in an examination of mental attributes is a unit of measurement with a stimulus and a prescriptive form for answering; 7

Introduction

and, it is intended to yield a response from an examinee from which performance in some psychological construct (such as ability, predisposition, or trait) may be inferred. (p. 3) One of the most recent definitions is provided in the latest edition of the Standards for Educational and Psychological Testing (2014). The Standards—prepared by the Joint Committee of the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education—serves the most comprehensive statement on best practices in educational and psychological testing that is currently available. The Standards define the term “item” as “a statement, question, exercise, or task on a test for which the test taker is to select or construct a response, or perform a task” (p. 220). The authors of the Standards also direct the reader to the term “prompt”, which is described as “the question, stimulus, or instruction that elicits a test taker’s response” (p. 222). The definitions from Osterlind and the Standards share common characteristics. An item contains an input in the form of a statement, question, exercise, task, stimulus, or instruction that produces an output which is the examinee’s response or performance. In addition, the output of the examinee is in a prescriptive form that is typically either selected or constructed. But no description of the internal workings or characteristics of the item is included. Haladyna and Rodriguez (2013), in their popular text Developing and Validating Test Items, offer a different take on the term by stating that “a test item is a device for obtaining information about a test taker’s domain of knowledge and skills or a domain of tasks that define a construct” (p. 3). They claim that one of the most important distinctions for this term is whether the item is formatted as selected or constructed. To overcome the limitations of these definitions, we offer a definition of the term “item” that will be used to guide AIG in our book. An item is an explicit set of properties that include the parameters, constraints, and instructions used to elicit a response from the examinee. Our definition specifies the contents in the black box, thereby overcoming the limitations in previous definitions by describing the input as a set of parameters, constraints, and instructions. In addition, we assert that the input be represented in a way that it can be replicated and evaluated. Replication is an important requirement because it means 8

Introduction

that the properties of the item are so explicit, detailed, and clear that it can be independently reproduced. Evaluation is an important requirement because it means that the properties used to produce the item for addressing a specific purpose can be scrutinized. Our definition does not include a format requirement, and it does not specify the representation for the parameters, constraints, and instructions. The second distinction relates to the workflow required to create an item. Item development is one piece within the much larger test development puzzle. Lane et al. (2016), for example, described 12 components of test development in their introductory chapter to the Handbook of Test Development (2nd edition). AIG occurs in the item writing and review subcomponent, which are within the item development component, where item development is component 4 of 12. While not explicitly stated, item writing and review in Lane et al. are synonymous with the traditional approach, as described earlier in this chapter. The traditional approach relies on a method where the SME creates each item individually using an iterative process with highly trained groups of SMEs who produce new items, as well as review and revise existing items until the items all meet specific standards of quality. Traditional item development relies heavily on the SMEs to identify, organize, and evaluate content using their knowledge, experience, and expertise. AIG, by way of comparison, uses an augmented intelligence workflow that combines the expertise of the SME with the power of modern computing to produce test items. AIG is characterized as a three-step process in which models are first created by SMEs, a template for the content is then specified by the SMEs, and, finally, the content is placed in the template using computer-based assembly. AIG can, therefore, be characterized as an augmented intelligence approach because large numbers of new items can be manufactured using the coordinated inputs created by humans with outputs produced from computers. Gierl and Lai (2013) described a three-step workflow for generating test items. This workflow differs from the traditional approach to item development because it requires the coordinated efforts of humans and computers to create items. In step 1, the SME identifies the content that will be used to produce new items. This content is identified using a framework that highlights the knowledge, skills, and abilities required to solve problems in a specific domain. Gierl, Lai, and Turner (2012) called this framework a cognitive model for AIG. A cognitive model is used as 9

Introduction

the first step to highlight the knowledge, skills, and abilities required by examinees to solve a problem in a specific domain. This model also organizes the ­cognitive- and content-specific information into a coherent whole, thereby presenting a succinct representation of how examinees think about and solve problems. With the content identified in step 1, it must then be positioned within an item model in step 2. An item model (LaDuca, Staples, Templeton, & Holzman, 1986) is like a mould, template, or rendering of the assessment task that specifies which parts and which content in the task can be manipulated to create new test items. The parts include the stem, the options, and the auxiliary information. The stem contains the content or question the examinee is required to answer. The options include a set of alternative answers with one correct option and one or more incorrect options. The stem and correct option are generated for a constructed-response item. The stem, correct option, and incorrect options are generated for the selected-response item. Auxiliary information includes any material, such as graphs, tables, figures, or multimedia exhibits that supplement the content presented in the stem and/or options. The content specified in the cognitive model highlights the knowledge, skills, and abilities required to solve problems in a specific domain. The item model in step 2 provides a template for the parts of an assessment task that can be manipulated using the cognitive model so that new items can be created. After the SME identifies the content in the cognitive model and creates the item model, the outcomes from steps 1 and 2 are combined to produce new items in step 3. This step focuses on item assembly using the instructions specified in the cognitive model. Assembly can be conducted manually by asking the SME to place the content from step 1 into the model created for step 2 (e.g., Pugh, De Champlain, Gierl, Lai, & Touchie, 2016). But a more efficient way to conduct the assembly step is with a computer-based assembly system because it is a complex combinatorial task. Different types of software have been written to assemble test items. For instance, Singley and Bennett (2002) introduced the Math Test Creation Assistant to generate items involving linear systems of equations. Higgins (2007) used Item Distiller as a tool that could be used to generate sentence-based test items. Gierl, Zhou, and Alves (2008) described software called IGOR (Item GeneratOR) designed to assemble test items by placing different combinations of elements specified in the cognitive 10

Introduction

model into the item model. While the details for test assembly may differ across these programs, the task remains the same: Combine the content from the cognitive models into specific parts of an item model to create new test items subject to rules and content constraints which serve as the instructions for the assembly task. Taken together, this three-step process serves as a workflow (see Figure 1.1) that can be used to systematically generate new items from a model of thinking, reasoning, and problem solving. It requires three steps where the data in each step is transformed from one state to another. We consider this workflow to be an item production system. The system is used to create a product that is consistent with our definition of “item”. Step 1, the content required for item generation is identified. The content is specified as a cognitive model. Step 2, the content is positioned in the item model. The content is extracted from the cognitive model and placed as individual values in an item model. Step 3, the instructions for assembling the content are implemented. The individual values from the item model as specified in the cognitive model are assembled using rules to create new test items. Because each step is explicit, the input and outcome from the system can be replicated and evaluated. The importance of this workflow in the item development component described by Lane et al. (2016) is also quite clear: It can be used to generate hundreds or thousands of new test items.

Benefits of Using AIG for Item Development AIG has at least five important benefits that help address the current need to produce large numbers of diverse, multilingual, high-quality test items in an efficient and economical manner. First, AIG permits the SME to create a single cognitive model that, in turn, yields many test items using the workflow presented in Figure 1.1. The ability to transform content from the initial state of a cognitive model into the final state of a test item is

Figure 1.1  Workflow and data transformation required to generate items 11

Introduction

made possible by item modelling. Item modelling, therefore, provides the foundation for AIG. An item model is a template that highlights the parts of an assessment task that can be manipulated to produce new items. An item model can be developed to yield many test items from one cognitive model. Item models can also be written in different languages to permit multilingual AIG. Second, AIG can lead to more cost-effective content production because the item model is continually re-used to yield many test items compared with developing each item individually. In the process, costly yet common errors in item writing (e.g., including or excluding words, phrases, or expressions, along with spelling, grammatical, punctuation, capitalization, typeface, and formatting problems) can be avoided because only specific parts in the stem and options are manipulated when producing large numbers of items (Schmeiser & Welch, 2006). The item model serves as a template for which the SME manipulates only specific, well-defined, parts. The remaining parts of the assessment task are not altered during the item development process, thereby avoiding potential errors that often arise during item writing. The view of an item model as a template with both fixed and variable parts contrasts with the traditional view where every single part of the test items is unique, both within and across items. Third, AIG treats the item model as the fundamental unit of analysis where a single model is used to generate many items compared with a traditional approach where the item is treated as the unit of analysis (Drasgow et al., 2006). Hence, AIG is a scalable process because one item model can generate many test items. With a traditional item development approach, the item is the unit of analysis where each item is created individually. If, for instance, an SME working in the medical education context intends to have 12,480 items for her bank, then she would require 10 item models (Gierl et al., 2012, for example, generated 1,248 medical surgery items from 1 cognitive model). If a particularly ambitious SME aspired to have a very large inventory with over a half-million items, then she would require approximately 400 item models (i.e., if each item model generated, on average, 1,248 medical items, then 401 item models could be used to generate 500,448 items). Creating 400 item models within a year would be a significant but viable item development goal (i.e., about 33 models a month). By way of contrast, writing 500,448 individual items within a 12

Introduction

year would be a monumental and likely impossible item development goal (i.e., about 41,700 items a month). Because of this unit of analysis shift, the cost per item will decrease because SMEs are producing models that yield multiple items rather than producing single unique items (Kosh, Simpson, Bickel, Kellogg, & Sanford-Moore, 2019). Item models can be re-used, particularly when only a small number of the generated items are used on a specific test form, which, again, could yield economic benefits. Item models can also be adapted for different languages to produce items that can be used in different countries and cultures. Fourth, AIG is a flexible approach to content production. Knowledge is fluid and dynamic (Nakakoji & Wilson, 2020; OECD, 2018). In the health sciences, for example, the creation of new drugs, the development of new clinical interventions, and the identification of new standards for best practice means that test content in the health sciences is constantly changing (Karthikeyan et al., 2019; Norman, Eva, Brooks, & Hamstra, 2006; Royal, Hedgpeth, Jeon, & Colford, 2017). These changes are difficult to accommodate in a flexible manner when the item is the unit of analysis because each part of the item is fixed. For example, if the standard of best practice shifted to conclude that a certain antibiotic is no longer effective for managing a specific presentation of fever after surgery, then all items directly or indirectly related to antibiotic treatment, fever, and surgery would need to be identified and modified or deleted from an item bank in order to reflect the most recent standard of best practice for antibiotic treatment. But when the model is the unit of analysis with fixed and variable parts, knowledge can be easily and readily updated to accommodate changes by modifying, updating, or eliminating content in the model. Even the task of identifying items that must be updated is made more manageable because only the small pool of item models rather than the large number of test items needs to be scrutinized and then updated to accommodate for the required changes. Fifth, AIG can be used to enhance test security (Wollack & Fremer, 2013). Security benefits can be implemented by decreasing the item exposure rate through the use of larger numbers of items. In other words, when item volume increases, item exposure decreases because a large bank of operational items is available. Security benefits can also be found in the item assembly step of the AIG workflow because the content in an item model is constantly manipulated and, hence, varied. 13

Introduction

This ability to modulate content makes it challenging for examinees to memorize and reproduce items because of the size, depth, and diversity of the bank.

Purpose of This Book The purpose of this book is to describe and illustrate a practical method for generating test items. Different methods can be used, but in this book, we will focus on the logic required for generating items using an item modelling approach. By item modelling, we mean methods that use templates to guide item generation. Our book is intended for two types of readers. The first audience is researchers from a broad range of disciplines who are interested in understanding the theory and the current applications of AIG. The second audience is practitioners and, in particular, SMEs who are interested in adopting our AIG methodology. Taken together, our presentation of the theory and practice of AIG will allow researchers and practitioners to understand, evaluate, and implement our AIG methodology. Ten different topics related to template-based AIG will be covered in this book. These topics will be presented in two major sections. The first section focuses on the basic concepts related to generating constructedand selected-response items. Constructed-response items present examinees with a prompt in the stem of the item. Examinees are then expected to construct or create their own responses. Selected-response items present examinees with a prompt in the stem of the item, as well as a list of possible responses. Examinees are expected to select the best response from the list of alternatives. Chapter 2 focuses on cognitive model development, which is appropriate for both the constructed- and selected-response item format. A cognitive model is a formal structured representation of how examinees think about and solve tasks on tests. We will begin by providing a definition and description of cognitive modelling and then highlight why these models are needed to generate test items. Chapter 3 addresses the topic of item model development. Item modelling is also required for generating constructed- and selected-response item formats. Item models are needed to structure the content that has been specified in the cognitive model. Chapter 4 provides an overview of the item generation process. Item generation relies on constraint coding in which the content from the 14

Introduction

cognitive model is placed into the template specified in the item model using specific instructions and assembly rules. The rules we present are applicable for generating constructed- and selected-response item formats. Chapter 5 contains a summary of distractor generation. Distractors are the incorrect options needed for creating the selected-response (e.g., multiple-choice) item format. We will describe how to create distractors for AIG. Chapter 6 is framed as a succinct guide that summarizes the information presented in Chapter 2 to Chapter 5 into a practical description of how to generate test items. Chapter 7, which is the last chapter in the first section of our book, presents different methods for evaluating the quality of the generated items. These methods focus on the substantive evaluation of the content in the cognitive and item models, as well as the quality of the content in the generated items. It also includes the statistical evaluation of the generated items using item analysis for quantifying item difficulty, discrimination, and similarity. The second section focuses on advanced topics in item generation. Chapter 8 highlights the importance of content coding. Once the items are generated, they must be organized. Coding is used to tag the content for the generated items so that they can be structured and accessed for different purposes. Having described an AIG method applicable for the most common item formats of constructed and selected response, Chapter 9 focuses on alternative item formats. In this chapter, we describe three alternative models that can be used to generate content in different item formats. The examples in this chapter also help demonstrate how AIG can be used to go beyond the standard constructed- and selected-­response item formats. Chapter 10 addresses the topic of rationale generation. AIG is a method for generating new test items, but it can also be used to create the corresponding rationale or solution for each of these items. Three methods for rationale generation are introduced and illustrated in this chapter to support formative feedback systems. Chapter 11 covers the topic of multilingual AIG. We present and illustrate a method that can be used for generating items in two or more languages. Multilingual AIG serves as a generalization of the three-step method described in Chapters  2 to 4. Finally, in Chapter 12, we summarize some of the key issues raised throughout the book by structuring this chapter in a question-and-answer format, and we describe the topics that we anticipate will shape the future of research and practice in AIG. 15

Introduction

References American Educational Research Association, American Psychological Association, National Council on Measurement in Education (2014). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. Ananiadou, K. & Claro, M. (2009). 21st century skills and competences for new millennium learners in OECD countries. OECD Education Working Papers, 41, OECD Publishing. doi:10.1787/218525261154 Auld, E. & Morris, P. (2019). The OECD and IELS: Redefining early childhood education for the 21st century. Policy Futures in Education, 17, 11–26. Bartram, D., & Hambleton, R. K. (2006). Computer-Based Testing and the Internet: Issues and Advances. New York: Wiley. Bennett, R. E. (2011). Formative assessment: A critical review. Assessment in Education: Principles, Policy, & Practice, 18, 5–25. Binkley, M., Erstad, O., Herman, J., Raizen, S., Ripley, M., Miller-Ricci, M., & Rumble, M. (2012). Defining twenty-first century skills. In P. Griffin, B. McGaw, & E. Care (Eds.), Assessment and Teaching of 21st Century Skills (pp. 17–66). New York: Springer. Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy, & Practice, 5, 7–74. Black, P., & Wiliam, D. (2010). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 92, 81–90. Chu, S., Reynolds, R., Notari, M. & Lee, C. (2017). 21st Century Skills Development through Inquiry-Based Learning. New York: Springer. Darling-Hammond, L. (2014). Next Generation Assessment: Moving Beyond the Bubble Test to Support 21st Century Learning. San Francisco, CA: Jossey-Bass. Drasgow, F. (2016). Technology and Testing: Improving Educational and Psychological Measurement. New York: Routledge. Drasgow, R., & Olson-Buchanan, J. B. (1999). Innovations in Computerized Assessment. Mahwah, NJ: Erlbaum. Drasgow, F., Luecht, R. M., & Bennett, R. (2006). Technology and testing. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 471–516). Washington, DC: American Council on Education. Ebel, R. L. (1951). Writing the test item. Educational Measurement (E. F.Linuist (Ed.), 1st ed., pp. 185–249). Washington, DC: American Council on Education. Ferrara, S., Lai, E., Reilly, A., & Nichols, P. D. (2017). Principled approaches to assessment design, development, and implementation. In A. Rupp & J. Leighton (Eds.), The Wiley Handbook of Cognition and Assessment: Frameworks, Methodologies, and Applications (pp. 41–74). New York: Wiley. Gierl, M. J., & Haladyna, T. (2013). Automatic Item Generation: Theory and Practice. New York: Routledge.

16

Introduction Gierl, M. J., & Lai, H. (2013). Using automated processes to generate test items. Educational Measurement: Issues and Practice, 32, 36–50. Gierl, M. J., Zhou, J., & Alves, C. (2008). Developing a taxonomy of item model types to promote assessment engineering. Journal of Technology, Learning, and Assessment, 7 (2). Retrieved from http://www.jtla.org Gierl, M. J., Lai, H., & Turner, S. (2012). Using automatic item generation to create multiple-choice items for assessments in medical education. Medical Education, 46, 757–765. Grégoire, J. & Hambleton, R. K. (2009). Advances in test adaptation research [Special Issue]. International Journal of Testing, 9, 73–166. Griffin, P., & Care, E. (2015). Assessment and Teaching of 21st Century Skills: Methods and Approach. New York: Springer. Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and Validating Test Items. New York: Routledge. Hambleton, R. K., Merenda, P. F., & Spielberger, C. D. (2005). Adapting Educational and Psychological Tests for Cross-Cultural Assessment. Mahwah, NJ: Erlbaum. Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77, 81–112. Higgins, D. (2007). Item Distiller: Text Retrieval for Computer-Assisted Test Item Creation. Educational Testing Service Research Memorandum (RM-07-05). Princeton, NJ: Educational Testing Service. International Test Commission. (2017). The ITC Guidelines for Translating and Adapting Tests (2nd ed.). [www.InTestCom.org]. Irvine, S. H., & Kyllonen, P. C. (2002). Item Generation for Test Development. Hillsdale, NJ: Erlbaum. Karthikeyan, S., O’Connor, E., & Hu, W. (2019). Barriers and faciliators to writing quality items for medical school assessments—A scoping review. BMC Medical Education, 19, 1–11. Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical review, a meta-analysis, and preliminary feedback intervention theory. Psychological Bulletin, 119, 254–284. Kosh, A. E., Simpson, M. A., Bickel, L., Kellogg, M., & Sanford-Moor, E. (2019). A cost-benefit analysis of automatic item generation. Educational Measurement: Issues and Practice, 38, 48–53. LaDuca, A., Staples, W. I., Templeton, B., & Holzman, G. B. (1986). Item modelling procedures for constructing content-equivalent multiple-choice questions. Medical Education, 20, 53–56. Lane, S., Raymond, M., & Haladyna, R. (2016). Test development process. In S. Lane, M. Raymond, & T. Haladyna (Eds.), Handbook of Test Development (2nd ed., pp. 3–18). New York, NY: Routledge. Mills, C. N., Potenza, M. T., Fremer, J. J., & Ward, W. C. (2002). Computer-Based Testing: Building the Foundation for Future Assessments. Mahwah, NJ: Erlbaum.

17

Introduction Nakakoji, Y., & Wilson, R., (2020). Interdisciplinary Learning in Mathematics and Science: Transfer of Learning for 21st Century Problem Solving at University. 1–22. Nichols, P. D., Kobrin, J. L., Lai, E., & Koepfler, J. (2017). The role of theories of learning an cognition in assessment design and development. In A. Rupp & J. Leighton (Eds.), The Wiley Handbook of Cognition and Assessment: Frameworks, Methodologies, and Applications (pp. 15–40). New York: Wiley. Nicol, D. J., & Macfarlane-Dick, D. (2006). Formative assessment and selfregulated learning: A model and seven principles of good feedback practice. Studies in Higher Education, 31, 199–218. Norman G., Eva K., Brooks L., & Hamstra S. (2006). Expertise in medicine and surgery. In K. A. Ericsson, N. Charness, P. J. Feltovich & R. R. Hoffman (Eds.), The Cambridge Handbook of Expertise and Expert Performance (pp. 339–353). Cambridge: Cambridge University Press. OECD. (2018). Future of Education and Skills 2030: Conceptual Learning Framework. A Literature Summary for Research on the Transfer of Learning. Paris: OECD Conference Centre. OECD (2019a). PISA FAQ. http://www.oecd.org/pisa/pisafaq/ OECD (2019b). PISA 2018 Technical Report. Paris: PISA, OECD Publishing. Osterlind, S. J. (2010). Modern Measurement: Theory, Principles, and Applications of Mental Appraisal (2nd ed.). Boston, MA: Pearson. Pan, Y. (2016). Heading toward artificial intelligence 2.0. Engineering, 2, 409–413. Parshall, C. G., Spray, J. A., Kalohn, J. C., & Davey, T. (2002). Practical Considerations in Computer-Based Testing. New York: Springer. Pellegrino, J. W., & Quellmalz, E. S. (2010). Perspectives on the integration of technology and assessment. Journal of Research on Technology in Education, 43, 119–134. Pugh, D., De Champlain, A., Gierl, M. J., Lai, H., & Touchie, C. (2016). Using cognitive models to develop quality multiple-choice questions. Medical Teacher, 38, 838–843. Ras, E., & Joosten-ten Brinke, D., (2015). Computer Assisted Assessment: Research into E-assessment. New York: Springer. Royal, K. D., Hedgpeth, M., Jeon, T., & Colford, C. M. (2017). Automated item generation: The future of medical education assessment? European Medical Journal, 2, 88–93. Rudner, L. (2010). Implementing the graduate management admission test computerized adaptive test. In W. van der Linden & C. Glas (Eds.), Elements of Adaptive Testing (pp. 151–165), New York, NY: Springer. Sadler, R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18, 119–144. Schmeiser, C.B., & Welch, C.J. (2006). Test development. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 307–353). Westport, CT: National Council on Measurement in Education and American Council on Education. 18

Introduction Shute, V. J. (2008). Focus on formative feedback. Review of Educational Research, 78, 153–189. Singley, M. K., & Bennett, R. E. (2002). Item generation and beyond: Applications of schema theory to mathematics assessment. In S. H. Irvine & P. C. Kyllonen (Eds.), Item Generation for Test Development (pp. 361–384). Mahwah, NJ: Erlbaum. Susanti, Y., Tokunaga, T., & Nishikawa, H. (2020). Integrating automatic question generation with computerized adaptive testing. Research and Practice in Technology Enhanced Learning, 15, 1–22. van der Linden, W. J., & Glas, C. A. W. (2010). Elements of Adaptive Testing. New York: Springer. Wainer, H., Dorans, N. J., Eignor, D., Flaugher, R., Green, B. F., Mislevy, R. J., Steinberg, L., Thissen, D. (2000). Computerized Adaptive Testing: A Primer (2nd ed.). Mahwah, NJ: Erlbaum. Wollack, J. A., & Fremer, J. J. (2013). Handbook of Test Security. New York: Routledge. Zheng, N., Liu, Z., Ren, P., Ma, Y., Chen, S., Yu, S., Xue, J., Chen, B., & Wang, F. (2017). Hybrid-augmented intelligence: Collaboration and cognition. Frontiers of Information Technology & Electronic Engineering, 18, 153–179. Ziles, C., West, M., Herman, G. L., & Bretl, T. W. (2019). Every University Should Have a Computer-Based Testing Facility. CSEDU 2019—Proceedings for the 11th International Conference on Computer Supported Eduation, Heraklion, Crete, Greece, 414–420.

19

Section 1 Basic Concepts Required for Generating Constructed- and Selected-Response Items

21

2

Cognitive Model Development Cognitive Models and Item Generation

Cognitive models have always been used to create test items—the only question is whether these models were specified implicitly or explicitly during the development process. Item development occurs when an SME creates a task that can be solved using different types of knowledge, skills, and competencies within a particular content area (Keehner, Gorin, Feng, & Katz, 2017; Leighton & Gierl, 2007). The concepts, assumptions, and logic used by SMEs to create and solve this content-specific task are based on their mental representation or mental model (Genter & Stevens, 1983, 2014; Johnson-Laird, 1983). In some cases, the SME mental representation may also include assumptions about how examinees are expected to solve the task (Gierl, 1997). A  cognitive model in educational testing is considered to be implicit when an item is created without documenting the process used or without capturing any type of formal representation of the concepts, assumptions, or logic used to create the item. In this case, the cognitive model that describes how a content expert and/or an examinee solves problems resides in the mind of the SME. Items developed with implicit cognitive models are not replicable. Alternatively, the concepts, assumptions, and logic used by SMEs to describe how content-specific tasks are created and solved can be made explicit by documenting the process and by using a formal representation. A cognitive model in educational testing can be defined as a description of human problem solving on standardized educational tasks that helps characterize the knowledge and skills examinees at different levels of learning have acquired in order to facilitate the 23

Basic Concepts in AIG

explanation and prediction of examinee test performance (Leighton & Gierl, 2007). This model organizes the cognitive- and content-specific information so that the SME has a formal, structured representation of how to create and solve tasks on tests. Items developed with a cognitive model are replicable because the information used to produce the item is clearly specified in a model using content that is explicit and detailed. Because the content is explicit and detailed, it can be evaluated to determine if the item is addressing a specific and intended outcome on the test. The purpose of AIG is not to produce one unique item—as with traditional item development—but many diverse items. As a result, a cognitive model for AIG provides the important first step in an item generation workflow because it contains the specifications required to create large numbers of diverse items (Gierl, Lai, & Turner, 2012). These specifications can include the content, parameters, constraints, and/or instructions that will be used to control the behaviour of the model during item generation. In addition to task creation, the model can be used to describe test performance in cognitive terms by identifying the knowledge and skills required to elicit a correct response from the examinee, which, in turn, can be used to make inferences about how examinees are expected to solve tasks generated by the system. In other words, by specifying the content, parameters, constraints, and/or instructions used to generate items and by identifying these variables in cognitive terms using specific knowledge and skills, we can describe the knowledge and skills that examinees are expected to use when solving the generated items. This modelling approach can also be used to explain why examinees select specific incorrect options when ­selected-response items are generated. In short, a cognitive model for AIG is an explicit representation of the task requirements created by the SME which is used to generate items and to describe how examinees are expected to solve the generated items. Cognitive models for AIG are not readily available because the task and cognitive requirements are specific, numerous, and, often, unique to each testing situation. As a result, these models must be created by the SME, often from scratch, for each test. Because of the important role these cognitive models play in the item generation and the test score validation process, they should also be thoroughly evaluated. 24

Cognitive Model Development

Benefits of Using Cognitive Models For AIG There are three benefits of using a cognitive model for AIG. First, the cognitive model identifies and organizes the parameters, constraints, and/or instructions required to control the item generation process. The variables are described using cognitive- and content-specific information within a formal structured representation (Embretson & Gorin, 2001). The SME must identify the content and the conditions required to generate the items. This content is then used by the computer-based assembly algorithms described in step 3 of the AIG workflow to produce new items. Therefore, one practical purpose of the cognitive model is to specify the cognitive and content features that must be manipulated to produce new items. As the number of features in the cognitive model increase, the number of generated items will increase. As the source of the features varies, the types of generated items will vary. Hence, both quantitative and qualitative characteristics can be manipulated in the cognitive model to affect generative capacity. Second, the cognitive model can be used to make inferences about how examinees are expected to think about and solve items because it provides a structured representation of the content, parameters, constraints, and/or instructions used to create the task. By identifying the knowledge and skills required to generate new items, this cognitive description can also be used to account for how examinees are expected to select the correct and incorrect options from the generated items produced using specific types of knowledge and skills (see Leighton & Gierl, 2011, for a review). Hence, the cognitive model could be considered a construct representation that not only guides item generation but also tests interpretation (Embretson, 1983, 1999, 2017). Test scores anchored to cognitive models should be more interpretable because performance can be described using a specific set of knowledge and skills in a well-defined content area because the model is used to produce items that are generated to directly measure content-specific types of knowledge and skills. Third, the cognitive model is an explicit and formal representation of task-specific problem solving. Therefore, the model can be evaluated to ensure that the generated items yield information that addresses the intended purpose of the test. In traditional item development, the SME is responsible for identifying, organizing, and evaluating the content 25

Basic Concepts in AIG

required to create test items. The traditional approach relies on human judgement acquired through extensive training and practical experiences. Traditional item development is also a subjective practice because an item is an expression of the SME’s understanding of knowledge and skill within a specific content area. This expression can be characterized as an implicit mental representation distinct for each SME and, therefore, unique for each handwritten item. Because the cognitive representation is implicit, distinct, and unique for each item, it is challenging to replicate and evaluate. Alternatively, the content, parameters, constraints, and/or instructions in a cognitive model for AIG are explicit and structured using cognitive- and content-specific information in a formal representation. Cognitive model development for AIG also relies on human judgement acquired through extensive training and practical experiences. But because the model is explicit, it can be replicated by SMEs. Because the model is explicit, it can also be scrutinized by SMEs and, if need be, modified to address inadequacies. Once evaluated, the model serves as a generalized expression for how tasks can be generated, as well as how examinees are expected to solve these tasks. This expression can be used immediately for creating content or archived for future use.

Developing Cognitive Models for AIG AIG cognitive model development occurs in different stages. In the first stage, the SME begins by identifying and describing, in general terms, the knowledge, content, and reasoning skills required to create different types of assessment tasks. The purpose of this stage is to delineate the testing domain where item generation will occur. The content needed to create these descriptions is often found in the test specifications. These specifications are typically represented in a two-way matrix, where one dimension represents content areas and/or learning outcomes, and the other dimension represents cognitive skills (Haladyna & Rodriguez, 2013; Lane, Raymond, & Haladyna, 2016; Perie & Huff, 2016; Schmeiser & Welch, 2006). The structure of the topics is organized by the content that belongs to each topic. Mathematics, for example, is often structured in a hierarchy of topics that range from simple to complex. 26

Cognitive Model Development

The  topics in mathematics also require the integration of simple concepts (e.g., simple addition and subtraction) to produce more complex concepts (e.g., application of addition and subtraction in factoring) as the topics increase in complexity (e.g., Gierl, 2007; Gierl, Alves, TaylorMajeau, 2010). The most widely used taxonomy for identifying these skills is Bloom’s Taxonomy of Educational Objectives: Cognitive Domain (Bloom, Englehart, Furst, Hill, & Krathwohl, 1956) or some variation (e.g., Anderson & Krathwohl, 2001). After the general content and cognitive domain of interest for the test have been identified and described, the second stage is focused on model development. The purpose of this stage is to create a working cognitive model. Items that represent the domain of interest described in stage 1 are first identified. These items are called parents. The parent is used to determine the underlying structure for a typical item in the domain of interest, thereby providing a point-of-reference for describing the content and the skills. Parent items provide a prototypical structure that will be modelled, as well as providing an exemplar for the types of items that can be generated. While a parent is not required, it helps expedite the process because it provides a context that can be used to scaffold the model. Cognitive models are created in an inductive manner in which the SME reviews a parent item and then identifies and describes the information that could be used to generate new items based on the characteristics of the parent. Part of the challenge with cognitive modelling in stage 2 stems from the need for specificity. The SME must be very specific about the content and parameters that will be manipulated. The content and parameters, in turn, will affect the knowledge and skills examinees use to solve problems that could affect the interpretations and inferences made about examinees’ performance. The SME must discern what content should be included and what content is beyond the scope of the model. A well-defined outcome in stage 1 will also aid in this process. Another part of the challenge in stage 2 stems from the need to make explicit the logic and instructions that will be used to operationalize the model. The SME must specify the rules required to combine the content needed to produce new test items, as well as the rules required to produce the correct answer for the generated items. These rules are created using constraints applied to the content in each part of the model. 27

Basic Concepts in AIG

Without the use of rules and constraints, all of the content in each part of the model would be combined to create new items. However, many of these combinations do not produce items that are sensible or useful. Constraints, therefore, serve as descriptions of conditions that must be applied during the assembly task so that meaningful items are generated. When the goal of item generation is to produce large banks containing many similar items, a relatively small number of constraints are used in the cognitive model. Alternatively, when the goal is to produce comparatively smaller banks of diverse items, a relatively large number of constraints are needed in the cognitive model. Hence, constraint coding is the method by which AIG captures the specific decisions, judgements, and rules used by SMEs to produce diverse, high-quality, content-specific test items. These rules are first described in the cognitive model for AIG. Because constraint coding involves the specification and application of detailed rules to different types and large amounts of content within a model, designing the stage 2 cognitive model is challenging. After a cognitive model for AIG has been created, the third stage is focused on model evaluation. One of the important benefits of using a formal representation, such as a cognitive model for AIG, is that it can be evaluated. We place a premium on cognitive model evaluation because the accuracy of this model is directly related to the quality of the generated items. Cognitive models can also be archived and used to generate items in the future. Therefore, it is important to ensure that the models are accurate. The purpose of this stage is to evaluate and, when needed, modify the cognitive model created in stage 2. In other words, the stage 3 evaluation is intended to provide the SME with feedback on the stage 2 model. In our experience, cognitive model evaluation is best conducted using a two-member SME team where one SME develops the model, and a second SME provides feedback. To identify the content and specify the constraints in the cognitive model, one SME is required to explicitly describe the knowledge, content, and reasoning skills required to generate items in a specific content area for a well-defined purpose. This cognitive model, which is produced in stage 2, is then used as the basis for the review in stage 3. Model evaluation is an important and consequential activity for AIG. If correctly specified, the generated items will reflect the correct combination of the content and logic outlined in the 28

Cognitive Model Development

cognitive model. As a result, the SME can begin to anticipate the quality of the generated items by reviewing the content and the logic in the cognitive model. Alternatively, if the cognitive model contains content and logic that is inaccurate, the resulting generated items based on the cognitive model could also be inaccurate. Problems related to the content and/or logic in the model should be identified and modified in this stage because they will reflect errors introduced in the earlier model development stages. After a cognitive model for AIG has been evaluated by a second SME, stage 4 is focused on a more comprehensive level of evaluation provided by an independent group of SMEs. While stage 3 evaluation is focused on feedback and model modification, the stage 4 evaluation is focused on operational implementation and use. Schmeiser and Welch (2006) described item development as a standardized process that required iterative refinements (see also Haladyna & Rodriguez, 2013; Lane et al., 2016). Cognitive model development could also be considered a standardized process that requires iterative refinements. Iterative refinement occurs when the two SMEs in stage 3 work together to evaluate and revise the cognitive model. Iterative refinement can also occur in stage 4 when a second review is conducted independently from the first by SMEs who did not develop or evaluate the cognitive model. The review task could be standardized using the outcomes from a rating scale where specific judgements on the quality of the cognitive model are collected. One way to conduct this review is by using a rating rubric. The rubric focuses on different components of cognitive model quality, such as content, logic, and representation (Gierl & Lai, 2016a). Content refers to the model’s capabilities for generating test items. Logic refers to the accuracy of the knowledge and skills required to produce the correct options. Representation refers to the adequacy of the model to reflect and cover the testing domain of interest. These three components are important qualities for any AIG cognitive model and thus should be evaluated. In sum, stage 4 allows an independent evaluator to provide the original AIG model developer with judgements on the quality of the model that go above and beyond the feedback that was provided by the second SME in stage 3. The goal of this stage is to identify and document cognitive models for AIG that are considered to be error-free and ready for operational use. 29

Basic Concepts in AIG

A Word of Caution When Creating Cognitive Models Creating cognitive models for AIG is challenging. As a result, it is a skill that SMEs can only acquire over time. We noted in Chapter 1 that the SME is responsible for the creative activities associated with identifying, organizing, and evaluating the content and cognitive skills needed to generate test items. Many different tasks are required to produce high-quality cognitive models. For example, SMEs identify the knowledge and skills required to solve different types of tasks; they organize this information into a cognitive model targeted to a specific content area; they create the detailed instructions required to coordinate the content within each model; they are expected to design many different cognitive models, often from scratch; they are also expected to evaluate and provide detailed feedback on models created by other SMEs. These responsibilities require judgement and expertise. In our experience, cognitive model development also requires a lot of practice.

Two Types of Cognitive Models for AIG Two types of cognitive models are commonly used for AIG: the logical structures and the key features cognitive models (Gierl & Lai, 2017). These models differ in how they organize information for item generation. Logical Structures Cognitive Model The first type of cognitive model is called logical structures. This model is used when a specific concept or a set of closely related concepts is operationalized as part of the generative process. The logical structures cognitive model is most suitable for measuring the examinees’ ability to use a concept across a variety of different content representations. The concept is often used to implement a formula, algorithm, and/or logical outcome. The defining characteristic of this cognitive modelling approach is that the content for the item can vary, but the concept remains fixed across the generated items. To illustrate this model, we present a simple math 30

Cognitive Model Development Table 2.1  Parent Item Related to Ratio, Proportion, and Rate Yesterday, a veterinarian treated 2 birds, 3 cats, 6 dogs. What was the ratio of the number of cats treated to the total number of animals treated by the veterinarian? (A) 1 to 4 (B) 1 to 6 (C) 1 to 13 (D) 3 to 8 (E) 3 to 11* * correct option

example using the three-step AIG method. This example will be used throughout the book to illustrate principles and applications using logical structures. The example, presented in Table 2.1, is adapted from a parent item used to measure concepts related to ratio, proportion, and rate. It was selected as a straightforward example so that readers can focus on the logic of our method without being overburdened by the content within the model. Figure 2.1 contains a cognitive model based on the parent item in Table 2.1 that can be used to solve word problems that measure range and ratio. Because this task is straightforward, the associated model is relatively simple. To organize information with the cognitive model, content is presented in three different panels. These panels structure and organize the information within the model. The panel structure is helpful for the SME when creating the model. The top panel identifies the general problem and its associated scenarios. The SME begins by identifying the general problem specific to the parent item. The middle panel specifies the relevant sources of information. Sources of information can be specific to a particular problem or generic, thereby applying to many problems. The bottom panel highlights the salient features. Each feature also specifies two nested components. The first nested component for a feature is the element. Elements contain content specific to each feature that can be manipulated for item generation. The content in each element is stored as values. These values can be denoted either as string values, which are non-numeric content, or integer values, which are numeric content. The second nested component for a feature is the constraint. Constraints serve as restrictions that must be applied to the elements during the assembly task to ensure that content in the elements are combined in a meaningful 31

Basic Concepts in AIG

Figure 2.1  A logical structures cognitive model for range and ratio

way so that useful items can be generated. Alternatively, constraints can be described as the problem-solving logic which serves as the instructions required to assemble the content in the elements. This logic serves as the cognitive component of the problem-solving task. A generic representation which outlines the required components for any cognitive model— not just those models described in our book—is provided in Figure 2.2. 32

Cognitive Model Development

Figure 2.2  A generalized structure for a cognitive model for AIG

The cognitive model in Figure 2.1 focuses on a range and ratio word problem. The specific scenario for this problem is based on a set of values in a ratio. Because this model is simple, there is only one source of information: the range. Each source of information contains at least one feature. In our example, the element for the range source of information is an integer value (I1 to I3). The range for the integers is identical, 2–8 in increments of 1. Each feature also contains constraints. There are no constraints for the integer elements in our example, meaning that values 33

Basic Concepts in AIG

Figure 2.3  A logical structures cognitive model with four different scenarios for range and ratio

in the range 2–8 can be used to generate test items. Models can quickly expand and become more complex. For example, Figure 2.3 contains a cognitive model with the same problem but an expanded list of scenarios for range and ratio word problems. Table 2.2 contains four different stems that are represented in the cognitive model presented in Figure 2.3. The general problem is still focused on range and ratio, but it includes four 34

Cognitive Model Development Table 2.2  Four Different Stems for Range and Ratio Yesterday a veterinarian treated 2 birds, 3 cats, 6 dogs. What was the ratio of the number of cats treated to the total number of animals treated by the veterinarian? (A) 1 to 4 (B) 1 to 6 (C) 1 to 13 (D) 3 to 8 (E) 3 to 11*

different scenarios: recognition of a value in a given set of values, sum of selected values among a set of values, sum of all values among a set of values, and presentation of a value among a set of values in a ratio. For each of the problems and scenarios presented in Figure 2.3, the same source of the information (range) and features list (integers) is used for our example. Large cognitive models which are common in operational AIG applications often address a single problem with four to seven different scenarios (top panel in Figures 2.3 and 2.4) and five to seven different sources of information (middle panel). The sources of information, in turn, typically contain 8–12 different features (bottom panel). This type of large cognitive model contains many variables capable of generating millions of items prior to the application of the constraints. Then when the constraints outlined in the cognitive models are applied, the majority of the generated items are eliminated because they result in infeasible combinations that would produce meaningless or inaccurate items. The combinations that remain are meaningful items produced by assembling specific combinations of elements within a feature for well-defined sources of information that can be used by the SME to measure different scenarios for a specific type of problem. Every aspect of the model—problem, scenario, sources of information, features, elements, constraints—is created by the SME. Examples of logical structures of cognitive models for AIG can be found in the content areas of science (Gierl & Lai, 2017; Gierl, Latifi, Lai, Matovinovic, & Boughton, 2016), mathematics (Gierl & Lai, 2016b; Gierl, Lai, Hogan, & Matovinovic, 2015), and classroom testing (Gierl, Bulut, & Zhang, 2018a). 35

Basic Concepts in AIG

Figure 2.4  A key features cognitive model for cold versus flu

Key Features Cognitive Model The second type of cognitive model is called key features. The key features cognitive model is used when the attributes or features of a task are systematically combined to produce meaningful outcomes across the item feature set. The use of constraints ensures that the relationships among the features yield meaningful items. The key features model is most suitable for measuring the examinees’ ability to assemble and apply 36

Cognitive Model Development

key features within a domain, as well as to solve problems using these key features. The defining characteristic of this modelling approach is that the content for the item can vary, and the key concept varies across the generated items due to the meaningful combination of features. In contrast to the logical structures model which focuses on representing algorithmic problem solving, key features models are organized based on the definition and relationship between features that are used to generate items. The systematic combination of permissible features is defined by the constraints specified in the feature panel (i.e., bottom panel) of the cognitive model. Table 2.3 contains a parent item for diagnosing either the cold or the flu. As with the logical structures model, we present a relatively simple example that will be used throughout the book to illustrate principles and applications using key features. This example was selected because it is straightforward and does not require extensive medical knowledge while still demonstrating the logic of the key features model. Table 2.4 provides a list of cold and flu symptoms. The variables on this list serve as the key features that can be used to differentiate cold and flu symptoms. Figure 2.4 contains a cognitive model based on the parent item in Table 2.3 using the symptoms in Table 2.4. The top panel identifies the problem and its associated scenarios. The SME began by identifying the general medical problem (i.e., respiratory illness) specific to the parent test item. Two types of scenarios associated with this problem are the common cold and seasonal flu. The middle panel specifies the sources of information related to the problem. In this simple example, two sources Table 2.3  Parent Item Related to Diagnosis of Common Cold and Seasonal Flu A 22-year-old female sees her doctor and reports that she’s been experiencing a mild cough and slight body aches that have developed over a few days. Upon examination, she presents with an oral temperature of 37°C. What is the most likely diagnosis? (A) Hay fever (B) Ear infection (C) Common cold* (D) Acute sinusitis (E) Seasonal influenza * correct option

37

Basic Concepts in AIG Table 2.4  Symptoms of Common Cold and Seasonal Flu Common Cold

Seasonal Flu

Fever Is Rare

Fever

Mild Cough, Chest Discomfort

Severe Cough, Chest Discomfort

Mild Body Aches and Pains

Severe Body Aches and Pains

Tiredness

Bedridden

Mild Headache

Severe Headache

Sore Throat

Sore Throat

Stuffy, Runny Nose

Stuffy, Runny Nose

Adapted from Public Health Agency of Canada “Cold or Flu: Know the Difference”.

are identified: history and examination. These sources of information are used to describe the content that will be manipulated in the generated items. The bottom panel highlights the salient features, which include the elements and constraints, for each source of information. For the example in Figure 2.4, five features (i.e., age, cough type, body aches, onset, temperature) were identified for the history and examination sources of information. Recall, each feature specifies two nested components: the elements and constraints. For instance, the cough type feature contains three values: mild, hacking, severe. The model is constrained so that mild and hacking coughs are associated with a cold, while a severe cough is associated with the flu. In other words, the SME specified that the features mild and hacking coughs can only be paired with a cold. The feature of severe cough, by comparison, can only be paired with the flu in our example. These requirements are presented in the constraint section of the feature panel. Mild and hacking are coded as CC, while severe is coded as SF. The key features model is characterized by its reliance on specific constraints in the features panel to produce generated items that contain logical combinations of values. Hence, Figure 2.4 serves as a cognitive model for AIG because it specifies and coordinates the knowledge, skills, and content required to generate items that, in turn, can be used to evaluate examinees’ reasoning and problem-solving skills to diagnose respiratory symptoms associated with a cold and the flu. Examples of key features cognitive models for AIG can be found in the content areas of abdominal injury (Gierl, Lai, & Zhang, 2018b), infection and pregnancy (Gierl & Lai, 2016b), chest trauma (Gierl & 38

Cognitive Model Development

Lai, 2018), hernia (Gierl & Lai, 2013), post-operative fever (Gierl, Lai, & Turner, 2012), jaundice (Gierl & Lai, 2017), and dentistry (Lai, Gierl, Byrne, Spielman, & Waldschmidt, 2016).

References Anderson, L. W., & Krathwohl, D. (2001). A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. New York: Longman. Bloom, B., Englehart, M. Furst, E., Hill, W., & Krathwohl, D. (1956). Taxonomy of Educational Objectives: The Classification of Educational Goals. Handbook I: Cognitive Domain. New York: Longmans, Green. Embretson, S. E. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179–197. Embretson, S. E. (1999). Generating items during testing: Psychometric issues and models. Psychometrika, 64, 407–433. Embretson, S. E. (2017). An integrated framework for construct validity. In A. Rupp & J. Leighton (Eds.), The Wiley Handbook of Cognition and Assessment: Frameworks, Methodologies, and Applications (pp. 102–123). New York: Wiley. Embretson, S., & Gorin, J. (2001). Improving construct validity with cognitive psychology principles. Journal of Educational Measurement, 38, 343–368. Genter, D., & Stevens, A. L. (2014). Mental Models. New York: Psychology Press. Genter, D., & Stevens, A. (1983). Mental Models. Hillsdale, NJ: Erlbaum. Gierl, M. J. (1997). Comparing the cognitive representations of test developers and students on a mathematics achievement test using Bloom’s taxonomy. Journal of Educational Research, 91, 26–32. Gierl, M. J. (2007). Making diagnostic inferences about cognitive attributes using the rule space model and Attribute Hierarchy Method. Journal of Educational Measurement, 44, 325–340. Gierl, M. J., & Lai, H. (2013). Evaluating the quality of medical multiple-choice items created with automated generation processes. Medical Education, 47, 726–733. Gierl, M. J. & Lai, H. (2016a). A process for reviewing and evaluating generated test items. Educational Measurement: Issues and Practice, 35, 6–20. Gierl, M. J. & Lai, H. (2016b). Automatic item generation. In S. Lane, M. Raymond, & T. Haladyna (Eds.), Handbook of Test Development (2nd ed., pp. 410–429). New York: Routledge. Gierl, M. J. & Lai, H. (2017). The role of cognitive models in automatic item generation. In A. Rupp & J. Leighton (Eds.), The Wiley Handbook of Cognition and Assessment: Frameworks, Methodologies, and Applications (pp. 124–145). New York: Wiley. 39

Basic Concepts in AIG Gierl, M. J., & Lai, H. (2018). Using automatic item generation to create solutions and rationales for computerized formative testing. Applied Psychological Measurement, 42, 42–57. Gierl, M. J., Alves, C., & Taylor-Majeau, R. (2010). Using the Attribute Hierarchy Method to make diagnostic inferences about examinees’ skills in mathematics: An operational implementation of cognitive diagnostic assessment. International Journal of Testing, 10, 318–341. Gierl, M. J., Lai, H., & Turner, S. (2012). Using automatic item generation to create multiple-choice items for assessments in medical education. Medical Education, 46, 757–765. Gierl, M. J., Lai, H., Hogan, J., & Matovinovic, D. (2015). A method for generating test items that are aligned to the Common Core State Standards. Journal of Applied Testing Technology, 16, 1–18. Gierl, M. J., Latifi, F., Lai, H., Matovinovic, D., & Boughton, K. (2016). Using automated processes to generate items to measure K-12 science skills. In Y. Rosen, S. Ferrara, & M. Mosharraf (Eds.), Handbook of Research on Computational Tools for Real-World Skill Development (pp. 590–610). Hershey, PA: IGI Global. Gierl, M. J., Bulut, O., & Zhang, X. (2018a). Using computerized formative testing to support personalized learning in higher education: An application of two assessment technologies. In R. Zheng (Ed.), Digital Technologies and Instructional Design for Personalized Learning (pp. 99–119). Hershey, PA: IGI Global. Gierl, M. J., Lai, H., & Zhang, X. (2018b). Automatic item generation. In M. Khosrow-Pour (Ed.), Encyclopedia of Information Science and Technology (4th ed., pp. 2369–2379). Hershey, PA: IGI Global. Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and Validating Test Items. New York: Routledge. Johnson-Laird, P. N. (1983). Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Cambridge, MA: Harvard University Press. Keehner, M., Gorin, J. S., Feng, G., & Katz, I. R. (2017). Developing and validating cognitive models in assessment. In A. Rupp & J. Leighton (Eds.), The Wiley Handbook of Cognition and Assessment: Frameworks, Methodologies, and Applications (pp. 75–101). New York: Wiley. Lai, H., Gierl, M. J., Byrne, B. E., Spielman, A., & Waldschmidt, D. (2016). Three modelling applications to promote automatic item generation for examinations in dentistry. Journal of Dental Education, 80, 339–347. Lane, S., Raymond, M., & Haladyna, R. (2016). Test development process. In S. Lane, M. Raymond, & T. Haladyna (Eds.), Handbook of Test Development (2nd ed., pp. 3–18). New York, NY: Routledge. Leighton, J. P., & Gierl, M. J. (2007). Defining and evaluating models of cognition used in educational measurement to make inferences about examinees’ thinking processes. Educational Measurement: Issues and Practice, 26, 3–16.

40

Cognitive Model Development Leighton, J. P., & Gierl, M. J. (2011). The Learning Sciences in Educational Assessment: The Role of Cognitive Models. Cambridge, MA: Cambridge University Press. Perie, M., & Huff, K. (2016). Determining content and cognitive demands for achievement tests. In S. Lane, M. Raymond, & T. Haladyna (Eds.), Handbook of Test Development (2nd ed., pp. 119–143). New York: Routledge. Schmeiser, C.B., & Welch, C.J. (2006). Test development. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 307–353). Westport, CT: National Council on Measurement in Education and American Council on Education.

41

3

Item Model Development Template-Based AIG Using Item Modelling

With the content identified in step 1 as outlined in the previous chapter, the next step is to position this content in an item model. An item model is a template of the assessment task that specifies the parts and content in the task that will be manipulated to create new test items. Item models provide the foundation for template-based AIG. Item models (LaDuca, Staples, Templeton, & Holzman, 1986; see also Bejar, 1996, 2002; Bejar, Lawless, Morley, Wagner, Bennett, & Revuelta, 2003) have been described using different terms, including schemas (Singley & Bennett, 2002), blueprints (Embretson, 2002), templates (Mislevy & Riconscente, 2006), forms (Hively, Patterson, & Page, 1968), frames (Minsky, 1974), and shells (Haladyna & Shindoll, 1989). Item models are created by the SME. These models are cast as templates that specify where the content from the cognitive modelling step should be placed to create test items. While the cognitive model focuses on the organization of content in the knowledge domain, the item model focuses on the organization of how this content will be presented in a test item. In other words, item models are created by the SME to provide a specific structure that will be used to generate items. Item models identify the parts of an assessment task that can be manipulated for item generation. These parts include the stem, the options, and any other auxiliary information that will be presented with the item. The stem contains context, content, and/or questions the examinee is required to answer. The options include a set of alternative answers with one correct option and one or more incorrect options or distracters. Auxiliary information includes any additional content, in either the stem or option, required to generate an item, including text, images, tables, 42

Item Model Development

graphs, diagrams, audio, and/or video. The stem and correct option are generated for the constructed-response item format. The stem, correct option, and incorrect options are generated for the ­selected-response (e.g., multiple-choice) item format. The stem and options can be further divided into elements. Elements were first introduced in Chapter 2. Elements contain values for each feature in the bottom panel of the cognitive model that can be manipulated for item generation. Values are denoted as strings, which are non-numeric content or integers which are numeric content.

Layers in Item Models Item models contain layers of information. Item models in AIG are often specified as either 1 or n-layer (Lai, 2013; see also Gierl & Lai, 2013). The goal of item generation using the 1-layer item model is to produce new items by manipulating a small number of elements at a single layer. The simplicity of this model makes it a popular choice for AIG because it is relatively easy to implement. We use element as the unit of analysis in our description because it is the most specific variable in the cognitive model that is manipulated to produce new items. Often, the starting point for 1-layer modelling is to return to the parent item used to create the cognitive model for AIG in Chapter 2. The parent item for the logical structures math cognitive model presented in Chapter 2 is shown at the top of Table 3.1. The parent highlights the underlying structure of the item, thereby providing a point of reference for creating alternative items. Then an item model is created from the parent by identifying elements that can be manipulated to produce new items. The item model identifies each part of the model required for item generation. In this example, the item model contains the stem, which identifies each feature in a square bracket (i.e., Yesterday, a veterinarian treated [I1] birds, [I2] cats, [I3] dogs. What was the ratio of the number of cats treated to the total number of animals treated by the veterinarian?) and the elements of the features (i.e., three integer v­ alues—I1, I2, I3—beginning with the value 2 and ending with the value 8 using increments of 1). For item models based on the logical structures cognitive model, the solution for the correct option is presented as the key (i.e., [I2] to \ [[I1] + [I2] + [I3] \]). 43

Basic Concepts in AIG Table 3.1  1-Layer Logical Structures Mathematics Item Model Parent Item: Yesterday, a veterinarian treated 2 birds, 3 cats, 6 dogs. What was the ratio of the number of cats treated to the total number of animals treated by the veterinarian? ( A) 1 to 4 (B) 1 to 6 (C) 1 to 13 (D) 3 to 8 (E) 3 to 11 Item Model: Stem

Yesterday, a veterinarian treated [I1] birds, [I2] cats, [I3] dogs. What was the ratio of the number of cats treated to the total number of animals treated by the veterinarian?

Element

[I1] Range: 2 to 8 by 1 [I2] Range: 2 to 8 by 1 [I3] Range: 2 to 8 by 1

Key

[I2] to \[ [I1] + [I2] + [I3] \]

Item models based on the key features cognitive model are formatted in the same manner, with one important difference: the key. The parent item for the key features medical cognitive model presented in Chapter 2 is shown at the top of Table 3.2. In this example, the item model contains the stem which identifies each feature in a square bracket (i.e., A [Age]-year-old female sees her doctor and reports that she’s been experiencing a [Cough Type] cough and [Body Aches] that have developed [Onset]. Upon examination, she presents with an oral temperature of [Temperature]. What is the most likely diagnosis) and the elements of the features (i.e., age with eight integer values [18 to 30 in increments of 1], cough type with three string values [mild, hacking, severe], body aches with four string values (slight body aches, slight body pains, severe body aches, severe body pains), onset with three string values [over a few days, within three to six hours, suddenly], temperature with four integer values [37°C, 37.8°C, 39°C, 39.5°C]). The correct option is presented as the key (i.e., common cold and seasonal flu). Notice, however, that the solution for the key is not presented in this model. The key helps differentiate the logical structures and key features cognitive models. Recall that the 44

Item Model Development

Table 3.2  1-Layer Key Features Medical Item Model Parent Item: A 22-year-old female sees her doctor and reports that she’s been experiencing a mild cough and slight body aches that have developed over a few days. Upon examination, she presents with an oral temperature of 37°C. What is the most likely diagnosis? 1: 2: 3: 4: 5:

Hay Fever Ear Infection Common cold Acute Sinusitis Seasonal Influenza

Item Model:

Stem

A [Age]-year-old female sees her doctor and reports that she’s been experiencing a [Cough Type] cough and [Body Aches] that have developed [Onset]. Upon examination, she presents with an oral temperature of [Temperature]. What is the most likely diagnosis? Age: 18–30, by 1 Cough Type: 1. mild, 2. hacking, 3. severe

Element

Body Aches: 1. slight body aches, 2. slight body pains, 3. severe body aches, 4. severe body pains Onset: 1. over a few days, 2. within 3-6 hours, 3. suddenly Temperature: 1. 37°C; 2. 37.8°C; 3. 39°C; 4. 39.5°C

Key

Common cold; Seasonal flu

logical structures cognitive model is used to measure the examinees’ ability to apply a concept with different types of content (i.e., values in the ­elements). The concept is often used to implement a formula, algorithm, and/or logical outcome. This concept is presented as the key in the item model. The defining characteristic of this modelling approach is that the content for the item can vary, but the concept remains fixed across the generated items. The key features cognitive model, on the other hand, is most suitable for measuring the examinees’ ability to assemble and apply key features. The defining characteristic of this modelling approach is that the content (i.e., values in the elements) can vary, as with the logical structures model, but also that the key concept varies across the generated items 45

Basic Concepts in AIG

due to the meaningful combination of features, which is unlike the logical structures model. The systematic combination of permissible features is defined by the constraints specified in the feature panel of the cognitive model. Hence, the constraints needed to uniquely combine different values to produce a correct response—and there can be many combinations for one correct response—must be described in the cognitive model. The main disadvantage of using a 1-layer item model for AIG is that relatively few elements can be manipulated. The manipulations are limited because the number of potential elements in a 1-layer item model is relatively small (i.e., the number of elements is fixed to the total number of words in the stem). By restricting the element manipulations to a small number, the generated items may have the undesirable quality of appearing too similar to one another. In our experience, generated items from 1-layer models are referred to pejoratively by many SMEs as “clones”. Cloning, in a biological sense, refers to any process in which a population of identical units is derived from the same ancestral line. Cloning occurs in item modelling if we consider it to be a process where specific content (e.g., nuclear DNA) in a parent item (e.g., currently or previously existing animal) is manipulated to generate a new item (e.g., new animal). Through this process, instances are created that are very similar to the parent because the information is purposefully transferred from the parent to the offspring. Clones are perceived by SMEs to be generated items that are easy to produce. Clones are often seen as simplistic products from an overly simple item development process. Most importantly, clones are believed to be easily recognized by test preparation companies, which limits their usefulness in operational testing programs because items can then be studied before the test administration. In short, items generated from 1-layer models are viewed by many SMEs as easily produced, overly simplistic, and clearly detectable. As a result, SMEs are often skeptical of items thought to be clones produced from the 1-layer item model.

Item Generation With 1-Layer Models One early attempt to address the problem of generating cloned items was described by Gierl, Zhou, and Alves (2008). They developed a taxonomy of 1-layer item model types. The purpose of this taxonomy was to 46

Item Model Development

provide SMEs with design guidelines for creating item models that yield different and varied types of generated items. Gierl et al.’s (2008) strategy for producing diversity was to systematically combine and manipulate those elements in the stem and options typically used for item model development. Their taxonomy included three parts: the stem, options, and auxiliary information. The stem is the section of the model used to formulate context, content, and/or questions. It contains four categories, as shown in Figure 3.1. Independent indicates that the ni elements in the stem are independent or unrelated to one another. That is, a change in one element will have no effect on the other stem elements in the item model. Dependent indicates nd element(s) in the stem are dependent or directly related to one another. Mixed independent/dependent include

Figure 3.1  Categories in the item model 47

Basic Concepts in AIG Table 3.3  Plausible Stem-by-Option Combinations in the Item Model Taxonomy Stem Options

Independent

Dependent

Mixed

Fixed

Randomly Selected

Yes

Yes

Yes

Yes

Constrained

Yes

Yes

Yes

No

Fixed

Yes

Yes

Yes

No

both independent and dependent elements in the stem. Fixed represents a constant stem format with no variation or change. The options contain the alternatives for the item model when the selected-response format is used. The options contain three categories. Randomly selected options refer to the manner in which the incorrect options or distractors are selected from their corresponding content pools. The distractors are selected randomly. Constrained options mean that the correct option and the distractors are generated according to specific constraints, such as formulas or calculations. Fixed options occur when both the correct option and distractors are invariant or unchanged in the item model. By crossing the stem and options categories, a matrix of item model types can be produced. The stem-by-options matrix is presented in Table  3.3. Ten functional combinations are designated with the word “yes”. The two remaining combinations are labelled as “no” because a model with a fixed stem and constrained options is an infeasible item type, and a model with a fixed stem and fixed options produces a single selected-response item type (e.g., a single multiple-choice item). This taxonomy is useful because it provides strategies for designing diverse 1-layer item models by outlining their structure, function, similarities, and differences. It can also be used to ensure that SMEs do not design item models where the same elements are constantly manipulated or where the same item model structure is frequently used. Gierl et al. (2008) provided 20 different examples of item models created by using different stem-by-option combinations in the taxonomy to illustrate how generated items would vary in content areas such as mathematics, science, social studies, language arts, and architecture. Using the Gierl et al. (2008) taxonomy, the Table 3.1 mathematics item model would be described as an independent stem with a constrained correct option. The Table 3.2 medical item model would be described as 48

Item Model Development

a mixed stem with constrained options. The age integer values in the stem are independent because they can assume any value while the cough type, body aches, onset, and temperature string values will depend on the combination of content in the item model. As a result, both independent and dependent elements are included in this example, making it mixed. The correct options are constrained by the combination of integer and string values specified in the stem, regardless of the age element.

n-Layer Item Models A generalized expression for a 1-layer item model requires multiple layers or n-layering. Multiple-layer models are used to generate content in an item that varies at two or more levels if the model where n denotes a value of two or more. For example, a 3-layer item model contains three layers of information. The goal of AIG using the n-layer item model is to produce items by manipulating a relatively large number of elements at two or more layers in the model. Much like 1-layer item modelling, the starting point for the n-layer model is to use a parent item. But unlike the 1-layer model where the manipulations are constrained to a linear set of generative operations using a small number of elements at a single level, the n-layer model permits manipulations of a nonlinear set of generative operations using elements at multiple levels. As a result, the generative capacity of the n-layer model is high. The concept of n-layer item generation is adapted from the literature on syntactic structures of language (Higgins, Futagi, & Deane, 2005; Jurafsky & Martin, 2009; McCarthy & BoonthumDenecke, 2012). Language is often structured hierarchically, meaning that the content is often embedded in layers. This hierarchical organization can also be used as a guiding principle to generate large numbers of meaningful test items. The use of an n-layer item model is, therefore, a flexible template-based method for expressing different hierarchical or layered structures, thereby permitting the development of many different but feasible combinations of embedded elements. The n-layer structure can be described as a model with multiple layers of elements where each element can be varied simultaneously at different layers to produce different items. In the computational linguistic literature, our n-layer structure could be characterized as a generalized form of template-based natural 49

Basic Concepts in AIG

language generation described by Reiter (1995) and Reiter and Dale (1997). The main disadvantage of n-layer item modelling is that the structures are more complex to create and evaluate. Hence n-layering requires much more experience and practice than 1-layer modelling in order to ensure the cognitive model content is correctly specified and constrained in order to generate meaningful items.

Item Generation With n-Layer Models A comparison of the 1- and n-layer item models is presented in Figure 3.2. For this example, the 1-layer model can provide a maximum of four different values for element A. Conversely, the n-layer model can provide up to 64 different values using the same four values for elements C and D embedded within element B. In the n-layer example, element C has four values (Value 1 to 4), element D has four values (Value 1 to 4), and element B has four values (i.e., Value with [Element C] with [Element D], Value with [Element D], [Element C] and [Element D], Value with [Element D]). In other words, element B includes four different configurations that each contain elements C and/or D. Hence the n-layer model produces 4(Element B)*4(Element C)*4(Element D) = 64 different values. Because the maximum generative capacity of an item model is the

Figure 3.2  A comparison of the elements in a 1-layer and n-layer item model 50

Item Model Development

product of the ranges in each element, the use of an n-layer item model will always increase the number of items that can be generated relative to the 1-layer structure (Lai & Gierl, 2013; Lai, Gierl, & Alves, 2010). The most important benefit of using the n-layer item model structure is that more elements can be manipulated, resulting in generated items that appear to be different from one another. This important characteristic of n-layer modelling will become important again in Chapter 11 where we present multilingual item generation. Language is a layer in the item model which allows us to use one model to generate items in two or more languages. But for now, n-layer item modelling is important because it can be used to address the problem of cloning that concerns many SMEs. A 2-layer variant of the logical structures math cognitive model is presented in Figure 3.3. This cognitive model serves as a generalization of the example presented in Chapter 2, Figure 2.1. The item model for the 2-layer logical structures mathematics model is provided in Table 3.4. In addition to manipulating the integer elements I1, I2, and I3, two new elements (occupation, task) are introduced where the layer 2 elements are embedded in the layer 1 elements to facilitate the generative process. Three different occupation values are included: veterinarian, mechanic, and doctor. For each occupation, three task values are specified. The task for a mechanic includes fixing the tires, fan belt, and brakes. The task for the doctor includes diagnosing colds, fevers, and infections. The task for the veterinarian includes treating birds, cats, and dogs. That is, by embedding elements within elements, different occupations and tasks can be combined with the integer values initially presented in the 1-layer model to generate more diverse and heterogeneous items. In our example, the generated items from the same n-layer model could even be considered different items because of the diversity introduced by the layered content. A 2-layer variant of the key features medical cognitive model is presented in Figure 3.4. This cognitive model serves as a generalization of the example presented in Chapter 2, Figure 2.4. The item model for the 2-layer key features medical model is provided in Table 3.5. The age, cough type, body aches, onset, and temperature elements for the 1-layer example in Table 3.2 can now be embedded with the elements test ­findings within a question prompt to facilitate the generative process. The  test findings include a swab element with the values nasal and throat, as well as a result element with the values positive and negative. The question prompt 51

Basic Concepts in AIG

Figure 3.3  A 2-layer logical structures cognitive model for range and ratio 52

Item Model Development Table 3.4  2-Layer Logical Structures Mathematics Item Model Item Model: Stem

[OCCUPATION STATEMENT], [TASKS INVOLVED]. What is the [PROBLEM]?

Elements: Layer 1

OCCUPATION STATEMENT: Last week a mechanic fixed The doctor diagnosed Yesterday a veterinarian treated TASK INVOLVED: [I1] [JobA] and [I2] [JobB]., (when I3 is 0) [I1] [JobA] , [I2] [JobB] and [I3] [JobC] [I1] patients with [JobA], [I2] patients with [JobB] and [I3] patients with [JobC] PROBLEM: number of [JobA] the [Occupation] [Task] today? number of [JobA] and [JobB] the [Occupation] [Task] today? total number of [Jobs] the [Occupation] [Task] today? ratio of the number of [JobA] [Task] to the total number of [Jobs] [Task] by the [Occupation]?

Layer 2: Range related

[I1] Range: 2 to 8 by 1 [I2] Range: 2 to 8 by 1 [I3] Range: 0 to 8 by 1 [IKey] Range: [I1],[I2],[I3]

Layer 2: Occupation related

[Occupation]: mechanic, doctor, veterinarian [JobA]: the tires, colds, birds [JobB]: the fan belt, fevers, cats [JobC]: the brakes, infections, dogs [Task]: fixed, diagnosed, treated [Jobs]: cars, patients, animals

Key

[IKey] to \[ [I1] + [I2] + [I3] \]

includes management (i.e., What is the best next step?) and diagnosis (i.e., What is the most likely diagnosis?). By embedding the elements from the 1-layer model with two new elements introduced for the 2-layer model (i.e. test findings and question prompts), two entirely different types of items (i.e., What is the best next step? What is the most likely diagnosis?) can be generated using the same model. 53

Basic Concepts in AIG

Figure 3.4  A 2-layer key features cognitive model for cold versus flu 54

Item Model Development Table 3.5  2-Layer Key Features Medical Item Model Item Model: Stem

[SITUATION] [TESTFINDINGS] [QUESTIONPROMPT]

Elements: Layer 1

SITUATION: A [Age]-year-old female sees her doctor and reports that she’s been experiencing a [Cough Type] cough and [Body Aches] that have developed [Onset]. Upon examination, she presents with an oral temperature of [Temperature]. TEST FINDINGS: A [Swab] swab produces a [Result] for a viral infection. QUESTIONPROMPT: What is the best next step? What is the most likely diagnosis?

Layer 2

Age: 18 to 30, by 1 Cough Type: (1) mild, (2) hacking, (3) severe Body Aches: (1) slight body aches, (2) slight body pains, (3) severe body aches, (4) severe body pains Onset: (1) over a few days, (2) within 3–6 hours, (3) suddenly Temperature: (1) 37°C, (2) 37.8°C. (3) 39°C, (4) 39.5°C Swab: nasal, throat Result: positive, negative

Key

Common cold, seasonal flu, send the patient home to rest, prescribe an oral antiviral medication

Two Important Insights Involving Cognitive and Item Modelling in AIG AIG is capable of producing an impressive number of test items. In Chapter 1, we claimed that the three-step AIG method could be used to generate hundreds or thousands of new test items using a single cognitive model. This statement is accurate. However, this claim is sometimes interpreted to also mean that one cognitive model can be used to produce the content needed for an entire test. This interpretation is inaccurate. The cognitive model will faithfully produce the content that is specified in the top panel. Recall the top panel includes the problem and scenarios. For example, the problem and scenarios in the medical cognitive model include respiratory 55

Basic Concepts in AIG

illness and the common cold and seasonal flu, respectively (see Chapter 2, Figure 2.4). This modelling structure means that regardless of whether the three-step AIG process is used to generate 500 or 5,000 items, these items will all be related to the problem of respiratory illness and measure the two scenarios of the common cold and seasonal flu. If the purpose of a test were intended to measure a single problem with a small number of scenarios, then only one cognitive model would be needed to produce the content for such a test. But in reality, this type of test is never created. Instead, the test specifications we described in step one for developing a cognitive model in Chapter 2 are used to identify many different content areas. These content areas must be carefully aligned with the problem and scenarios in our cognitive models to produce the content for the test. Hence in most operational testing situations, many different cognitive models will be required to implement the objectives described in the test specifications because the specifications tend to be multifaceted and complex. We advise that when developers or users want a quick summary of the generated content, they simply need to review the problem and scenarios panel in the cognitive model. This panel provides a succinct and accurate snapshot of the general content domain that will be represented by all of the generated items from a single cognitive model. The topic of item diversity serves as a closely related second insight. Again, we return to the claim that AIG is capable of generating an impressively large number of items. While the content represented by these items is outlined in the problem and scenario panel of the cognitive model, the diversity that can be captured with these items is dictated by the item modelling approach. A 1-layer item model produces new items by manipulating a small number of elements at a single level in the model. This type of item model is a good starting point for a novice AIG developer because it is relatively simple. Using our medical example, the cognitive model with a respiratory illness problem that includes the common cold and seasonal flu scenarios can be used to generate diagnosis (i.e., What is the most likely diagnosis?) items using a 1-layer item model. Regardless of whether these cognitive and item models generate 500 or 5,000 items, all of the generated items will be related to the examinee’s ability to diagnose the cold or flu under the general problem area of respiratory illness. N-layer item modelling helps diversify the generation process. An n-layer item model produces new items by manipulating a relatively large number of elements 56

Item Model Development

at two or more layers in the model. This type of modelling is appropriate for the experienced SME because it is more complex compared to 1-layer modelling. But the complexity offers the benefit of generating more diverse items. Using our medical example, a single n-layer item model was used to generate both diagnosis (i.e., What is the most likely diagnosis?) and management (What is the best next step?) items—two entirely different items. Because the layering process is unlimited, other types of items (e.g., treatment) could also be included in this model. Therefore, we advise that developers or users review the layering approach when they want to understand and anticipate the kind of diversity that an item model is capable of producing. N-layer models are capable of generating diverse items, but the content for these items will always remain within the domain defined by the problem and scenarios panel in the cognitive model.

Non-template AIG: A Review of the State of the Art It is important to recognize that AIG can be conducted in many different ways.1 Our book is focused on template-based AIG using item modelling. But non-template AIG approaches also exist. Now that we have described template-based AIG, we provide a brief summary of non-template AIG, as described in three recent studies. We consider these three studies to represent the state of the art for non-template AIG. We also explain why the template-based approach is preferred for operational testing applications, at least at this point in the history of AIG. Non-template AIG can be guided by the syntactic, semantic, or sequential structure of a text. Non-template AIG, which relies heavily on natural language processing (NLP) techniques and knowledge bases, can be used to directly generate statements, questions, and options from inputs such as texts, databases, and corpora of existing information. With this approach, templates are not required for generating content. The first, commonly used, non-template AIG approach is syntax based. This approach operates at the syntax level, where language representations and transformations are defined using syntax. Syntax is a description of how words are combined in sentences, where the syntactic structure of a sentence conveys meaning. The typical syntax-based approach requires 57

Basic Concepts in AIG

tagging parts of speech, defining syntactic categories (e.g., verbs, noun phrases), and constructing syntax trees. Danon and Last (2017) described a syntax-based approach to automatically generate factual questions in the content area of cybersecurity. Syntax-based question generation extracted key statements from a text and then transformed the syntactic structure to directly produce factual questions, which served as the generated items (Heilman, 2011). To build on the work first described in Heilman’s (2011) dissertation research, Danon and Last introduced a new language processing technique to provide richer contextual information to the questions in order to improve question-generation quality. The Danon and Last system started by training word embeddings using Word2vec with over one million cyber-related documents to accurately represent the syntactic relationships expressed in the corpus. Then a set of sentences was selected from the corpus as an input to the generation system. Initial question-answer pairs were generated by identifying possible answer phrases and creating suitable questions for each answer type. Finally, the quality of the generated questions from the previous step was improved by adding extra contextual information to the stem. Item generation was conducted in the content area of cybersecurity. An evaluation corpus composed of 69 sentences was created using 30 articles on cybersecurity. The syntax-based AIG system was provided with a statement such as “Polymorphic virus infects files with an encrypted copy of itself” in order to generate questions such as, “What infects files with an encrypted copy of itself?” One hundred and forty-eight questions were generated. The generated questions were then evaluated by two cybersecurity SMEs using question quality from Heilman’s 2011 study as the baseline. The evaluation was based on the fluency, clarity, and semantic correctness of the questions. The results indicated that 68% of the questions initially generated by Heilman’s (2011) system could be modified using the current system. Among the modified questions generated by Danon and Last, 42% were identified as acceptable by the SMEs. The second, commonly used, non-template AIG approach is semantic based. This approach generates questions at the semantic level. Semantics focuses on the meaning of a sentence, which is often expressed using a structured combination of words. Semantic-related techniques include finding synonyms (e.g., Gütl, Lankmayr, Weinhofer, & Höfler, 2011), processing word sense disambiguation (e.g., Susanti, Iida, & Tokunaga, 2015), 58

Item Model Development

and translating words and sentences from one language to another. Flor and Riordan (2018) described a semantic-based AIG system that generates factual questions using role labelling. Semantic role labelling is the process of assigning particular labels to different parts of a sentence in order to represent their semantic roles. The system used the information gathered from role labelling with rule-based models to create constituent or wh-questions and yes/no questions. Two steps were required. To begin, Flor and Riordan used open language processing tools to analyze the grammatical structure of sentence parts required for assigning particular semantic roles. SENNA, for instance, is a popular tool for identifying generalizable core arguments in a sentence to provide specific labels, such as agent (or subject), patient (or object), location, and time of the event. Then the system directly generated constituent questions from the labelled sentences focusing on a focal label (e.g., agent) to select the most appropriate question type (e.g., what, who, where). To prevent the generation of erroneous questions, rule-based decisions were used. For example, the system sub-classified question types based on prepositions (e.g., on, for, in) and provided do-support for certain cases. Generating yes/no questions followed a similar framework by providing do-support to the original statement. Item generation was conducted in the content area of education. A corpus of 171 sentences was created using 5 educational expository texts. The sematic-based AIG system was provided with a statement such as “Peter called on Monday”, or “Peter called for six hours”. In this example, the semantic role labelling identified several focal points in the sentence, such as “Peter” as the agent and “on Monday” and “for six hours” as the time information. Using this information, the system could generate constituent questions such as “When did Peter call?” and “How long did Peter call?” For the yes/no question type, a statement such as “The particles from the Sun also carry an electric charge” could be used to generate a question such as “Do the particles from the Sun carry an electric charge?” Eight hundred and ninety constituent and yes/no questions were generated. Question quality was evaluated, first, by generated comparable questions using a neural network AIG described by Du and Cardie (2017), which served as the baseline and, second, by asking two linguistic SMEs to compare the neural network and the sematic-based generated questions on their relevance, grammatical, and semantic soundness. Both the neural network and the sematic-based AIG systems were required to generate questions using the 171-sentence corpus. The result 59

Basic Concepts in AIG

indicated that constituent (i.e., 10.20 out of 14, where a higher score indicates a better result) questions and yes/no (i.e., 11.41) questions generated from the semantic-based AIG system were more relevant, grammatical, and semantically sound compared to the questions generated from the neural network (i.e., 8.01). The third commonly used, non-template AIG approach is sequence based. Sequence describes a set of letters or words that follow one another using a specific ordering to produce meaning. Hence, the sequence-based AIG focuses on mapping content to sequential numeric vectors and then using neural networks to predict the sequence of letters and words in order to create new content. Von Davier (2018) described a sequence-based AIG approach for generating personality statements. His item generation system used a language modelling approach. Language modelling captures sequential information in a text using letter and word probabilities. Language models are commonly used in NLP tasks that require predicting the future occurrence of letters and words using word history. Von Davier used a deep learning neural network algorithm to train a language model based on a corpus created using existing personality items. Then a carefully trained language modelling system based on a recurrent neural networks algorithm was used to predict the most probable next character given a sequence of input characters. Each character was passed onto the single neural network cell to inform and predict the most probable next character to form a complete sentence. Because of the configuration in his system, the output sequences of characters reflected the language structures of the input sequences. Item generation was conducted in the content area of personality testing. A corpus of 3,320 personality statements was used to identify probabilistic sequential dependencies among the words. The sequence-based AIG system was provided with the existing personality statements from the corpus to generate 24 new personality items. The evaluation of the generated items focused on comparing their statistical qualities with existing personality items. The 24 generated personality items were combined with 17 existing personality items to produce an inventory that was administered to 277 participants. Then exploratory factor analysis was conducted to identify the factor structure for the items. The generated items were not distinguishable from the existing items in the factor analysis. As an example, generated items such as “I usually control others”, and “I prefer to be the perfect leader” loaded on the “extraversion” factor, 60

Item Model Development

along with the existing extraversion personality items. Similarly, generated items such as “I often do not know what happens to me”, and “I rarely consider my actions” loaded on the “neuroticism” factor, along with the existing neuroticism personality items.

Is It Preferable to Use a Template for AIG? The three non-template AIG approaches based on the syntactic, semantic, and sequential features of language have the advantage of generating items directly. No template is required. This means that the second step in our three-step AIG method could be eliminated. However, as our review of the literature reveals, generating items without a template requires a significant amount of training, instruction, and specific prior information that is currently very difficult to acquire in the form of a corpus or knowledge database of some kind. Non-template AIG requires a corpus of data. The data in the corpus must be directly related to the purpose of the test because this information will be used to generate test items. When this content in the corpus is available, it is structured using NLP methods defined at the syntactic, semantic, or sequential level to generate simple items. However, the availability of a corpus produced limited results in all of the studies we reviewed, as the AIG system was only able to generate simple one-sentence test items (e.g., “Do the particles from the Sun carry an electric charge?”). The quality of these single-­sentence-generated items also varied across the three studies. The outcomes from our research program, along with our experiences working with diverse groups of testing practitioners, have convinced us that, for now, a template-based approach is the most appropriate for generating test items because it is the most flexible. Operational item generation must be flexible in order to permit SMEs to produce items in diverse content areas using different item formats. Item development is used to create content for tests that are administered at different levels in the education system (e.g., K–12, post-secondary), for a wide range of purposes (e.g., formative assessment, summative assessment, licensure testing, certification exams), in many different subject areas (e.g., mathematics, language arts, science, medicine, nursing, abstract reasoning, dentistry, law, business), with a range of item formats (e.g., selected response, 61

Basic Concepts in AIG

constructed response), and item types (e.g., multiple response, hot spot, drag and drop). Non-template AIG approaches that require corpora and knowledge bases have limited value in modern testing programs because rarely, if ever, are corpora available to provide the content needed to generate items across these diverse and specialized areas of practice. Item modelling, by way of contrast, is intended to provide researchers and practitioners with an approach for generating items that can be used at different levels in the education system for a broad range of purposes using different item formats and types because corpora of highly specific and relevant data are not required. Instead, the data is provided by the SME using the cognitive and item modelling steps. Operational item generation must be flexible in order to permit SMEs to produce items that reflect the current state of the art in item development. Test items on exams today are complex because they are designed to measure a range of sophisticated 21st-century knowledge and skills. These items are created by SMEs who use their judgement, expertise, and experiences to create tasks that are sophisticated and challenging for examinees. A modern testing program also needs thousands of these kinds of items because CBT, test design, and globalization are expanding the role and frequency of assessment in most education systems. Non-template AIG approaches that produce generated items that are constrained to simple one-sentence questions or phrases have limited value in operational programs because tests today rarely contain these kinds of items. Item modelling, by way of contrast, is intended to help SMEs address the demands of working in a modern testing program where large numbers of diverse and complex items must be created that reflect the current standards of practice. Above all, operational item generation must be flexible in order to permit SMEs to produce high-quality items because quality is of paramount importance. Item development is viewed as a standardized process that required iterative refinements because it must yield items that meet a high standard of quality (Lane, Raymond, & Haladyna, 2016; Schmeiser & Welch, 2006). Well-established item writing practices and conventions guide content development where quality is achieved, in part, using guidelines (American Educational Research Association, American Psychological Association, National Council on Measurement in Education, 2014; Ganji & Esfandiari, 2020; Haladyna & Rodriguez, 2013; International Test Commission, 2017). Guidelines provide a 62

Item Model Development

summary of best practices, common mistakes, and general expectations that help ensure that the SMEs have a shared understanding of their tasks and their responsibilities. Standardization helps control for the potentially diverse outcomes that can be produced when different SMEs perform the same item development task. Iterative refinement in the form of structured and systematic item review yields detailed feedback on different standards of item quality that, in turn, can be used to revise and improve the original item. Non-template AIG approaches that produce items with varying levels of quality have limited value in operational testing programs. Item modelling, by way of contrast, is intended to complement the item development conventions that currently guide educational testing in order to produce complex items of consistently high quality. In short, template-based AIG is the most flexible approach that can be used to produce large numbers of complex and diverse high-quality items today quickly and economically. The non-template AIG approaches presented by Danon and Last (2017), Flor and Riordan (2018), and von Davier (2018) hold tremendous promise for the future. But to meet the current challenges, template-based AIG is the preferred approach for generating items because it offers SMEs a great deal of control and flexibility.

Note   1 A comprehensive history of AIG was presented by Haladyna (2013), which covered the period from 1950 with the development of Louis Guttman’s facet theory to the early 2000s. Gierl and Lai (2016) provided an update on Haladyna’s summary by describing the key developments in both AIG theory and practice from 2005 to 2015.

References American Educational Research Association, American Psychological Association, National Council on Measurement in Education (2014). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. Bejar, I. I. (1996). Generative response modelling: Leveraging the computer as a test delivery medium (ETS Research Report No. 96–13). Princeton, NJ: Educational Testing Service. Bejar, I. I. (2002). Generative testing: From conception to implementation. In S. H. Irvine & P. C. Kyllonen (Eds.), Item Generation for Test Development (pp. 199–217). Hillsdale, NJ: Lawrence Erlbaum. 63

Basic Concepts in AIG Bejar, I. I., Lawless, R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2003). A feasibility study of on-the-fly item generation in adaptive testing. Journal of Technology, Learning, and Assessment, 2 (3). Retrieved from http://www.jtla.org Danon, G., & Last, M. (2017). A syntactic approach to domain-specific automatic question generation. arXiv:1712.09827 [cs.CL]. Du, X., & Cardie, C. (2017). Identifying where to focus in reading comprehension for neural question generation. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (Volume 1: Long Papers), pp. 1342–1352, Vancouver, CA. Association for Computational Linguistics. Embretson, S. E. (2002). Generating abstract reasoning items with cognitive theory. In S. H. Irvine & P. C. Kyllonen (Eds.), Item Generation for Test Development (pp. 219–250). Mahwah, NJ: Lawrence Erlbaum. Flor, M., & Riordan, B. (2018, June). A sematic role-based approach to opendomain automatic question generation. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, New Orleans, LA, pp. 254–263. Ganji, M., & Esfandiari, R. (2020). Attitudes of language teachers toward ­multiple-choice item writing guidelines: An exploratory factor analysis Journal of Modern Research in English Language Studies, 7, 115–140. Gierl, M. J., & Lai, H. (2013). Using automated processes to generate test items. Educational Measurement: Issues and Practice, 32, 36–50. Gierl, M. J. & Lai, H. (2016). Automatic item generation. In S. Lane, M. Raymond, & T. Haladyna (Eds.), Handbook of Test Development (2nd ed., pp. 410–429). New York: Routledge. Gierl, M. J., Zhou, J., & Alves, C. (2008). Developing a taxonomy of item model types to promote assessment engineering. Journal of Technology, Learning, and Assessment, 7 (2). Retrieved from http://www.jtla.org Gütl, C., Lankmayr, K., Weinhofer, J., & Höfler, M. (2011). Enhanced Automatic Question Creator – EAQC: Concept, development and evaluation of an automatic test item creation tool to foster modern e-education. Electronic Journal of e-Learning, 9, 23–38. Haladyna, T. (2013). Automatic item generation: A historical perspective. In M. J. Gierl & T. Haladyna (Eds.), Automatic Item Generation: Theory and Practice (pp. 13–25). New York, NY: Routledge. Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and Validating Test Items. New York: Routledge. Haladyna, T., & Shindoll, R. (1989). Items shells: A method for writing effective multiple-choice test items. Evaluation and the Health Professions, 12, 97–106. Heilman, M. (2011). Automatic factual question generation from text (Doctoral dissertation). Carnegie Mellon University. Available from ProQuest Dissertations and Theses database. Higgins, D., Futagi, Y., & Deane, P. (2005). Multilingual generalization of the Model Creator software for math item generation (Research Report No. RR-05–02). Princeton, NJ: Educational Testing Service. 64

Item Model Development Hively, W., Patterson, H. L., & Page, S. H. (1968). A “universe-defined” system of arithmetic achievement tests. Journal of Educational Measurement, 5, 275–290. International Test Commission. (2017). The ITC Guidelines for Translating and Adapting Tests (2nd ed.). [www.InTestCom.org]. Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Upper Saddle River, NJ: Pearson. LaDuca, A., Staples, W. I., Templeton, B., & Holzman, G. B. (1986). Item modelling procedures for constructing content-equivalent multiple-choice questions. Medical Education, 20, 53–56. Lai, H. (2013). Developing a Framework and Demonstrating a Systematic Process for Generating Medical Test Items (Doctoral Dissertation). Retrieved from doi:10.7939/R3C93H Lai, H., & Gierl, M. J. (2013). Generating items under the assessment engineering framework. In M. J. Gierl & T. Haladyna (Eds.), Automatic Item Generation: Theory and Practice (pp. 77–101). New York, NY: Routledge. Lai, H., Gierl, M. J., & Alves, C. (2010, April). Using item templates and automated item generation principles for assessment engineering. In R. M. Luecht, Application of Assessment Engineering to Multidimensional Diagnostic Testing in an Educational Setting. Denver, CO: Symposium conducted at the annual meeting of the National Council on Measurement in Education. Lane, S., Raymond, M., & Haladyna, R. (2016). Test development process. In S. Lane, M. Raymond, & T. Haladyna (Eds.), Handbook of Test Development (2nded., pp. 3–18). New York, NY: Routledge. McCarthy, P. M., & Boonthum-Denecke, C., (2012). Applied Natural Language Processing: Identification, Investigation, and Resolution. Hershey, PA: IGI Global. Minsky, M. (1974). A framework for representing knowledge (Memo No. 306). Cambridge, MA: MIT-AI Laboratory. Mislevy, R. J., & Riconscente, M. M. (2006). Evidence-centered assessment design. In S. M. Downing & T. Haladyna (Eds.), Handbook of Test Development (pp. 61–90). Mahwah, NJ: Lawrence Erlbaum. Reiter, E., (1995). NLG vs. templates. arXiv:cmp-lg/9504013. Reiter, E. & Dale, R. (1997). Building applied natural-language generation systems. Natural Language Engineering, 3, 57–87. Schmeiser, C. B., & Welch, C. J. (2006). Test development. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 307–353). Westport, CT: National Council on Measurement in Education and American Council on Education. Singley, M. K., & Bennett, R. E. (2002). Item generation and beyond: Applications of schema theory to mathematics assessment. In S. H. Irvine & P. C. Kyllonen (Eds.), Item Generation for Test Development (pp. 361–384). Mahwah, NJ: Lawrence Erlbaum. Susanti, Y., Iida, R., & Tokunaga, T. (2015, May). Automatic generation of English vocabulary tests. In CSEDU (1) (pp. 77–87). von Davier, M. (2018). Automated item generation with recurrent neural networks. Psychometrika, 83, 847–857. 65

4

Item Generation Approaches for Generating Test Items

Different approaches can be used to generate items. We adopt an item modelling approach, informed by a cognitive model, to leverage the benefits of binary computing to produce large numbers of test items in a controlled and flexible manner, as described in the previous chapter. But it is important to remember that different approaches can be used. We begin by reviewing three general approaches for generating items. Then we present our preferred approach by describing how plausible items are assembled with the use of logical constraints. This approach relies on a technique called bitmasking. Finally, we use bitmasking with the content from the mathematics and medical models developed in the previous chapters to demonstrate how logical constraints can be used to generate items. The methods used to generate items can be categorized into three distinct areas: an instruction, an ontological, and a logical constraint approach. With the instruction-based approach (e.g., Arendasy & Sommer, 2012; Embretson & Kingston, 2018; Geerlings, Glas, & van der Linden, 2011; Geerlings, van der Linden, & Glas, 2012; Higgins, Futagi, & Deane, 2005; Singley & Bennett, 2002), a specific set of programming instructions are created to generate a specific set of items. This approach provides flexibility of expression for the SME to generate items in different content areas using different item formats. However, a limitation of this approach is the need to program each item model. The generation process requires the SME to provide instructions to the computer programmer for each item model, where the programmer’s task is to express

66

Item Generation

the problem in a generation format. This approach creates an extended workflow as the SME must express the problem in a language and format that the programmer can interpret and then implement. This approach also shifts the required time for item development from SMEs to programmers. An instruction-based approach can be used as a viable approach to generating items, but it is challenging to scale the item development process because each model must be programmed individually. With the emergence of NLP techniques, non-template-based app­ roaches can be used to generate items, as we described and illustrated in Chapter 3 (e.g., Danon & Last, 2017; Flor & Riordan, 2018; Gütl, Lankmayr, Weinhofer, & Höfler, 2011; von Davier, 2018; see also Leo, Kurdi, Matentzoglu, Parsi, Sattler, Forge, Donato, & Dowling, 2019; Mitkov & Ha, 2003). These ontological-based approaches generate test items by drawing on information that can be described in a corpus or knowledge base. As a result, they can be used to generate items without the use of templates and without intervention from the SME. These approaches are used to generate items based on representations described at the syntax, semantics, and sequence levels, meaning that novel items that may not have been considered by the SME can be generated from the structure of knowledge that exists in the corpus. While important developments in non-template-based AIG continue to emerge, this approach relies on the existence of corpora or knowledge bases that can reliably represent concepts and topics in the specific area of interest for the SME. Currently, knowledge bases that cover specific content areas and contain the depth of information needed to describe complex relationships that are suitable for the kinds of items used in modern testing programs are limited, at best. A logical constraints approach provides a straightforward way of generating items with the use of iterations. Using this approach, the generated content is specified as an element (“element” is defined for a cognitive model in Chapter 2 and an item model in Chapter 3) in which each element contains all possible values to be displayed and substituted in each generated item. The presentation of the elements is organized in a cloze test format, where all values of an element can be displayed. Then all combinations of the element values are iteratively assembled. The total number of items that can be generated

67

Basic Concepts in AIG

is a product of the m ­ aximum number of values in each element (Lai,  Gierl,  &  Alves,  2010). But with logical constraints, not all combinations of the values will produce meaningful test items. To prevent implausible combinations, the constraints defined by the SME in the cognitive and item models are used to limit the generated outcomes to those combinations that are deemed to be meaningful. With the use constraints, the generation process can be described as an iterator that permutes through all combinations of elements and, in the process, eliminates combinations (i.e., meaningless generated items) that do not meet the constraint requirements. The logical constraints approach is flexible because it does not need specific computer programming for every item model as required with the instruction-based approaches. It can also be used to produce items using small amounts of content provided by the SME, thereby eliminating the need for corpora or knowledge bases as required with the ontological-based approaches.

The Importance of Constraint Coding With the logical constraints approach for item generation, restrictions or constraints provide the ultimate control over how items should be generated. Because the generation approach must consider all possible permutations of elements that can be used to produce items, constraints are created to stop implausible combinations of values from being used in the assembly process. Constraints also allow the SME to implement the knowledge structures described in the cognitive and item models. Initially, constraints are expressed in the form of Boolean logic statements. Boolean logic statements provide a method of expressing the constraints that must be met in order to assemble a specific set of values within the elements into a test item. Values can be compared and queried using Boolean operators, such as equal, not equal, greater, less than, or, and. For example, given two elements (A and B), each with three values (A1, A2, A3 and B1, B2, B3), if an item is only plausible when A1 is presented with B1 or B2, then a constraint where A1 cannot appear with the value of B3 has to be defined. A combination of these operators can also be used together. To control for logical constraints in our mathematics example, values can be constrained with numerical 68

Item Generation

operators, such as the sum of A and B cannot be greater than ten, which would be expressed as A + B0) AND ((A + B)0 AND A