Computers and Their Impact on State Assessments : Recent History and Predictions for the Future [1 ed.] 9781617357275, 9781617357268

The Race To The Top program strongly advocates the use of computer technology in assessments. It dramatically promotes c

153 61 5MB

English Pages 325 Year 2012

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computers and Their Impact on State Assessments : Recent History and Predictions for the Future [1 ed.]
 9781617357275, 9781617357268

Citation preview

Computers and Their Impact on State Assessments Recent History and Predictions for the Future

A volume in The MARCES Book Series Robert W. Lissitz, Series Editor

This page intentionally left blank.

Computers and Their Impact on State Assessments Recent History and Predictions for the Future edited by

Robert W. Lissitz University of Maryland

Hong Jiao University of Maryland

INFORMATION AGE PUBLISHING, INC. Charlotte, NC • www.infoagepub.com

Library of Congress Cataloging-in-Publication Data Computers and their impact on state assessments : recent history and predictions for the future / edited by Robert W. Lissitz, University of Maryland, Hong Jiao, University of Maryland. pages cm. -- (The MARCES book series) Includes bibliographical references. ISBN 978-1-61735-725-1 (pbk.) -- ISBN 978-1-61735-726-8 (hardcover) -ISBN 978-1-61735-727-5 (ebook) 1. Educational tests and measurements--Computer programs. 2. Educational tests and measurements--Data processing. I. Lissitz, Robert W. II. Jiao, Hong. LB3060.5.C65 2012 371.26--dc23                           2011048230

Copyright © 2012 Information Age Publishing Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the publisher. Printed in the United States of America

Contents Preface................................................................................................... vii 1 Computer-Based Testing in K–12 State Assessments........................... 1 Hong Jiao and Robert W. Lissitz

Pa rt I Computer-Based Testing and Its Implementation in State Assessments 2 History, Current Practice, Perspectives and what the Future Holds for Computer Based Assessment in K–12 Education.............. 25 John Poggio and Linette McJunkin 3 A State Perspective on Enhancing Assessment and Accountability Systems through Systematic Implementation of Technology........... 55 Vincent Dean and Joseph Martineau 4 What States Need to Consider in Transitioning to ComputerBased Assessments from the Viewpoint of a Contractor................... 79 Walter D. Way and Robert K. Kirkpatrick 5 Operational CBT Implementation Issues: Making It Happen....... 105 Richard M. Luecht



v

vi   Contents

Pa rt I I Technical and Psychometric Challenges and Innovations 6 Creating Innovative Assessment Items and Test Forms................... 133 Kathleen Scalise 7 The Conceptual and Scientific Basis for Automated Scoring of Performance Items......................................................................... 157 David M. Williamson 8 Making Computerized Adaptive Testing Diagnostic Tools for Schools.......................................................................................... 195 Hua-Hua Chang 9 Applying Computer Based Assessment Using Diagnostic Modeling to Benchmark Tests........................................................... 227 Terry Ackerman, Robert Henson, Ric Luecht, John Willse and Jonathan Templin 10 Turning the Page: How Smarter Testing, Vertical Scales, and Understanding of Student Engagement May Improve Our Tests............................................................................................. 245 G. Gage Kingsbury and Steven L. Wise

Pa rt I I I Predictions for the Future 11 Implications of the Digital Ocean on Current and Future Assessment.......................................................................................... 273 Kristen E. DiCerbo and John T. Behrens About the Editors............................................................................... 307 About the Contributors...................................................................... 309

Preface Computer-based testing (CBT) has been in existence for a long time. Since its adoption, it has been used for different purposes in different settings such as selection in military, admission for higher education, classification in licensure and certification, grade promotion and diagnosis of strength and weakness in K–12 education. Recent advances in computer technology make the computer an indispensable part of learning, teaching and assessment. To align with the technology-bound format of learning and teaching, assessments integrated with computer technology are an increasingly accepted format in K–12 state applications. The advantages of computerized testing over traditional paper-andpencil tests (PPT) are very well documented in literature (e.g., Bennett, 2001). These include easy accessibility, speeded scoring and reporting, use of multimedia technology and innovative item formats to reflect real-life situations to produce more authentic assessments. In addition to the logistical convenience and efficiency, the flexible and innovative item formats expand the assessable knowledge and skills that cannot be realistically evaluated within a PPT format. Ultimately, a comprehensive assessment system can be built within a technological framework. The recently initiated federal Race to the Top (RTTT) assessment program promotes the development of a new generation of assessments for providing timely feedback to “enhance instruction, accelerate learning, and provide accurate information on how our students and schools are performing” (Forgione, 2011). The federal assessment requirements entail multiple facets. These include summative evaluation of students’ achievement status, formative feedback of students’ strength and weakness to facilitate instrucComputers and Their Impact on State Assessments, pages vii–ix Copyright © 2012 by Information Age Publishing All rights of reproduction in any form reserved.

vii

viii   Preface

tion, and measuring students’ growth trajectories as accomplished and as desired. All student level data can be later utilized to evaluate teacher’s and school’s performance. The assessment requirements by the RTTT program can not be fulfilled by using a single end of course or end of year assessment, rather it requires deliberate construction of an assessment system with sound reliability and validity and clearly articulated goals. The RTTT assessment requirements demand the use of computer based testing to get diagnostic/prescriptive feedback in a timely and efficient fashion. It simply can’t be done by conventional paper and pencil testing, alone, in a cost effective way. CBT is becoming widely used now and its use is growing, perhaps exponentially, in K–12 assessment programs. Many of the applications have started out with simple linear computer administrations that mirror the use of traditional paper-and-pencil tests, but there are many more sophisticated incarnations of CBT, including computerized adaptive tests at the item level and multistage adaptive tests to fulfill federal assessment requirements. A quick survey indicates that at least 44 states or districts are actually using CAT already. Oregon is even approved by the Federal Government to use CAT for AYP purposes. Hawaii, Delaware and other states are in different stages of implementing CAT assessment models, but most seem to be moving to more and more sophisticated systems. The State of Maryland has a long-term commitment to state assessment and is beginning to bring computer based systems into that arena. In an effort to see what is out there and to explore options and innovations that will help push this effort forward, the Maryland Assessment Research Center for Education Success (MARCES) has organized a conference on “Computers and Their Impact on State Assessment: Recent History and Predictions for the Future.” A number of the primary contributors to this field were invited to present at the conference and to submit chapters for this book. We are pleased that they agreed to do so. The conference was organized around the expectations for CBT as it will impact on the schools. As we organized the conference, we were seeking answers to some of the following questions, which we consider critical to success state computer based testing: 1. What is the history of our field in using computers in assessment and where is it likely to go? 2. What is going on in the states with computer based assessments? 3. Do teachers and administrators accept this sort of testing? 4. Problems with getting the state ready to implement CBT? 5. How to construct a linear CBT or CAT system? 6. Any special issues with creating item banks for CBT? 7. Any CBT test delivery issues?

Preface    ix

8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.

Is security an issue? Any new item types and formats? What are the scoring challenges? Any progress on automated scoring of complex item responses? Any new psychometric models? How do the psychometric models compare? Are there special issues with linking and equating in adaptive testing? What about location of items in a CAT—are the parameters stable? How are the data being used? Does the data make any difference besides its contribution to the employment of psychometricians and IT personnel? Can we report formative AND summative results from the same data? What are some of the best ways to report data and how do they compare in terms of impact on the learner, teacher, and administrator Does any of this relate to teacher and principal assessment? Do the psychometric models provide growth information? What about value added applications? Does anyone do on-demand testing of students and if so, are there any special issues? What is the benchmark for public education compared to the business world and the military world?

These just give a few of the questions that confronted us and for which we hoped to receive insights from our distinguished speakers. In short, the construction of a CBT or CAT system in a state is not a simple project. The design of such a system needs to be planned ahead of time considering multiple facets. We hope this book will help to provide a comprehensive view of the facets to be considered in constructing such a significant system for a state assessment program. In addition, we believe that the chapters based on the conference presentations provide examples of how such a system should be created and the outcomes that can be anticipated. We would like to thank the State of Maryland for their continued support of the Maryland Assessment Research Center for Education Success. References Bennett, R. E. (2001). How the internet will help large-scale assessment reinvent itself? Education Policy Analysis Archives, 9(5), 1–23. Forgione, P. D. (2011). Innovative opportunities and measurement challenges in throughcourse summative assessments: State RTTT assessment consortia development plans. Presentation at the Annual Meeting of the National Council on measurement in Education.

This page intentionally left blank.

Chapter 1

Computer-Based Testing in K–12 State Assessments Hong Jiao and Robert W. Lissitz University of Maryland

The use or the purpose of assessment determines the design and construction of the assessment. Assessment is never just for the sake of assessment. The assessment outcomes are information collected for decision-making. At the classroom level, assessment results are utilized to provide diagnostic information and inform instruction. At the school level, assessment outcomes are frequently used for teacher evaluation and assignment. At the state and federal levels, assessment outcomes are often the indicators for state and federal fund allocation. The focus of traditional assessment may depend more on the interplay of content experts and psychometric experts to come up with high quality assessment outcomes. As computer technology advances and becomes an indispensable part of students’ daily learning and instruction, assessment should be matched with the use of computer technology as well. Computerbased assessments, just as with traditional testing, are the responsibility of content, psychometric, and technology experts. Financing National Defense: Policy and Process, pages 1–21 Copyright © 2012 by Information Age Publishing All rights of reproduction in any form reserved.

1

2   H. JIAO and R. W. LISSITZ

Computer-based testing (CBT) is becoming increasingly adopted in the K–12 state assessments due to its advantages over paper-and-pencil tests (PPT). CBT provides faster score reporting, more flexible test administration, enhanced test security, better alignment with student learning and daily activities, and allows the use of more innovative item types to assess higher-order thinking skills. This book contains many chapters testifying to these advantages. CBT can be delivered in different modes: linear or adaptive. In general, the linear form of CBT is most often an e-version converted from a linear PPT, created as an e-test where a fixed form is used for every examinee, or multiple parallel forms are generated on the fly without adaptation of item selection to the current estimation of students’ abilities. In other words, CBT can be simply an electronic version of a PPT. On the other hand, computerized adaptive testing (CAT) can build a different test form to target each individual student. The adaptation can be made at both the individual item level, which is the conventional CAT, or at the item group level, which is often referred to as a multistage test. This book focuses on the practices and challenges related to the use of computer-based testing in K–12 state assessment programs. This introduction starts with a review of different modes of CBT, their potential uses in K–12 state assessments and associated issues. Following that is an introduction to the book chapters, which give potential solutions to some of the issues raised as well as illustrations of successful implementations. The last section summarizes the recommendations for using CBT in K–12 state assessments based on the contribution of the book chapters. Computer-Based Testing Algorithms Computer-Based Linear Tests The No Child Left Behind (NCLB) Act of 2001 stipulated that grade three to grade eight students be annually tested on reading and math as well as on science for each grade span. High school exit examinations are the graduation requirements for high school students. This act dramatically increased the testing load for K–12 students and work load on state assessment personnel. Many states explored using the computer-based tests for fast score reporting and retesting for graduation tests. Example states are Mississippi, Virginia, and Texas. The common practice then was to create a linear CBT with the same items as the PPT from which it is converted. Most often, the CBT even has the same item presentation sequence. Compared with PPT, linear CBT saves money in printing and test paper delivery and item scoring. It can increase the testing frequency as needed. In testing settings other than K–12, linear

Computer-Based Testing in K–12 State Assessments     3

CBT may randomize item presentation sequence or option presentation sequence. However, in the K–12 setting, the item presentation sequence and option presentation sequence are often maintained the same as its PPT form to preserve the comparability between two test administration modes so that the same raw score-to-scale score conversion table and the norm table can be applied. Most often the scores from the linear CBT and PPT are treated as equivalent under the condition that a comparability study supports this practice. Unlike CAT, students taking linear CBT still need to answer too easy and too difficult items as they do in PPTs. Figure 1.1 (Wang & Jiao, 2005) shows a linear CBT with the same items for examinees with ability of 1 and –3.5. The majority of the items are too easy for the student with ability of 1 while the majority of the items are too difficult for the student with an ability of –3.5. Test security can also be an issue due to the limited number of available test forms. The linear-on-the-fly test (LOFT) is a CBT algorithm where multiple parallel test forms can be automated and assembled during test administration. This testing algorithm is developed to reduce item exposure and improve test security by assembling a unique parallel form for each examinee conforming to the same test specification in terms of content and statistical characteristics (Luecht, 2005). However, in practice, high overlap rate of the items and exposure control can be potential issues (Lin, 2010). Further, LOFTs have different measurement precision for students with different abilities, as no adaptation is made to examinees’ ability estimates. The burden of test form construction is dramatically increased without the gain in

Figure 1.1  Administration of a linear CBT (the same items for two examinees with different true abilities of 1 and –3.5 represented by the two dark lines).

4   H. JIAO and R. W. LISSITZ

measurement precision in the case where administration without proper and accurate prior knowledge of the test-taker’s true ability has occurred. Computer Adaptive Tests (Adaptive at Individual Item Level) CAT differs from linear CBT in that the item administration in CAT is adapted to the current estimate for an examinee’s ability. The adaptation could be at the item level or item group level. The former is known as the CAT while the latter is known as the multistage CAT. The implementation of a CAT requires multiple steps. Wainer (1990) stated the key questions in administering a CAT as 1) how do we choose an item to start the test, 2) how do we choose the next item to be administered after we have seen the examinee’s response to the current one, and 3) how do we know when to stop. In general, the steps for a CAT administration (Parshall, Spray, Kalohn, & Davey, 2002) can be summarized as follows. Step 1. Start a CAT. Use one of the initial item selection methods to select a starting item. The “best guess” method administers the first item of medium difficulty. The “use of prior information” method makes use of other test scores or information to get an estimate of examinee ability and then chooses an item with difficulty most appropriate. The “start easy” method begins the test with relatively easy items to give the examinee time to warm up and then moves the difficulty up to match the student’s estimated ability. Step 2. Estimate ability. Based on the examinee’s response to the item, estimate examinee’s latent ability using maximum likelihood estimation or one of the Bayesian estimation methods: Bayes mean (or expected a posterior, EAP) and Bayes mode (or maximum a posterior, MAP) (Baker, 1992). Step 3. Select next item. Once the examinee’s latent ability is estimated based on his/her responses to items administered, the next item selected is the one which will satisfy three requirements: 1) maximizing test efficiency, 2) content balancing, and 3) item exposure rate control. Step 4. Terminate a CAT. When the test meets a preset criterion such as a certain level of measurement precision of examinee’s latent ability, or a preset length of test is reached, CAT stops. When test length is defined as the stopping criteria, examinees are measured with the same number of items but different levels of measurement precision. On the other hand, when a precision level is set as the stopping criteria, all examinees will be measured

Computer-Based Testing in K–12 State Assessments     5

with the same measurement accuracy but different test length. Graphically, a CAT administration can be represented as in Figure 1.2 (Wang & Jiao, 2005). The main advantage of CAT over linear CBT is that it provides more accurate latent trait estimates with fewer items than traditional linear tests (Chang, 2004). Figure 1.3 (Wang & Jiao, 2005) illustrates the measurement precision differences between a linear test and a CAT. It is possible to achieve high measurement precision (low measurement error) along the whole ability range in a CAT. But in linear tests, the fidelity and bandwidth contradiction inherent in the test design can not be solved. A linear test can either achieve high fidelity but small bandwidth as in Figure 1.3 line

Figure 1.2  Administration of a CAT for a student with the true ability of –1.

Figure 1.3  Standard error of measurement in a linear CBT and a CAT.

6   H. JIAO and R. W. LISSITZ

(a) or large bandwidth but low fidelity as illustrated in Figure 1.4 line (b) (Wang & Jiao, 2005). Via the adaptive selection of items based on the current ability estimate, CAT can maintain high fidelity with a large bandwidth as illustrated in Figure 1.5 (Wang & Jiao, 2005). CATs can reduce testing time by more than 50% with the same level of reliability, thus reducing influence of fatigue in students’ test performance by not wasting time on too easy or too hard items. Conventional tests are most informative and accurate for average students, while CATs maximize measurement precision for most examinees, even those with a wide range of abilities. Compared with the linear CBT, CAT can provide scores with the same measurement precision for examinees of different abilities. Selfpaced administration offers flexibility to each student. Test security may be increased since no hardcopy of test booklets are distributed, although, obviously, computer security is of critical importance. Data collection and score reporting are easier and faster in CAT. On the other side of the coin, the limitations associated with a CAT test delivery are unavoidable. One of the concerns is that some examinees may perform much worse in CAT than traditional tests because of differences in test questions, test scoring, test conditions and student groups (Kolen, 1999–2000). Every item is selected to target at the current estimate of an examinee’s ability, so every question looks like a challenge to each examinee. This may adversely impact the test-taking mentality. Item review is not

Figure 1.4  Two possibilities for a linear CBT in terms of fidelity and bandwidth.

Computer-Based Testing in K–12 State Assessments     7

Figure 1.5  Fidelity and bandwidth comparison between a linear CBT and CAT.

allowed in conventional CATs since examinees may take advantage of item review to achieve a higher score than they should obtain. The lack of item review may affect students’ motivation and result in different test-taking strategies from those in linear tests. Another important concern is about test security and item pool usage. Since an adaptive algorithm intends to select items with the most information at the current estimate of an examinee’s ability and a majority of examinees’ abilities fall in the center part of the ability scale, the same items may be heavily used. Thus, certain items tend to be much more often selected or not selected for administration, thereby making the item exposure rates quite uneven (Chang, 2004). Exposure control algorithms can be used to prevent overuse of good items, but exposure conditioning on ability is not controlled. Test time reduction and measurement precision increase depending on the breadth of the construct domain and the depth of the item pool (Paek, 2005). In addition, item parameter estimates used for scoring examinees later in the operational test are based on field test administration. To obtain accurate estimates of item parameters, a relatively large sample size of 1000 to 2000 is needed. However, this is hard to reach for locally developed tests (Grist, 1989). In addition, item security can be a problem unless brand-new items are continuously being introduced to the item bank. The applications of CAT have been found in licensure and certification tests such as the National Council Licensure Examinations (NCLEX) for Practical Nurses (PN) and for Registered Nurses (RN) under the National

8   H. JIAO and R. W. LISSITZ

Council for State Boards of Nursing (NCSBN); in admission tests such as the Graduate Management Admission Test (GMAT) or Graduate Record examination (GRE); and in personnel selection tests such as Armed Services Vocational Aptitude Battery (ASVAB). In recent decades, CAT is increasingly adopted in K–12 state assessments. Northwest Evaluation Association (NWEA) developed CATs known as Measures of Academic Progress (MAP), and Renaissance Learning Publishers has CAT products like STAR Reading, Math, and Early Literacy. The number using CAT at district or state level for different assessment purposes has been increasing. For the NCLB, Oregon is currently the only state that got approval from the U.S. department of Education to use CAT for AYP designation and to meet NCLB assessment requirements (Harmon, 2010). Under the influence of the Race to the Top Program to promote using technology in state assessments, other states like Delaware and Hawaii, are moving to join the trend. Multistage Computerized Adaptive Tests (Adaptive at Item Group Level) Multistage tests (MSTs) come in several forms and names, including computerized mastery tests, computerized adaptive testlets, computeradaptive sequential testing, multiple form structures, and bundled, preconstructed multistage adaptive testing (Hendrickson, 2007; Keng, 2008; Luecht, 2005). They all involve clustering items into pre-assembled modules. Modules can be constructed in multiple ways. One common practice is to pre-construct modules as a miniature of the test, conforming to the specification but with varying module difficulties. A multistage test requires the construction of multiple panels. One panel consists of multiple stages. Each stage is made up of one or more modules. Three stages are commonly used or researched. A graphical representation for a three-stage CAT is displayed in Figure 1.6 (Lu, 2009). Stage 1 contains one module with moderate difficulty (M). Stage 2 consists of several modules with different levels of difficulty, easy (E), moderate (M), and hard (H) items. The same difficulty structure can be maintained in stage three. After the administration of the stage 1 module, an examinee’s ability will be estimated. Based on the estimate of the examinee’s ability from module 1, a module from stage 2 will be selected according to a pre-determined selection rule. The examinee will be routed to that module in stage 2. This is repeated in stage 3. If stage 3 is the last stage, an examinee’s responses to all items in all the modules from all stages will be used for the examinee’s final ability estimate. The increased popularity of MSTs comes with the practical concerns of CATs. Maintaining some desired features of CATs, MSTs address CAT shortcomings in several aspects. First, MSTs have more administrative control over

Computer-Based Testing in K–12 State Assessments     9

1M

Stage 1

2E

3E

2M

3M

2H

3H

Stage 2

Stage 3

Figure 1.6  A multistage CAT with three stages.

content quality (i.e., content specifications). Automated test assembly (ATA) algorithms are usually used to build the modules so that statistical and content constraints are satisfied. It is possible for content experts and developers to review specifications in a test blueprint within one module for quality assurance purposes. Context effects may also be reduced using expert review. Second, MSTs allow students to review items within a module without concerns for the integrity of the test. Third, problems with sparse operational data matrices for subsequent item recalibration are mitigated since items are administered in bundles (Stark & Chernyshenko, 2006). Finally, the adaptation points are, in general, reduced in MSTs compared with CATs, therefore leading to a more efficient test administration with faster scoring and routing, and easier data management and computer processing. The MST design generally does not achieve the same level of measurement precision as the testlet-based item-level CAT. In other words, to achieve the same measurement precision, more items are needed in MSTs. When constructing the modules and panels of MSTs, the correct underlying proficiency distribution needs to be carefully established, for instance, by a thorough investigation of the testing population prior to the initial and subsequent administrations (Hendrickson, 2007; Keng, 2008). MSTs have been used in the real operation of large-scale testing programs including the Law School Admissions Test (LSAT), the Test of English as a Foreign Language (TOEFL), the National Council of Architectural Registry Board (NCARB), the U.S. Medical Licensure Examination (USMLE) and the Uniform CPA (certified public accountant) Examination (Hendrickson, 2007; Keng, 2008; Luecht, Brumfield & Breithaupt, 2006). No state is currently using MST for their state assessments. However, there is potential that this CBT algorithm will be used in the criterion or standards-based state assessments, where content is closely related to the performance level and students’ growth and content specification requirements can be scrutinized before the test administration.

10   H. JIAO and R. W. LISSITZ

Computerized Classification Tests When grouping students into different performance levels is a goal of a CAT, the use of CAT results is essentially for classification. Literature exists for CAT algorithms for classification purposes such as the CAT with point estimates for classification (Kingsbury & Weiss, 1983), sequential probability ratio test (Wald, 1947), based computerized classification test (Eggen, 1999; Eggen & Straetmans, 2000; Ferguson, 1969; Reckase, 1983; Spray, 1993), and computerized mastery test (CMT) using Bayesian decision theory (Lewis & Sheehan, 1990; Sheehan & Lewis, 1992). Though none of the above CAT algorithms for classification has been used in K–12 state assessment, they have the potential to fulfill some, if not all, the requirements for state assessments. A computerized classification test (CCT) based on the point estimate of latent ability differs from the conventional CAT in that the current theta (θ) estimate is compared with a confidence interval to make a classification decision (Kingsbury & Weiss, 1983). Such a CCT stops when the confidence interval of the estimate of the examinee’s θ does not contain θc , the cut score on the latent ability scale. If the lower boundary of the confidence interval is larger than θc , the examinee is classified as a master. If the upper boundary of the confidence interval is smaller than θc , the examinee is classified as a non-master. If the confidence interval contains θc , the test continues by administering another item based on the current estimate of the examinee’s latent ability. Items are selected to maximize the measurement precision of the current theta estimate. This classification algorithm is currently used in the NCLEX exams for PN and RN to issue licenses to qualified nurses. Another computerized classification test algorithm is based on the Sequential Probability Ratio Test (SPRT) to make a decision between two hypotheses related to pass or fail decisions. Reckase (1983) proposed the use of SPRT in IRT based CAT to make classification decisions via statistical testing rather than statistical estimation. These two statistical hypotheses are the null hypothesis, H0: θ = θ0, and the alternative hypothesis, H1: θ = θ1, where θ is the latent ability of the examinee. θ0 is the largest lower boundary of the minimal competency to classify an examinee as a non-master. θ1 is the least upper boundary of the minimal competency to classify an examinee as a master. The width between these two boundaries is the indifference region. When conducting a Sequential Probability Ratio Test, two types of classification error rates need to be specified, the nominal Type I (a) and Type II (b) error rates. The upper boundary, A, and the lower boundary, B, of the likelihood function will be determined by the two types of error rates: A≤

1− b b and B ≥ . a 1− a

Computer-Based Testing in K–12 State Assessments     11

Item information for each item at the cut score point is calculated based on a chosen IRT model. The item with the largest amount of information at the cut score point can be selected first. If the likelihood ratio of item responses at the upper bound to the lower bound of the indifference region is larger than or equal to the upper boundary, A, the examinee is classified as a master, and the test stops. If the likelihood ratio is smaller than or equal to the lower boundary, B, the examinee is classified as a non-master, and the test stops. If the likelihood ratio is between the upper boundary and the lower boundary, no classification decision can be made. The test continues by selecting the next item in the pool with the largest amount of item information at the cut score point. When time or test length limit has been reached, and a decision has not yet been made, some methods will be adopted to force a decision. A common practice is to compute the distance between the logarithm of the likelihood ratio and the logarithm of the two boundaries respectively. If the distance between the logarithm of the likelihood ratio and the logarithm of the upper boundary is smaller than that between the logarithm of the likelihood ratio and the logarithm of the lower boundary, the examinee is classified as a master. If the reverse is true, the examinee is classified as a non-master. In general, a CCT using SPRT is more efficient than the conventional computerized adaptive test to achieve the same level of classification accuracy (Ferguson, 1969; Spray & Reckase, 1996). A 2011 NCME session titled “Computerized Adaptive Tests for Classification: Algorithms and Applications” was devoted to the discussion of different updated CAT algorithms for classification purposes. A CAT system has been developed to serve dual functions of classification and growth measurement to fulfill the requirements for state assessment programs by federal policy (Kingsbury, 2011). An illustration of a more recent development in SPRT based CCT exists where a composite hypothesis is used instead of a point hypothesis structure for multiple category classification decisions (Thompson & Ro, 2011). Further talks were given about a testletbased computerized mastery test based on Bayesian decision theory (Smith & Lewis, 2011). An item selection algorithm is developed to use CAT to measure latent ability and latent classes based on profiles of attribute mastery (Chang, 2011). The decision theoretic approach to classification is reviewed and some potential improvements are discussed.  The techniques using influence diagrams and Bayesian networks in CAT for classification is also illustrated (Almond, 2011). In addition, a new algorithm for latent class identification in CAT is demonstrated where theta estimation and latent group membership can be estimated concurrently using the mixture Rasch model based computerized classification test (Jiao, Macready, Liu, & Cho, 2011). All the presented CAT algorithms for classification purposes may be applied to the performance level categorization in state assessments.

12   H. JIAO and R. W. LISSITZ

Current Challenges Related to CBT in K–12 State Assessments Large-scale state assessments have long been used to provide summative information about where a student stands in terms of content standards or when compared with other students. The recent Race To The Top (RTTT) program, like many previous education federal policies, aims at improving student achievement and closing achievement gap between subgroups of student population. More specific requirements related to these goals can be summarized as adopting challenging common core standards across states to keep students competitive nationwide at the same level, accurately measuring students achievement at different grades and at different time points to high content and performance standards, tracking student achievement growth within grade and across grades to predict their success and readiness for college and career, and using technology based assessment methods. One of the requirements that makes the RTTT program different from previous federal programs is that the RTTT program promotes the exploration and adoption of CBT, especially CAT, in state tests due to its high efficiency and accuracy of the ability estimation along a continuum. CBT or CAT has been in existence for a long time (see the review in Chapter 2) and has been used to fulfill the NCLB assessment requirements. Though an increasing number of states have the experience of using linear CBT in their state assessment programs in some content area at some grades, only a few states have the experience of implementing a CAT in their large-scale state assessment program. Common content and performance standards and better measures of growth based on computer testing format will be the major components of the next generation of state assessments (Phillips, 2009). Some other challenges associated with the new assessment requirements will emerge. These challenges include how to make linear CBT or CAT more feasible in building formative, interim, and summative assessment into one coherent computer based testing system so that continuing testing will be possible throughout a semester to provide diagnostic information and summative data about students’ performance at the end of their course work. Constructing tests based on more challenging common content and performance standards among the states and administering the test on computer is not a simple task for stakeholders involved in the test development process. Better measurement of growth in largescale CBT (either linear or adaptive) is a relatively newly tapped area. In addition to the challenges related to test development fulfilling the common core computer based tests for formative, interim, and summative information, practical and computer technical challenges need to be

Computer-Based Testing in K–12 State Assessments     13

considered as well. To sum up, these challenges include cost and logistical considerations from state and test vendor perspectives, test development including item development by content experts and computer experts, test delivery algorithm and measurement theory for psychometric analysis of item response data collected, and a test delivery system based on computer technology. Literature has tapped some of the challenges listed above and may provide some advice to deal with the issues. For example, Luecht (2005) pointed out that CBT may offer obvious advantages in terms of psychometric properties such as better measurement accuracy and efficiency of testing programs, but the real cost of developing item pools, test administration, and redesigning systems and procedures of CBT needs to be carefully considered, given the fact that most CBT programs have experienced the increased testing fees charged to the students and large financial investments in software systems and transitioning process to CBT, which may not be recovered for many years. Current literature focuses more on transition of PPT to CBT (McCall, 2010). Way (2010) proposed a two year strategy for CAT transition in the K–12 setting. He suggested that multiple fixed forms can be administered in the first year. In the second year, a CAT pool can be established by combining all online fixed forms. Recently three states, Idaho, Utah and Oregon shared their experiences in transition from PPT to CBT. They focused on the issues of statewide implementation of a CBT, policy and technology challenges, technical limitations, political will, cooperation with vendors, program design, infrastructure establishment, and potential problems (Alpert, 2010; Cook, 2010; Quinn, 2010). Suggestions were made that planning is critical in the transition, advantages and disadvantages need to be balanced, technical readiness of schools has to be developed, and maintaining a dual program is not cost-effective but might be a reality for the transitioning time (Carling, 2010). A lot of literature focuses on the comparability of PPT and CBT or CAT. Recent years see increasing research addressing the comparability of CBT and PPT in K–12 settings (Bennett et al., 2008; Lottridge, Nicewander, Schulz, & Mitzel, 2008; Way, Davis, & Strain-Seymour, 2008). The results were not consistent, probably due to the variation in measurement setting, statistical sampling, and the wide spectrum in CBT administration, students’ computer-experience (Kingston, 2009), different test-taking strategies, academic subjects, item types and examinee groups (Paek, 2005). Two meta-analyses on the equivalence between CBT and PPT were recently conducted on K–12 mathematics and reading tests (Wang, Jiao, Young, Brook, & Olson, 2007, 2008). For mathematics tests, it was found that administration mode had no statistically significant effect on students’ achievement. For reading tests, it was also found that administration mode

14   H. JIAO and R. W. LISSITZ

had no statistically significant effect. Kingston (2009) synthesized 81 studies’ results for multiple choice items from 1997 to 2007 in K–12 setting which ranged from elementary, middle to high school, from English, language arts, mathematics, reading, science to social studies. He found that grade had no effect on comparability, but subject did. English language arts and social studies show a small advantage in CBTs, and mathematics shows a small advantage in PPTs. Moreover, Kingston (2009) summarized some other issues investigated in comparability studies such as different student subgroups, students’ computer experience, socioeconomic status, gender, impact of monitor quality, network quality, test speededness, and student preferences to a certain mode. The NCLB Federal Peer Review Guidance (4.4) and AERA, APA, and NCME Joint Standards (4.10) can be used as applicable standards in establishing comparability. The studies suggested that comparability claims can likely be supported using evidence in terms of design, administration, analyses of item statistics and scores by tests, group, ability, regions, and classification consistency (Domaleski, 2010). Related to the psychometric properties, several decisions need to be made (Way, Davis, & Fitzpatrick, 2006). These include the type of CBT algorithms (linear vs. adaptive, adaptation at item or item group level, ability point estimation vs. classification), item response theory (IRT) model, content balancing, item exposure and test overlap rate, the width and depth of the item pool, scoring of constructed response items, use of a common stimulus (reading passages, science scenarios), how to conduct quality control for CBT, field testing items in CBT programs, and life span of items in item pool. In terms of measurement models, the choice would make a great difference. A Rasch model based CAT can support a reduction of 20% in test length compared to a conventional test, while a 3-parameter IRT model based CAT can support a reduction of 50%. However, the 3PL model based CAT may select some highly discriminating items, very often causing unbalanced use of items in the pool (Way, 2010). Literature has addressed some of the issues related to computer based testing in the K–12 state assessments. (Some other technical issues are not reviewed here, such as item exposure control measures and content balance. Interested readers are referred to Chapter 8 for a detailed review on these two topics). With the intention to highlight the key issues related to the new assessment challenges in the K–12 setting, a conference was organized by the Maryland Assessment Research Center for Education Success (MARCES) with the focus on computer based testing and its impact on state assessment. All book chapters are based on the conference presentations by their respective authors.

Computer-Based Testing in K–12 State Assessments     15

Contributions of the Book Chapters This book organizes the chapters based on the conference presentations into three sections. Section one includes Chapters 2 through 5. It reviews history, presents the perspectives of the state and test vendors for K–12 computer based tests, and shows operational consideration in implementing CBT in the K–12 setting. Section two, consisting of Chapter 6 through 10, addresses the technical issues related to the use of innovative items, scoring such items, extracting diagnostic information from a CBT, the construction of vertical scale, and using the adaptive item selection method to increase student engagement in the testing process. Section three (Chapter 11) presents a successful example of integrating learning and assessment and the use of simulation and game based assessment formats in the field of Cisco Networking. We felt this example is so important that it stands by itself. We hoped to have a comparable example from military testing, but the presenter was unable to provide a chapter. An e-copy of the presentation slides is available at http://marces.org/conference/cba/MARCES%20Segall%20 Moreno.ppt. More detailed information about each chapter follows. Section one starts in Chapter 2 with a review of history and current practice of computer based assessments in K–12 education and predicts the future for CBT. By reviewing the history and relevant literature, the author summarizes the advantages of computerized testing in a very comprehensive manner. In addition to the elaboration of the gains from using CBT, the disadvantages are listed as well. To guarantee successful implementation of CBT in schools, training, and team work, the involvement of technology staff are especially recommended. Last, CBT items from different testing programs are demonstrated. A state perspective in Chapter 3 illustrates the need for a comprehensive system integrating technology, assessment, and professional development of teachers. It reports the results from a survey about current state implementation of technology in assessments. Subsequently, the authors present an integrated system of computer technology, assessment, instruction, and teacher professional development. Potential impediments to the success of the system are summarized and elaborated in terms of infrastructure, security, funding, sustainability, local control, and building an appetite for online systems. Finally some recommendations are made for future directions. From a test vendor’s perspective, Chapter 4 addresses the challenges of the use of computer based assessments for multiple purposes and what can be done by both test vendors and the state to best prepare for computer-based tests. The services provided by test vendors related to CBT are elaborated in terms of project management, content development, test administration, scoring and reporting, psychometric and research support, and dealing with multiple contractors. Considerations related to transition

16   H. JIAO and R. W. LISSITZ

strategies, measurement issues, and operational issues are recommended for the state preparing for CBT operations. Chapter 5 talks about the CBT operational implementation issues. Eight systems are summarized in a CBT enterprise as the item development and banking system, test assembly and publishing system, registration and scheduling system, test delivery system, post-administration psychometric processing system, results management and transmittal system, score reporting and communication system, and finally, the quality control and assurance system. It can be used as a blueprint for the CBT development and a quality control procedure for the CBT implementation. Section two addresses the technical and psychometric challenges in meeting the new assessment requirements in CBT. Chapter 6 introduces different types of innovative items for assessment purposes ranging from formative classroom assessment purposes to summative large-scale assessments. Different challenges of using innovative items in classroom assessment and large-scale assessment are discussed. The author highlights the psychometric challenges related to test form construction, equating when more complex measurement models are applied for item response data analysis, score reporting when multiple dimensions are intended to be measured using the innovative items. Chapter 7 emphasizes the conceptual and scientific basis for scoring innovative items in CBT, reviews currently available automated scoring systems, and offers some comments on the science of scoring and the future of automated scoring. The concept of scoring starting at the initial stage of test design is highlighted. Chapter 8 reviews the diagnostic measurement models and the technical details related to item selection algorithms, content balancing, and item exposure control techniques in cognitive diagnostic model based CAT. The implementation of a CAT for diagnostic purposes is illustrated using a real example from China with a Q matrix constructed before test development. Diagnostic CAT is further expanded in Chapter 9 to a multistage CAT for cognitive diagnosis purposes. Chapter 9 presents an ongoing study applying a loglinear diagnostic model to an Algebra II benchmark test to construct a multistage diagnostic CAT. The paper reports the creation of the Q-matrix with the help of content experts, pilot testing, standard setting, administration, and reporting of results as well as next steps to convert this process to a multistage computerized test. Chapter 10 discusses the creation of a common scale based on vertical scaling and reports the stability of a vertical scale to determine whether the measurement of student achievement levels and achievement growth between grades and across years is valid based on the common scale. Student engagement in the testing process is discussed. A new item selection

Computer-Based Testing in K–12 State Assessments     17

method is suggested for CAT to motivate students to fully engage in the testing process based on the student engagement data collected. The last chapter (Chapter 11) starts by describing the four process delivery model under the framework of evidence-centered design, and then introduces an automatically-scored, simulation-based assessment and a gamebased assessment. It highlights the idea of assessing students in ways more closely tied to daily learning activities and real world tasks. Conclusion CBT is an inevitable trend in K–12 state assessment programs. The construction of CBT is a system of systems (Luecht, 2010). Many decisions need to be made and issues need to be addressed before the construction of a quality CBT system. These decisions and key issues include 1. The existence of dual modes versus mono mode. If dual models exist concurrently, comparability issues needs to be addressed. 2. Intended use of the CBT system. The expected outcomes of the test should be clarified. These include whether the test is used to get point ability estimates of students, or to classify students into different performance or proficiency categories, to model growth, to provide diagnostic information, or as a comprehensive system of interim, formative, and summative assessments. 3. Linear versus adaptive. A decision is needed to choose between linear CBT vs. CAT algorithms. 4. Adaptive at the item level versus adaptive at the item group level. If the CAT algorithm is adopted, a decision has to be made whether the adaptive point is at item level or item group levels, namely a choice between item level CAT and a multistage CAT. 5. Item format. Aligned with the assessed knowledge, ability, and skills, proper item formats should be chosen. Whether only multiplechoice items will be used or inclusion of some constructed response items is needed. Are innovative items necessary for properly assessing the content standards? 6. Scoring. If constructed-response items or innovative items are used, a choice has to be made between human scoring and automated scoring in terms of necessity, quality, and implementation. 7. Measurement model. An item response theory model has to be chosen from a variety of options depending on the characteristics of test design and item response data. These IRT models include one-parameter, two-parameter, three-parameter, or four-parameter logistic

18   H. JIAO and R. W. LISSITZ

IRT models, or polytomous IRT models. Selection between a unidimensional IRT model and a multidimensional IRT model needs to be made. A decision related to the use of diagnostic measurement models is needed as well. 8. Item pool construction. Once the use of the test is determined, a proper item pool needs to be constructed to serve the measurement purposes. For example, if CAT is adopted, the process of developing an item pool with adequate pool size and high psychometric qualities should be planned. If classification is the testing purpose, an item pool should be constructed with more items around the cut score points. 9. Testing delivery of CAT. As there are multiple options related to each stage of test delivery of CAT, decisions are to be made regarding how to start, how to select next item, and when to stop. 10. Score reporting. Working together with the test users, clarification is needed related to the information to be included on score reports to individual students, teachers, school, districts, and state. A decision has to be made on reporting one sum score, sub-scores, diagnostic profiles or all. 11. Modeling growth in CBT/CAT. A proper growth model needs to be chosen before the implementation of the test. If a common scale is needed for modeling growth, the approaches to constructing a common scale such as vertical scaling should be well planned starting at the stage of test form construction or the implementation of CAT. 12. Equating, linking and vertical scaling in CBT/CAT. These technical details need to be specified clearly before the test administration. The construction of a CBT in K–12 state assessments is a complex project. The design of such a system needs deliberate planning. These book chapters provide the perspective of state and test vendors to make it happen. The new challenges of adopting computer based tests in the K–12 state assessments are particularly addressed related to the use of innovative items and the scoring of these items, providing diagnostic feedback to facilitate learning and instruction, constructing vertical scales to track growth, and building smarter tests in a CBT/CAT system. The book chapters address the most challenging issues in implementing CBT, regardless of delivery format, in the K–12 state assessments. They do not intend to provide conclusions or perfect solutions to the issues addressed. Rather, they intend to provide some of the guidance needed to succeed as well as provide food for thought and also lead the field further down the path to using assessment results to promote student learning and achievement.

Computer-Based Testing in K–12 State Assessments     19

References Almond, R. (2011, April). Utilities and quasi-utilities for classification. Paper presented at the Annual Meeting of the National Council on measurement in Education. New Orleans, LA. Alpert, T. (2010, June). Oregon’s online assessment. Paper presented at the Annual Meeting of the National Conference on Student Assessment. Detroit, MI. Baker, F. B. (1992). Item response theory: Parameter estimation techniques. New York: Marcel Dekker, Inc. Bennett, R. E., Braswell, J., Oranje, A., Sandene, B., Kaplan, B., & Yan, F. (2008). Does it matter if I take my mathematics test on computer? A second empirical study of mode effects in NAEP. Journal of Technology, Learning, and Assessment, 6(9). Retrieved from http://www.jtla.org Carling, D. (2010, June). Using computers for testing—it’s not just plug and play. Paper presented at the Annual Meeting of the National Conference on Student Assessment. Detroit, MI. Chang, H. H. (2004). Understanding computerized adaptive testing: From Robbins-Monro to Lord and beyond. In David Kaplan (Ed.), The SAGE handbook of quantitative methodology for the social sciences (pp. 117–133). Thousand Oaks, CA: Sage Publications. Chang, H. (2011, April). Making computerized adaptive testing a diagnostic tool. Paper presented at the Annual Meeting of the National Council on measurement in Education. New Orleans, LA. Cook, S. (2010, June). Idaho’s online testing: Boots on the ground report. Paper presented at the Annual Meeting of the National Conference on Student Assessment. Detroit, MI. Domaleski, C. (2010, June). Addressing NCLB/Peer review. Paper presented at the Annual Meeting of the National Conference on Student Assessment. Detroit, MI. Eggen, T. J. H. M. (1999). Item selection in adaptive testing with the sequential probability ratio test. Applied Psychological Measurement, 23, 249–261. Eggen, T. J. H. M, & Straetmans, G. J. J. M. (2000). Computerized adaptive testing for classifying examinees into three categories. Educational and Psychological Measurement, 60, 713–734. Ferguson, R. L. (1969). The development, implementation, and evaluation of a computerassisted branched test for a program of individually prescribed instruction. Unpublished doctoral dissertation, University of Pittsburgh. Grist, S. (1989). Computerized adaptive tests. ERIC Digest No. 107. Harmon, D. J. (2010, June). Multiple perspectives on computer adaptive testing for K–12 assessments: Policy implication from the federal perspective. Paper presented at the Annual Meeting of the National Conference on Student Assessment. Detroit, MI. Hendrickson, A. (2007). An NCME instructional module on multistage testing. Educational Measurement: Issues and Practice, 26(2), 44–52. Jiao, H., Macready, G., Liu, J., & Cho, Y. (2011, April). A mixture Rasch model based computerized classification test. Paper presented at the Annual Meeting of the National Council on measurement in Education. New Orleans, LA.

20   H. JIAO and R. W. LISSITZ Keng, L. (2008). A comparison of the performance of testlet-based computer adaptive tests and multistage tests. Unpublished doctoral dissertation. University of Texas, Austin. Kingsbury, G. G. (2011, April). Adaptive testing for state accountability: Creating accurate proficiency levels and measuring student growth. Paper presented at the Annual Meeting of the National Council on measurement in Education. New Orleans, LA. Kingsbury, G. G., & Weiss, D. J. (1983). A comparison of IRT-based adaptive mastery testing and a sequential mastery testing procedure. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp.  237–254). New York: Academic Press. Kingston, N. M. (2009). Comparability of computer- and paper-administered multiple- choice tests for K–12 populations: A synthesis. Applied Measurement in Education, 22, 22–37. Kolen, M. J. (1999–2000). Threats to score comparability with applications to performance assessments and computerized adaptive tests. Educational Assessment 6, 73–96. Lewis, C., & Sheehan, K. (1990). Using Bayesian decision theory to design a computerized mastery test. Applied Psychological Measurement, 14, 367–386. Lin, C-. J. (2010). Controlling test overlap rate in automated assembly of multiple equivalent test forms. Journal of Technology, Learning, and Assessment, 8(3). Retrieved from http://www.jtla.org Lottridge, S., Nicewander, A., Schulz, M. & Mitzel, H. (2008). Comparability of paperbased and computer-based tests: A review of the methodology. Monterey, CA: Pacific Metrics Research. Lu, R. (2009). Impacts of local item dependence of testlet items with the multistage tests for pass-fail decisions. Unpublished doctoral dissertation. University of Maryland, College Park, MD. Luecht, R. (2005). Some useful cost-benefit criteria for evaluating computer-based test delivery models and systems. Journal of Applied Testing Technology. Retrieved from http://www.testpublishers.org/Documents/JATT2005_rev_Criteria4CBT_RMLuecht_Apr2005.pdf Luecht, R. (2010, October). Operational CBT implementation issues: Making it happen. Presentation at the Tenth Annual Maryland Assessment Conference, College Park, MD. Luecht, R., Brumfield, T., & Breithaupt, K. (2006). A testlet assembly design for adaptive multistage tests. Applied Measurement in Education, 19, 3, 189–202. McCall, M. (2010, April). Validity issues in computerized adaptive testing. Paper presented at the Annual Meeting of the National Conference on Student Assessment. New Orleans, LA Paek, P. (2005). Recent trends in comparability studies. San Antonio, TX: Pearson Education, Inc. Parshall, C. G., Spray, J. A., Kalohn, J. C., & Davey, T. (2002). Practical considerations in computer-based testing. New York: Springer-Verlag. Phillips, G. W. (2009). Race to the Top Assessment program: A new generation of comparable state assessment. Denver, CO: United States Department of Education Public Hearings.

Computer-Based Testing in K–12 State Assessments     21 Quinn, J. (2010, June). Computer based testing in Utah. Paper presented at the Annual Meeting of the National Conference on Student Assessment. Detroit, MI. Reckase, M. D. (1983). A procedure for decision making using tailored testing. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237–254). New York: Academic Press. Sheehan, K., & Lewis, C. (1992). Computerized mastery testing with nonequivalent testlets. Applied Psychological Measurement, 16, 65–76. Smith, R., & Lewis, C. (2011, April). Computerized mastery testing: Using Bayesian sequential analysis to make multiple classification decisions. Paper presented at the Annual Meeting of the National Council on measurement in Education. New Orleans, LA. Spray, J. A. (1993). Multiple-category classification using a sequential probability ratio test (Research Report 93-7). Iowa City, Iowa: ACT, Inc. Spray, J. A., & Reckase, M. D. (1996). Comparison of SPRT and sequential Bayes procedures for classifying examinees into two categories using a computerized Test. Journal of Educational and Behavioral Statistics, 21, 405–414. Stark, S., & Chernyshenko, O. S. (2006). Multistage testing: Widely or narrowly applicable? Applied Measurement in Education, 19, 3, 257–260. Thompson, N., & Ro, S. (2011, April). Likelihood ratio-based computerized classification testing. Paper presented at the Annual Meeting of the National Council on Measurement in Education. New Orleans, LA. Wainer, H. (1990). Introduction and history. In H. Wainer (Ed.), Computerized adaptive testing: A primer (pp. 1–20). Hillsdale, NJ: Lawrence Erlbaum Associates. Wald, A. (1947). Sequential analysis. New York, NY: Wiley. Wang, S., & Jiao, H. (2005, June). Cost-benefits of using computerized adaptive test for large- scale state assessment. In the session of “Technical and Policy Issues in Using Computerized Adaptive Tests in State Assessments: Promises and Perils.” Paper presented at the Annual Meeting of the National Conference of the Large-scale Assessment. San Antonio, TX. Wang, S., Jiao, H., Young, M. J., Brook, T., & Olson, J. (2007). A meta-analysis of testing mode effects in grade K–12 mathematics tests. Educational and Psychological Measurement, 67, 219–238. Wang, S., Jiao, H., Young, M. J., Brook, T., & Olson, J. (2008). Comparability of computer-based and paper-and pencil testing in K–12 reading assessments: A meta-analysis of testing mode effects. Educational and Psychological Measurement, 68, 5–24. Way, D. (2010, June). Some perspectives on CAT for K–12 assessments. Paper presented at the Annual Meeting of the National Conference on Student Assessment. Detroit, MI Way, W. D., Davis, L. L., & Fitzpatrick, S. (2006). Practical questions in introducing computerized adaptive testing for K–12 assessments. San Antonio, TX: Pearson Education, Inc. Way, W. D., Davis, L. L., & Strain-Seymour, E. (2008). The validity case for assessing direct writing by computer. San Antonio, TX: Pearson Education, Inc.

This page intentionally left blank.

Part I Computer-Based Testing and Its Implementation in State Assessments

This page intentionally left blank.

Chapter 2

History, Current Practice, Perspectives and what the Future Holds for Computer Based Assessment in K–12 Education John Poggio and Linette McJunkin University of Kansas

The history of computerized testing in education spans five decades, which is a diminutive presence when viewed along the historic timeline of education; however, the many technological advancements that have occurred within the period have and will continue to change the face of educational assessment dramatically. The passing of the No Child Left Behind Act in 2002 shifted the focus of education onto assessment, therein almost certainly ensuring the growing reliance on technology and the benefits associated with the improvements within the field. The advantages of computerized testing, including immediate score reporting, cost reductions, and the promise of increased test security, have extended the appeal of computer Computers and Their Impact on State Assessments, pages 25–53 Copyright © 2012 by Information Age Publishing All rights of reproduction in any form reserved.

25

26    J. POGGIO and L. McJUNKIN

delivered tests in today’s high-stakes testing environment (McIntire, & Miller, 2007; Parshall, Spray, Kaholn, & Davey, 2002; Wise & Plake, 1990). Several developments within computerized testing need defining prior to expanding on the historical progression of the technological advancements. Computerized testing and computer-based tests, or CBT, are interchangeable terms; however, the types of computerized tests are not. Consequently, delineation of the most common test types is appropriate to avoid confusion and uncertainty later in the text. The third edition of Educational Measurement (Linn, 1989) offered a chapter centered on presenting the origination of computerized educational measurement. In the chapter, the authors noted the four generations of computerized testing, beginning with the simplest, computerized test, proceeding through computer adaptive tests, and extending the discussion into continuous measurement and intelligent measurement (Bunderson, Inouye, & Olsen, 1989). The simplest form of CBT is the delivery of a paper-and-pencil (PPT) test via computer and is typically referred to as linear or fixed form test (Computer Assisted Testing, 2010; Patelis, 2000; Poggio, Glasnapp, Yang, & Poggio, 2005). This mode of delivery involves simply transferring the test items onto a computer screen sequentially, beginning with the first question, then the second and so on (Computer Assisted Testing, 2010; Impara & Foster, 2006; McIntire & Miller, 2007; Patelis, 2000). As indicated through the terminology, the order of test items is fixed, but it allows examinees to omit, review and revise answers to items (Patelis, 2000). This pattern is similar to PPT and therefore is familiar to students, creating straightforward explanation and implementation. Utilizing this delivery format has several benefits including standardization of instructions and test item delivery, immediate scoring, cost reduction from eliminating printing and mailing charges, possible reduction in test-taking time, as well as the inclusion of audio and video modifications (Choi & Tinkler, 2002; Computer Assisted Tesing, 2010). On the other hand, if identical form administration is a concern, this model of test delivery could introduce some undesirable security issues (Patelis, 2000). In many school situations, an insufficient number of computers make it necessary that students take their tests at different times, thus extending the testing window which also lead to test security issues. Linear-on-the fly tests (LOFT) minimize this concern by administering unique forms that have been assembled at the opening of each test session to adhere to predetermined content and psychometric specifications (Patelis, 2000). Using test assembly algorithms, multiple formats can be created from one item pool, typically allowing for item overlap (Drasgow, Luecht, & Bennet, 2006). Additionally, to ensure adequate content coverage, item selection strategies can be set to be a stratified random sample (Impara & Foster, 2006). Like linear tests, LOFT allow examinees to omit, review and

History, Current Practice, Perspectives and what the Future Holds    27

revise responses, while extending the advantages of reducing item exposure through randomization of test item ordering or selection (Patelis, 2000). Computer-adaptive tests, or CAT, are tests that adapt to the student based on responses during the testing process (Computer Assisted Testing, 2010). A CAT operates from a bank of items that are based on item properties (i.e., difficulty, guessing likelihood, and discrimination) in relation to examinee ability through the application of Item Response Theory (IRT) models. Although the summarization of Item Response Theory (IRT) in a few sentences is certainly not sufficient for understanding the theory, it is reasonable to do so, as it will allow for clarification of how the theory promoted the development of computerized adaptive testing. When applying an IRT model to an adaptive test, the examinee is presented with an item that is designed to have a difficulty (p value) of .50 for that particular examinee, in other words, the test taker has a 50–50 chance of successfully answering the item (McIntire & Miller, 2007; Thorndike, 2005). Since IRT models employ items that have a calibrated (that is, numerical analysis) position on an underlying trait scale, the test continually presents items based on the examinee’s response (Bunderson et al., 1989). Items that are too easy or too difficult provide no information relative to that individual, consequently, presenting items that are close to the examinee’s ability level and item response allows the test to essentially “zero-in” on the examinee’s true ability level (Patelis, 2000; Thorndike, 2005). There are several benefits with this capability, including that the process requires fewer items than traditional PPT, therefore reducing testing time while addressing overexposure of items (McIntire & Miller, 2007; Patelis, 2000; Thorndike, 2005; Williamson, Bejar, & Mislevy, 2006). Similarly, there is an increase in test security since it is unlikely that two individuals will take the same collection of test items, and thus they will not see an identical test, primarily because the pattern of right and wrong examinee responses and the consideration of item exposure produces the test (Computer Assisted Testing, 2010; Patelis, 2000; Thorndike, 2005). We mention without elaboration that CAT, and in most respects fixed form CBTs, also augments control of item displays, increases display capabilities, improves security, reduces measurement error of examinees’ scores, allows for the construction of tests and creation of items by computer, and are used to maximize the test information function (Bunderson et al., 1989; Drasgow et al., 2006). In addition to the aforementioned features, specialized CBTs have also flourished in education, particularly in the area of medical training. Branched or Response-Contingent tests are those reliant upon examinee responses, which allow examiners to present hypothetical patient scenarios to students in order to measure problem-solving skills or sequentially dependent response patterns (Computer Assisted Testing, 2010). Sequential tests are used generally to make classification decisions, such as graduation,

28    J. POGGIO and L. McJUNKIN

and are another form of CBT that seems to have found a specialized niche (Computer Assisted Testing, 2010). By presenting items that have been ranked in order of precision necessary to successfully complete, these tests are administered until classification can be made and the test ends. These tests are different from CAT because, though the tests are individualized, the items are delivered based on an ordered pattern designed to measure a continuous trait, whereas CAT items are delivered based on the examinee trait level (Computer Assisted Testing, 2010). Another testing mode that is becoming common in specialized areas is the “testlet” design (Wainer, 2010). These tests utilize fixed selections of items that are grouped using content specifications, are typically ordered based on item difficulty, and are administered to students in fixed blocks (Patelis, 2000; Wainer & Kiely, 1987; Wainer, Bradlow & Wang, 2007). Testlet items allow developers to present blocks based on subject matter, difficulty, or two- and multi-stage testing which involves delivering two or more blocks of items (Patelis, 2000). A reading comprehension passage with a handful of questions is a typical example of a “testlet.” The benefits of these “mini-tests” are increased efficiency, evaluation of content by experts to review individual, pre-constructed testlets, and the ability of examinees to skip, review, and change answers within a test stage (Drasgow et al., 2006). Having defined the modes of delivery in computerized tests, now our historical evaluation can be less restricted by technical jargon and is consequently more focused on the noteworthy technological innovations and the realization of computerized testing in education within the last fifty years. The timeline begins by examining the role of computers in educational testing prior to 1960, progressing through the decades to current practices, ending with possible future applications. A Look Back . . . The first notable invention that affected educational assessment prior to 1960 occurred in a high school science classroom when Reynold B. Johnson provided a glimpse of the possibilities associated with mechanical and technological engineering in education with his development of the first workable test-scoring machine (Wainer, 2010). His design was extended in 1934, when Professor Benjamin Wood at Columbia collaborated with IBM to create a mechanical test-scoring machine (Wainer, 2010). This development has been touted as the single most important development in testing (Brennan, 2006). The first operational computer, MARK 1, was developed in 1944 at Harvard, followed by ENIAC at the University of Pennsylvania in 1946 (Molnar, 1997). These computers were used primarily in engineering, mathematics, and the sciences (Molnar, 1997). Throughout the 1950s, the

History, Current Practice, Perspectives and what the Future Holds    29

predominant function of computers within educational testing remained with the optical scanner and test scoring (Computer Assisted Testing, 2010). While taking traditional multiple-choice PPT, examinees would record responses on a scannable answer sheet, after which, the answer sheets were fed into the optical scanner, which would scan the examinee responses, indicating correct and incorrect responses, and score the test (Cantillon, Irish, & Sales, 2004). The machine sufficiently reduced the cost and time required for grading, stimulated the use of large-scale selected response testing and has been noted as the technological development that solidified the use of multiple-choice items within educational testing (Brennan, 2006; Computer Assisted Testing, 2010; Wainer, 2010). The launching of the Soviet satellite Sputnik in 1957 forced the topic of educational reform as the United States entered the “golden age” of education (Molner, 1997). This drive to advance education and technology within instruction and assessment provided the platform needed for the expansion of computers in education. By 1959, the first large-scale project evaluating the use of computers in education, PLATO, was implemented at the University of Illinois (Molnar, 1997). The development of mainframes in the 1960s broadened the utilization of computers within testing, allowing computers to be used for test score interpretation and test data analysis (Computer Assisted Testing, 2010). In 1963, John Kemeny and Thomas Kurtz instituted the concept of timesharing, which allowed for several students to interact with the computer, developing the easy-to-use computer language BASIC, which rapidly filtered through a variety of subjects and essentially all education levels (Molnar, 1997). Late in the decade, mainframe computers became more accessible within education settings with the addition of multiple terminals, advancements in display information and the ability to accept examinee responses entered via keyboard (Computer Assisted Testing, 2010). Dial-up modems functioning at “blazing” speeds of 10 to 30 transmitted characters per second connected multiple terminals as rudimentary time-sharing software was used to extract examinee responses at each terminal and then transmit that information (Computer Assisted Testing, 2010). Late in the 1960s, the National Science Foundation recognized the emerging technological advancements and instituted the development of 30 regional computing networks, which served more than 300 universities (Molnar, 1997). Though these progressions appear basic when compared to the frequent developments in today’s technological world, these advances opened the door furthering computerized testing in education. Until the 1970s, computers were utilized behind the scenes, predominantly to score tests and process reports (Wainer, 2010). The 1970s brought the development of the minicomputer, which made available hardware that allowed computerized testing to expand once again (Computer Assisted

30    J. POGGIO and L. McJUNKIN

Testing, 2010). The minicomputer, smaller than mainframe computers, which would fill a room, were nonetheless much larger than the later developed microcomputers (Computer Assisted Testing, 2010). Minicomputers allowed single users to access the installed testing hardware and software, in turn allowing for the standardization of the testing process (Computer Assisted Testing, 2010). Within the test development arena, Frederick Lord (1980) introduced the theoretical structure of a mass-administered, yet individually tailored test (initially termed “flexilevel” tests) by utilizing IRT models to create CAT. The United States military was instrumental in supporting extensive theoretical research efforts on the potential of adaptive tests. Leaders in test development and administration, the military had already administered the Army General Classification Test (AGCT) to more than nine million enlistees, later developing the Armed Forces Qualification Test (AFQT) and finally the Armed Services Vocational Aptitude Battery (ASVAB) (Wainer et al., 2007). Though the desire for adaptability was high, the need for inexpensive, high-powered computing would not be fulfilled until the late 1980s (Wainer, 2010). And in Closer Proximity . . . The fifteen years spanning from 1975 through the 1980s brought a flurry of activity as computers began to be used more frequently in education, and computer administration of tests became more widespread (Wainer, 2010). The emergence of the microcomputer, paired with cost efficient multiprocessors, increased operational research and implementation (Bunderson et al., 1989). The introduction of the microcomputer in 1975 catapulted the personal computer age (Molnar, 1997). Prior to this development, educational systems were reliant on timesharing systems, as earlier computers remained expensive. The supercomputer was launched during this same timeframe, as well as utilization of high-bandwidth communication networks to connect computers, allowing for global access to knowledge and information worldwide (Molnar, 1997). The Commission on Excellence in Education fueled that progression by acknowledging the inundation of computers into everyday activity and conveyed an expectation of computer familiarity by highlighting the lack of computer knowledge within the American student population (Gardner, 1983). With that, large-scale computer assessment projects began to emerge (Bunderson, et al. 1989). Within the initial years of the 1980s, the Army began to complete the development of the computerized adaptive ASVAB (Green, Bock, Humphreys, Linn, & Reckase, 1982). The NSF presented itself at the center of technological advancements in education and research by establishing five supercomputer centers in 1984 for computer communication (Molnar, 1997). By

History, Current Practice, Perspectives and what the Future Holds    31

1985, the NSF had built a national network, NSFNET, providing computer systems to all colleges and university for research and educational purposes, linking over 15,000 networks and 100,000 computers, serving over one million users across the globe (Molnar, 1997). By 1985, Educational Testing Service (ETS) had announced plans to become fully committed to evaluating computerized testing and had solidified that commitment by implementing operational systems (Ward, 1986). Once computer advancements began to take hold at the federal government level and the business sector, the possibilities of development began to filter down to the state level as well. Individual states began to evaluate the effectiveness of computerized assessment as test delivery options. One early example took place in California as the state chose to include a computerized prototype of their Comprehensive Assessment while developing a state-wide assessment system (Olsen, Inouye, Hansen, Slawson, & Maynes, 1984). The emerging transition to computerized testing did not occur without reservations being expressed. It was common as computerized assessment grew in attention to find claims of CBT being impersonal, an experience that would frighten students, along with claims of computer anxiety and the lack of familiarity that would produce non-equivalence to PPT at best. However, as now, research, study and evaluation typically found no statistical difference between PPT and CBT. By the 1980s, computerized tests introduced a new level of standardization not possible with the traditional PPT. The testing format allowed for precision and control of instruction delivery and item displays including timing, as well as audio and visual components (Bunderson et al., 1989). Test security was increased as the new testing mode eliminated the need for paper copies of tests and answer keys and also offered password and security protection to prevent access to item banks and testing materials (Bunderson et al., 1989). To aid in the implementation and utilization of computerized tests, several guidelines were published. In 1986, the American Psychological Association recognized the need for such an outline and announced the Guidelines for Computer-based Psychological Tests and Interpretations. In 1995, the American Council on Education published the Guidelines for Computerized Adaptive Test Development and Use in Education, followed by joint publication of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999). In 2002, the Guidelines for Computer-based Testing was introduced (Association of Test Publishers, 2002), ending with the most recent publication, the International Guidelines on Computer-based and Internet-delivered Testing (International Test Commission, 2005). These guidelines collaboratively emphasize score equivalency and stress the importance of proficiency

32    J. POGGIO and L. McJUNKIN

and understanding needed to ensure the appropriate use of computerized testing (Wang, Jiao, Young, Brooks, & Olson, 2008). Turning back to test development, a major development of the 1990s was the establishment and utilization of the internet, which furthered computerized testing by establishing the ability to deliver tests via the worldwide web (Computer Assisted Testing, 2010). Though the development has sparked debate regarding the standardization and validity of some web-based tests, as well as discussions involving the differences between PPT and web-based tests, it is appears certain that the internet will continue to play a dominant and vital role in computerized testing into the future. The 1990s also saw the evaluation and implementation of CBT within several large testing programs. By October 1992, ETS introduced the computerized version of the GRE (Fairtest, 2007). Currently, the computerbased version of the test is the exclusive mode of delivery in the United States, Canada, and several other countries, with the paper-based version of the test only being offered in parts of the world where computer-based testing is not available (ETS, 2011). In 1999, the National Center for Educational Statistics commissioned the Technology-Based Assessment project, designed to gather information from 2000–2003 regarding the effectiveness of implementing computer-based testing as a test delivery option (NCES, 2006). That same year, CBT was implemented as the delivery mode of the US medical licensure exam (Cantillon et al., 2004). Arriving at the Present Within the last decade, technological advancements have once again changed the face of computerized testing by increasing flexibility, complexity, interactivity, and allowing for the introduction of multimedia and constructed response item components, not to mention the growing development of automated scoring of constructed response items (Williamson et al., 2006). In the 21st century, computerized testing has seen a steady increase in statewide assessments. States such as Idaho, Kansas, Oregon and Virginia have been joined by others, including Indiana, Nebraska, North Carolina, Washington, Iowa, Kentucky and South Carolina in implementing computerized delivery of assessments (Poggio et al., 2005). Today, it is common for a state to offer its assessment to schools both in an online version and as the traditional PPT. The advancements and availability of highpowered, low-cost desktop computing has presented test development with the opportunity to individualize tests and incorporate new types of questions that were previously not possible (Wainer, 2010). With advancements in computerized testing, ETS announced plans to evaluate a computerized version of the SAT (Fairtest, 2007). Beginning in August of 2011, ETS will

History, Current Practice, Perspectives and what the Future Holds    33

change the GRE format by presenting the test in a format that allows the examinee to edit or change answers, skip answers within a section, and allows for the use of an on-screen calculator (ETS, 2011). The test will also utilize new item types within the verbal and quantitative reasoning sections. The new items incorporate open-ended questions as well as the traditional multiple-choice items, questions requiring multiple correct answers, and questions that require students to highlight sentences within a reading passage to answer the question (ETS). In addition to utilizing the adaptive format for the verbal and quantitative sections, the inclusion of the new item types creates a novel method for obtaining individualized information. As the cost of computers continues to decline and complete societal pervasiveness of the technology increases, the costs associated with printing and maintaining PPT continue to increase, in a sense forcing PPT to the periphery of testing, to be used only when absolutely necessary (Brennan, 2006). However, based on the economic factors associated with transferring testing systems from traditional formats to computerized delivery systems, the likely cost efficiency of optical scanning will ensure the survival of PPT for some time, probably well into this decade. That being said, there are benefits associated with CBT that continue to make this delivery mode very desirable. In the next section, we address what we see as the real strengths of computerized assessment in K–12 education going forward. Advantages of CBT going forward Below we have briefly detailed what appear to be many of the advantages associated with CBT during this period of large scale movement to online testing in education. For each we have also pointed to necessary research studies that would help to confirm or advance our thinking and therefore school practice. We note that specifying advantages carries with it a responsibility to understand the risks that could be inherent; we caution the reader as well. Immediacy of Results Among the most notable features, and already mentioned in this chapter, associated with computerized testing is the immediacy of results. In most systems as soon as a student finishes taking the test, item data and scores are ready for display. Delays and slowness in reporting results can be a concern of the past. So quick is the capability to return performance results that results can be shared with examinees progressing from item to item if desired. How best to capitalize on this feature, and avoid any un-

34    J. POGGIO and L. McJUNKIN

necessary and unwitting consequences, deserves attention of the research community. Cost Efficiency Moving tests to computer delivery will signal a reduction in costs. While upon first consideration this may not seem obvious, production of test booklets and answer sheets are extremely large expenditure items that can be set aside. And many of the related expenses of testing, notably repetitive (or standardized) tasks over time (booklet creation, shipping, scanning, etc.), are largely eliminated. Deserving of serious consideration is the development of a sound and dependable public domain alternative that educators and classroom teachers can use whenever desired. Exploration and efforts along this route merit national attention and resolution. Accuracy in Data Recording and Thus, Data Collection Capturing information from students on answer sheets is among the biggest negatives of the testing process—and something too infrequently discussed. A casual and informal review of information gathered on answer sheets over a number of years revealed the considerable number of errors that students make. Answers to questions and the information provided, such as gender, race, age, and even name are frequently mis-coded and mis-marked by students. Computer based assessment all but eliminates this source of error in the measurement process. When the error does persist, the key to elimination of this source of CBT error is review over time as well as the necessary correction of computer code when found. Student Motivation A number of researchers and field practitioners have commented on heightened level of student motivation when assessments are taken online, and research data in the form of student self-reports at all grades confirm this claim. This in turn translates into higher performance. Whether such student motivation will be seen over time is not clear; consequently the question remains, will motivation for the task be maintained over time? Though future speculation will vary, at this time students appear more attentive, comfortable, and responsive to computer rather than paper and pencil assessment. As noted before, studies of this aspect deserve consideration coupled with strategies that support greater testing motivation.

History, Current Practice, Perspectives and what the Future Holds    35

Adaptive Testing Alternative Moving traditional tests to computer administration has opened the door to computer adaptive testing and its alternatives (i.e., testlet testing structures). This approach has the advantage of reducing traditional test length by half, and in some instances as much as 75% of the original paper and pencil length. In this methodology, while we have considerable reduction in test length, there is relatively little loss in test properties such as validity and reliability, not to mention the assessments may be fairer (all students experience the same degree of challenge). However, and not to be too Polyanna-ish, adaptive testing has important preconditions that must be attended to. Most notably, a viable (sufficient data for investigative psychometric analyses), strong (supporting item parameter studies) and large (targeted item pool extending beyond 300 to 350 items at a minimum) item bank is required. Another precondition required is a well-defined and understood construct dimensionality. Reduced Administrator and Instructor Effort As the tests will be administered online, the traditional activities by administrators, staff and teachers of receiving, unpacking, securing, counting, sorting, and distributing test booklets and answer sheets are eliminated. This management effort requires considerable commitment of time and attention but is essentially eliminated when we move to computerized testing. One drawback at the start of CBT in some schools will be the need for multiple testing sessions when there is an insufficient number of computers to handle all examinees at the same time. Again, consequences of delaying testing for some should be evaluated. Meeting the Needs of Special Populations Technological advances used with computerized testing signals the dawn of a new era in testing for students with special learning needs and other populations (i.e., non native speakers). Largely and all too often unstandardized accommodations of the past can now be transformed into common and fixed procedures. For example, increasing text size with the click of a button or via/during the test registration process, transforming text to speech, zooming, colorizing objects, English/native language translations, ASL avatars for the hearing impaired, presentation of one tem at a time to the student, and other alterations can be implemented. We are at the start of this exciting time in testing, and attention to the needs and capacities of

36    J. POGGIO and L. McJUNKIN

these student populations deserve our constant and frequent research attention. Data we have collected signals these students are more motivated given CBT accommodations, and they perform stronger than their peers who received PPT. The potential of computerized testing technologies merits intense study and exploration. Ready Support for Data Based Decision Making The speed with which data can become available to educators allows rapid analysis and evaluation of performance. Whereas in the past it could take months for data to be returned to administrators and teachers, today’s computerized tests allow educators to review results and make instructional and evaluative decisions on behalf of students while the information is fresh and current. This feature clearly signals a need for a renewed effort at score reporting for individual student diagnosis and the study of groups and classes, buttressed by what the technology can support and leverage. Communication with parents and caregivers can be brought to the front as a common expectation, indeed requirement, given the speed, transparency and directness of such communication of results. Not only does the psychology of test result reporting come into play, but the sociology as well. “How are we doing” deserves reconsideration and study beyond print media tables and charts. Store Information, Ready Access to Results, and Create of Archives As students take a test on a computer, the storage and retrieval of all or any information is efficient and speedy. In the past reliance on answer sheets meant data cleanup, scanning and processing, scoring and check up, and eventual readying of information for storage and retrieval. But with computer-based testing, the storage of information is part of the process, and access to that information is efficient and accurate. It is important to note that care must be taken to ensure information is protected and secure at all times and during sharing/transfer. Ability to Modify Tests as Necessary In the past, when an error was discovered in a test booklet, the publisher or educator had little recourse but to stand by and watch the consequence of the error occur time after time during administration. In today’s com-

History, Current Practice, Perspectives and what the Future Holds    37

puterized tests when an error is found it can be corrected immediately. In this way, test scores are improved for many students. How best to handle the effect of mid-testing correction deserves attention and proper resolve (for example, correct the error as soon as practical, but do not count the item until adjustments can be made). Improved Security . . . Some Risks As tests are delivered online, traditional handling of test materials is eliminated. Breaches of test security are fewer. There are threats related to hacking that are not avoided. However, up to this time there have been few reports of security issues because of computerized administration. We must remain attentive and vigilant. Increased Opportunity for Collection of Supplemental Information or Data Computerized testing allows for and supports the collection of additional information. Students can be quickly and easily surveyed (even during or before testing if desired, as screens can be controlled), or additional data can be gathered either with all students or samples of students and items (i.e., multiple matrix sampling designs). Research and evaluation can have an active place in assessment with planning and research designs. Equal ( . . . Stronger?) Performance Research of the past 25 years is clear and certain: paper and pencil testing yields scores equivalent to computer-based testing (Kingston, 2009; Wang, Jiao, Young, Brooks, & Olson, 2007, 2008). We should continue to study these effects. Research advises it has not been an issue. We advise that we study the engine a time or two . . . or three, but then generalize. The real effect to watch is, with student gaining CBT experience and becoming “better” (that is, more proficient test takers), will CBT scores become stronger? One final remark: with the plethora of computerized versus paper and pencil comparability studies now available, the requirement for such basic studies is justifiably at an end. Understand we have concluded that exploring this fundamental question has in our expectation moved to more exacting group difference, test difference questions. As PPT was never, nor should it be, the “gold standard,” it is time to move on.

38    J. POGGIO and L. McJUNKIN

Implementation Barriers Following from the observations above, presenting this section as “Disadvantages” would be wide of the mark. We have crossed over to a new set of procedures that rely on technology to deliver and conduct tests and testing. Our sense is to comment on barriers we will face during the transition. We offer the following in no particular order.

Local Resistance to Change Change can come very hard or very slow for some. This new technology brings with it a demand for new skills and understandings that must be mastered. In addition, some will see no need to change from the way that testing has been done. Careful, thoughtful steps must be planned to assist individuals in making the transition. Without assistance and help, change will be very slow for a few. Individuals must be brought into the new technology and made to feel comfortable and be supported. Workshops and hands-on experience that build assurances are essential for these persons.

Local Capacity A school or school district may not have the wherewithal to move to online testing. Equipment may not be current or sufficient, or local staff may not have the expertise to venture into this arena. The inability to move forward with the innovation will in its own way limit adoption and understanding. Fear of this unknown needs leadership’s deliberate attention.

Risk—Not Changing Horses . . . These are high stakes tests for everyone—student, teacher, administrator, parent—and failure may not be tolerated. Careers are on the line and shifting to a new mode of testing may not be seen as wise. “Why change now?” will be the plea. As commented above, awareness and sensitivity are important considerations; plans must be prepared with these factors in mind. Remember, research study after research study has shown in the main that there will be no differences observed in performance, and students will prefer this approach to assessment.

History, Current Practice, Perspectives and what the Future Holds    39

Minimize or Eliminate Obstacles Perceived (and real) impediments to implementation of computerized testing will add to frustration and dissatisfaction. When systems are new to the teacher and testing is pressing forward with the threat of accountability, tolerance for the unexpected will be very low. This frustration may be limited by creating systems that work, ensure teachers and testing officials understand the system, and the system needs to be forgiving. Communication is Essential Frequent, brief and effective communication among all involved is essential. Share information about the program and its activities. Constant communication is necessary; weekly newsletters, announcements, sharing and postings, email and the web are our friends! Build a community of knowledgeable and involved educators (and parents, as this is new to them too), and share the successes and failures encountered. Transparency is essential. Cost Redux Previously, we commented that computerized assessment in the schools should result in lower costs. Let us be fair and alert: Running “Dual Programs” (both PPT and CBT) during the transition period (4 to 5 years) is very, very expensive—about twice the cost initially! Yet, support of the infrastructure to do online testing must not be shortsighted. It must be planned. What was once spent on paper, ink, response forms, scanning, the attention of staff, storage and mailing/shipping, will now be spent on servers, web designers and programmers. Costs can be saved, but initially, it will be more expensive to do testing on both fronts. Illustrations On the pages that follow, we have selected a handful of actual CBT web pages to illustrate what is being offered in education. The images were acquired by the authors in spring/summer 2010 from the publishers’ web pages as demonstration samples. In offering these few views of test items only (and often incomplete views), we run the great risk of misrepresenting what is available from a distributor, or failing to show a particular group or field of educational testing, or not portraying accurately what is there, or failing to show what actually happens when viewed live, and so on. We

40    J. POGGIO and L. McJUNKIN

strongly encourage and advise the reader to contact a publisher (and again, we apologize that we have not illustrated all below) and ask for electronic access to review what they have, can do, and have in the works. Though clearly not K–12 CBT examples, the illustrations begin with three examples (mathematics, reading and writing) of computerized testing from the Graduate Record Examinations (GRE) produced in the late 1980s. It is interesting to compare what was available then and what is done testing K–12 students today (Figures 2.1 through 2.17). The illustrations are presented without comment, and no evaluation is intended or to be inferred. Later chapters in this book provide additional examples and illustrations by these same authors and publishers. Computer-Based Assessment: Future Attentions and Attractions As we prepare to launch toward new designs for assessment, supported and complemented by electronically delivered and captured assessments, we should be optimistic. Danger on the roadway ahead seems most unlikely; opportunities would appear to be defined only by our limitations. It is a new age for student achievement assessment. The other major domains for edu-

Figure 2.1  GRE Quantative (ETS, circa 1989).

History, Current Practice, Perspectives and what the Future Holds    41

Figure 2.2  GRE VERBAL (ETS, circa 1989).

Figure 2.3  GRE WRITING (ETS, circa 1989).

42    J. POGGIO and L. McJUNKIN

Figure 2.4  CTB Acuity Mathematics.

Figure 2.5  CTB Acuity Reading.

History, Current Practice, Perspectives and what the Future Holds    43

Figure 2.6  NWEA Mathematics.

Figure 2.7  ITS Reading.

44    J. POGGIO and L. McJUNKIN

Figure 2.8  AIR Mathematics.

Figure 2.9  AIR Reading.

History, Current Practice, Perspectives and what the Future Holds    45

Figure 2.10  Pacific Metric Mathematics.

Figure 2.11  Pacific Metric Reading.

46    J. POGGIO and L. McJUNKIN

Figure 2.12  Pearson Mathematics.

Figure 2.13  Pearson Reading.

History, Current Practice, Perspectives and what the Future Holds    47

Figure 2.14  CAL Mathematics, Selected Response with tools illustrated.

Figure 2.15  CAL Mathematics, Constructed Response.

48    J. POGGIO and L. McJUNKIN

Figure 2.16  CAL Reading Selected Response.

Figure 2.17  CAL Reading, Constructed Response.

History, Current Practice, Perspectives and what the Future Holds    49

cation assessment (i.e., aptitude, ability, performance, and non-cognitive assessment) need to be nurtured at this time and along similar paths. School, program and student evaluation should not be limited by assessment methodology. Reliance on methods that are biased or fail to have a consistent or sufficient foundation will only limit us. So, one call for future attention is to expand our understanding and knowledge of effects of computerized testing to include the breadth and depth of measurements from domains beyond achievement testing. Many facets of assessment can be presumed to fall into line with achievement testing outcomes, but certainly not all. Will ability testing return to its roots? Will allowing the determination of setting a baseline before proceeding with the measurement be better supported by adaptive testing strategies? Will affective measures find it necessary to reform norms and even norm groups? Will test time or time to first response per item once again emerge as a viable input or intervening factor as it was early in the last century? Will today’s new electronic data capture method instill a reliance on retrospective methodology? This list could be endless. The point is, the thinking from the past alone should not set tomorrow’s agenda for research and development in education assessment. Not only should we stand prepared to reinvest in the underlying measurement processes, but the audiences of our attentions merit thoughtful, deliberate consideration. In earlier sections of this chapter we cited differentiation for student populations. We must not skip or skirt this necessary, constant attention; who is doing well or differently as we shift assessment platforms, and why, begins a litany of known caveats that deserve repeating and renewed investigation. It would also serve us well to reconsider the role of educators in this process; can they (teachers and administrators) better assist, focus and help to improve the assessment process and activities to strengthen the validity of the student’s appraisal(s)? Such considerations can only strengthen our efforts. Peer-to-peer designs during student assessment could emerge utilizing electronic formats relying on simulations, scenarios, scaffolding, and external and controlled guidance when standardized as offerings. Last, we must not overlook the role of parents in this assessment process. Sharing of information is an obvious route; however, participation when possible could offer true advantages when planned and expected. The age old design of the student taking the test in isolation and then awaiting results and action could benefit from careful review and scrutiny. This next point we bring falls well short of a sage prediction, as it is on everyone’s radar: capturing examinee constructed response item responses and adaptive/testlet testing are already upon us. The age of our student population and their readiness, and the content/subject areas served deserve exploration of course. But, associated advances in automated scoring (i.e., artificial intelligence scoring) and automated item generation are not far off. Service based designs (allowing for both human and machine scor-

50    J. POGGIO and L. McJUNKIN

ing and creation of items in partnership) will offer promise, credibility and assurances for now. The chapter by David Williamson (Chapter 7) later in this text will highlight and inform regarding automated scoring capacity and features, and work by Embretson and Poggio (2012) will point to automated item development and construction. As these technologies grow, expand and improve, we can look forward to shared and combined responsibility in testing. Continuing this view toward the future, we would be remiss were we not to highlight and indeed underscore the potential changing structure of assessment. We begin with a glimpse of the past; in the timeframe of 1900 an instructor stands before the class and carefully, patiently reads aloud test questions to the students, and they respond. Come forward to mid-century, 1950s and the surrounding time, the instructor is supported by copy equipment, test booklets are distributed with machine scoreable answer sheets, and students respond. View the possibility of today’s computerized test; items appear on a screen and the student reads but is also supported by a human voice or synthesize speech to hear the text, thus removing the confounding element of reading comprehension. It is interesting today that when faced with this very doable potential, many educators reject this accommodation. We have trained people too well! In a phrase, we must re-evaluate the purpose and goal of the assessment and allow for student choices that support demonstration of the essential, targeted learned achievement. We should allow and encourage spoken text questions as a standardized feature, when appropriate, for all students to reduce reading as a confounding variable when we measure certain traits and skills. Clearly, some would object to a spoken reading comprehension test passage, but what of the questions that go along to query comprehension? What of a new area to evaluate, now more measurable in terms of validity and reliability through computer standardization, that of listening comprehension? Our goal here is not to settle this issue, but rather point to how the technology empowers change in assessment. Alert, ever learning and attentive with open minds are the necessary criteria for computerized assessments as we go forward. The final point may be the most difficult challenge, and yet it is persuasive and we think essential and necessary. Electronic technology must and should support new designs and expectations for testing on use, the testing/measurement/psychometric specialists. At prior times, testing came at the end of the education experience of learning. Testing stood apart from instruction by definition and design. Today, electronic learning, or eLearning can and should exist in combination with electronic testing, or eTesting. Michael Scriven conceived of formative evaluation (1967), and his message has been repeated and strengthened by others such as Bloom, Hastings and Madaus (1971), Sadler (1989) and Black and Wiliam (1998). Hallmark to each of these treatises is the assertion that assessment must strengthen

History, Current Practice, Perspectives and what the Future Holds    51

learning by joining with the process of instruction. With computerized testing we are on the cusp of this blending assessment with instruction (Poggio & Meyen, 2009). Selecting or creating test questions that reflect what has been taught (alignment utilizing item properties, i.e., distractors or incorrect responses) and features allowing computers to retrain, reinforce or rehearse specific learned expectations is a reality that must be utilized to the fullest extent possible. Assessment through CBT and in partnership with instruction has an eventful and exciting future. References American Council on Education. (1995). Guidelines for computer-adaptive test development and use in education. Washington, DC: Author. American Educational Research Association, American Psychological Associal, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: Author. American Psychological Association. (1986). Guidelines for computer-based tests and interpretations. Washington, DC: Author. Association of Test Publishers. (2002). ATP computer-based testing guidelines. Retrieved from http:www.testpublishers.org Black, P. J., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy and Practice, 5, 7–73. Bloom, B. S., Hastings, T, & Madaus, G. F. (1971). Handbook of Formative and Summative Evaluation. New York: McGraw-Hill Co. Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In R. L. Brennan (Ed.), Educational measurement. New York, NY: Macmillan. Bunderson, C. V., Inouye, D. K., & Olsen, J. B. (1989). The four generations of computerized educational measurement. In R. L. Linn (Ed.), Educational measurement. New York, NY: Macmillan. Cantillon, P., Irish, B., & Sales, D. (2004). Using computers for assessment in medicine. British Medical Journal, 329(7466), 606. Choi, S., & Tinkler, T. (2002, April). Evaluating comparability of paper-and-pencil and computer-based assessment in a K–12 setting. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA. Computer Assisted Testing. (2010, May 4). Retrieved from http://pagerankstudio. com/Blog/2010/05/computer-assisted-testing Drasgow, F., Luecht, R. M., & Bennet, R. E. (2006). Technology and testing. In R. L. Brennan (Ed.), Educational measurement. (4th ed., pp. 471–516). Westport, CT: Praeger. Embretson, S., & Poggio, J. (2012). An evaluation of select psychometric models to monitor growth in student achievement. Paper presented at the annual meeting of the National Council on Measurement in Education. Vancouver, Canada. ETS. (2011). About the GRE. Retrieved from http://www.ets.org/gre/general/ about

52    J. POGGIO and L. McJUNKIN Fairtest. (2007). Computerized testing: More questions than answers. Retrieved from http://www.fairtest.org/computerized-testing-more-questions-answers Gardner, D. P. (1983). A nation at risk: The imperative for educational reform. Washington, DC: National Commission on Excellence in Education. Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L. & Reckase, M. D. (1982). Evaluation plan for the computerized adaptive vocational aptitude battery (Research Report No. 82-1). Baltimore, MD: Johns Hopkins University. Impara, J., & Foster, D. (2006). Item and test development strategies to minimize test fraud. In Downing, S. M. & Haladyna, T. M. Handbook of test development (pp. 91–114). Mahwah, NJ: Earlbaum. International Test Commission. (2005). International guidelines on computer-based and internet delivered testing. Retrieved from http://www.intestcom.org Kingston N. M. (2009). Comparability of computer- and paper-administered multiple-choice tests for K–12 populations: A synthesis. Applied Measurement in Education, 22(1), 22–37. Linn, R. L. (1989). Educational measurement (3rd ed). The American Council on Education, New York: Macmillan. Lord, F. M. (1980). Application of Item Response Theory to practical testing problems. Hillsdale, NJ: Erlbaum. Molnar, A. (1997). Computers in education: A brief history. THE Journal (Technological Horizons in Education), 24(11). Retrieved from http://thejournal.com/ Articles/1997/06/01/Computers-in-Education-A-Brief-History.aspx?Page=1 McIntire, S. A., & Miller, L. A. (2007). Foundations of psychological testing: A practical approach. Thousand Oaks, CA: Sage. National Center for Educational Statistics. (2006). Technology-based assessment project. Retrieved from http://nces.ed.gov/nationsreportcard/studies/ tbaproject.asp Olsen, J. B., Inouye, D., Hansen, E. G., Slawson, D. A., & Maynes, D. M. (1984). The development and pilot testing of a comprehensive assessment system. Provo, UT: WICAT Education Institute. Parshall, C. G., Spray, J. A., Kalohn, J. C., & Davey, T. (2002). Practical considerations in computer-based testing. New York, NY: Springer-Verlag. Patelis, T. (2000). An overview of computer-based testing (RN-09). New York, NY: The College Board. Poggio, J., Glasnapp, D., Yang, X., & Poggio, A. (2005). A comparative evaluation of score results from computerized and paper & pencil testing in a large scale state assessment program. Journal of Technology, Learning, and Assessment, 3, 4–30. Poggio, J., & Meyen, E. (2009, April). Blending assessment with instruction. Paper presented at the annual meeting of the National Council for Measurement in Education, San Diego, CA. Sadler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18, 119–140. Scriven, M. (1967). The methodology of evaluation. In R. E. Stake (Ed.), Curriculum evaluation. American Education Research Association Monograph Series on Evaluation, No. 1, Chicago: Rand McNally.

History, Current Practice, Perspectives and what the Future Holds    53 Thorndike, R. M. (2005). Measurement and evaluation in psychology and education (7th ed.). Upper Saddle River, NJ: Pearson. Wainer, H. (2010). Computerized adaptive testing: A primer (2nd ed.). NY: Routledge. Wainer, H, Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. Cambridge, England: Cambridge University Press. Wainer, H., & Hiely, G. L. (1987). Item clusters and computerized adaptive testing. A case for testlets. Journal of Educational Measurement, 24, 185–201. Wang, S., Jiao, H., Young, M. J., Brooks, T. E., & Olson, J. (2007). A meta-analysis of testing mode effects in grade K–12 mathematics tests. Educational and Psychological Measurement, 67, 219–238. Wang, S., Jiao, H., Young, M. J., Brooks, T. E., & Olson, J. (2008). Comparability of computer-based and paper-and-pencil testing in K–12 assessment: A meta-analysis of testing mode effects. Educational and Psychological Measurement, 68, 5–24. Ward, W. C., Kline, R. G., & Flaughner, J. (1986, August). College Board computerized placement tests. Validation of an adaptive test of basic skills (Research Report No. Retrieved from http://www.eric.ed.gov/PDFS/ED278677.pdf. Williamson, D. M., Mislevy, R. J., & Bejar, I. I. (2006). Automated scoring of complex tasks in computer-based testing. Mahwah, NJ: Earlbaum. Wise, S. L., & Plake, B. S. (1990). Computer-based testing in higher education. Measurement and Evaluation in Counseling and Development, 23(1), 3–10.

This page intentionally left blank.

Chapter 3

A State Perspective on Enhancing Assessment and Accountability Systems through Systematic Implementation of Technology Vincent Dean and Joseph Martineau Michigan Department of Education

Introduction States vary widely in terms of how far they have gone in shifting from paper and pencil based large-scale assessments to utilizing technology as the platform for test delivery. A few states have been quite aggressive in this regard and have successfully met the numerous content, technical, and policy challenges required for U.S. Education Department (USED) approval as having met the requirements for standards and assessment systems (USED, 2009) under the Elementary and Secondary Education Act (ESEA) as reauthorized Computers and Their Impact on State Assessments, pages 55–77 Copyright © 2012 by Information Age Publishing All rights of reproduction in any form reserved.

55

56    V. DEAN and J. MARTINEAU

by the No Child Left Behind act of 2001 (USED, 2002). Other states, however, have barely begun the process of exploring this shift in technology. Despite this dissonance in and infancy of implementation, there is strong agreement among state education agencies that computer based assessment is the best way to solve a number of problems that include dealing with unprecedented fiscal challenges, connecting with a K–12 population of digital natives, and providing instructionally relevant results in a manner and timeframe that is more useful to educators and policy makers. The purpose of this chapter is to describe some of the challenges and potential benefits of computer based assessment from the perspective of a state education agency, and also to articulate how technology can become an essential component of a balanced assessment and accountability system. Current State and Federal Requirements States are required to adhere to an extensive set of requirements under the current version of the ESEA. While there is broad agreement that the essential elements for high quality assessment are appropriate for ensuring that things such as content alignment and technical adequacy are thoughtfully considered, there has been significant concern about the peer-review process utilized by the U.S. Education Department (USED, 2009). Not surprisingly, states have submitted both consistent and divergent types of evidence that have met the various requirements at different times, to different panels of reviewers, and have received varying feedback on where they are strong or deficient. While not perfect, the peer review process has helped drive many conversations among states, vendors and other stakeholders to determine how best to meet these elements. On top of the requirements enforced by the USED, each state has unique legislation, policy, and contexts within which to develop and administer its large-scale assessment system. Michigan, for example, has legislation requiring that a recognized college-entrance examination be administered as part of its high school assessment. Since under current federal requirements, computer administered assessment is not required, at least one state legislative body has been the impetus for moving in this direction (Utah, 2004). As described in the next section, federal support for this trend has been increasing. National Reform and State Reform Efforts and Changing Stakes Federal funds provided to states under the American Recovery and Reinvestment Act—or ARRA (U.S. Government Printing Office, 2009)—through

A State Perspective on Enhancing Assessment and Accountability    57

programs such as the State Fiscal Stabilization Fund—or SFSF (USED, 2010a)—and Race to the Top—or RTTT (USED, 2010c)—encouraged enhanced utilization of technology. As it was optional for states to participate in these programs and receive funds, the perception of state agencies was that the federal government was including elements that it would like to see as part of its vision for moving education forward. These included enhanced data and reporting systems, assessments that could be administered with very rapid turnaround of results, and moving to less-traditional delivery of instruction such as online courses. Specific components that USED desired to see in the next-generation of assessment systems can be found in the application instructions for the RTTT assessment competition (USED, 2010b). The consortia of states that applied for these funds were required to assure that they would construct and administer via computer an operational assessment by the 2014–2015 school year. These states were also expected to address measurement of higher-order thinking skills and describe how they would utilize technology to return results to educators much more quickly than had been possible with paper and pencil tests. Another aspect to the reform movement is the unprecedented increase in the stakes for educators, states, and, most of all, students. As state education agencies attempt to advocate for and move their stakeholders towards adopting reforms with tremendous implications, there is a corresponding need to utilize technology. Computerized testing is a key link in this chain. Many states, as they worked to create a policy landscape capable of attracting RTTT funds, passed legislation that tied test scores to teachers and other educators such as administrators. This link was made in order to facilitate reforms such as merit pay, school accreditation, and making valueadded judgments about how individual teachers were impacting student achievement. Another example of federal preference for this conversation can be found in the requirements for states and local districts that desire funds under the School Improvement Grant program, also funded under ARRA. The impact of these grant funds was expected in large part to be measured by substantive gains in student achievement. Assessments Based on the Common Core State Standards A significant portion of funds made available for states under the RTTT competition was set aside to develop high quality assessment systems to measure the Common Core State Standards, or CCSS (USED, 2010b). In order to be eligible for these resources, states were required to come together and apply for them as consortia. As the dust settled during the second round of

58    V. DEAN and J. MARTINEAU

RTTT, two consortia emerged, applied, and were eventually funded in approximately equal measure. These two entities are the SMARTER1/Balanced Assessment Consortium (SBAC, 2011) and the Partnership for Assessment of Readiness for College and Careers (PARCC, 2010). Each consortium is comprised of a number of governing states, those that have committed to working with only one entity but have full voting privileges on all aspects of the project, and advisory states, that may participate in both groups but have limited authority. Figure 3.1 below displays the current consortia membership and status (governing or advisory) as of February 2011. Both consortia are committed to having operational assessments developed and ready for administration in the 2014–2015 academic year. One funding condition specified that these operational assessments must utilize technology to significantly improve administration, scoring, and reporting. The primary administration platform will be computers or other digital devices, items will be scored electronically as much as possible (e.g., Artificial Intelligence (AI) engine scoring of constructed response items), and the results will be provided to educators, parents and policy makers in a timeframe conducive to informing an individual student’s immediate instructional needs. While there are many similarities between the two consortia with re-

Figure 3.1  RTTT consortia membership map.

A State Perspective on Enhancing Assessment and Accountability    59

gard to outcomes and goals, a significant difference lies in how assertive each is being in terms of maximizing the potential benefits of technology. The PARCC consortium will be using Computer Based Testing (CBT) and SBAC will develop a Computer Adaptive Testing (CAT) platform. CBT has the advantage of being more “tried and true” but may require students to participate in longer assessments. CAT can result in more efficient administration and improved measurement precision but requires that a substantially larger item pool be developed. In addition, both consortia will be including significantly more diverse item types (e.g., multiple-choice, constructed response, performance tasks, technology-enhanced items, etc.) than any state currently utilizes for high stakes tests. How a greater number of item types combines with automated scoring and computer-delivered testing creates a significant number of research questions that each organization must explore prior to operational administration. In addition to their base funding, the PARCC consortium received a supplemental award to pursue research and development to improve technology-enhanced items and AI scoring engines. SBAC received supplemental resources to facilitate improved access to these computerized assessments to English language learners by developing a significant number of versions in other languages. In support of maintaining these two systems, both consortia are planning to make the technology they develop as consistent with open-source platforms as possible. Current State Implementation of Technology in Assessment As was mentioned in the introduction, many states have made significant progress in moving from traditional paper-based assessments to computers. In order to inform the development of this chapter, Michigan conducted an informal survey of all fifty state assessment directors and the District of Columbia in October of 2010. All but one state submitted a response to the survey; responses were gleaned from the one state that did not respond from a review of their state assessment website. Of the 51 entities, 44 currently have Computer-Based Testing (CBT) initiatives which includes the use of adaptive engines in a few states. This includes: • 26 states currently administer at least some version of their largescale general populations assessments online • 15 states have formal plans to begin (or expand) online administration of large-scale general population assessments • 12 states currently administer special populations’ (e.g., alternate or English language proficiency) assessments online

60    V. DEAN and J. MARTINEAU

• Three states have formal plans to begin (or expand) online administration of special populations’ assessments While these results show that states are indeed moving toward technologyenhanced assessments, it should be noted that very few states have undertaken large-scale implementation, and no state has implemented technology-enhanced assessment across the board for all student populations and all assessments. Additional questions were designed to gauge state adoption of CBT options such as Artificial Intelligence (AI) scoring of constructed response items, implementation of interim (e.g., benchmark or unit) assessments, and Computer Adaptive Testing (CAT). The results of these questions showed: • Seven states currently use Artificial Intelligence (AI) scoring of constructed response items • Four states currently use Computer Adaptive Testing (CAT) technology for general population assessment, with one more moving in that direction soon • No states currently use CAT technology for special populations’ assessment • 10 states offer online interim/benchmark assessments • 10 states offer online item banks accessible to teachers for creating “formative”/interim/benchmark assessments tailored to unique curricular units • 16 states offer End of Course (EOC) tests online, or are implementing online EOC in the near future Again, this shows that small numbers of states have implemented these portions of technology-enhanced assessment. Other information yielded by the survey shows that in some areas, only a very few states have made much progress. For example, six states offered computer based testing (CBT) options on general population assessments as an accommodation for special populations (e.g., students with disabilities). Only four states reported piloting and administration of innovative item types (e.g., flash-based modules providing mathematical tools such as protractors, rulers, compasses). Finally, six states reported substantial failure of their large-scale online program, resulting in cessation of computer based testing. Going forward, some reported having recovered and moving back online, while others reported no immediate plans to return to online testing. The importance of this reported result is tremendous in the atmosphere of a nearly universal move to online testing among states. The care with which the initiatives move online is of utmost importance if the initiatives are to result in useful, useable systems for the states in the consortia.

A State Perspective on Enhancing Assessment and Accountability    61

Technology-Enhanced Assessment Benefits The potential benefits from an assessment platform that is technology based are clear and enticing to educators at all levels. The idea is exceedingly tempting that we can engage students in a medium that 1) captures their interest and attention in a manner superior to paper and pencil tests, 2) yields near-immediate student level and aggregate reports to guide instruction, 3) facilitates transparency in regard to how assessment results play out in accountability metrics, 4) permits improved measurement of students with unique learning needs, and 5) does all of the above with superior cost-efficiency. Below, each of these elements is explored in more detail. Appropriate for Digital Natives One component of the college and career readiness conversation deals with the importance of so-called “soft skills” (e.g., interpersonal communication, creative thinking, ability to complete standardized forms, etc.) that traditionally have not been measured well on large-scale mathematics and reading tests. Navigating an online or other computer-based environment is one of these skills that incorporating technology into assessment can help measure in new ways. Not only have today’s students grown up with unprecedented access to technology, many have been utilizing increasingly sophisticated devices at younger ages. Proficiency with technology is unquestioningly an essential skill for college and career readiness, and incorporating such things as technology-enhanced items into the next generation of assessments provides the opportunity to measure this construct. Effective application of technology can expand both the nature of the content presented, and the knowledge, skills and processes that can be measured (Quellmalz & Moody, 2004). As states explore avenues where student learning happens in locations other than brick and mortar school buildings, online assessment administration becomes critical. For example, Michigan has a few districts approved to provide a large percentage of high school student credit through online courses, and has two K–12 virtual schools with approximately 400 students each who receive 100 percent of their instruction online. As this expands, it is anticipated that eventually many high school students will complete their graduation requirements without ever setting foot in a school building. As they will have access to all their courses online, it is important that they are assessed in the same milieu.

62    V. DEAN and J. MARTINEAU

Reporting at the Speed of Light(ish) It is essential that as the results from assessment systems are used for increasingly high stakes, results are delivered into the hands of educators and policy-makers as soon as possible. While it is possible that individual student data can be provided almost immediately after the student takes a test, it is unlikely that aggregate data files (e.g, school, district, and state level reports) will be instantly available. Because of the stakes, it is essential that the data undergoes effective and thorough quality-control and assurance processes that, while often automated, do take some time. That being said, results from paper-based assessments do not arrive until weeks or possibly months after the test has been administered, and technology can cut this timeframe down to days or even quicker, depending on the level of results aggregation. This may be complicated as the next-generation assessment systems currently being developed increase in complexity over the paper-based predecessors. For example, both SBAC and PARCC are developing summative assessments that include a wide variety of item types (e.g., multiple-choice, constructed response, technology-enhanced, and performance task items). A key element in maximizing reporting speed is the use of Artificial Intelligence (AI) scoring for extended written responses, technology enhanced items, and performance tasks whenever possible. Both consortia expect to utilize AI scoring engines for constructed response items in their proposed systems, and the results will be used for high stakes purposes. To assure stakeholders that the scoring is done reliably, human read-behind of at least a portion of the student responses is important to include (Council of Chief State School Officers & Association of Test Publishers, 2010). This may have an impact on reporting speed, depending on how many human scorers are needed to verify the AI engine’s work. While not completely perfected for all objective scoring purposes, the AI engines currently available have demonstrated sufficient reliability and consistency with human scorers to justify their inclusion in several state assessment systems. Each generation of these automated scoring programs brings significant improvement, and it is expected that through research and development by organizations such as PARCC and others that even more robust AI engines will be available by the time the two large consortia roll out their new tests in the 2014–2015 school year. It is certainly possible, however, that even by then these next generation systems will include items that elicit student responses for which AI engines are not yet ready to reliably score. In those instances, the consortia will need to plan on developing a large network of trained hand-scorers, likely through the use of a vendor, that can receive, score, and submit results for incorporation in the overall data files very quickly. This will be a logistical challenge and expensive due

A State Perspective on Enhancing Assessment and Accountability    63

to the sheer numbers of potential student responses to be scored, but most states have experience with this type of model. Transparency Appropriate for the Stakes Thurlow, Elliott, and Ysseldyke (2003) noted that common challenges to transparency include current state data systems and the reports provided to stakeholders from statewide testing. The need for a transparent system has never been greater than in the present. As wide an array of stakeholders as possible needs the opportunity to understand, and participate in revisions as needed, in the next generation of assessment and accountability mechanisms. As the stakes appear to continually rise as states wrestle with complex issues such as tying student test scores to teachers, merit pay, valueadded accountability, and so on, there should be frequent opportunities to challenge and revisit how these ideas are implemented. Like any effort that is initially policy-based, state education agencies must take responsibility for providing those on the receiving end of accountability metrics, and the general public, with the chance to guide revisions of the system based on research that will be conducted following implementation. Failing to incorporate these types of mechanisms will result in assessment and accountability systems that mirror the inflexible and unyielding ones of the past, the only change being in how far and fast they can reach due to the incorporation of technology. Improved Measurement Precision for Special Populations One of the areas of greatest potential for incorporating technology into large-scale assessment programs lies in deeper and richer assessment of students that have difficulty accessing general assessments. Thompson, Thurlow, and Moore (2003) noted that computer-based tests have the capacity to incorporate features designed to enhance accessibility for students with disabilities and English language learners. As noted above, several states already administer online or technology-enhanced versions of their assessment to some of these students in an effort to really get at what they know and are able to do. The complex potential manifestations of English language acquisition and the numerous recognized disability categories often confound, or at least severely restrict, what inferences can be made about achievement from standardized paper-based tests. In order to combat this challenge, states have adopted wide ranges of accommodations that students may utilize and have created one or more

64    V. DEAN and J. MARTINEAU

alternate assessments. These alternate assessments are designed to permit students to demonstrate what they have learned on instruments appropriate for their level of functioning. Typically, these are students with disabilities so severe that they cannot appropriately access the state’s general assessment, no matter how many accommodations they might be provided. Computerized assessments often allow students to control some of the presentation features (e.g., text size, highlighting tools, text to speech, etc.), which both facilitates access to the items and promotes independence. There are some students, however, whose disabilities make it likely that they will not be able to access computers or other devises. These students, typically those with severe cognitive impairments, are required to be included in statewide assessment programs in order to ensure that they are being exposed to as much academic content as possible. This is one part of important efforts to include these students in the general curriculum wherever possible and raise the quality of instruction and other services they receive. While a small number of these students may be unable to access computers, technology can be included in the next-generation assessment systems in a number of ways to facilitate enhancements in their instruction. Some state alternate assessments are comprised of performance tasks or rating scales. In both instances, the results are generated based on observing the students engage in some activity. Once the observations are complete, teachers, and in some cases students, could enter the scores into an online system and realize the same reporting speed benefits as those whose students participate in the general assessment. Another possibility is to have the tasks and observations recorded with digital video that is subsequently sent into a central repository for scoring by other educators. Teachers receiving feedback on how to modify and improve instruction from other teachers based on video evidence is an intriguing opportunity to enhance the educational programs for students with a variety of needs. The state of Idaho has recently implemented an alternate assessment of this type (Idaho State Department of Education, 2010). As the two assessment consortia develop their online engines, there is a wealth of opportunities to enhance access for students with disabilities and English language learners. In particular, the computeradaptive approach of SBAC, that seeks to make the test adaptive not just by item difficulty but also item type, holds immense promise for increased measurement precision in populations of highly diverse learners. Positive Budget Impact Moving to online or at least computer-based assessment will likely take a significant amount of stress off the world’s forests. The sheer volume of paper involved in traditional assessments is quite impressive, matched by the

A State Perspective on Enhancing Assessment and Accountability    65

equally impressive amount of resources state have allocated to print, ship and scan it. In addition to the cost savings that states may potentially realize by shifting away from paper, not having to rely on paying individual humans to hand-score thousands of student responses is another potential boon of technology. The AI scoring possibilities described above have the potential to save a great deal of money while simultaneously permitting resurgence in constructed response items. Many states have been forced to eliminate constructed response items on most of their large-scale programs due to being unable to sustain hand-scoring costs. Improved Test Security The more high-stakes the system, the more likely security breaches become. Several states, such as Georgia, have recently experienced a significant number of schools impacted by cheating or other means of tampering with state assessment administration or scoring. Online assessment permits a higher degree of control over user-role access to elements of the system appropriate to their station. In addition, technology can be used throughout the scoring process to automate ways of detecting cheating. For example, it is possible to build erasure analysis algorithms into automated scoring systems. Taking advantage of programs like this has the potential to improve confidence in the data without delaying reports. Development of these systems must carefully take into account the push and pull between transparency and usefulness (e.g., open-source access) and the need to assure stakeholders that assessment results are valid and reliable. Integrating Technology with Balanced Assessment and Accountability Systems Taking advantage of the benefits described above is essential to states having educational systems that prepare students for college and the workforce in a globally competitive environment. Figure 3.2 below depicts one view of the type of balanced assessment and accountability system that states have been unable to build to date (Martineau & Dean, 2010). This model is necessarily somewhat complex as it attempts to account for the needs of educators essential for effective implementation in addition to the multiple outcomes that are required for the high-stakes systems recently being developed and implemented. For the purposes of this chapter, we briefly touch on each of the bands found in the model (Professional Development, Content & Process Standards, etc.) and how each could benefit from maximizing the use of technology.

Figure 3.2  A balanced assessment & accountability system. Used by permission from Martineau & Dean (2010).

66    V. DEAN and J. MARTINEAU

A State Perspective on Enhancing Assessment and Accountability    67

Professional Development An essential starting point with this model is recognizing that professional development forms the footings of a balanced system. State agency assessment experts will continue to struggle with helping the general public understand how these assessments, and the results they yield, should be used. Other authors have argued that the results from standardized instruments are inappropriate to be utilized for measuring the effectiveness of teachers (Popham, 2000). We strongly recommend that all institutions of higher education involved with training teachers and administrators build thoughtful assessment and data literacy coursework into all of their programs. There is a cycle at play here that must be addressed, or educators will be ill-prepared to challenge the actions of policy-makers and politicians who succumb to the temptation to use these instruments in way that are contrary to their initial purpose. As the available assessments and data from them improve, and the same tests are used across larger numbers of districts and states, the temptation to use them for comparing education at all levels will rise. In a nutshell, better tests lead to higher stakes use, this reveals inadequacies in the tests, and demand is created for better tests and so on. It is easy to find examples of this cycle since NCLB, with rapid acceleration of this conversation brought about by RTTT. This can only be tempered by giving educators a requisite amount of assessment and data literacy. Not only is this an issue of equity if test scores increasingly become part of their evaluations or compensation, it is the only way that they will be able to use the results of the powerful new assessment systems on the horizon to positively impact student achievement. All the technological enhancements in the world applied to assessments will be meaningless if the end users are functionally illiterate with regard to purpose, application, and relevance. There are a number of specific instances where the use of technology can help remediate this challenge. The development of high-quality online training programs that contain well thought out interactive features is critical to cost-efficient scaling up of balanced assessment literacy. Educators also need opportunities to share ideas and thoughts about system implementation through the use of social networking and live coaching. This can be accomplished through the use of electronic (graphic, audio, video) capture for distance streaming of materials, plans, and instructional practice vignettes over high speed networks. This would facilitate discussion regarding instructional practice among end users of the system and other stakeholders.

68    V. DEAN and J. MARTINEAU

Content & Process Standards We define the second critical layer of a balanced system as content and process standards. This involves starting with a limited set of high school exit standards based on college and career readiness. Following articulation of these exit standards, K–12 content/process standards in a logical progression to college and career readiness can be fleshed out. Based on these learning progressions, instructional materials and other aspects of a full-fledged curriculum (e.g., targeted instructional units, decisions about pedagogy, etc.) can be put together. After these decisions have been expressed, technology can play an essential role in delivery, consistency, and comparability. Educators should have access to an online clearinghouse of materials that contains lesson plans, suggested materials, and video vignettes of high quality instruction for each unit or lesson. This clearinghouse should be constructed via a flexible platform that enables users to submit items to the repository in a variety of accessible formats. Finally, the clearinghouse should have the capability to capture and clearly display user moderated ratings of the quality and utility of each item. For example, viewers of a video-captured lesson plan should be able to assign a rating based on how successful it was when attempted in the viewer’s specific context. A key task to accomplish before moving deeply into assessment practices is to classify content standards in three ways: timing, task type, and setting. In terms of timing, we identify three levels, including content standards, that should be measured: (1) on-demand in a timed fashion (e.g., fluency standards in math or reading); (2) on-demand, but untimed (e.g., standards that can be measured in a short time frame but do not require fluency); and (3) using methods that allow for feedback from educators as students pass go through multiple steps of a complex task. In terms of task type, we identify four classes of response types of individual content/process standards, including selected response (e.g., multiple choice items), short constructed response (e.g., short answer items), extended constructed response (e.g., essays, showing work, building graphs), and performance events (e.g., complex tasks measuring higher order skills in an integrated fashion). Finally, in terms of setting, we identify two classes of content/process standards: those that should be the province of classroom testing only, and those that are appropriate for secure assessment. Based on the classifications described above, several types of assessment are necessary to develop in order to fully measure each aspect of a rich system of content and process standards.

A State Perspective on Enhancing Assessment and Accountability    69

Classroom Formative & Summative Assessment While formative and summative assessment practices are separate disciplines, we combine them here for the purpose of making a distinction about the level of implementation. There is a need for both formative and summative assessment practices at the classroom level, and for them to augment what can be reliably accomplished through large-scale administration. Popham (2008) modestly revised a definition of formative assessment developed by the Council of Chief State School Officers, State Collaborative on Assessment and Student Standards group dedicated to promoting the use of formative processes as follows: Formative assessment is a planned process in which assessment-elicited evidence of students’ status is used by teachers to adjust their ongoing instructional procedures or by students to adjust their current learning tactics. (p. 6)

The implementation of formative assessment practices such as teacherstudent feedback loops can be enhanced by technological aids such as the variety of commercially available response devices (e.g., clickers, tablet computers, phones). Students need the ability to provide rapid responses to teacher queries over online systems and remote responses to formative queries (e.g., rural areas and virtual schools). Educators also need the capability to select and develop summative classroom assessments at will. For example, a teacher should be able to create on-demand, micro-benchmark (e.g., small unit) assessments. This necessitates the construction of a non-secure item bank, ideally populated with content developed by other educators. It should be customizable to fit specific lesson plans or curricular documents, yield instant reporting for diagnostic/instructional intervention purposes, and inform targeted professional development in real time. Developing and implementing a component like this has the potential to be a very powerful tool for teachers. However, it is essential that this level of the system remain purely formative in nature. This means that the results must not be used for large-scale accountability purposes; they belong entirely to the teachers and principals and are specific to their unique context. Secure Adaptive Interim & Summative Assessment With high-quality classroom assessment practices in place, secure, largescale assessments can be developed to support accountability metrics. Note that in the proposed system, all secure assessments are given in an adaptive manner where possible. We define an adaptive environment as in Sands,

70    V. DEAN and J. MARTINEAU

Waters, and McBride (1997), where large pools of test items are available with wide ranging levels of difficulty so that each student’s achievement is measured by items tailored to the students’ individual level of achievement. While each individual student gets a different set of test items, each student can still receive a set of items that measures achievement on the same content standards as every other student. Such tailored testing has the significant advantage that fewer items can generally be used to obtain a high degree of precision, and each student’s achievement can generally be measured with comparable degrees of precision. However, adaptive testing requires the development of massive numbers of items to support each intended purpose. For example, multiple item pools are required for summative and interim/benchmark tests. There are three types of large-scales assessments that we believe should be supported in a balanced system. Repeatable, on-demand, customizable, online, unit assessments. The Race to the Top Assessment competition description called these “through-course” assessments. These are unit assessments targeted toward individual “packages” of content standards being taught in a specific unit within a course. These through-course assessments have several purposes: • They provide an advance look at a student’s trajectory toward endof-year proficiency as students finish units throughout the year. • They also provide students with multiple opportunities to demonstrate proficiency, rather than a single test at the end of the year. • Because there would be multiple opportunities for students to demonstrate proficiency throughout the year, the use of these results for high-stakes accountability purposes is more defensible. • They provide opportunities for mid-course corrections in instructional practices (like those used in Response to Intervention). • They provide information that can be useful in designing differentiated instruction. • They provide an opportunity for the end-of-year assessments to be eliminated for many students, provided they pass all unit assessments. In this system, we anticipate moving beyond traditional CAT/CBT by including AI scoring of constructed response items, technology enhanced items, performance tasks and events (through simulations) and even gaming type items. End of year summative assessments. This is the traditional end of year summative assessment. It is important to retain this type of assessment to assure that students who pass all through-course unit assessments are also able to pass the traditional end of year assessment. This assessment is also neces-

A State Perspective on Enhancing Assessment and Accountability    71

sary for those students who do not pass all through-course assessments, to demonstrate proficiency at the end of the full year of instruction. There are three groups of students who must take the end of year summative assessments to first adequately develop, and then to adequately maintain those assessments. They are: 1. An initial scaling and calibration group during development to assure that the scales are appropriately constructed. 2. Ongoing randomly selected validation groups during maintenance (to validate that students proficient on all required unit tests retain proficiency at the end of the year). 3. Students who do not achieve proficiency on all required unit tests. End of year summative assessments give students a final opportunity to demonstrate overall proficiency if proficiency was in question on any single unit assessment. The use of through-course assessments should allow for the elimination of a single end-of-year test for many students. Portfolio development and submission. Because not every content standard can be measured using test items or tasks that can be scored electronically, it is important to retain the option for submission of portfolio-type materials for scoring as well. Like the Idaho alternate assessment system (Idaho State Department of Education, 2010), these could include scannable materials, electronic documents, and/or audio/video of student performance submitted via a secure online portfolio repository Because these are unlikely to be scorable entirely using AI, and it is still important to have rapid turnaround, these should leverage technical infrastructure to be scored on a distributed online scoring system that prevents teachers from scoring their own students’ portfolios (e.g., Idaho’s alternate assessment portfolio scoring site). Besides being able to measure higher order skills using portfolio submissions, these can also be scored both for final product and development over time, which may indeed be reflections of different content and/or process standards. Accountability Each of the summative assessment types described in the preceding sections can contribute valuable information to accountability models. Some scholars have addressed this by creating growth models that measure “progress toward a standard” model (Thum, 2003). In such models, it is imperative that students who are not on track to achieve the end-game policy objective have a growth measure. However, when measuring growth toward meeting an ultimate expectation, it is not necessary to have a measure of

72    V. DEAN and J. MARTINEAU

growth for students who are already on track to meeting a rigorous expectation. Student growth in regard to performance on standardized achievement tests is one common aspect of value-added accountability models. However, no model of this type currently incorporates data from the full complement of assessment types described above. As the stakes have risen sharply for individual teachers and principals, valid concerns have emerged about inferences drawn from existing value-added models. We suggest that only through the systematic implementation of a balanced assessment and accountability system, built on the foundation of robust professional development, does the opportunity for strong multidimensional reporting exist to support valid value-added models. Impediments to Building Technologically Robust Assessment Systems Infrastructure The fiscal challenges to redesigning and deploying technology rich assessment systems are not to be underestimated. The initial investment for developing systems that will only work on technology platforms with reasonable technical specifications are only half of the equation. In addition to these, the opportunity for states to stop their current programs and completely shift funds to the new, superior systems has not been provided. States are required, for the sake of continuity, to continue administering their existing assessment programs. As noted above, these remain paper based in the majority of states. This creates a period of time where dual systems exist in various stages of construction, implementation, and phasing out. While necessary, to some degree, this is a very expensive undertaking that states struggle to maintain. Complicating this is the wide disparity in regard to technological infrastructure in local districts. Access to high-speed internet, the computer to student ratio and other calculations must be factored into the total expenditure for bringing technology to bear in assessment. Once the development is complete, there exist substantial costs to support sustainability. For example, the funds provided to SBAC and PARCC under the Race to the Top competition are specifically to be used only for developing their respective assessment systems, not operational administration. States will be responsible for picking up the tab on administration, scoring, reporting, and how the assessments end up being utilized in each state accountability system. This perpetual investment in administration is expected to be comparable to what states currently pay, but yield much richer, instructionally relevant information. As most states struggle over the next few years to deal with major budget challenges, each education agency

A State Perspective on Enhancing Assessment and Accountability    73

will be asked to justify continuing to expend significant resources on assessment. Combating this will require that the consortia developed assessments are highly successful starting with their first administration and reporting cycle. This includes adequate resources being spent on public relations and professional development activities to ensure that the general public and end users such as educators are aware of the benefits and enhancements over prior systems. The conversation about ongoing costs includes how states will set aside resources for local districts to deal with recurring hardware, software and maintenance costs as technology becomes dated and in need of replacement. To this point, concern has been voiced about the rapid speed with which technology is changing. This includes new devices that have implications for how assessment items are displayed, compatibility across system platforms, and other hardware and software variability (National Research Council, 2010). The two assessment consortia, for example, must plan to ensure that the assessments they develop are not technologically obsolete before they become operational. Local Control Beyond the substantial funding and level of effort requirements, developing the type of comprehensive system described above is only capable of being implemented and sustained with local buy-in. No single state (let alone district) could afford the cost of creating a system with so many interlocking parts, several of which have no well-accepted precedent. The only opportunity to bring adequate resources to bear on this problem is through collaborative efforts like the assessment consortia formed as part of Race to the Top. However, there are inherent risks with these types of entities that must not be ignored. First, consortia, or any organization formed to complete a massive and multi-faceted purpose, can tend towards self-perpetuation over time. Resisting this nepotistic trend is essential to ensuring that the work continues to meet the needs of the original members and stays true to the group’s fundamental purpose. Second, consortia cannot ignore reasonable needs for flexibility. Every local entity has a unique context, and any system should include components with options for implementation. This can be tricky, as there is a balance to be struck between appropriate flexibility and enough standardization to support comparability. Finally, consortia must monitor member contributions and create opportunities for stakeholders to become invested in the processes and outcomes. Maximizing member involvement and investment is essential to creating a sense of ownership in the system’s end users.

74    V. DEAN and J. MARTINEAU

Recommendations for the Future As the next generation of assessments takes form, conversations about how to maximize their utility will continue to evolve. While a great deal of important research will be accomplished by the two large state-led consortia (SBAC and PARCC), there will be more to do after these assessments are deployed and results become available. It is essential that the potential of these next-generation systems be realized. This requires stakeholders to buy-in on a number of fronts. Educators and the general public must be presented with a careful cost-benefit analysis that presents a convincing argument. They must be presented with significant improvements in receiving results, connections between families and the school, and improved instructional practices. In addition to these overall constructs, inroads to the more specific areas outlined below must be made. Research on Psychometrics and Administration While technological advancements make it more likely to be able to develop innovative item types that can be scored both objectively and subjectively, it is important that two ideas remain at the fore: (1) new item types are experimental and should not be used for high-stakes purposes until validated, and (2) every attempt to create item types that can be scored objectively should be made before resorting to developing new types of subjectively scored items to assure that adequate reliability and cost-effectiveness are maintained. As new item types are developed and piloted, it is essential that adequate time and resources be allocated to understanding how they function psychometrically and statistically. As states seek to further their understanding of student learning by incorporating new and more item types than has been the case to date, the temptation to draw inferences before robust validation will be great. There is much yet to be understood about how administering multiple, complex and new types of items plays out in the context of technology-enhanced assessment. Thorough research must be completed prior to using results from such systems to make judgments about the efficacy of individual teachers and schools. This research agenda must focus on item and test form equating, the comparability of accommodated and alternate versions, and how best to set achievement standards on assessments with unprecedented complexity and consequences. State and Local Capacity Building Any new assessment and accountability that utilizes technology to enhance its components had the potential to make us data-rich and analysis-

A State Perspective on Enhancing Assessment and Accountability    75

poor. If stakeholders and end users are ill-equipped to properly analyze and incorporate assessment and accountability data into curricular and instructional contexts, the impact on student achievement will be modest at best. Building state and local education agency capacity for appropriate analysis based on common definitions is critical. One example of this can be found in the national debate around definitions of 21st century collegeready and career-ready skills. Many questions abound about whether or not these skills are the same, similar, or quite different. If states define them differently, then serious questions about comparability and application will permeate this country’s education system. Sustainability While there is broad consensus that these types of systems are critical to reforming U.S. education, state education agencies have significant concerns that these massive high stakes structures will not be sustainable in the current national budget climate. While the large financial incentives that came along with the American Recovery and Reinvestment Act (ARRA), a portion of which formed RTTT, formed the impetus for many of these initiatives to take root, sustainability must become a priority at the federal level, within, and across states. To maximize cross-state focus, we recommend continued significant funding of initiatives through ESEA reauthorization, Enhanced Assessment Grants, and other competitive/formula funding opportunities. The scoring of competitive applications should be weighted toward robust development of integrated systems across all aspects of assessment & accountability and significant and rigorous research, development, and evaluation of the validity and impact (intended and unintended consequences) of system implementation. Formula funding should stipulate collaboration in system development designed to guarantee a continued focus on students with the greatest needs (e.g., economically disadvantaged and students with disabilities). As many state education agencies do not have robust standing resources for writing competitive grants, we encourage strong partnerships with institutions of higher education and philanthropic organizations with a track record of success and innovation in education. Conclusion We believe that the education system in this country is at a crossroads. The significant gap between the resources needed to build and sustain superior assessment and accountability systems that provide better opportunities for all students and current state, federal and local fiscal realities can only be

76    V. DEAN and J. MARTINEAU

bridged through the creative and effective use of technology. This is the only solution we see to not just developing, but also scaling up the level of professional development, transparency, and rigor that will truly result in a transformative series of events that makes education in this country globally competitive once more. Additionally, we believe it is essential that states (and consortia of states) address the integration of technology into a comprehensive system of assessment and accountability. Many of the inroads technology has made into state systems has been in a piecemeal fashion. Without a rich, complete plan for integration of technology into a comprehensive structure, it is unlikely that the system will be able to function coherently. We provide a basis for thorough integration into a comprehensive system. We hope that such rich and far-reaching integration will result in a system with longevity that truly serves the needs of students by providing educators with the tools to succeed. Note 1. The SMARTER acronym stands for Summative Multi-state Assessment Resources for Teachers and Educational Researchers.

References Council of Chief State School Officers & Association of Test Publishers. (2010). Operational best practices for statewide large-scale assessment programs. Washington, DC: Author. Idaho State Department of Education. (2010). Idaho standards achievement test—alternate (ISAT-Alt) portfolio manual. Boise, ID: Author. Martineau, J. A., & Dean, V. J. (2010). Making assessment relevant to students, teachers, and schools. In V. Shute & B. J. Becker (Eds.), Innovative assessment for the 21st century: Supporting educational needs (pp. 139–166). New York, NY: Springer-Verlag. National Research Council. (2010). State assessment systems: Exploring best practices and innovations. Washington, DC: The National Academies Press. PARCC. (2010). Partnership for Assessment of Readiness for College and Careers . Retrieved March 8, 2011, from Florida Deparment of Education web site: http://www.fldoe.org/parcc/ Popham, W. J. (2000). Testing! Testing!: What every parent should know about school tests. Needham Heights, MA: Allyn & Bacon. Popham, W. J. (2008). Transformative assessment. Alexandria, VA: Association of Supervision and Curriculum Development. Quellmalz, E., & Moody, M. (2004). Models for multi-level state science assessment systems. Washington, DC: National Research Council.

A State Perspective on Enhancing Assessment and Accountability    77 Sands, W. A., Waters, B. K., & McBride, J. R. (1997). Computerized adaptive testing: From inquiry to operation. Washington, DC: American Psychological Association. SBAC. (2011, February 25). SMARTER/Balanced assessment consortium. Retrieved March 8, 2011, from Washington Office of the Superintendent of Public Instruction: http://www.k12.wa.us/smarter/ Thompson, S., Thurlow, M. J., & Moore, M. (2003). Using computer-based tests with students with disabilities. Minneapolis, MN: University of Minnesota. Thum, Y. M. (2003). Measuring progress toward a goal: Estimating teacher productivity using a multivariate multilevel model for value-added analysis. Sociological Methods & Research, 33(2), 153–207. Thurlow, M. L., Elliott, J. L., & Ysseldyke, J. E. (2003). Testing students with disabilities: Practical strategies for complying with district and state requirements. Thousand Oaks, CA: Corwin Press, Inc. U.S. Government Printing Office. (2009, February 17). Public Law 111–5–American Recovery and Reinvestment Act of 2009. Retrieved March 8, 2011, from U.S. Government Printing Office web site: http://www.gpo.gov/fdsys/pkg/PLAW111publ5/content-detail.html USED. (2002). No Child Left Behind Act of 2001, Pub. L. No. 107-110, 115 Stat. 1425 USED. (2009, January 12). Standards and assessments peer review guidance: Information and examples for meeting requirements of the No Child Left Behind Act of 2001. Retrieved February 21, 2011, from USED web site: www2.ed.gov/policy/elsec/ guid/saaprguidance.doc USED. (2010a, March 14). State fiscal stabilization fund. Retrieved March 8, 2011, from USED web site: http://www2.ed.gov/policy/gen/leg/recovery/factsheet/stabilization-fund.html USED. (2010b, May 8). Race to the top assessment program: Application for new grants. Retrieved February 21, 2011, from USED web site: http://www2.ed.gov/programs/racetothetop-assessment/comprehensive-assessment- systems-app.doc USED. (2010c, September 9). Race to the top fund. Retrieved March 8, 2011, from USED web site: http://www2.ed.gov/programs/racetothetop/index.html Utah. (2004). Title 53A (State System of Public Education), Chapter 1 (Administration of Public Education at the State Level), Section 708 (Grants for online delivery of UPASS tests). Retrieved February 21, 2001, from http://le.utah.gov/dtForms/ code.html

This page intentionally left blank.

Chapter 4

What States Need to Consider in Transitioning to Computer-Based Assessments from the Viewpoint of a Contractor Walter D. Way Robert K. Kirkpatrick Pearson

Introduction Applications of computer-based testing have been commonplace in a variety of settings for some time, including the military (Sands, Waters & McBride, 1997), college admissions (Mills & Steffen, 2000), licensure and certification (Luecht, Brumfield, & Breithaupt, 2006; Melnick & Clauser, 2006). In large-scale K–12 state testing programs in the U.S., the move to computerbased testing has lagged somewhat compared to these other fields, but is beginning to pick up momentum. A survey carried in 2010 and reported in Chapter 3 of this volume (Martineau & Dean, 2012) indicated that of the Computers and Their Impact on State Assessments, pages 79–103 Copyright © 2012 by Information Age Publishing All rights of reproduction in any form reserved.

79

80    W. D. WAY and R. K. KIRKPATRICK

50 U.S. states and Washington, DC, 44 have computer-based testing initiatives that include an operational computer-based test, a pilot computerbased test, and/or plans to implement computer-based testing in the near future. In some states, computer-based testing initiatives are extensive. For example, Alpert (2010) noted that virtually all of Oregon’s NCLB mathematics tests in 2008–2009 were administered online using computerized adaptive testing. In a legislative briefing, the Virginia State Department of Education (2009) reported administering 1.67 million online tests in their assessment program in 2009. Peyton (2008) reported that two-thirds of survey respondents in Kansas schools indicated over 90 percent of students in their schools were taking the Kansas Computerized Assessments in 2006 as opposed to paper versions of the state tests. Although these extensive online testing applications are encouraging, they have not come easily. Even though the majority of their students have tested online for six years or more, Oregon still struggles today with scheduling students to test by computer (Owen, 2010). Virginia’s successes are in large part due to the statewide investment reported at more than $524,772,000 over the nine-year Web-Based Standards of Learning (SOL) Technology initiative (Virginia State Department of Education, 2009). In most states, the infrastructure needed for extensive online testing applications is unevenly available across districts and schools. Even when the needed infrastructure is available, the transition from traditional paperand-pencil testing to computer-based assessment is challenging because it requires schools and state departments of education to develop and implement new processes and approaches that are needed to support computerbased assessment. Nearly all large-scale state assessment programs rely on services from testing contractors to develop, administer, score, and/or report results for their assessment programs. In general, testing contractors have developed the capability to support computer-based assessments and have experience with helping their client transition to computer-based assessments. This chapter addresses what states need to do to transition to computer-based assessments from the viewpoint of a contractor. The issues and advice offered here reflect the experiences of a single contractor, but we are reasonably confident that our positions are generally consistent with those of other organizations providing computer-based assessment services to state departments of education. The chapter is divided into four major sections. The first section discusses the recent educational reforms in the U.S. that have led to the Common Core assessments. The common core assessments are expected to have a major impact on state assessment programs and to accelerate a comprehensive transition to computer-based assessments. The second section considers the various contractor services related to computer-based

What States Need to Consider in Transitioning to Computer-Based Assessments    81

assessment and how these services differ from those that support traditional paper-and-pencil assessment programs. The third section discusses state considerations associated with computer-based assessments, including transition strategies, measurement issues and operational issues. The final section addresses interoperability and computer-based assessments, which is a consideration that is gaining increasing attention within the large-scale assessment community. Computer-Based Assessments and Educational Reform Recent educational reform movements are providing incentives for states to transition their assessment programs to computer delivery. Under the American Recovery and Reinvestment Act of 2009, the president and Congress invested unprecedented resources into the improvement of K–16 education in the United States. As part of that investment, the $4.35 billion Race to the Top Fund focused on a state-by-state competition for educational reform. Race to the Top also included $350 million in competitive grants that were awarded in 2010 to two state consortia to design comprehensive new assessment systems to accelerate the transformation of public schools: The Partnership for Assessment of Readiness for College and Careers (PARCC)1 and the SMARTER Balanced Assessment Consortium (SBAC).2 In their competition submissions, both consortia stressed the use of technology generally and computer delivery in particular as a basis for summative tests. For example, in their submission the PARCC say they would “administer a streamlined computer-based assessment with innovative item types” and the SBAC submission indicated that the summative component of their proposed system “would be administered as a computer-adaptive assessment.” All but five states have joined one or both consortia with the understanding that the Common Core assessments will be delivered by computer. However, the Common Core assessments will not be implemented until 2015 at the earliest. States are therefore preparing for a transition both by considering how their state standards intersect with the Common Core State Standards and by considering how they can gain additional experience with computer-based assessments. Beyond the impending Common Core assessments, states have compelling reasons for moving toward online testing. For example, computer technology can provide faster turnaround of scores and also link assessment results directly to instructional programs. Online testing offers increased student engagement, opportunities for cost and time savings, enhanced security, and greater equity across student populations. It creates opportunities to more fully capture student performance and to track student growth.

82    W. D. WAY and R. K. KIRKPATRICK

To this end, the contractors that develop and deliver state assessments are working with their state clients to find opportunities to implement new online assessments, as well as to enhance existing computer-based assessments by introducing innovative items and other technology-based enhancements. From a contractor’s perspective, the next generation of assessments is ready to take off. Contractor Services Related to Computer-Based Testing Many contractors have experience with developing and delivering largescale state assessments on computer. Each contractor is likely to have its own procedures and delivery platform. In most cases, contractor responsibilities for computer-based testing are based on a formal business agreement between the state department of education and the contractor. This agreement is often the result of a formal procurement process through which the state specifies the scope of work that is required and multiple contractors bid on the work. Contractors provide specialized services in the areas of program management, content development, test administration, scoring and reporting, as well as psychometric and research support. As with paper-and-pencil testing, the sophistication of the services needed to support a computer-based testing program must match the complexity and goals of the program. Well defined and executed special services are instrumental in delivering a successful computer-based test. Program Management Computer-based testing programs differ from traditional paper-based programs in their logistical needs. For example, the schedule for development of paper-based programs is driven by the date when test materials are required to be physically at the testing location, but the schedule for development of computer-based testing programs is driven by the date when the system must be available to test takers. These dates can differ by as little as a day to as much as several weeks, with deadlines for making a computer assessment available typically coming later than the deadlines for paper form availability. Understanding the differences in the logistical needs of paper-based and computer-based testing programs, and the implications of those differences, is one place where contractor program management expertise is valuable.

What States Need to Consider in Transitioning to Computer-Based Assessments    83

Programs that are transitioning over time from paper to computer also have unique needs. In these cases schedules for paper and computer test development must be commingled so that both modes are available at the right time. This is a complex process that requires coordination across several test development disciplines and may include frequent problem solving sessions. Multiple versions of test ancillary documents (e.g., directions for administration) may need to be created, and information technology systems may need a high degree of coordination in order to score and report results. Experienced contractor program management will know where logistical gaps occur and will be able to assist in resolving those gaps or any other problems that might occur. Content Development Contractor test development staff typically have expertise in developing and presenting content on computer and have an awareness of the unique requirements that computer and paper presentations have on content creation. This experience may also include working with new technologyenhanced item types such as graphing or simulation, the development of which requires a unique combination of content and software (Strain-Seymour, Way & Dolan, 2009). The guidelines for using specific online development tools and capabilities of online systems are continuously evolving. Contractor test development staff will be most familiar with the needs of their particular systems and the general features or capabilities that transcend systems. In our experience the pervasiveness of differences in how content is presented on paper and online can be surprising to someone new to computer-based testing. For example, many items composed for paper delivery must be completely recomposed using different fonts or graphics for online systems. Experienced test development staff are trained on where these differences exist. Contractor test development staff may have experience with several customers and contractor-owned products. Their work on new presentation approaches has resulted in first-hand knowledge of strategies that work as well as those that do not work. The contractor’s test development team can help their clients avoid learning by trial-and-error when doing such is not necessary. Test Administration Three issues related to test administration for which contractors can be of assistance are: 1) system readiness, 2) administration designs, and 3) prob-

84    W. D. WAY and R. K. KIRKPATRICK

lem resolution. For open systems, such as those that are browser-based, we recommend a system certification process be used whereby the end-point user (often a school) must meet the system requirements provided by the contractor and complete a systems check before the first day of testing. The certification process can vary from simple to complex. In simple cases the information technology staff at the end-point may be required to verify, perhaps in writing, that their computer infrastructure meets minimal requirements specified by the contractor. In sophisticated cases the contractor may perform an on-site inspection of the local computer infrastructure that includes bandwidth and load testing. In load testing the contractor passes large amounts of data between the contractor system and the school in order to simulate the strain that will occur on the system if every computer at the school was being used for testing at the same time. Contractors can also execute the load testing remotely in collaboration with the local school staff. There are many options in between the simple and sophisticated solutions. Owners of high-stakes testing programs typically opt for strategies that are on the sophisticated side, as administration failures can have damaging consequences. Contractor program management can assist in determining the level of certification needed based on the stakes of the project. The number of computing resources continues to be a limiting factor for computer-based testing administrations. In large-scale testing situations, such as statewide testing, it may take several weeks to complete testing. As a result online administrations can be perceived to have increased risk associated with test security. Simple administration designs for traditional (nonadaptive) tests include: • Dual mode (computer and paper) where both modes receive the same form • Dual mode where the paper form is a different parallel form from the computer-administered form • Computer only, where only one form is used • Computer only, where multiple parallel forms are used In practice, relatively sophisticated designs are common, including multiple form designs where forms are retired at specific intervals or after certain levels of exposure are reached as well as adaptive testing, which uniquely targets test administration to each student. In many cases paper forms continue to be used for accommodations, such as Braille, even with a project that is intended to be computer-only. A useful approach here is to have a unique form for accommodations in order to minimize exposure. This may also allow the accommodated form to be reused, thus reducing the cost of developing accommodated materials. Multi-form designs can be

What States Need to Consider in Transitioning to Computer-Based Assessments    85

costly to implement, and this should be considered along with the risk of exposure for the project. Figure 4.1 provides an example of a moderately sophisticated three-week computer-based administration design. In this design three pairs of forms are offered each week in order to provide exposure control. Forms A and B are previously administered forms that have established psychometric properties and scales. Scores on these forms can be reported more rapidly than for the remaining forms. Form A has been converted to paper in order to produce accommodated versions that are administered through the entire three-week window. Forms J, K, P, and Q are new forms that will be postequated3 after Weeks Two and Three respectively. This design might be used in cases where early testers have different score reporting timelines than late testers. Such cases are common for high school graduation tests where early testers may be composed exclusively of students who did not pass in a previous administration and are retaking the test. Scores for these students are often needed more quickly in order to make more immediate educational decisions. In this design, costs and schedule are strategically managed by offering one paper version of a form used for accommodations. After the new forms are post-equated one may be chosen to convert to paper for accommodations in a future administration. The design in Figure 4.1 can be modified to meet the specific needs of the project. Useful variations include adding forms, removing or suspending forms once a threshold for exposure has been met, and randomized administration within or by groups of students. The form designs may utilize preequated forms so that scores can be reported immediately and/or to avoid issues related to the representativeness of the students that happen to test within a certain time period. Different contractor software systems may be

Figure 4.1  Hypothetical computer-based administration design.

86    W. D. WAY and R. K. KIRKPATRICK

structured to implement different administration designs. Contractor psychometric and program management staff are best informed on the capabilities of their systems and are an excellent source of advice on the strategies that work best for the various needs of the project. Moving beyond traditional test form designs opens the door to a much wider array of possible test administration designs. Computer adaptive testing and randomized testlet designs offer more control over the exposure of content and provide capabilities that are not available in traditional testing. A number of state testing program contractors have expertise in delivering innovative test designs such as computer adaptive testing that can be used if a sufficiently large and representative bank of calibrated items is available. Paper-and-pencil testing is part of our collective cultural experience in the United States. Paper-based group and individual testing in schools has been going on for decades. Most educators participated in standardized testing when they were children themselves and have personal experience with scannable documents (e.g., the bubble sheet), number two pencils, and traditional item formats. Problems encountered during paper-based test administration seldom surprise the educators responsible for test administration and are usually resolved through standard practices. Problems impacting a single student, such as a torn document, can often be responded to by replacing materials from the stack of spares. When problems impacting all students are encountered, for example if an item is misprinted in a very noticeable way, special instructions can be emergency distributed to proctors instructing students on how to by-pass the item, and the contractor can subsequently exclude the misprinted item from scoring. Because we have used these resolutions to administration problems for decades, we trust that the solutions are effective. Computer-based testing brings a whole new series of problems, and necessary solutions, that are not yet part of our collective testing experience. As an example we provide two simple cases that might just as easily happen in paper-based testing. 1. What should be done if the system is not available? In paper-based testing, when the test is not available the solution would be to rearrange the schedule until the test is available. But in computer-based testing the reason the test is not available may be difficult to determine, leading to repeated attempts to gain access to the system while test-takers wait. 2. What should be done if there is a disruption to testing, such as a fire drill? In paper-based testing, the administration might be moved to another location after the situation is resolved, such as the lunchroom. In computer-based testing the lab may not be available again

What States Need to Consider in Transitioning to Computer-Based Assessments    87

until after the testing window is over, and the school may be forced to schedule an administration outside of normal school hours. Making such testing disruptions complex is the fact that the various computer-based testing systems used by contractors are likely to be more disparate from one another than the different processes for returning paper test materials. Systems are also expected to evolve over time, and the way we interact with computers is almost certainly to change far more rapidly than the mature paper processes of the past. There are many new issues to face in computer-based testing, and building a common awareness of how to resolve these issues may take years—just as it has for paper-based testing. There are contributing factors that make resolving these issues more challenging than those that are faced with paper-based testing. Test proctors may have little experience with the computer systems themselves, internet connections are sometimes interrupted, and local software might be incompatible with the testing application. These kinds of issues lead to challenges in getting, and keeping, the testing system up and running that must be overcome. Because the contractor has faced these issues before, they have the experience and expertise to help schools and state departments of education resolve them. Computer-based testing also provides an opportunity to correct problems in ways that are not feasible in paper-based testing. For example, if after a few days of testing an item is discovered to be erroneous, the item can be corrected, replaced, or simply removed for future test takers. In paper-based testing such a problem is rarely corrected in the field and is usually addressed by removing the item from scoring. We note that on-demand corrections also bring about new issues to resolve. In the example provided, if the faulty item is corrected, the scores reported before the correction is implemented are based on a different number of items than the scores reported after the correction is implemented. This can create problems for interpreting and aggregating scores of students who were administered the test before and after implementation of the correction. Programs that continue to be dual-mode could be further complicated by the correction. The contractor will have experience with these kinds of situations and will be prepared to offer advice on the kinds of problem resolutions that are most effective with their systems. Scoring/Reporting Scoring and reporting services are among the traditional competencies of contractors, and the services provided for computer-based testing used are likely to be parallel in many ways to the services provided for paper-based testing. On the other hand, computer-based testing has brought on new and

88    W. D. WAY and R. K. KIRKPATRICK

exciting opportunities for scoring and reporting. Many contractors are able to provide services that could not be provided just a few years ago. Automated scoring of essays and related complex written responses has been available for some time and has received increased attention within the assessment literature in recent years (Dilkli, 2006; Phillips, 2007). Clearly, automated scoring is becoming more attractive as applications of computer-based assessment increase in number. In addition, new item formats are being researched and used more frequently (Scalise & Gifford, 2006; Zenisky & Sireci, 2002). The techniques for scoring these item formats range from traditional key-scoring to complex algorithmic approaches. New item formats may allow students to engage in multiple interaction steps leading to large amounts of data. System performance may be negatively impacted when large amounts of data are passed over network connections from testing sites to a central scoring system. As a result, developers may embed the scoring logic or algorithms in the software for certain items. While the need for this approach is understandable, it is also important for contractors to preserve the capability of their content and systems to return the raw data of student interactions in case changes need to be considered in the future or it becomes necessary to audit the results obtained and reported in the field. Many testing programs are presenting reports on computer today, and contractors generally offer both paper and computer-based reporting regardless of the mode used for testing. Online reporting can range from downloadable images of the paper reports to user selectable aggregations and dynamic presentations. Many contractors have staff that are highly skilled in data presentation and can offer expert advice on reporting in either mode. These consultation services are particularly important when test developers want to use both modes for reporting, as optimal presentations may be different for each mode. Psychometric & Research Support Contractors provide specialized psychometric services for computerbased testing in several areas. The most commonly requested services include consultation on test design, research on innovative items, research on scoring automation, and issues of mode comparability. Much of the research in these areas continues to be relatively new territory. Due to their experience with multiple contracts and proprietary systems, contractor psychometric staff may have insight about practical measurement issues that is unique and invaluable. Computer-based test designs range from traditional fixed-length tests to dynamic designs that control content exposure or adapt to estimates of student achievement. In designing complex computer adaptive tests, most

What States Need to Consider in Transitioning to Computer-Based Assessments    89

contractors utilize computer simulations to test and document the performance of the adaptive tests that is expected once they are administered operationally. Simulations of adaptive tests have been utilized for psychometricians for many years and are well documented in the research literature (cf., Eignor, Stocking, Way & Steffen, 1994). When scores obtained from paper-based and computer-based testing are intended to be used interchangeably, professional standards require that empirical comparability of the modes be established (APA, 1986; AERA, APA, NCME, 1999, Standard 4.10). Research on state assessments has shown that mode effects are negligible when a computer version test is created to mirror the paper version (e.g., Poggio, Glasnapp, Yang & Poggio, 2006). But new approaches to content creation and test design are not intended to result in computer and paper presentations that are exactly the same. Items that are created to leverage the multimedia capabilities of computerized presentation provide a meaningfully different student experience than printed tests. Contractor psychometric staff have experience in these areas and are available to provide empirically-based advice on solutions for areas well studied and to conduct applied research for issues that are novel. Challenges with Multiple Contractors A state department of education may work with more than one contractor to accomplish its goals. A common approach is for one contractor to develop test content while another contractor is responsible for the computer-based delivery, scoring and reporting of the test. As technologies develop, new products and processes may result in the need for new business strategies. One strategy we see as likely is the addition of a third contractor who is responsible for automated scoring of complex tasks—such as essays. Establishing clear roles and responsibilities for each contractor is the first step towards a successful project when more than one contractor is involved. Because different contractors have different systems and procedures, some challenges can be expected. We discuss this issue in our section on interoperability. When multiple contractors are involved, we recommend the following actions: 1. Map out the complete test development, administration, scoring and reporting process. 2. Identify all touch-points between test developers and contractors, and between contractors and other contractors. 3. Identify a single responsible party for each activity in the process. If more than one party is responsible, break the activity into multiple steps so that one owner can be identified for each step.

90    W. D. WAY and R. K. KIRKPATRICK

4. Develop a project management strategy that addresses the scheduling and oversight of the handoffs across parties. 5. Develop specifications for all handoffs among parties. Finally, we recommend that a practice execution of the end-to-end process be successfully completed before operational testing begins. Even the best planned projects may have unforeseen issues appear during the production phase. By conducting a practice session, the number of unexpected issues arising from cross-party handoffs can be reduced or eliminated. Transition strategies State testing programs have a diverse set of computer-based testing issues to address. States face issues associated with transitioning from paper to computer-based testing, measurement strategies, and operational considerations such as infrastructure, policy, and, in some cases, credibility. Transitioning from paper to computer-based testing can be a challenging task and is one of the places where a contractor can be particularly useful. The tasks and issues requiring attention are somewhat different depending on whether the program is migrating from paper to computer, or if a new assessment is being added to the program that will be exclusively computer-based. Many states will find it difficult, if not impossible, to move to 100% testing on computer and may elect to allow schools to opt for paper testing while the necessary infrastructure is put in place. Full Transition Full transitions require a tremendous amount of coordination and planning and may take several years to accomplish. In most cases online infrastructure must be developed in concert with the test development activities. To make a full transition, a state department of education will need support from their legislature and from local schools. Even schools with modern computing infrastructures may face challenges in participating in required statewide computer-based testing. During the transition period a clear set of policy goals and the paced roll-out of the online program should be published widely. The first year or two of the transition may be more manageable if only a few tests are migrated to computer. For example, a state may start by migrating a high school end-of-course test, then gain experience and feedback from the field in order to inform the second year of transition. There are many components of state testing programs that may require a longer transition period

What States Need to Consider in Transitioning to Computer-Based Assessments    91

and may benefit by transitioning later in the schedule due to continuous improvements in technology. Components such as accommodated forms, certain subjects, and testing very young students might transition last. One useful approach that we have observed is for states to have a formal process that allows schools to opt out of computer-based testing for a period of time while infrastructure is put in place. When transitioning high stakes programs such as graduation or grade promotion tests, the parent may be allowed to request that their child continue to be tested using a paper form. One challenge faced during the transition period is the coordination of processing and scoring so that final score reports for both paper and computer modes can be released simultaneously. This may be a major issue for programs where post-equating is used and the same form appears in two modes. If large volumes of paper-based testing persist, the state may choose to maintain a traditional paper-based reporting schedule even though computer administered scores may be available earlier. If volumes of paper-based testing are small, such as when paper is used only for accommodations, a strategy sometimes used is to have schools key-enter the paper-based responses into the online system so that no paper processing is required. This can speed up score reporting considerably. In such cases the answer documents may be returned to the contractor for score auditing purposes. During the transition period there will be a plethora of policy issues that the state must consider. For example, are schools allowed to self-select administering some tests in one mode, and other tests in the other mode? Contractors can assist state departments in identifying the policy areas to consider and in identifying alternative solutions. Partial Transition Partial transitions are more common today. A typical strategy is to launch new additions to the program on computer. For example, in the past decade several states adopted computer-based testing as the primary, or exclusive, mode for high school end-of-course exams. While planning for partial transitions generally requires the same considerations as those for a full transition, the strategies are enacted on a smaller scale. States, and school districts, might find this approach easier to implement. In addition, a partial transition might be useful in paving the way for a broader transition several years in the future. Contractors are able to provide advice and options for partial transitions that fit the specific needs of the state.

92    W. D. WAY and R. K. KIRKPATRICK

Measurement Issues Although more and more state programs are moving to computer-based testing, the most common approach is to use the computer as a data collection device for traditional forms of testing rather than to create innovative tests or innovative measures. In recent years new approaches to content presentation are beginning to take root, with more state projects turning to innovative item formats. In some cases concern over comparability of paper and computer modes has led to replicating the items administered on paper forms in computer presentation. Item Formats The multiple-choice item continues to be the dominant item format in state testing, even when presented on computer. When tests will be used for both paper and computer there are often efforts to make the items appear exactly the same in both modes, including cases where the computer presentation is kept black-and-white, or the paper presentation is made such that no scrolling will be needed when the item is placed on computer. The contractor should be familiar with item design features that lead to mode differences and will assist with decisions about the style elements that should be considered to avoid comparability problems. Statewide writing assessments present some unique considerations in moving to computer-based delivery. Some studies have indicated higher performance on essays entered using the computer compared with handwritten essays (Russell & Haney, 1997; Russell & Plati, 2001). Other studies have suggested the opposite finding, that is, lower performance for computer-based essays compared with handwritten essays (Bridgeman & Cooper, 1998; Way & Fitzpatrick, 2006). These mixed findings may have to do to with the keyboarding skills and experiences of the students involved in the studies, that is, students tend to perform better on writing tasks that are administered in the format they use in the classroom. Technology enhanced traditional item formats are most easily introduced in cases where only the computer mode will be used. These item formats have various names such as drag-and-drop and hot spot, but the task conducted by the student resembles what is accomplished in a multiple-choice or matching item format. Regardless of whether traditional or technology enhanced traditional item formats are used, test developers may want to leverage the multimedia capability of computers. Video and audio is becoming more common place, as are color figures and photographs. If tests using multimedia artifacts are used in a dual-mode setting, the state needs to carefully consider the impacts of doing such on mode comparability. In

What States Need to Consider in Transitioning to Computer-Based Assessments    93

some cases a paper presentation of the task may not be possible, and items measuring alternative tasks may need to be provided for the paper form. New item formats are being conceptualized and researched by many organizations, with contractors providing services ranging from theorizing the item formats themselves to writing the software that delivers these formats. In some cases these new formats are still in the experimental phase where scoring and measurement models are still being developed. In other cases the properties of the items are better known. For state assessment programs we believe it is important for the state to evaluate the utility of these items from both psychometric and policy points of view. While these new item formats are generally perceived as more engaging and rigorous for students, they are also more expensive to develop and implement. We also have concern about interoperability of these item types on different contractor systems and the lifespan of the items themselves. Although multiplechoice items typically do not introduce any innovation in computer-based tests they are also far less likely to be dependent on a particular system or testing platform. Models for Delivery As with item format, the most common model for delivering computerbased tests continues to be via a traditional linear test. However, computerized administrations offer alternative delivery models that can be very useful in fulfilling the goals of the program. New models include randomized testlet, multi-level, and adaptive testing. All of these models can enhance test security and perhaps extend the use of the item pool. In the randomized testlet model testlets are, in essence, mini-tests containing just a few items that are assembled according to specifications in the similar manner as full-length tests. However, a testlet is not intended to be administered by itself. Ideally, the testlets are built to the same psychometric and content properties. In a randomized testlet design the blueprint indicates the number and type of testlets that should be administered instead of the number of items. Students are randomly assigned testlets to meet the blueprint. A multi-level test is similar to the randomized testlet model, only a decision is made about student ability after each testlet is administered. The next testlet administered is targeted at the student’s ability. Finally, adaptive tests, or fully-adaptive tests, update the decision about student ability after each item is administered. Contractors have different capabilities with respect to these new models for delivery. The particular vendor’s system is designed to implement their vision of these models, and different algorithms for randomizing or

94    W. D. WAY and R. K. KIRKPATRICK

adapting to student ability are utilized. The contractor can provide the best advice on how to develop content and test designs for using these models. Comparability Much of the evidence on mode comparability shows that the differences are small (cf., Kingston, 2009; Wang, Jiao, Brooks, Young, & Olson, 2007, 2008). However, the different systems used by contractors do provide different presentations and student experiences, and research has suggested that mode comparability can be affected by the testing interface used (Pommerich, 2004). With emerging computer-based technologies such as the development of new item formats that may not be producible in paper mode, and automated scoring procedures that require key-entered responses, mode comparability may continue to be a challenging issue for measurement experts for years to come. Most states that have explored comparability issues have done so from the perspective of transitioning their programs from paper to computer. In the future, testing programs are certain to be developed primarily for administration on computer, with paper-based tests only offered for accommodations or other special cases. Such an approach will confound mode comparability with that of the comparability of accommodations, and disentangling the issues in a definitive manner may be difficult and expensive to research. Other sources of non-comparability between paper and computer modes include user familiarity with the mode being used and differences between the primary mode of instruction and the mode of testing. It is conceivable that at some point in the future students will use computer media for their learning more frequently than they use paper media. In such cases, mode comparability may be about investigating whether students are disadvantaged by testing in the more traditional paper modality. For all these reasons, generalizations drawn from comparability studies may be point-intime inferences and subject to re-evaluation over time. The costs of implementing a dual-mode program are higher than implementing a single-mode program. As states transition to computer-based testing and begin to reap the associated benefits, they will also face the ever-present pressure to effectively manage costs. This ultimately makes mode-comparability a public policy issue. We have worked with states and their technical advisors in deploying computer-based programs from various comparability perspectives, and have taken from this a few guidelines for when a comparability study may be needed to provide the defensibility for a program. 1. The test is used for high stakes decisions and the student originally took the test in a mode that they will be no longer allowed to take

What States Need to Consider in Transitioning to Computer-Based Assessments    95

it in. For example, a graduation test where the first administration is exclusively on paper, but retake administrations for students not passing are planned to be exclusively online. 2. Testing very young students. 3. The computer-based test contains content that cannot be replicated in the paper mode. In our experience there are many challenges to designing and executing a conventional comparability study, and these challenges can often limit the extent of the inferences that can be drawn from the research results. Experimental designs, either between groups or within groups, usually require special data collections outside of normal testing windows, and in many cases the scores do not count for students or schools, introducing history and motivation effects in the outcomes. Within groups designs require counterbalancing to be strong, and such designs are hampered by memory or history effects if only one version of the form is available and by motivation or practice effects even when two forms are available. Because of the difficulty involved in getting schools to willingly participate, samples may not be representative. In addition, control variables that may be important in mode research, such as computer usage, may not be readily available. Because of the practical difficulties in implementing strong experimental designs in studies of mode comparability, many states adopt quasi-experimental approaches. In these studies, statistical techniques are employed to control for any pre-existing differences between the groups. Methods such as analysis of covariance (Davis, Strain-Seymour, Lin, & Kong, 2008), multilevel regression models (Sykes, Ito, & Ilangakoon, 2007), matched samples comparability analyses (Glasnapp, Poggio, Carvajal, & Poggio, 2009; Way, Davis, & Fitzpatrick, 2006; Way, Um, Lin, & McClarty, 2007;), and propensity score matching (Puhan, Boughton, & Kim; 2007; Yu, Livingston, Larkin, & Bonett, 2004) have been used to evaluate comparability in quasi-experimental designs. These designs tend to be less intrusive on schools and students, although the limitations of the designs can limit the inferences made from the study results. As dual-mode programs become more commonplace, procedures used for differential item functioning might be used to compare computer and paper mode group performance as part of item development activities. Items performing differently by mode can be evaluated and excluded from the item pool if construct irrelevant features are identified. This information can be utilized to inform future item development. Using the same test development logic as used with cultural/gender bias, it can be argued that if a test is assembled using only items free of “group mode effects,” the test itself will be free of these effects. Theoretically, this could eliminate the need to perform a form-level comparability study.

96    W. D. WAY and R. K. KIRKPATRICK

Because contractors are motivated from a business perspective to continually improve their computerized testing systems, we believe contractors have a unique and important role in developing innovative and cost effective comparability research strategies. We also believe that some degree of comparability across contractor systems is needed by the industry in order to promote collaboration among contractors. Poggio and McJunkin (2012) provide some comparisons across contractor platforms in this volume (Chapter 2). Scoring Models Computer-based testing is bringing several exciting changes to scoring strategies. As discussed by Williamson (2012) in this volume (Chapter 7), automated scoring of essays has great potential and is being used by several contractors today. Of course, traditional machine-based objective scoring will remain prevalent. New item types may require more sophisticated scoring approaches, such as complex tables, algorithms, or a combination of the two. Contractors can provide several different scoring services, which in some cases may be proprietary. In developing content, we have found that three steps in scoring need to be considered: 1. Collection and representation of the student response 2. Evaluation of that response 3. Assignment of a score to the evaluated response As we noted in an earlier section, it is possible and sometimes necessary (e.g., with adaptive testing) to embed the logic for scoring computer delivered items at the time they are administered. With items that are more complex than multiple-choice, it seems advantageous to use embedded scoring as long as system performance is not impacted by doing such. However, if embedded scoring is used for high-stakes uses, we strongly recommend that the raw, evaluated, and scored responses be returned to the central scoring system so that an audit trail can be established. Scoring systems should have the capability of overriding the score awarded by the item in the event that it is discovered to be wrong. Operational Issues One of the biggest challenges faced by state departments of education in rolling out computer-based testing programs is infrastructure. It has long been recognized that schools must have enough computers and enough

What States Need to Consider in Transitioning to Computer-Based Assessments    97

networking bandwidth to implement computer-based testing. In our view, successful implementation of a computer-based testing program also requires a collaborative relationship between the contractor and the local information technology staff. Tests that are media-rich or make use of new item formats require significant network bandwidth and speed. Configuring local computer and network systems to allow the testing software to work correctly requires sufficient technical know-how. In some cases a portal must be opened in the local firewall to allow the testing software to send data back to the scoring system. Automatic updates for operating systems and other locally used software can create compatibility problems with testing systems and may need turned off during the testing window. When problems launching the test are encountered, it can be difficult to diagnose the nature of the problem. While the contractor will have expert helpdesk staff and carefully developed help materials that can assist schools with setting up their systems and diagnosing problems, schools are responsible that their system capabilities meet the requirements published by the contractor. Successfully implementing these systems is a partnership. Training of school staff falls in two categories: 1) system configuration, and 2) administration management. In system configuration the contractor will train local information technologists in the technical requirements of the contractor’s system. Topics such as system set-up, local caching, network capability, and minimum hardware requirements should be covered. The contractor may provide white papers or onsite consultation to help the system set-up go more smoothly. A trial run can be very helpful for training, and if load testing is conducted at the same time it can verify that the complete system is ready to go before the test administration begins. Administration management, or set-up, is typically conducted by school administrative staff or teachers. This involves enrolling students in the system, assigning students to tests, proctoring the administrations, and closing out sessions when testing is completed. Contractors will provide multiple levels of training on administration management. Typically, a test administration manual covers the steps needed to complete these activities. In large-scale state testing contractors may hold sessions at the state department of education or regional locations where testing coordinators can receive face-to-face training. Contractor helpdesk staff are available when unexpected events occur, such as if test sessions fail to launch, or error messages appear on the screen. The first couple of administrations are likely to require more training and support than later administrations. State departments and schools need to establish policies and procedures for monitoring computer labs during testing, including pausing the administration for bathroom breaks, resuming students who have session disruptions, allowable materials during testing, and others. In many cases these policies or procedures will be similar to those used in paper-based testing,

98    W. D. WAY and R. K. KIRKPATRICK

but in other cases they may not be. Contractors are able to provide advice on the kinds of policies and procedures that are unique to computer-based testing, and how to implement them. One operational area that is often quite different for computer and paper administrations is test security. In computer-based administrations the testing window is often longer due to a limited number of computers available for use. Students may also be able to see the test content and responses of their peers more easily on a computer screen than they can a paper test book and answer sheet. In some cases, the novelty of computer-based testing may make the test content more memorable, as with innovative or media-rich items. These differences may bring forth a need for stronger on-site test security methods, such as special configuration of the computer lab for testing, using more test forms, or leveraging the capabilities of the computer to control exposure using different testing models such as randomized testlet designs or computer adaptive testing. Contractors are able to provide advice to state departments about test security and may have unique products or strategies to help in this area. Interoperability and Computer-based Assessments The technology environment in which assessment solutions operate is evolving rapidly. As the supporting technology becomes more sophisticated, assessment interoperability standards will become more and more necessary for independent testing components to work together. For example, two assessment contractors building test items in accordance with an assessment interoperability standard could exchange these items with minimal manual intervention. As another example, if the standard is sufficiently comprehensive and sophisticated, then two systems should be able to exchange entire item banks and associated test definitions in order to deliver, score, and report with similar—if not identical—outcomes. An even more sophisticated example would be when one student’s performance results from a test correlated to a set of curriculum standards is usable by a different system to direct the student to targeted instructional materials correlated to those same curriculum standards. Without assessment interoperability standards, such collaboration between separate players and systems would at best require manual intervention and at worst would not be possible. States should therefore strive to become knowledgeable about interoperability as they look to transition their assessments online. Furthermore, states should challenge those contractors working with them to address interoperability in their proposal responses and statements of work. To date, proprietary solutions have flourished with an emphasis on differentiating

What States Need to Consider in Transitioning to Computer-Based Assessments    99

vendor-specific features, products, processes, and systems, rather than on standardizing data models across vendors. With interoperability playing a larger role in the assessment business model, compliance on the part of assessment contractors to standards will follow. However, the realization of mature interoperability of assessment content and systems will ultimately depend on the quality of the interoperability specifications, and the ability of standards to evolve with innovations in how assessments are developed, administered, and supported by underlying technologies. A variety of interoperability standards are currently used to represent student data, test content/structure, and assessment results. The most commonly used standards include the Sharable Content Object Reference Model (SCORM), the Schools Interoperability Framework (SIF), the Question and Test Interoperability (QTI) specification, and the standards produced by the Postsecondary Electronic Standards Council (PESC). These standards each have a somewhat different emphasis. For example, SIF and PESC focus on student data and assessment reporting. In contrast, SCORM and QTI primarily address educational content. Recently, a project funded by a U.S. Department of Education Enhanced Assessment Grant has resulted in a new specification, called the Accessible Portable Item Profile (APIP). This project uses current QTI and Access for All specifications to develop an integrated set of tags, along with descriptions of expected behaviors, which can be applied to standardize the interoperability and accessibility of test items. The range of available specifications related to assessment operability provides complexities for states to consider. In analyzing barriers to widespread interoperability, it is important for states to recognize the evolutionary nature of today’s assessment standards. Although these concerns should not discourage states from challenging their contractors to demonstrate a commitment to the interoperability of proposed products and services, they do suggest that an inflexible insistence on full conformance to a particular standard may not be in the state’s best interest in introducing or enhancing technology within their assessment programs. Summary and Conclusions In this chapter, we have outlined a number of considerations that states should address when transitioning to computer-based assessments. Clearly, most states have already started down this path, and those that have not yet done so are likely to be developing plans in the near future. The advantages of moving to the computer are obvious and in some area such as writing may be soon be required if assessments are to reflect instruction and student practice in a valid manner (Way, Davis & Strain-Seymour, 2008). The

100    W. D. WAY and R. K. KIRKPATRICK

case for moving state assessments online is further strengthened by the testing reforms that were stimulated by the American Recovery and Reinvestment Act of 2009 and that will play out over the next five years. The core of this chapter were the sections that addressed the various contractor services related to computer-based assessment and the specific considerations that are most deserving of attention from states as they plan and carry out transitions of their testing programs to online delivery. The discussion covered both measurement and operational issues that we have encountered in working with states during such transitions. We also addressed the important topic of assessment interoperability standards, which will play a pivotal role in the evolution of technology-based assessment and instruction over the coming years. It seems fitting to us that in closing a chapter addressing what states need to consider in transitioning to computer-based assessments, we should recognize the transitional nature of this topic. Most assessment experts acknowledge that we are on the cusp of a technology-fueled revolution that will radically change the way that learning occurs and how we assess what learners know. Although the term “paradigm shift” is often used too frivolously, we believe that the application of technology to learning, instruction and assessment over the next 10 years is going to bring just that. We therefore predict and even hope that the considerations raised in this chapter will be largely viewed as out-of-date and irrelevant in the not-to-distant future. Notes 1. http://www.achieve.org/parcc 2. http://www.k12.wa.us/smarter/ 3. Post-equating refers to the practice of equating scores on a new test to those from a previous test version using the data collected from administering the new test. Post-equating contrasted with pre-equating, where the equating conversions for the new form are established before it is administered.

References Alpert, T. (2010, April). A coherent approach to adaptive assessment. Paper presented at the National Academies, Board on Testing and Assessment and The National Academy of Education Workshop, Best Practices for State Assessment Systems, Washington, DC. Retrieved from http://www7.nationalacademies. org/bota/Best_Practices_Workshop_2_Agenda.html American Educational Research Association (AERA), American Psychological Association (APA), and the National Council on Measurement in Education

What States Need to Consider in Transitioning to Computer-Based Assessments    101 (NCME). (1999). Standards for educational and psychological testing. Washington, DC: AERA. American Psychological Association Committee on Professional Standards and Committee on Psychological Tests and Assessments (APA). (1986). Guidelines for computer-based tests and interpretations. Washington, DC: Author. Bridgeman, B., & Cooper, P. (1998, April). Comparability of scores on word-processed and handwritten essays on the Graduate Management Admission Test. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA. Davis, L. L, Strain-Seymour, E., Lin, C., & Kong, X. (2008, March). Evaluating the comparability between online and paper assessments of essay writing in the Texas Assessment of Knowledge and Skills. Presentation at the Annual Conference of the Association of Test Publishers, Dallas, TX. Dikli, S. (2006). An overview of automated scoring of essays. Journal of Technology, Learning, and Assessment, 5(1). Retrieved from http://www.jtla.org Eignor, D. R., Stocking, M. L., Way, W. D., & Steffen, M. (1993). Case studies in adaptive test design through simulations (ETS Research Report RR 93–56). Princeton, NJ: Educational Testing Service. Glasnapp, D., Poggio, J., Carvajal, J., & Poggio, A. (2009, April). More evidence: Computer vs. paper and pencil delivered test comparability. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA. Kingston, N. M. (2009). Comparability of computer- and paper-administered multiple-choice tests for K–12 populations: A synthesis. Applied Measurement in Education, 22(1), 22–37. Luecht, R. M., Brumfield, T., & Breithaupt, K. (2006). A testlet assembly design for the uniform CPA examination. Applied Measurement in Education, 19(3), 189–202. Martineau, J. A., & Dean, V. J. (2012). A state perspective on enhancing assessment and accountability systems through systematic integration of computer technology. In R. W. Lissitz & H. Jiao (Eds.), Computers and their impact on state assessment: Recent history and predictions for the future (pp. 55–77). Charlotte, NC: Information Age. Melnick, D. E., & Clauser, B. E. (2006). Computer-based testing for professional licensing and certification of health professionals. In D. Bartram & R. K. Hambleton (Eds.), Computer-based testing and the internet: Issues and advances (pp. 163–185). West Sussex, England: Wiley. Mills, C., & Steffen, M. (2000). The GRE computer adaptive test: Operational issues. In W. J. van der Linden & C. A.W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 75–99). Dordrecht, The Netherlands: Kluwer Academic Publishers. Owen, W. (July 23, 2010). Oregon school computer labs overwhelmed by demands on students. OregonLive.com. Retrieved from http://www.oregonlive.com/education/ index.ssf/2010/07/oregon_school_computer_labs_ov.html. Peyton, V. (2008). Quality and utility of the Kansas computerized assessment system from the perspective of the Kansas educator. Retrieved from http://www.cete. us/research/reports/pdfs/peyton2008_utility.pdf

102    W. D. WAY and R. K. KIRKPATRICK Phillips, S. M. (2007). Automated essay scoring: A literature review. (SAEE research series #30). Kelowna, BC: Society for the Advancement of Excellence in Education. Poggio, J., Glasnapp, D., Yang, X., & Poggio, A, (2006). A comparative evaluation of score results from computerized and paper & pencil mathematics testing in a large scale state assessment program. Journal of Technology, Learning, & Assessment, 3(6). Retrieved from http://www.jtla.org Poggio, J., & McJunkin, L. (2012). History, current practice, perspectives, and predictions for the future of computer-based assessment in K–12 education. In R. W. Lissitz & H. Jiao (Eds.), Computers and their impact on state assessment: Recent history and predictions for the future (pp. 25–53). Charlotte, NC: Information Age. Pommerich, M. (2004). Developing computerized versions of paper-and-pencil tests: Mode effects for passage-based tests. Journal of Technology, Learning, and Assessment, 2(6). Retrieved from http://www.jtla.org Puhan, G., Boughton, K., & Kim S. (2007). Examining differences in examinee performance in paper and pencil and computerized testing. Journal of Technology, Learning, and Assessment, 6(3). Retrieved from http://www.jtla.org Russell, M., & Plati, T. (2001). Effects of computer versus paper administration of a state-mandated writing assessment. TCRecord. Retrieved from http://www. tcrecord.org/Content.asp?ContentID=10709 Russell, M., & Haney, W. (1997). Testing writing on computers: An experiment comparing student performance on tests conducted via computer and via paperand-pencil. Education Policy Analysis Archives, 5(3). Retrieved from http:// epaa.asu.edu/epaa/v5n3.html Sands, W. A., Waters, B. K., & McBride, J. R. (Eds.) (1997). Computerized adaptive testing: From inquiry to operation. Washington, DC: American Psychological Association. Scalise, K., & Gifford, B. (2006). Computer-based assessment in e-learning: A framework for constructing “intermediate constraint” questions and tasks for technology platforms. Journal of Technology, Learning, and Assessment, 4(6). Retrieved from http://jtla.org Strain-Seymour, E., Way, W. D., & Dolan, R. P. (2009). Strategies and processes for developing innovative items in large-scale assessments. Iowa City, IA: Pearson. Retrieved from http://www.pearsonassessments.com/hai/images/tmrs/StrategiesandProcessesforDevelopingInnovativeItems.pdf Sykes, R. C., Ito, K., & Ilangakoon, C. (2007, April). Evaluating the mode of administration of algebra and algebra readiness tests. Paper presented at the annual meetings of the National Council on Measurement in Education, Chicago, IL. Virginia State Department of Education. (2009). Statewide web-based standards of learning technology initiative. Retrieved from http://www.doe.virginia.gov/ support/technology/sol_technology_initiative/annual_reports/2009_annual_report.pdf Wang, S., Jiao, H., Brooks, T., Young, M., & Olson, J. (2007). Comparability of computer-based and paper-and-pencil testing in K–12 mathematics assessments. Educational and Psychological Measurement, 67(2), 219–238. Wang, S., Jiao, H., Brooks, T., Young, M., & Olson, J. (2008). Comparability of computer-based and paper-and-pencil testing in K–12 reading assessments:

What States Need to Consider in Transitioning to Computer-Based Assessments    103 A meta-analysis of testing mode effects. Educational and Psychological Measurement, 68(1), 5–24. Way, W. D., & Fitzpatrick, S. (2006). Essay responses in online and paper administrations of the Texas Assessment of Knowledge and Skills. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA. Way, W. D., Davis, L. L., & Fitzpatrick, S. (2006, April). Score comparability of online and paper administrations of the Texas Assessment of Knowledge and Skills. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Francisco, CA. Way, W. D., Davis, L. L., & Strain-Seymour, E. (2008). The validity case for assessing direct writing by computer. Iowa City, IA; Pearson. Retrieved from http:// www.pearsonassessments.com/NR/rdonlyres/CAF6FF48-F518-4C68-AF2F2959F902307E/0/TheValidityCaseforOnlineWritingAssessments.pdf?WT. mc_id=TMRS_The_Validity_Case_for_Assessing_Direct Way, W. D., Um, K., Lin, C., & McClarty, K. L. (2007, April). An evaluation of a matched samples method for assessing the comparability of online and paper test performance. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL. Williamson, D. (2012). The scientific and conceptual basis for scoring innovative and performance items. In R. W. Lissitz & H. Jiao (Eds.), Computers and their impact on state assessment: Recent history and predictions for the future (pp. 157– 193). Charlotte, NC: Information Age. Yu, L., Livingston, S. A., Larkin, K. C., & Bonett, J. (2004). Investigating differences in examinee performance between computer-based and handwritten essays (RR-04-18). Princeton, NJ: Educational Testing Service. Zenisky, A. L., & Sireci, S. G. (2002). Technological innovations in large-scale assessment. Applied Measurement in Education, 15(4), 337–362.

This page intentionally left blank.

Chapter 5

Operational CBT Implementation Issues Making It Happen1 Richard M. Luecht University of North Carolina at Greensboro

Introduction There are at least two ways to conceptualize moving to computer-based testing (CBT). One way is to focus on the publicized promises of CBT—that is, potentially new and flashy item types, computer-adaptive item selection, 24/7 testing access, immediate score reporting, and so on. This conceptualization of CBT often leads to the somewhat naïve assumption that all one needs is an item bank, a test assembly process to generate CBT test forms, computers to administer the tests, and scoring algorithms to process the data. The second conceptualization of CBT recognizes the reality of planning and implementing an overhaul or possibly completely new design of an entire testing enterprise, with careful and early consideration of as many details and complexities as possible (e.g., Luecht, 2005a, 2005b; Drasgow, Luecht and Bennett, 2006; Sands, Waters, and McBride, 1997). Consistent with the second conceptualization, this chapter attempts to portray the complex authenComputers and Their Impact on State Assessments, pages 105–130 Copyright © 2012 by Information Age Publishing All rights of reproduction in any form reserved.

105

106   R. M. LUECHT

ticities of CBT and the many systems and subsystems that make up the testing enterprise. The reality is that moving to CBT requires a serious commitment of significant financial, human, and technical resources. CBT does provide some benefits, such as providing new item types and quicker, lower cost transmission of data and results. CBT also eliminates the need for paper test booklets and answer sheets, and tends to offer better security and control over the test materials from both logistics and data management perspectives. However, CBT also introduces some new challenges. Some of the more serious challenges are due to diminished testing capacities at test centers. Consider that hundreds of thousands of examinees can be accommodated for paper-and-pencil testing when university lecture halls/ labs, school auditoriums, conference center space, and hotel meeting rooms are co-opted as testing centers on a small number of fixed testing dates each year. This test administration model has worked successfully and cost effectively for decades. Achieving that same capacity on a small number of fixed test dates each year is not possible using commercial CBT centers. There are enormous scheduling complexities when thousands of examinees want to test at approximately the same time of the year, and seating capacity at commercial testing centers is limited. Furthermore, using auxiliary, temporary sites such as high-school, college, or university computer labs, or setting up temporary CBT sites in rented facilities is difficult to technically and logistically pull off. The Internet has offered the option of online CBT; however, security remains a concern2 and the capacity issue only goes away if test takers are allowed to use their own computers, notebooks, or personal digital devices/smart phones as testing stations. If proctored CBT test administration sites are required, capacity will always be an issue. The primary impact of limited CBT capacity is that the testing events must be spread over a longer period of time. Some testing organizations have opted to use fixed testing windows (e.g., several fixed days or weeks of continuous testing). Others allow on-demand, continuous testing to take place throughout the year. When testing is carried out over an extended period of time, item exposure becomes a serious problem. Examinee collaboration networks can fairly quickly acquire large segments of the item pool, if the entire pool is exposed for even a few days. Using non-overlapping test forms for every day of testing (i.e., items shared across test forms) may solve the exposure issue but raises two additional problems: (i) requirements for an even larger item pool and (ii) poorer quality linkages among the test forms that could seriously impact subsequent equating/calibration practices. Since the capacity issue is not likely to go away anytime soon, the best solution appears to be to use large item banks to generate many reasonably unique test forms, expose items to no more examinees than is absolutely necessary, and hope that the necessary linkages in the data hold up well for psychometric equating and calibration purposes.

Operational CBT Implementation Issues    107

Other challenges are specific to CBT; for example, managing security issues (e.g., examinee identification authentication, implementing secure data transmission channels and data encryption procedures, and thwarting attempts by highly determined examinee collaboration networks to steal large portions of the active item banks by strategic memorization); designing effective accounting, quality control and quality assurance procedures for managing and reconciling massive amounts of test data; and carrying out ongoing psychometric analyses for calibrations, scoring and equating with sparse data. In actuality, CBT is often more complex and certainly more costly than paper-and-pencil testing, unless one wants to believe the hype and naïve promises. A fundamental theme in this chapter is that handling these and other challenges requires a commitment of substantial resources, as well as adopting a strong systems design approach that goes far beyond applying simple ad hoc modifications or repairs to legacy procedures and systems developed for paper-and-pencil testing. Rather, it is essential to recognize that any CBT enterprise is actually an integrated system of systems (Drasgow, Luecht, and Bennett, 2006). Accordingly, a formal system architecture needs to be engineered for CBT from the ground up—an architecture that is robust, efficient, extensible, and change-friendly, without requiring massive re-design or re-development if technologies change or new technologies emerge. An Overview of CBT Systems The term CBT enterprise is used here to denote a system of computer and digital processing hardware, software, and human systems and procedures that carry out the end-to-end operations of testing for a particular testing program, from item and test design through final scoring and communication with stake holders. A system is a collection of components and procedures that accomplish something. For example, a computer delivery system can be characterized as a network of interconnected PCs or a variety of other configurations with login and test delivery software to allow the examinees to take the test and to store his or her results. A data management system is likewise a collection of interconnected data sources (tables, data bases, etc.) and tools for manipulating the data. A system of systems is merely an extension of that definition to an enterprise. The systems discussed in this chapter are characterized by three facets: (i) one or more repositories; (ii) software procedures, tools and applications; and (iii) human-led procedures. Repositories are databases, files, and other data sources that hold digital information. Repositories can be formal or informal, structured or unstructured.3 Software procedures can range from data management applications to advanced statistical tools for analyzing

108   R. M. LUECHT

and scoring test data. Most CBT systems are person-machine systems that include interactions between the software applications, data, and humans, including computer operators, database administrators, and information technology specialists, test developers, external test committees comprised of subject matter experts, statisticians, psychometricians, quality control experts, test proctors and operations personnel, and the examinees. More often than not, the systems and procedures developed for paperand-pencil testing will fail to scale as part of a CBT enterprise because of inadequate technical quality, excessive operational costs or resources when use/demand is increased by an order of magnitude, or inability to integrate those systems with other legacy systems in a seamless way. The implication is that CBT often requires designing a new enterprise. There are substantial costs and development complexities that need to be carefully planned using iterative design and development, and ensuring that every part of the system fully integrates with the rest of the enterprise. The functional demands (requirements) of the CBT enterprise are relatively straight-forward: (a) to maximize the accuracy, utility and fairness of the test scores produced; (b) to minimize costs; (c) to move possibly more complex data more quickly, more securely and more accurately from item generation through final scoring; and (d) to provide immediate responsiveness and automation where possible to improve efficiency, reduce costs and minimize errors. Achieving those functional requirements is not trivial. Existing systems— often developed for paper-and-pencil testing with limited numbers of forms per year—are usually not scalable for CBT because they are too cumbersome and they depend heavily on human involvement in all processes and procedures. Therefore, even these straightforward functional demands for CBT are difficult to achieve given cost and time constraints, human errors, and scalability concerns, especially if the end-goal is 99.999999% in terms of both accuracy and systems reliability. This section highlights eight basic systems that comprise most operational CBT enterprises: (i) item development and banking; (ii) test assembly and publishing; (iii) examinee eligibility, registration/fee collection, and scheduling; (iv) test delivery; (v) results data management and transmittal; (vi) psychometric equating/calibration, scaling and post-examination processing, item analysis, key validation and test analysis; (vii) final scoring, reporting and communication; and (viii) quality control and assurance. Figure 5.1 depicts these eight component systems. Each system, represented by a rectangular shape, contains component databases, processes and procedures denoted by the ellipses. Two important points need to be made regarding the eight systems in Figure 5.1. First, the components shown in the figure are only meant to be illustra-

Operational CBT Implementation Issues    109

Figure 5.1  A CBT enterprise: A system of eight systems.

110   R. M. LUECHT

tive, not comprehensively inclusive of all parts of a functional system. For example, an item development and banking system includes more component databases, processes and procedures than the six indicated in the figure. Many of these additional system elements are described further on, specific to each system. Second, the systems themselves may be directly or indirectly interconnected in various ways. For example, the test assembly and composition system would use standardized item pool extractions (queries) to retrieve data automated test assembly. Similarly, item data from the content management, keys and rubrics data management, item statistics, and resource libraries would be extracted and included as part of the published resource file sent to test centers. In turn, the test resource files would be used by the test delivery software. The eight systems are also summarized in Table 5.1. The key data repositories and primary functions for each system are listed in that table. The Item Development and Banking System An item development and banking system for CBT is a collection of software authoring tools, databases, digital libraries, and data management utilities that manage the data associated with all items, item sets/groups Table 5.1  Summary of Eight Systems in a CBT Enterprise System

Repositories

Item Development and Banking

Content management, items, item sets, exhibit libraries, item statistics

Primary Functions

Manage item content, rendering forms, statistics, and usage Test Assembly and Content management, item statistics, Construct test forms (realPublishing constraints, objective functions, test time or otherwise) form, resource files Registration and Examinee identification, Determine examinee Scheduling demographics, eligibility, eligibility, collect fees, scheduled exams schedule exams Test Delivery Examinee identification, assigned Administer the tests to test forms, irregularity reports examinees Results Management Results files with examinee response Transmit, store, verify and and Transmittal and timing results, test forms, item reconcile raw examinee lists results Post-Administration Queries, extract templates, results files, Item analysis, calibration/ Psychometric intermediate flat files equating, scaling, score Processing computations Score Reporting and Examinees, scores transcripts, report Prepare final score tables, Communication forms, correspondence forms reports, and exports Quality Control and Version controls, reconciliations, Audit, problem-solve and Assurance root cause analyses, sign-offs, lockensure quality of all downs processes and data

Operational CBT Implementation Issues    111

and problem linking items, complex performance exercises (CPEs), test sections, and test forms. This system usually stores four types of information about test items, item sets, and CPEs: (1) rendering information used to display each item or CPE, including information required for examinee interactivity (e.g., software interface control settings, timing, calculators, spell-checkers, highlighters); (2) content and other attributes used in assembly test forms (e.g., topics, content codes, cognitive classifications, task model references, linguistic indices); (3) statistical item, item-set or CPE data; and (4) operational data about the items (reuse history, exposure rates and controls, and equating/scoring status). In addition, the system must store item authoring templates and palettes, graphics libraries, text passages, and other digital exhibits. Beyond merely storing the items, the item development and banking system needs to provide easy-to-use, secure, possibly online access and tools for the item designers, item writers and test development editors. Not all CBT items are identical. While that is certainly not a very profound statement, it at least helps to make the point that designing and implementing an item development and banking system involves far more than designing a simple database table or two for multiple-choice items with item stems, several distractor options to be assigned to radio-button control labels, and an answer key. Modern CBT item types include many varieties of selected-response item types (i.e., hot-spot items), interactive drag-and-drop items that require the examinee to select and drag items with a mouse, text editing using input boxes or word processing applications, and many other variations of highly interactive simulations and CPEs. The item development and banking system needs to be flexibly designed to accommodate all current, as well as possible, future item types, psychometric calibration models, and scoring evaluators. Designing the appropriate, flexible, and extensible data structures for an item banking system is an essential step in implementation. The structures need to provide efficient storage, but also facilitate data retrieval and data extracts. It is important that the designers understand the need for flexible, on-demand data retrieval in multiple export formats. Of the most frustrating aspects of end users is to have database management personnel repeatedly ask them to justify why they need data in a particular format for external analyses every time the end users make a request. If a database is well-designed, queries and formatted extracts should be seamless and not require weeks of database programmer intervention to create various auxiliary data files for end users. The item bank, in particular, needs to be easily accessible to authorized users while ensuring change-tracking, versioning, and lock-down capabilities. In addition, changes to the data structures may be required as new item types of additional item design elements are considered (e.g., so-called technology enhancements). A reasonable item

112   R. M. LUECHT

content management system should provide a structured way to store and retrieve the following types of unique information for items, item sets, and other CPEs: • Text of the actual item and response instructions to the examinee • Response capturing controls and labels for buttons, check boxes, text areas, etc. (e.g., radio buttons to be checked for multiplechoice items) • Exhibit references to an external library or data source (pictures, reference text, graphics) • Auxiliary response tools (calculators, spell checkers, etc.) • Answer keys, rubrics and scoring evaluator rules, objects, and agents • User-defined content and cognitive coding, including readability, word counts, psycholinguistic indices • Usage statistics • Statistical information including classical item statistics, distractor analysis results, item response theory parameter estimates, differential item functioning statistics, and associated standard errors of estimate • Problem-linked item sets (e.g., reading passages, graphics with associated text) and interactive components including scrolling, highlighting/marking tools • Word counts and content indicators for item set objects or CPEs. Most higher-quality content management systems are built using robust data management software with flexible data structures such as XML.4 Having a sound data management system is essential for CBT. In addition, moving to CBT requires careful planning of the data structures by competent data-systems designers who are extremely knowledgeable and skilled in object-oriented and relational database systems design. One of the most serious errors in planning a system is to limit the data structures to only support existing data specifications (item types, test types, etc.). For example, even exclusively multiple-choice testing programs may consider using other item types or even using item sets at some point in the future. If the system is not open-ended in terms of its data structures design, the entire content management system may need to be overhauled when those capabilities are eventually needed. Not being able to provide certain CBT capabilities because of poor initial planning is ultimately more costly than engaging in developing a robust architecture that allows for expansion of data types and structures in the future. Figure 5.2 provides a graphical depiction of two sample repositories, one for item content and rendering information and the other for statistical information. The test developers would likely deal primarily with the content

Operational CBT Implementation Issues    113

Figure 5.2  Content management and item statistics repositories.

management repository. The psychometricians would more likely deal with the statistical repository. The tri-footed relations (lines with three prongs) among the item data objects (rectangular boxes) denote one-to-many relationships between the data. For example, a particular exhibit such as a graphic or text passage can be used with multiple item sets. Today, licensing commercially available item development and banking software systems is feasible and cost-effective, with many vendors now providing fairly powerful systems and flexible customization to meet a variety of different business needs. Whether an organization chooses to purchase a licensed database system or develops a customized system in-house, obtaining a robust and reliable system as soon as possible in the CBT implementation process is essential. Taking short cuts to meet short-term financial goals and sacrificing quality by licensing subpar software or trying to cobble together a system from legacy software is courting system failure when moving to CBT. The item development and banking system is the foundation of the entire enterprise, and testing programs need to invest in a very solid foundation, including hiring or contracting with competent information

114   R. M. LUECHT

technology experts to maintain the system. When these technical resources involve outside contractors, it is also important for testing agencies and organizations to contractually maintain ownership of their data and the actual data structures, as well as rights relating to data exports and external usage. Although legal contract issues are far outside the scope of this chapter, those issues are extremely important to emphasize regarding any discussion of CBT implementation. Test Assembly and Publishing Test assembly and publishing make use of the item development and banking system to select items for test forms and then publish the test forms in a manner that allows them to be administered by the test delivery system. One of the most important accomplishments in the testing industry insofar as making CBT feasible has been the development of automated test assembly (ATA) tools as an integral part of the test assembly and publishing system. For paper and pencil testing, a relatively small number of test forms can usually be constructed over several days or weeks by a joint effort of test development editors, psychometric staff and subject matter experts (SMEs). The degree of parallelism achieved depends on many factors, including the quality of the item pool as well as level of agreements among the SMEs as to what “content-parallel” means. However, through largely brute force human efforts, the test forms are built and then sent on for publishing and printing. In contrast, CBT typically requires numerous test forms, sometimes even thousands of test forms. For example, certain types of tests—notably computer-adaptive testing (CAT) and linear-on-the-fly testing (LOFT) (see Folk & Smith, 2002)—assemble a customized test form for every examinee in real-time. Whether test forms are pre-constructed or assembled in real-time, the number of test forms needed for CBT makes using ATA almost mandatory. Computer-assisted ATA allows test developers to build numerous content and statistically parallel test forms in seconds using sophisticated item selection methods (Luecht, 1998, 2000; Luecht & Hirsch, 1992; Swanson & Stocking, 1993; van der Linden, 1998, 2005; van der Linden & Adema, 1998). ATA implementations can range from using database sorting and counting algorithms to rather sophisticated mathematical optimization heuristics and algorithms borrowed and modified from computer network design, airline scheduling, transportation logistics, and manufacturing applications (e.g., minimizing travel distances, maximizing productivity, or minimizing costs). The four basic components of any ATA system are: (1) an item pool or bank as the data source for the test items or assessment tasks to be selected; (2) a system of content and other tangible constraints that must be met for content validity, timing, and other properties of a test; (3) an objective func-

Operational CBT Implementation Issues    115

tion that usually determines the statistical or psychometric characteristics of the test form; and (4) a test assembly software for selecting the items to satisfy the objective function, subject to the constraints and limitations of the item pool. The ATA software solves a basic problem that can be generally framed as a mathematical optimization model: optimize f'x subject to: Q'x ≤ a R'x = b {x} ∈ (0,1) where f is an item-pool-length vector of some function of statistical item characteristics—for example, item response theory item information—Q and R are matrices of item attributes (e.g., content indicators), and x is a matrix of binary indicators denoting which items (rows) are selected for which test forms (columns). Even CAT and LOFT can be shown to be relatively simple implementations of real-time ATA—that is, performing the test assembly while the examinee is taking the test. Van der Linden (2005) provides a thorough introduction to ATA terminology and algorithms. The item pool, constraints, and objective functions usually interface with linear or mixed integer programming software, or customized ATA software applications using heuristics, to create item lists attached to specific testlevel data objects, depending on the type of test design used. For example, a complete test form can be viewed as a data object containing a list of 60 item identifiers. A test section can also be a data object. Each test object has a set of constraints and an objective function to be met. The ATA software application selects from the item pool a collection of test items or assessment tasks that meet the constraints and that satisfy the objective function. However, because the structure of the data in the item pool is typically not amenable to direct use with ATA software, a series of data conversion steps are required to format the assembly problem in a way that the ATA software can understand. Once the test assembly process runs, additional data conversions are needed to translate the item selections into usable item lists or tables that can be forwarded to the test publishing software. ATA is not the same as test publishing. ATA merely selects items for one or more test data objects.5 Test publishing then takes the converted item lists and, for CBT, usually marries those item selections to item text, rendering templates, answer keys, relevant item statistics, and so on for every test data object. For example, if 20 test forms are constructed using ATA, there will be 20 item lists. All of the data needed for CBT delivery must then be extracted from the item development and banking databases. Most test delivery software uses proprietary data structures and formats to store all of this data for each test form. The complete package of test forms is typically

116   R. M. LUECHT

called a resource file. Testing instructions, timing requirements, graphics, pictures, and other data associated with each test object and every item are also stored in the resource file. It should be obvious that all testing locations administering the same test forms should employ the same resource file. The resource file is usually encrypted and uploaded to local file servers or stored on servers that can be securely accessed over the Internet. More recent advances in cloud computing make it possible to store one or more resource files across multiple storage locations—essentially creating a virtual database. The resource file serves as the primary data source for the test delivery system, containing all of the data needed to administer and score every active test form, including experimental pretest items. At this point, it may be useful to pause and reflect on how much data management is involved in CBT implementation before a single test question is administered to a single examinee. It should be obvious that a very complex data management system is required to store the item and test data. That system needs to interface with the test assembly and composition system to eventually generate the resource files needed by the test delivery system. Careful, robust integration and plenty of training of users is essential for seamless operation. Registration and Scheduling System Registering examinees, collecting fees, and scheduling their examinations are all essential components of CBT. Lower stakes examinations—for example, low-stakes certification examinations or self-administered Internet examinations—allow almost anybody to sit for the test, and there is no apparent need to keep careful records as to the identity of every test taker or their testing history. Test takers are essentially assigned a customer number unique to every testing event, fees may be collected, and the examinee is provided with an authorization to test (immediately or at some future date). Higher stakes examinations often need to establish the identity and eligibility of the test taker at time of registration as well as authorization to test. Therefore, examinees may have to prequalify with a particular jurisdiction and be issued a unique, secure authorization code or number. This and other relevant information is stored in a secure database that is accessed when the examinee attempts to schedule his or her examination. In addition to the examinees identity and testing authorization(s), contact information, demographic information, and testing history—including any history of irregularities and correspondence—may be stored in a complete digital transcript. If the examination program has a policy that restricts retest attempts within a particular time frame, the registration system will prohibit excess testing attempts during that period. Security of the data and the actual transactions—especially if financial information and other personal information is exchanged—is essential.

Operational CBT Implementation Issues    117

When fees are collected, multiple levels of encryption and authentication are often needed, even for lower stakes examinations. Unless testing is truly available on-demand for walk-in or log-in candidates, each examinee must schedule his or her test, given available testing seats and time slots. The scheduling system therefore needs to be a rather sophisticated application that attempts to find suitable locations, dates, and time slots that attempts to reconcile the testing organizations’ test administration policies with the applicant preferences. If capacity is limited, the examinee may need to select less desirable alternative testing times or dates. Some scheduling systems allow block registration and scoring of candidates. This is a convenient capability that allows schools or entire classes to be scheduled as intact units. Test Delivery System The test delivery system is probably the most conspicuous system in the enterprise. This is the system that the examinees interact with when logging in and taking the test actual test. A test delivery system is actually a very complicated and hopefully integrated set of operational procedures and software applications. Formally, the test delivery system performs nine basic functions: (1) decrypting and restructuring of the resource file; (2) logging in, verifying, and authenticating the test taker; (3) selecting the items to administer (e.g., fixed sequence, random or heuristic-based—like an adaptive test); (4) populating a navigation control and enforcing authorized navigation through the test by the examinee; (5) rendering of the test items and running scripts to add animations, interactivity, and so on to the items; (6) capturing responses; (7) executing timing controls (e.g., enforcing section time outs) and providing pacing help to the examinee; and (8) scoring responses in real-time—which may be needed for adaptive testing as well as final scoring, if a score report is immediately provided to the examinee; and (9) encrypting results and transmitting them back to a central storage repository. Commercial test drivers vary in their capabilities to perform these operations and also tend to differ in their ability to scale these procedures to different types of tests and test units (e.g., item selection and timing controls at the level of modules containing multiple items, item sets, or CPEs). The test delivery system needs to deal with both examinee-level data as well as test materials (i.e., the resource file—see Test Assembly and Publishing). The test delivery system retrieves data from the resource file to present practice questions, instructions, and other information to the candidate; to implement test navigation functions that control how the examinee moves through the test, including presentation, review and sequencing rules; possibly provides embedded scoring and item/test unit selections (e.g., for adaptive testing); controls timing and pacing of the examination; presents different and possibly complex item types and allows the examinees to re-

118   R. M. LUECHT

spond to those items; activates reference and ancillary look-up materials where appropriate; provides access to calculators and auxiliary tools; stores the examinees’ results; and ensures that the examination terminates normally. Proctors and login facilities are also part of the test delivery system. Test delivery systems vary in terms of the types of testing facilities used, the type of connectivity supported, the variety of item types utilized, the size of testing units administered, and types of test delivery models employed. Modern CBT facilities can be dedicated, commercial test, temporary facilities set up expressly for larger-scale testing events, online classrooms or computer laboratories, or direct PC connections via the Internet. Most of the connectivity for modern CBT test delivery systems is handled via the Internet with performance directly related to the capabilities of the connectivity channels used. Connectivity refers to the speed of transmission and the bandwidth (i.e., the amount of digital information that can be transmitted). Speed and bandwidth can affect the way item or test unit selections and navigation in the test are handled, the retrieval and display of high definition images, audio, or video, the extent of interactivity possible, and even scoring through-put. Using encrypted (i.e., secure), high-speed data transmissions over the Internet, many testing organizations are able to directly link central processing servers to any testing site or personal computing device that has Internet connectivity with sufficient bandwidth and speed. Conversely, without sufficiently fast connectivity or ample bandwidth, the test delivery system may suffer serious performance degradation. The item types in use also have an impact on the test delivery system. As computerized performance exercises (CPEs) and other technology-enhanced items gain popularity, the test delivery system must incorporate more complex rendering formats, more interactive components and responsecapturing mechanisms, and more intricate scoring protocols. Ultimately, all of this added complexity results in more complicated real-time data management, the need for more expansive functionality in the user interface, more complex scoring algorithms, and more sophisticated error handling. The final difference in test delivery systems involves the type of test delivery model used. There are four broad classes of test delivery models: (1) computerized fixed-length tests (CFT); (2) linear-on-the-fly tests (LOFT); (3) item-level computer-adaptive tests (CAT); and (4) modular or testletbased multistage (MST) panels. These four types of test delivery models also have many subtle variations. For example, item-level CAT can range from a relative simple item selection heuristics that attempt to maximize the measurement precision of each examinee’s score, to stratified sampling strategies integrated into the adaptive item selection, to sophisticated, real-time shadow tests that employ sophisticated ATA optimization algorithms that allow complex constraints to be incorporated into the adaptive algorithm. Similarly, MST examinations can range from adaptively administered mod-

Operational CBT Implementation Issues    119

ules or testlets selected in real time to carefully pre-constructed modular tests that ensure exact measurement properties, very accurate control over exposure risks, and precise content balance. The four classes of CBT test delivery models can be distinguished by the degree of adaptation used, the size and type of test units used, and where the actual item selections and test assembly take place for building each examinee’s test form. For example, CAT and computer-adaptive MST models adaptively select each item or test unit to match each examinee’s proficiency score with incremental accuracy. In contrast, CFT, LOFT, and certain variations of mastery MSTs build test forms without any adaptation of the test difficulty to the examinee’s proficiency. Instead, these delivery models may use a common set of statistical targets for every test form. The test administration unit sizes also differ across test delivery models. A single test item may be the fundamental unit of test administration (i.e., one item = one test administration object). However, a delivery unit can also be a set of items assigned to a reading passage, or a particular problem scenario can also be packaged to present as a single unit or module. In fact, any cluster of items can be pre-constructed and packaged as a unique test administration unit. The terms module or testlet have been used to describe these discrete test administration units. There is no optimal test administration unit size. Intermediate test administration units such as testlets or modules are generally easier to handle in terms of data management and quality control, and examinees seem to prefer them if for no other reason than being able to review and change answers before submitting the unit. However, some amount of flexibility and mobility is always sacrificed through consolidation of the items into units, especially when adaptation is desired. Test delivery models also differ with respect to where the actual test assembly takes place. CFTs and MSTs are typically pre-constructed using ATA. This provides a strong measure of quality control over the statistical properties of every test form, the content balance, and other test features that test development staff and subject-matter experts wish to control. LOFT and CAT are constructed in real-time or immediately before the examinee begins taking the test. Both the LOFT and CAT models provide greater flexibility in building test forms and (at least theoretically) make it impossible for examinees to predict which items they are likely to see—a definite test security advantage. However, quality control over every test form is almost impossible to implement with LOFT and CAT. Instead, strong quality assurance measures must be used. Table 5.2 provides a comparative summary of the four classes of test delivery models. The trade-offs among different types of test delivery models largely depend on the choices of costs and benefits, some of which may be indirect and even intangible (e.g., perceptions of fairness, trust, or integrity by the test users). One thing is clear. A CBT delivery system that restricts the test ad-

120   R. M. LUECHT Table 5.2  Comparative Summary of Four Classes of CBT Delivery Models Model CFT LOFT CAT MST

Adaptation No No Yes Yes

Admin. Units Items Items Items Modules

Test Assembly of Forms Preconstructed Forms Real-Time Real-Time Preconstructed Forms

ministration unit size to a single item or a fixed size testlet or module may be overly restrictive in terms of future capabilities and expansion. For example, a rather typical, linear test form may be broken down into two or more sections, and each section may contain item sets, item groups, or discrete items. The examinee would be administered the test form and then presented with each test section. An extensible test delivery system should be capable of supporting multiple fixed, linear test configurations but should also provide possible future support for computer-adaptive tests, randomized linear-onthe-fly tests, and computer-adaptive multistage testing. Results Management and Transmittal System The results management and transmittal system is activated upon completion of each testing event. Some testing programs immediately score each examinee’s test and present him or her with a score report. Other examination programs require a buffer period so that every examinee’s identity, the integrity of the data, and his or her responses and results can be verified and authenticated, before issuing a score report. Many testing organizations create results files that may contain each examinee’s identifying information, test form(s) taken, sequence of items, responses, and timing on each item, or even a complete transcript of everything the examinee did. Survey data collected after the test is usually included in the results files as well. The results need to be linked back to the examinee data in the registration and scheduling system. Figure 5.3 presents some data linkage possibilities across the two systems. Post-Administration Psychometric Processing System Post-administration processing typically involves multiple statistical analyses, often requiring powerful statistical or psychometric software applications and professional psychometricians and statisticians to determine the appropriate analyses and to interpret the results. The post-administration and psychometric processing system usually offers the following five classes of procedures: (i) item analyses, distractor analyses and key validation analyses to check answer keys and rubrics; (ii) item calibration using an appro-

Operational CBT Implementation Issues    121

Figure 5.3  Examinee repositories as part of the registration and scheduling system and the results management and transmittal system.

priate item response theory (IRT) model; (iii) test analysis to determine reliability and/or measurement information, as well as source of extraneous variance; (iv) psychometric equating analyses to link results across time and establish equivalent interpretations of the scores; (v) scaling steps necessary to compute score tables, scale scores, or final decision look-up tables. If a testing program has been around for a long time, the processing system may be a collection of commercial and home-grown software applications—many of which may have been developed on mainframe computers for paper-and-pencil testing. These types of legacy systems are often inadequate for the scope of analyses required for CBT. Some simple examples can help make this point. Consider that, for fixed-length paper-and-pencil tests, every examinee is assigned one of several possible test forms. If the test forms are constructed to be reasonably parallel in terms of average item difficulty, an item analysis is relatively simple to carry out simultaneously for all of the test forms. Statistics such as item difficulties (means or p-values) and item-total score correlations are simple to compute. Now consider an adaptive test where the items are selected to tailor the difficulty of each test form to a particular examinee’s proficiency. Because of the adaptive item selection, the total-scores for every examinee are confounded with sometimes dramatic differences in test difficulty, and the entire item analysis somewhat useless. Another example involves IRT calibrations. Many CBT programs employ large numbers of test forms for security reasons. Software that may once have been fully adequate to calibrate 100 to 200 items must now simultaneously handle perhaps thousands or even tens of thousands of items in a single IRT calibration. Although commercial IRT software certainly exists

122   R. M. LUECHT

for handling tens of thousands of items, legacy systems probably do not handle problems of that size because the software designers never envisioned a move to CBT. The post-administration psychometric processing system also requires a great deal of data coercion to force it to conform to formats supported by the analysis software. For example, Figure 5.4 presents three data views of examinee responses on a multiple-choice test. Figure 5.4a represents a partial examinee record containing the raw data from a CBT results file (see

Figure 5.4  Results data in three views (a) Results; (b) Raw data flat file; (c) Scored data flat file.

Operational CBT Implementation Issues    123

Results Management and Transmittal System). Figure 5.4b displays the raw responses in a more conventional flat-file format that might be submitted to an item analysis program. Figure 5.4c displays the same responses in a flat file, this time scored as “1” for correct or “0” for incorrect responses. Each of these files serves a different purpose (raw storage of results, item and distractor analysis, IRT calibration). The term flat file refers to an implicit file format that typically list items or variables (i.e., data fields) in one or more columns and examinee records as rows, with one or more records per examinee. A fixed format flat file places each item or variable in a specific set of columns. A delimited flat file would separate the data in each record by commas, tab characters, or some other character. Most statistical analysis software and psychometric item analysis and calibration assume that the data will be provided in a flat file format. Therefore, the data need to be reformatted for each data view and analysis purpose. Data extractions often require full collaboration between the end-users—in this case statisticians or psychometricians—and the database management team. The database management team needs to approach extracts as a standardized query design process that is repeated for every exam processing cycle, rather than as a one-time, unnecessary request. Formally speaking, an extract begins with a structured query of a data source such as a table of examinees. Based on the data returned by the database management software in response to the query, a data view is prepared that reformats the query results using a set of restructuring functions that produce one or more data sets (or record sets). The results of a standardized extract should be formatted file structure that serves a particular purpose. Multiple data views should be generated for different uses (e.g., test assembly, item analysis, calibration, scoring). In this context, standardization implies that the query and generation of data views from each extract are well-structured and reusable. Furthermore, new instances of an extract can be generated by manipulating the properties of a particular data view to change the outputs (e.g., changing or restricting data types, presentation formats). Figure 5.5 shows an example of six data tables that might be included in a standardized extract from the results management and transmittal system, the item development and banking system, and the test assembly and publishing system in order to carry out item analyses and key validation, IRT calibrations, or equating analyses for some number of CBT forms. (Note that, with only minor modifications, this same type of extraction would be reasonable for almost any of the test delivery models described earlier.) The databases being queried as part of an extract may store the data in very deeply structured, normalized tables. These tables must be queried, joined using established or computed relations, and restructured to generate the types of analysis files required for psychometric processing.

124   R. M. LUECHT

Figure 5.5  Extraction of linked data tables for item analysis, key validation, IRT calibration and/or equating.

Figure 5.6 graphically shows an example of the results of a series of data extracts, joins, and restructuring steps that might create analysis files for an IRT calibration. The items are queried and then sorted (top of figure). At the bottom of the figure, the examinees are similarly queried (e.g., all firsttaker examinees who completed the test between two dates). The examinee query then extends to the results file to retrieve the item response data for each of those examinees. The item identifiers are then matched to the sorted item list at the top of the figure. Those item-matched responses are then formatted with one examinee record per row as a scored response file, where “1” denotes a correct response, “0” denotes an incorrect response, and “9” is a missing (unmatched or not administered) item response. The formatted data file in the center of Figure 5.6 would then be submitted to the IRT calibration software for processing. The point in discussing standardized extracts is that merely getting the data into the results management and transmittal system is not enough. Highly competent psychometricians and statisticians can be entirely stymied by limited access to the data needed for their analyses—whether due to a lack of understanding about queries and the data structures, or due to some fundamental miscommunication between them and the database administrators. Standardized extracts meeting all of the routine analysis needs of the psychometricians and statisticians should be implemented as quickly as feasible. Score Reporting and Communication System The score reporting and communication system provides score reports to test takers, transcripts to authorized recipients, and extract files to other or-

Operational CBT Implementation Issues    125

Figure 5.6  Standardized data extractions for analysis: queries and reformatting of the data.

ganizations. The latter types of extracts containing examinee scores are common in certification and licensure, where a certification or licensing agency or board receives the final scores and issues the official score reports. Although the score reporting and communication system may seem relatively straight-forward, it should receive as much quality control support and verification work as any other system. Sending the wrong score file out

126   R. M. LUECHT

to a client is disastrous and can be avoided by putting in appropriate verification and quality control steps. Quality Control and Assurance System The quality control (QC) and quality assurance (QA) system has the primary responsibility for maintaining version control over application software and data structural modifications, signing off on procedural variations, determining root causes and solutions for problems, carrying out routine tests and audits of all processes and procedures, and verifying all results. Whether these functions are performed by a separate QC/QA group or integrated into the other CBT systems, they are absolutely essential. QC procedures tend to emphasize the testing and evaluation of systems components as well as discovering the root causes of defective outcomes. In contrast, QA is more proactive and focuses on stabilizing or improving processes to preclude defects. Both QC and QA are on-going. Unfortunately, QC and QA staff are often only recognized when problems occur; however, their roles in the ongoing efforts to eliminate data errors and other defective outcomes are an essential part of any CBT enterprise. Some Recommendations for Sound CBT Systems Implementation The discussion of the eight systems in a CBT enterprise should make it clear that there are numerous repositories to be designed, many software applications to be written and tested, and an extensive array of human-led procedures to implement and document. The apparent complexity of the system is real, but manageable through proper planning. A realistic view of CBT implementation merely needs to recognize the serious resources and pragmatic timelines that may be needed. The reward can be a highly efficient, cost-effective system that provides better and fairer tests in less time and with fewer human resources required, once the fully integrated enterprise is operational. But that reward can only be realized if each of the systems is built using the best components and hiring the best people. Where possible, procedures should be automated or at least provided in computer-assisted modes. Robust components should be locked down with strong version controls and usage policies, and operators at all levels should be curtailed from engaging in customizations and manual operations. Where manual operations are required (e.g., determining key changes, interpreting equating analyses), all results should be independently verified by two or more competent individuals. The need for audits and multiple verification steps cannot be stressed enough. In general, most errors and

Operational CBT Implementation Issues    127

costs occur whenever humans are directly involved in test development, administration, and processing of test results. .

Building Robust Repositories Object-oriented principles such as encapsulation, flexibility, extensibility and scalability should become concrete goals when designing every data repository. In addition to flexible and extensible data structures, all of the systems in the CBT enterprise need to firmly adhere to two basic tenets of database management: (a) single source and (b) object uniqueness. The singlesource principle is relatively straightforward. It implies that there is a master database or repository that holds a single “official” version of every data object (item text, answer keys, etc.). Any changes to the data object should be made to this master version and forward propagated to every use instance of the object. The most common application of this principle occurs with changes to a multiple-choice answer key due to a simple typo in the database resulting in a mis-key. Since mis-keys are often caught late in the examination processing cycle—after results are processed for some number of examinees—it is often tempting to make the key changes to intermediate analysis files. A better approach is to make the change to the master database and then regenerate all of the intermediate files using established data extraction scripts or code. Two versions of the same data object in different files is a quality control fiasco waiting to happen. The object uniqueness principle is fundamental for data integrity. Any data object, for example the text of a test item, the content codes for that item, or the answer key, are considered as unique entities by the data management system. If we create multiple instances of those entities, we must assign them unique identifiers. For example, if one item is revised after being administered on a test form—perhaps because of a minor flaw in wording—the newer version of the item becomes a new entity (item) and should be assigned a unique identifier. Different versions of the same object should never be allowed. Careful version controls, modification policies enforced by the data management software, and “lock-down” of each data object are critical for implementing this principle in practice. Building Robust Procedures Although manual procedures are occasionally needed to meet particular needs, every procedure—whether computerized or human-led—should be documented with a “lock down” step that precludes customization or modification. When modifications are made, a prescribed set of test procedures

128   R. M. LUECHT

should be employed to ensure that proper and stable results are obtained. The revised procedures should then be “locked down.” Version control of every procedure is essential—especially the most routine procedures. Since most procedures in CBT are aimed at processing data, it is also essential to set a goal of 100 percent audit of all data. Many of the audit procedures can be automated and included as part of the overall process. One important data verification or audit procedure is called data reconciliation. Reconciliation refers to bringing things into balance or harmony. In accounting the principle is usually associated with balancing accounts—for example, balancing a checking account balance to ensure that the banks records reconcile to an externally maintained balance. Reconciliation can be applied to almost every part of the CBT enterprise. Items to test forms assignment should never be approximately accounted for. Answer key changes should be summarized and verified in the master database as well as for every live (i.e., active) instance of an answer key or rubric file. Considering results, examinee data should never be “lost within the system.” In short, every response and every examinee record should be confirmed to ensure that there are no discrepancies or corruptions of the data. This ability to carry out a 100 percent audit of every aspect of the testing event is one of the capabilities that clearly distinguishes a welldesigned CBT enterprise from paper-and-pencil testing. That is, physically accounting for every piece of paper and certainly every response cannot be achieved with test booklets and answer sheets/booklets. With digital records, it is not only feasible, but should be standard operating procedure for every testing organization. Eligible examinee records should be reconciled to scheduled examinee events. Scheduled testing events must be reconciled to complete and partial examinee records received from testing sites and data record content should be reconciled to known test forms. Corrupted, duplicate, missing, and partial records all must be accounted for as part of the reconciliation, with appropriate resolution implemented according to established operating policies. Modern CBT requires enormous amounts of data to be moved, usually on a near-continuous basis. For example, 10,000 examinees taking a 50item computer-based test will generate 500,000 response records (item answers, response times, etc.). Despite the tremendous improvements in data encryption, transmission, and database management technologies over the past two decades, there is always some potential for errors related to data distortion and corruption, broken or faulty data links, or general programming faults in the data management system(s). Quality control and assurance aimed at eliminating virtually all errors should be the ultimate goal in implementing a robust CBT enterprise. Although perfection—that is, completely error free data—is unlikely in practice, numerous quality control and quality assurance procedures are necessary at different points in

Operational CBT Implementation Issues    129

time to either reduce the likelihood of data errors (prevention) or at least to identify errors when they occur (detection). In virtually any database management situation, structure reduces error! If more structure can be imposed on the data, fewer errors are likely because preventative measures are easier to implement. And when errors do occur, it is easier to detect them in highly structured data than in less-structured data. Notes 1. Based on an Invited Paper Presentation at the Tenth Annual Maryland Assessment Conference: Computers and their Impact on State Assessment: Recent History and Predictions for the Future. October, 18–19, College Park MD. 2. Some businesses are now offering online proctoring capabilities, using realtime digital camera feeds and other monitoring technologies. The feasibility of these services for high stakes assessments is questionable. 3. Obviously, formal, structured data repositories are often more reliable and efficient from a data management perspective than unstructured, informally created repositories. 4. XML refers to the Extensible Mark-up Language, a text-based data format for representing almost any data structures and relations among the data. The XML 1.0 Specification was produced by the World Wide Web Consortia and is consistent with other hierarchical data specifications. XML is highly portable across operating systems and computing platforms. 5. As used throughout this section, the phrase test data objects refers to a reading passage and associated items (i.e., an item set), a module or testlet used in a multistage test, a test section, or an entire test form. Conceptualizing test data objects in a hierarchical manner provides a flexible way to represent even very complex item-to-test configurations.

References Drasgow, F., Luecht, R. M., & Bennett, R. (2006). Technology and testing. In R. L. Brennan (Ed.), Educational Measurement (4th ed.), pp. 471–515. Washington, DC: American Council on Education/Praeger Publishers. Folk, V. G. & Smith, R. L. (2002). Models for delivery of CBT. In C. N. Mills, M. T. Potenza, J. J. Fremer, & W. C. Ward (Eds.), Computer-Based Testing, pp. 41–66. Mahwah, NJ: Lawrence Erlbaum Associates. Luecht, R. M. (1998) . Computer assisted test assembly using optimization heuristics. Applied Psychological Measurement, 22, 224–236. Luecht, R. M. (2000, April). Implementing the computer-adaptive sequential testing (CAST) framework to mass produce high quality computer-adaptive and mastery tests. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.

130   R. M. LUECHT Luecht, R. M. (2005a). Operational issues in computer-based testing. In D. Bartrum & R. Hambleton (Eds), Computer-Based Testing and the Internet, (pp. 91–114). New York, NY: Wiley & Sons Publishing. Luecht, R. M. (2005b). Some useful cost-benefit criteria for evaluating computerbased test delivery models and systems. Association of Test Publishers Journal. Retrieved from www.testpublishers.org/journal.htm Luecht, R. M., & Hirsch, T. M. (1992). Item selection using an average growth approximation of target information functions. Applied Psychological Measurement, 16(1), 41–51. Sands, W. A., Waters, B. K., & McBride, J. R. (Eds.). (1997). Computerized adaptive testing: From inquiry to operation. Washington, DC: American Psychological Association. Swanson, L., & Stocking, M. L. (1993). A model and heuristic for solving very large item selection problems. Applied Psychological Measurement, 17, 177–186. van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement, 22, 195–211. van der Linden, W. J. (2005). Linear models for optimal test design. New York: Springer. van der Linden, W. J., & Adema, J. (1998). Simultaneous assembly of multiple test forms. Journal of Educational Measurement, 35, 185–198.

Part II Technical and Psychometric Challenges and Innovations

This page intentionally left blank.

Chapter 6

Creating Innovative Assessment Items and Test Forms Kathleen Scalise

For decades, educators have anticipated that serious games, virtual simulations and other interactive computer activities could one day supply credible assessment evidence to inform teaching and learning (de Freitas, 2006; Gredler, 1996; Stenhouse, 1986). That day may be here or soon arriving. Products now available in several subject matter areas illustrate that progress has been made (Clark, Nelson, Sengupta, & D’Angelo, 2009; Shute et al., 2010; Shute, Ventura, Bauer, & Zapata-Rivera, 2009; Williams, Ma, Feist, Richard, & Prejean, 2007). Plans on the horizon by a number of developers suggest promising trends to come. Products, plans and development pathways for such innovative assessments are the topic of this book chapter. Importantly, the alignment of advances in both information technology and measurement technology are making the promise of problem-based assessments such as games and simulations with formal measurement properties more of a reality (Levy & Mislevy, 2004; Shute et al., 2010; Wilson, 2003). To make information interpretable and establish a strong evidence case for the quality of measures requires engineering formal measurement characteristics into innovative assessments, including technology products (Scalise, 2010b; Scalise, Madhyastha, Minstrell, & Wilson, 2010; Shute et al., 2010). Computers and Their Impact on State Assessments, pages 133–155 Copyright © 2012 by Information Age Publishing All rights of reproduction in any form reserved.

133

134    K. SCALISE

In this chapter, we will take up several examples of innovative assessments to illustrate chapter topics. Some of the examples may not look like assessment questions and tasks, and some may not be immediately recognizable as assessments at all. Therefore a new definition of assessment is introduced here. The suggestion is that whenever a software product is engaged in three important elements, it is engaged in an assessment practice. The three elements can be summarized in a single phrase: collecting evidence designed to make an inference (Scalise & Gifford, 2008). Collecting evidence is the first element. If information or observations of some kind are collected through the technology, then this can be a part of assessment. The observation could take many forms and may involve the respondent in a wide manner of response types, from speech, language, and video to writing, drawing or calculating, to fully enacting a physical performance or creating a product. The second element suggested here is “designed”—in other words, for this discussion of innovative assessments, the conversation is specifically limited to assessments which have been designed to elicit information. This includes knowing what the goals and objectives of the assessment are, and intentionally in the design process constructing a means by which observations and inferences can be made to provide information on the goals. Finally, the third element describes the purpose of the observation and the design: to make an inference. Inferences in assessment may be anything from generating a score or proficiency estimate, reporting a diagnostic profile, making a course placement or other instructional decision, providing work products from which teachers or other persons will draw some conclusions, or even feeding back information to the computer in embedded assessments to adapt instructional content on the fly. All of these involve using the observations and design of the assessments to come to some sort of conclusions about the respondent. Note that in the development of educational technology products, it is not uncommon for developers who are engaged in all three of these practices to not realize or not acknowledge that their products are involved in educational assessment. This is important to note because when information is used about a student and inferences are made about how he or she is learning or should learn, evidence should have characteristics of good evidence quality (Kennedy et al., 2007). One caution among many caveats for work in this area is that it is important to remember that advances in both areas of information and measurement technologies have been necessary, and that these two sets of technologies describe distinctly different tools, to meet distinctly different needs in assessment engineering (Wilson, 2003). Throughout this book chapter, therefore, the two terms will be highlighted separately to distinguish the aspect of technology under discussion. The plethora of media and tools now available

Creating Innovative Assessment Items and Test Forms    135

through information technology have lowered the cost and difficulty of creating and delivering innovative computer-based assessments and capturing data. Advances in measurement technology have made it more possible to build and model coherent assessments within more complex contexts, and to interpret the results. Both areas are burgeoning with new possibilities every year. Together they may carry assessment to new levels. It should be noted too that while this chapter focuses on innovative examples in educational assessment, similar conversations also are taking place in such assessment areas as licensure, accreditation, vocational and occupational assessment. In order to help supply what is needed for measurement technology in such assessments, developers should be able to supply adequate technical reports on their products regarding the assessment practices embedded within the software, and standards for this field should follow acceptable evidentiary practices. The examples in this paper are summarized in Table 6.1. They are drawn either from current products or those in development. The table is not meant to be a comprehensive list but simply to suggest a few contexts for discussion. The contexts shown are: OnPar Simulations (assistive technology), WestEd SimScientist (classroom-based assessment tools), U.S. NAEP Technology-based Assessments (simulation tasks for large scale assessments), Machinima Genetic Counseling Scenario (situational judgement task), River City (a multi-user virtual environment), Cisco’s new Aspire game (21st century skills), Intel’s Tabletop Simulations (webcam behavioral analysis), and Eye Tracking in reading assessments (biometrics). The first three examples involve digital simulations, the next three involve digital games, and the final two explore other innovations that are becoming apparent in the field such as behavioral observations, biometrics, information foraging, crowd sourcing and collaborative assessment (Wilson et al., 2010b). Table 6.1  Some Examples of Innovative Assessments, Including Simulation, Gaming and Other Formats Example

Focus

OnPar Simulations WestEd SimScientist U.S. NAEP Technology-based Assessment Machinima Genetic Counseling Scenario River City Cisco’s Aspire Game Intel’s Tabletop Simulations Eye Tracking in reading assessments

Assistive Technology Classroom-based Assessment Simulations Simulation for Large Scale Assessments Second Life Situational Judgement Task Multi-user Virtual Environment or MUVE 21st Century Skills for Career Assessments Webcam Behavioral Assessment Biometrics in Innovative Assessments

136    K. SCALISE

For purposes of this chapter, simulation and gaming will be defined using the Clark Gaming Commissioned Paper from the U.S. National Academy of Sciences (Clark et al., 2009): • Digital Simulations: “Computational models of real or hypothesized situations or phenomena that allow users to explore the implications of manipulating or modifying parameters within the models” (p. 4). • Digital Games: “digital models that allow users to make choices that affect the states of those models” having “an overarching set of explicit goals with accompanying systems for measuring progress” and including “subjective opportunities for play and engagement” (p. 26). Most of the assessments in the examples come from the perspective of encouraging students to learn while being assessed. Measures are interactive, and content and skills are often integrated. There may be a collaborative element to the simulation, scenario or game, either through prerecorded digital characters or with live virtual teams of students. The following Figures 6.1 through 6.8 are included for commentary and scholarship in the field in assessment. For instance in the Machinima Genetic Counseling Scenario shown in Figure 6.1, high school students become “virtual interns” and learn by work-

Figure 6.1  These situational judgment tasks created in tools such as Second Life assess student knowledge as counselors and in other role play situations. (http:// staff.washington.edu/djgawel/ncavideo/) Source: Machinima Genetic Counseling Scenario

Creating Innovative Assessment Items and Test Forms    137

ing on cases involving virtual patients who present with various concerns, much like real genetic counselors (Svihla et al., 2009). Each case can be considered similar to a test form and involves a simulated meeting among the student intern, a virtual mentor and clients seeking genetic counseling. Each meeting is recorded in advance and then presented to subsequent students as video clips. Sickle-cell disease was one topic, with cases created by filming avatars, or animated characters, in the “Second Life” virtual world. The assessment explores skills and knowledge in inheritance, evolution, gene-environment interactions, protein structure-function, political policy, and bioethics. What quickly becomes clear in the examples to be described is that through technological innovations, the computer-based platform offers the potential not only for enhanced summative or judgment-related scoring but for high quality formative assessment. This can closely match instructional activities and goals, make meaningful contributions to the classroom, and perhaps offer instructive comparisons with large-scale or summative tests. Formative assessment has many meanings in many contexts. Here it is defined as assessment able to contribute evidence to “form” or shape the instructional experience. The examples including demonstrations in most cases can be seen in more depth at the websites listed in Table 6.1. To briefly describe the rest, the OnPar example in Figure 6.2 are simulation assessments that include assistive technology, to reduce language load and provide multiple ways to access information. In Figure 6.3, WestEd simulations show that are intended to be used in the classroom and provide robust assessment information on students. Figure 6.4 is a large-scale example of simulations used in the U.S. NAEP technology assessments. Here examinees adjust variables and examine effects in various simulations and associated graphs and displays. In Figure 6.5 the River City MUVE (multi-user virtual environment) takes on a historic context for the game and includes collaborative space and tools for students to work together. Figure 6.6 shows a new product from Cisco, previewed for assessment audiences in 2010. In a number of products for its training academy, Cisco uses practices of extensive domain analysis and job task analysis to identify the goals and objects, or constructs of interest to measure, as well as to establish good alignment between instruction and assessment. Aspire offers a game interface and 21st century assessments that model real-life job aspects. These include such features as constant interruptions of the “job” assessment by email requests from clients and competing tasks that need to be completed. Finally Figure 6.7 reverses the technology equation by having students work hands-on in physical situations, with the technology observing and recording as a form of innovative assessment. Figure 6.7 shows a table top simulation from Intel during which elementary age students work on learning projects and are watched by the

138    K. SCALISE

Figure 6.2  OnPar simulations are assessments that include assistive technology, to reduce language load and provide multiple ways to access information. Source: ONPAR Assessment for English Language” Learners.

Figure 6.3  WestEd simulations are intended to be used in the classroom and provide robust assessment information on students. Source: SimScientists WestEd.

Creating Innovative Assessment Items and Test Forms    139

Figure 6.4  The new U.S. NAEP technology assessments include simulations in which examinees adjust variables and examine effects in various simulations and associated graphs and displays. Source: ED.gov, Technology-Based Assessment Using a Hot-Air Balloon Simulation, http://www.ed.gov/technology/draft-netp-2010/techbased-assessment”

Figure 6.5  This MUVE (multi-user virtual environment) takes on an historic context for the game, and includes collaborative space and tools for students to work together. Source: The River City Project, http://muve.gse.harvard.edu/rivercity project/curriculum.htm

140    K. SCALISE

Figure 6.6  A new product from Cisco in 2010 offers a game interface and assessments for their career training academy, with assessments engineering based on extensive job task and domain analysis. Source: Cisco Systems, Inc., Passport21.

technology through overhead webcams, to generate an assessment of their performance. Another example of reversing the technology direction can be found in reading assessments with eye tracking head gear. These biometric assessments are a technology-based innovation that provides data on reading patterns and focal points. Diagnostic profiles from such biometric approaches may reveal where student reading approaches can be improved and are indicating that there are substantially different patterns among striving readers. As the digital divide lessens, it would seem that technologies such as these should be poised to take advantage of new frontiers for innovation in assessment, bringing forward rich new assessment tasks and potentially powerful scoring, reporting and real-time feedback mechanisms for use by teacher and students. Indeed some researchers and practitioners have described it as inevitable that engaging computer-based activities will become used extensively for data collection and reporting purposes in educational assessment.

Creating Innovative Assessment Items and Test Forms    141

Figure 6.7  A look ahead: This table top simulation from Intel reverses the technology equation—students work on physical objects in the classroom but are watched by the technology through overhead webcams, to generate an assessment of their performance. Source: Intel Corporation, Projector-Based Augmented Reality Interface (Simulated)”

This is where measurement technology enters the picture. As will be seen throughout the products examined, the complexity of untangling formal measurement information from such complex settings is a new territory for the field of educational measurement. The difficulties of creating credible assessments and the lack of numerous working examples lends to the hesitancy of schools and policy settings to incorporate such information into the assessment mix. However, the vast array of new measurement techniques and the advances in areas such as multidimensional modeling are providing substantial new toolkits. What is meant by innovative assessment? We have encountered some surprises in working with research and practice teams deploying innovative assessments. First, although it is anticipated from the research literature that interactive assessments are likely to be more engaging for students than some traditional assessment formats, we have found them to be remarkably of interest to students. Teachers in our Formative Assessment Delivery project, for which some summary assessment results show in Table 6.2, report that they have never before had stu-

142    K. SCALISE

dents ask for additional homework, but they do so in order to do more of the interactive assessments outside of class. Teachers also describe students who are absent and miss taking an assessment who remind the teacher on subsequent days and request to take “their assessment” when they return. In our own work, we are using evidence-centered design practices through the BEAR Assessment System (Berkeley Evaluation and Assessment Research) to help establish the coherence of assessments (Wilson, 2005). This approach provides tools to establish and align the goals and objectives of measurement with (i) the observations made through the innovative task formats, (ii) the outcome space or scoring approach established for interpretability, and (iii) the use of measurement models to aggregate data and help build good evidence properties for the new kinds of assessment. Demonstration tasks showing pathways to development and including a resource kit to be used by developers are being generated in the ATC21S project and should be of interest for developers of serious games, simulations and other interactive learning materials designed to include embedded assessments that offer good evidence of learning (Wilson et al., 2010a, 2010b). Table 6.2 shows some results of using this approach to evidence-centered design to create some interactive assessments. We have also found in these research field trials that it is often hard to get the students to stop taking the interactive assessments after the time allotted is complete, and to end their game or simulation play. This has been true across a variety of formats and degrees of sophistication to the assessments. It is not unusual for students to request to return to the queue and have the opportunity for another “turn” at the assessments. This has been the case even when field trials have been conducted in highly engaging Table 6.2  Some Analysis Results for Our Beta Trials Activity Including Natural-user Interface Objects for Assessment, March 2010 Activity 1 Validity Study Sample Size Mean time to completion of assessment Number of items total in instrument Percent auto scoring Percent open-ended (hand-scored with rubric) Missing data proportion Model Cronbach’s alpha MLE person separation reliability Expected a-posteriori/person variance reliability (EAP/PV)

Result 574 respondents 9 minutes, 9 secs. 22 91% (20 of 22) 9% (2 of 22) 1.4% PCM .79 .81 .80

Creating Innovative Assessment Items and Test Forms    143

environments, such as science museums where children have the choice of a wide variety of interesting activities. While this would perhaps not be surprising with elaborate or sophisticated games or simulations that for instance matched what is available to children in home game play, we have noted such levels of engagement even in very simple assessment games and interactive scenarios, such as the single screen assessment objects that will be introduced in the Intermediate Constraint Taxonomy table shown in Figure 6.2. One question this raises in our work is the degree to which not only the gaming and interactivity itself is interesting, but whether there is a role being played by the zone of proximal development. In other words, the assessments are intentionally designed to be in a suitable range of skills and knowledge to support offering robust measurement evidence, with item information tending to be maximized when students are about equally likely to achieve or not achieve the task. Some of the assessments we have used even adapt to this with embedded measures. In addition to providing robust measurement evidence, this has the additional effect of aligning student thinking in places where they may be actively constructing knowledge. This is in strong contrast, for instance, to usual classroom-based assessments, which we have found in our work tend to strongly favor students being able to achieve or be successful on the task, often yielding 80% mastery rates or higher. Could this alignment of maximizing measurement information with more challenging assessments but still within reach be especially engaging to student thinking processes actively constructing knowledge, and what is the cognitive science of engagement when students are actively constructing knowledge in such zones of proximal development? Innovative assessments of the type described in this paper may yield some exciting findings in this area. However one caution is that in a literature survey of more than 100 research studies on educational simulations for grades 6–12 (Scalise, Timms, Clark, & Moorjani, 2009), we found that though the study authors seemed to appreciate the value of engaging and active simulations for the teaching context, when it came to the assessments, the simulations were not used. Software products often tended to leave the simulation context and launch more traditional question-asking strategies such as multiple-choice worksheets and short answer typing for the assessment component, with high mastery ranges in which students were expected to generally know and be able to report back the answer. Would it be possible to retain more of the simulation, gaming and interactivity in the assessments themselves, across products? Part of reaching such goals for product developers is a better understanding of transforming assessments as they are moved from paperand-pencil formats, rather than simply migrating them across to electronic platforms with little change.

144    K. SCALISE

A taxonomy for innovative item formats Before further exploring the game, simulation and other innovative assessments in the examples, this chapter will step back and discuss some of the research literature on assessment formats. This can help developers move to transforming rather than simply migrating their assessments to technology platforms, and also better connect technologists with research in the measurement field. One of the first questions often asked regarding innovative tasks and test forms is what is meant by innovative? Here it is helpful to connect to some of the new work being done in assessment development today with the measurement literature. This will allow a definition of innovation to be proposed later in the paper, while also showing how what is going on today in interactive assessments can be mapped back to the traditional educational measurement literature. To do this, a taxonomy of item formats (Scalise, 2010a; Scalise & Gifford, 2006) is shown here in Figure 6.8, called the Intermediate Constraint Taxonomy for E-Learning Assessment Questions and Tasks (IC Taxonomy). Organized along the degree of constraint on the respondent’s options for answering or interacting with the assessment item or task, the taxonomy describes a set of iconic item types termed “intermediate constraint” items. These item types have responses that fall somewhere between fully constrained responses (i.e., the conventional multiple-choice question), which can be far too limiting to tap much of the potential of new information technologies, and fully constructed responses (i.e., the traditional essay), which can be a challenge for computers to meaningfully analyze even with today’s sophisticated tools. The 28 example types discussed in this paper are based on seven categories of ordering involving successively decreasing response constraints from fully selected to fully constructed. Each category of constraint includes four iconic examples. Literature references for the taxonomy are available online for each cell of the table, as well as working examples of each type and operational code that has been released in open source (http://pages.uoregon.edu/kscalise/taxonomy/taxonomy.html). Literature references illustrating the interactions were drawn from a review of 44 papers and book chapters on item types and item designs—many of them classic references regarding particular item types—with the intent of consolidating considerations of item constraint for use in e-learning assessment designs. Organizing schemes for the degree of constraint and other measurement aspects regarding items can be helpful (Bennett, 1993). One potential limitation for realizing the benefits of computer-based assessment in both instructional assessment and large-scale testing comes in designing questions and tasks with which computers can effectively interface (i.e., for

Creating Innovative Assessment Items and Test Forms    145

Figure 6.8  Intermediate Constraint Taxonomy for E-Learning Assessment Questions and Tasks organizes types of task format innovations that can take place in computer-based assessments. Source: Kathleen Scalise, University of Oregon.

scoring and score reporting purposes) while still gathering meaningful measurement evidence. The Intermediate Constraint Taxonomy highlights numerous “intermediate constraint” formats that can readily lend themselves to computer interpretation. The columns for the IC Taxonomy were based on the item format facet of Bennett’s “Multi-faceted Organization Scheme” (Bennett, 1993), with the addition of one column not represented in the original scheme as summarized in Figure 6.9. In this way, the IC Taxonomy can be considered a modified Bennett framework. Drawing on the concept of what might be called a constraint dimension, the Intermediate Constraint Taxonomy for E-Learning Assessment Questions and Tasks features a variety of innovations in the stimulus and/or response of the observation. IC types may be useful, for instance, with automated scoring in computer-based testing (CBT). IC items and task designs are beginning to be used in CBT, with response outcomes that are promising for computers to readily and reliably score, while at the same time offering more freedom for the improvement of assessment design and the utilization of computer-mediated function­ality. The taxonomy of constraint types described here includes some characteristics, previous uses, strengths and

146    K. SCALISE

Figure 6.9  Randy Bennett’s Multi-faceted Organization Scheme shows the item format facet of task development on the front face of the cube. Source: Kathleen Scalise, University of Oregon.

weaknesses of each type, and we present examples of each type in figures in this chapter. Intermediate constraint tasks can be used alone for complex assessments or readily composited together, bundled and treated with bundle (testlet) measurement models (Scalise, 2004). At one end of the spectrum, the most constrained selected response items require an examinee to select one choice from among a few alterna­ tives, represented by the conventional multiple-choice item. At the other end of the spectrum, examinees are required to generate and present a physical performance under real or simulated conditions. Five intermediary classes fall between these two extremes in the taxonomy and are classified as selection/identification, reordering/rearrangement, substitution/ correction, completion, and construction types. Note that all item types in the item taxonomy can include new response actions and media inclusion. Thus, by combining intermediate constraint types and varying the response and media inclusion, e-learning instructional designers can create a vast array of innovation assessment approaches and could arguably match assessment needs and evidence for many instructional design objectives.

Creating Innovative Assessment Items and Test Forms    147

Media inclusion, simulations, within-item interactivity and data-rich problem-solving in which access to rich resources such as books, resources and references are made available online, are all innovations that can be incorporated in many of the item types discussed below. To better understand the IC Taxonomy and how it organizes potential format innovations, consider the first column. Here, a variety of fully selected item formats are shown, of the multiple-choice variety. At the top of the column, in Cell 1A, one of the least complex and most restricted formats is shown—a two-choice item type with only True and False allowed as the answer choices. The next cell, Alternate Choice, slightly enlarges the selection possibilities, by retaining the restriction on only two answers but relaxing the labeling of the answers to include any two choices that the item writer wishes to include. Continuing down the column, the classic multiplechoice item type is shown in 1C, allowing any small number of answers, but in text only formats. Finally, Cell 1D enlarges this to retain the selection from amongst a small set of answers, but in which the prompt or answers can include new media such as pictures, sounds and animations. Similarly, going across the columns, the degree of construction of the response becomes more complex, for instance with categorizing formats (3B), limited figural drawing that introduces a figure and ask the respondent to make changes to it (4C), calls for completion of a matrix format (5D), and introduces for instance fully figural drawing that does not being with a composed figure but only building elements (6B). To see how this matrix can be applied, consider the screenshot of the animated figure below (Figure 6.10). The small clown animation includes a “magic wand” tool. When the tool passes over the figure in various ways, the figure performs actions a clown might make, such as releasing an item from his hat, making leaping motions or snatching objects from the air. Such an animated figure might be used, for instance, in a measure of reading comprehension. Using the clown action figure, the respondent could indicate what he or she believed took place in the reading passage. This is an example of Cell 6A, open-ended or “uncued” multiple choice. In this item type while the respondent is still selecting response, the selection is made from a large number of options. Theoretically, if the animated figure were enabled to make the full range of such a figure’s actions, this could be described as encompassing the entire “outcome space” or possible ideas a child might have about what a clown could have done. Many IC types such as limited figural drawing and animated role playing to capture assessment information will be seen as entering into the development of games and simulations later in this chapter.

148    K. SCALISE

However, it is organized, it should be noted that in different contexts, assessment developers are tending to find themselves starting out in different parts of the IC table. This is because when migrating and transforming paper-and-pencil assessments to technology-based formats, or creating new assessments, traditional approaches differ across regions and assessment purposes. In the U.S., for instance, innovation in educational assessment today is often moving from traditional assessments in the fully selected leftmost column of the table, to the right toward more constructed types to increase the range of what can be represented in assessments, and measure aspects of constructs that have been difficult to measure in the past. In the UK, however, traditional assessments are much more likely to fall into the right-most columns of the table. Their challenge in taking advantage of computer-based formats often is to move to the left in the table, to take advantage of some of automated data collection, scoring and real-time reporting capabilities of new computer-based assessments. Therefore it can be seen that what is a new innovation in any particular assessment context, whether computer-based or otherwise, depends on what has been happening in that particular context. The needs and purposes for change should be shaping and nurturing evolution in the area. Thus we finally arrive at proposing a definition for an innovative item type:

Figure 6.10  Interacting with animated figures in role play such as for this carnival clown is an example of Cell 6A in the IC Taxonomy. Source: Karina Scalise, Harvard University.

Creating Innovative Assessment Items and Test Forms    149

It should be defined in context, with properties to include being a useful or potentially useful form distal to, or different from, current or standard assessment design practices in the context. Simulation, Game or Other Task Surround as the “Test Form” An interesting element to note of the IC Taxonomy is that there are no question or task examples shown in Column 7, which represents the most fully constructed and complex activities. This column would include most games used for assessment and many simulations, if they involve some degree of complexity, as well as numerous other activities as shown in the table, including teaching demonstrations, full virtual laboratory experiments, and projects in project-based learning. The reason they are not shown is that here the taxonomy begins to transition from questions and small tasks— sometimes called items and item bundles in formal measurement—to larger activities that usually consist of a series of tasks and interactions. How might it be helpful to think of task surrounds such as games and simulations as test forms? An important area of incorporating advances in measurement technology into innovative assessments is to be able to identify measurement principles that relate to the new approaches. These may be able to be used to better build an evidence case and establish the credibility of the inference process. For assembly of larger instruments from a collection of questions and tasks, numerous assessment engineering techniques have been developed to help create, assemble and evaluate test forms. Some of these approaches seem as if they could be readily extended to the new formats, and lend new measurement tools to technologists. If the individual opportunities for collecting assessment information are conceived of according to format as in the Intermediate Constraint Taxonomy and considered items or sets of items, then the Task Surround of the overall game or larger simulation becomes in a sense the “test form.” In the world of computer-based innovative assessments, Column 7 of the IC taxonomy then could be considered the transition to a type of “test form” or assembly of IC opportunities for students to show what they know and are able to do. In this way, the gaming and simulation examples for this chapter, shown in Table 6.1, are all examples of activities that fall into Column 7 of the IC Taxonomy. If it is a large game or an extensive simulation, or if the game or simulation adapts to offer different experiences to different users, this might be thought of as possibly containing multiple forms, alternate forms, or having the “forms” constructed on the fly from an adaptive item “pool,” this last approach mapping to computer-adaptive testing.

150    K. SCALISE

The promise of assembling innovative formats and using them for assessments in simulations and games has long been evident. Computers and electronic technology today offer myriad ways to enrich educational assessment both in the classroom and in large-scale testing situations. With dynamic visuals, sound and user interactivity as well as adaptivity to individual test-takers and near real-time score reporting, computer-based assessment vastly expands testing possibilities beyond the limitations of traditional paper-and-pencil tests. The possibility of rich problem-based settings and complex interactive formats extends greatly through simulations, gaming, crowd sourcing, collaboration opportunities and other types of new tasks. Back to the Simulation and Gaming Examples With these ideas of taxonomies of formats and test forms that can be task surrounds built from taxonomical types, we now return to the simulation, gaming, behavioral and biometric examples of assessment described earlier in the chapter. Careful examination of the interactions and data collection opportunities in most of the game and simulation examples reveals that most, if not all, of the discrete data collection observations that result from students operating on a technological interface can be described as some item format falling into the modified Bennett framework—or in other words, either representing an actual cell in the IC taxonomy or possible to classify as an additional row falling under one of the modified Bennett column headings. For instance, in the WestEd SimScientist example shown of connecting fish in a food web, the drawing of the arrows can be seen as an example of limited figural drawing, or adjusting a figure that has already primarily been presented. One innovation it adds over the example in the table is substituting a dynamic graphic, or animation, for the static picture shown in the example. The NAEP technology-based assessments and balloon example capitalize on the uncued or open-ended multiple choice item type, with slider bars allowing the simulation to represent the full outcome space. The situational judgment tasks use Second Life to generate a series of game-like scenario and thus provide new media in what can be considered the supporting prompt material, followed by a small set of selected response choice, and thus are an example of multiple choice with new media. In such examples as these, the modified Bennett types of individual interactions are designed to work together and may cross numerous types, in the end forming a set of observations designed to make inferences about the respondents. In this way they can be conceived together in the task surround of game or simulations as a type of test form. Assessment engineer-

Creating Innovative Assessment Items and Test Forms    151

ing practices such as evidence-centered design, including the UC Berkeley BEAR Assessment System approaches used in some of our work, can be employed to design coherence into the assessment materials and align the constructs and observations with the evidence and inferences, which will be discussed below in the ATC21S scenario examples. One challenge in thinking of engineering innovative educational assessments is when the technology is being used not by the learners but to observe the learners. Examples are shown in the webcam behavioral video segment and the eye tracking biometrics example. How should the resulting products be interpreted for assessment information? These types of assessments are only recently coming into play, and there is very sparse research regarding their use. However, often such work products are analyzed with automated scoring and artificial intelligence algorithms. Here, coding of individual indicators often becomes key to interpretation, such as momentary time sampling of eye track and focus relevancy. This can render the interpretation format as uncued or open-ended multiple-choice, since the entire outcome space—all possible focal points—are offered from which the student “selects” during the reading process. Much of such work is new and may offer some promising avenues for use in innovative assessments. The approaches being used in the field for automated scoring thus deserve considerable attention in any discussion on deploying innovative formats in educational assessment, but since this volume includes a chapter on the topic, we will not explore it further here. Assessment Engineering and Promising Development Pathways Interest in the operational use of innovative formats for educational assessment is undergoing rapid change. In the U.S., for example, both of the two Race to the Top consortia for state assessment selected for funding by the U.S. federal government in 2010 include extensive plans for innovative item types in their test forms and item pools, while the 2010 U.S. National Educational Technology Plan (NETP) calls for extensive use of innovation in educational assessment, including promoting the use of numerous games, simulations and other scenarios. This 2010 NETP plan, “Transforming American Education: Learning Powered by Technology,” includes a number of goals and recommendations for the future of U.S. educational assessment. It stresses providing timely and actionable feedback about student learning in the classroom. Data are to be used directly to improve achievement and instructional practices for individual students, rather than primarily as a post-intervention

152    K. SCALISE

accountability process. The intent is to serve not only students and teachers but also a variety of other educational stakeholders, such as school administrators and the state, for continuous improvement. Goals of the plan emphasize the need for much richer and more complex tasks, building capacity for use and research on assessments in the schools, and revising policies and regulations to protect privacy while enabling effective data collection and use. The U.S. NETP correctly points out that many e-learning technology products exist that can begin to “make visible sequences of actions taken by learners in simulated environments,” and help “to model complex reasoning tasks.” Making some actions visible is a good step along the way toward assessment but does not necessarily rise to the quality of evidence needed to justify fair, valid and reliable measurement. What are good processes, then, for developing credible assessments using innovative formats? As Messick (1989) has suggested, a validity argument in assessment consists not only in showing that the evidence collected does support the intended inferences, but also in showing that plausible rival inferences are less warranted. This is where the specifications of the tasks are crucial, particularly in the context of 21st century skills, which often must necessarily involve complex performances. The collection of tasks presented to students must be designed and assembled in such a way that plausible rival interpretations—such as the possibility that success was due to familiarity with the particular context rather than the underlying skill—are less warranted than the intended inferences. How do or can simulations, games and other assessments build a case for credible evidence, especially if the information is to be used to inform large-scale assessment, or yield classroom data that influences decisionmaking about students? The field of educational measurement is grappling with this question in a variety of ways. One perspective describes educational assessment as having been in an era of sparse data availability—a data desert—and now entering a time of much more prolific data opportunity, or so-called “data density”—a data jungle (Behrens & DiCerbo, 2010). From this perspective, some practices of educational measurement were built to cull a reasonably strong evidentiary signal from relatively few and limited data collection opportunities. Now with much greater availability of data, the field can begin to expand. The goal is to incorporate into practice not only a much wider array of innovative item formats, but to employ many advances in new measurement technologies. This could mean making operational approaches and models that previously rarely left the research arena but now could serve the field more broadly. It should be noted, however, to carry the metaphor one step forward, that while a desert is a complex ecosystem and requires adaptation for survival, so does a jungle. Interpreting patterns from large and complex data-

Creating Innovative Assessment Items and Test Forms    153

sets requires much coherence in assessment practices, including use of robust tools and well-researched techniques. This is especially true when the inferences to be made are expected to pick up an array of subtle, diagnostic patterns, be used to establish reliable trends over time, and provide the ability to capitalize on near real-time feedback. These are ambitious assessment challenges and require attention to good evidentiary practices in order to realize the potential of the approaches. References Behrens, J., & DiCerbo, K. (2010, October). What can we learn from the application of computer based assessment to the industry. Tenth Annual Maryland Assessment Conference, Computers and Their Impact on State Assessment: Recent History and Predictions for the Future. Retrieved from http://marces.org/ conference/cba/agenda.htm Bennett, R. E. (1993). On the meaning of constructed response. In R. E. Bennett, & Ward, W. C. (Ed.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 1–27). Hillsdale, NJ: Lawrence Erlbaum Associates. Clark, D., Nelson, B., Sengupta, P., & D’Angelo, C. (2009). Rethinking science learning through digital games and simulations: Genres, examples, and evidence. An NAS commissioned paper. Retrieved from http://www7.nationalacademies.org/bose/Clark_Gaming_CommissionedPaper.pdf de Freitas, S. I. (2006). Using games and simulations for supporting learning. Learning, Media and Technology, 31(4), 342–358. Gredler, M. E. (1996). Educational games and simulations: A technology in search of a research paradigm. In D. H. Jonassen (Ed.), Handbook of research for educational communications and technology (pp. 521–539). New York: MacMillan. Kennedy, C. A., Bernbaum, D. J., Timms, M. J., Harrell, S. V., Burmester, K., Scalise, K., et al. (2007, April). A framework for designing and evaluating interactive e-learning products. Paper presented at the 2007 AERA Annual Meeting: The World of Educational Quality, Chicago. Levy, R., & Mislevy, R. (2004). Specifying and refining a measurement model for a computer-based interactive assessment. International Journal of Testing, 4(4), 333–369. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.), (pp. 13–103). New York: American Council on Education and Macmillan. Scalise, K. (2004, June). A new approach to computer adaptive assessment with IRT construct-modeled item bundles (Testlets): An application of the BEAR assessment system. Paper presented at the 2004 International Meeting of the Psychometric Society, Monterey, CA. Scalise, K. (2010a, May). Innovative item types: New results on intermediate constraint questions and tasks for computer-based testing using NUI Objects. Session on Innovative Considerations in Computer Adaptive Testing. Paper presented at the

154    K. SCALISE National Council on Measurement in Education Annual Conference, Denver, CO. Scalise, K. (2010b, May). The influence and impact of technology on educational measurement. Invited Symposium, National Council on Measurement in Education (NCME), Denver, CO. Scalise, K., & Gifford, B. R. (2006). Computer-based assessment in e-learning: A framework for constructing “intermediate constraint” questions and tasks for technology platforms. Journal of Teaching, Learning and Assessment, 4(6). Retrieved from http://www.jtla.org Scalise, K., & Gifford, B. R. (2008, March). Innovative item types: Intermediate constraint questions and tasks for computer-based testing. Paper presented at the National Council on Measurement in Education (NCME), Session on “Building Adaptive and Other Computer-Based Tests,” New York, NY. Scalise, K., Madhyastha, T., Minstrell, J., & Wilson, M. (2010). Improving assessment evidence in e-learning products: Some solutions for reliability. International Journal of Learning Technology, Special Issue: Assessment in e-Learning, 5(2), 191–208. Scalise, K., Timms, M., Clark, L., & Moorjani, A. (2009, April). Student learning in science simulations: What makes a difference. Paper presented at the Conversation, Argumentation, and Engagement and Science Learning, American Educational Research Association. Shute, V., Maskduki, I., Donmez, O., Kim, Y. J., Dennen, V. P., Jeong, A. C., et al. (2010). Modeling, assessing, and supporting key competencies within game environments. In D. Ifenthaler, P. Pirnay-Dummer & N. M. Seel (Eds.), Computer-based diagnostics and systematic analysis of knowledge (pp. 281–309). New York, NY: Springer-Verlag. Shute, V., Ventura, M., Bauer, M., & Zapata-Rivera, D. (2009). Melding the power of serious games and embedded assessment to monitor and foster learning: flow and grow. In U. Ritterfeld, M. Cody, & P. Vorderer (Eds.), Serious games: Mechanisms and effects (pp. 295–321). Mahwah, NJ: Routledge, Taylor and Francis. Stenhouse, D. (1986). Conceptual change in science education: Paradigms and language-games. Science Education, 70(4), 413–425. Svihla, V., Vye, N., Brown, M., Phillips, R., Gawel, D., & Bransford, J. (2009). Interactive learning assessments for the 21st century. Education Canada, 49(3). Retrieved from academia.edu.documents.s3.amazonaws.com/ . . . /Svihla_ 2009_Summer_EdCan.pdf Williams, D., Ma, Y., Feist, S., Richard, C. E., & Prejean, L. (2007). The design of an analogical encoding tool for game-based virtual learning environments. British Journal of Educational Technology, 38(3), 429–437. Wilson, M. (2003). The technologies of assessment. Invited Presentation at the AEL National Research Symposium, Toward a National Research Agenda for Improving the Intelligence of Assessment Through Technology. Chicago. Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Lawrence Erlbaum Assoc. Wilson, M., Bejar, I., Scalise, K., Templin, J., Wiliam, D., & Torres Irribarra, D. (2010a, May). 21st-century measurement for 21st-century skills. Paper presented at the American Educational Research Association Annual Meeting, Denver, CO.

Creating Innovative Assessment Items and Test Forms    155 Wilson, M., Bejar, I., Scalise, K., Templin, J., Wiliam, D., & Torres Irribarra, D. (2010b, January). Assessment and teaching of 21st century skills: Perspectives on methodological issues. White Paper presented at the Learning and Technology World Forum 2010, London.

This page intentionally left blank.

Chapter 7

The Conceptual and Scientific Basis for Automated Scoring of Performance Items David M. Williamson1

Introduction There is a tension in the design of large-scale state assessments between high construct-fidelity of tasks and efficiency in design, delivery, scoring and reporting. For some domains, performance-oriented tasks in which students construct their responses rather than select them from presented options are considered more faithful representations of the construct. However, the relative efficiencies, in both time and cost, for development, administration and scoring of multiple-choice items are so substantial that relatively few constructed-response items tend to be used in large-scale assessment. The increasing quality of automated scoring systems, which use computer algorithms to score constructed-responses, and the growing variety of item types they are capable of scoring facilitates use of constructed-response items with scoring efficiencies that are more like their multiple-choice counterparts. This chapter is intended as an orientation for practitioners to some opporComputers and Their Impact on State Assessments, pages 157–193 Copyright © 2012 by Information Age Publishing All rights of reproduction in any form reserved.

157

158    D. M. WILLIAMSON

tunities and issues related to automated scoring for constructed-response items, whether innovative (such as computer simulations) or traditional (such as essays or mathematical equations), and is organized into three sections. The first is the conceptual basis for scoring innovative tasks, followed by a general overview of automated scoring methods that are commercially available and concluding with some comments on the science of scoring. The Conceptual Basis for Scoring In design and scoring of performance items it is tempting to begin with a realistic task and then determine how to score it. However, a more effective approach is to define behaviors that are indicative of important distinctions in ability (the scoring) and then devise circumstances (tasks) that tend to elicit these patterns of behavior. That is, rather than thinking about scoring realistic tasks, think about how to design tasks around the parts of the construct for which you want to have observable distinctions, ultimately encoded in the scoring. This is an echo of Bennett and Bejar (1998) that automated scoring of performance tasks is not just about scoring, but that effective scoring is embedded in the context of an assessment design. More innovative assessments have a greater need for a rigorous approach to design to ensure that innovations are in the service of assessment rather than the reverse. In the absence of strong design there is greater risk that innovations may offer no advantage in construct representation or efficiency and simply entail greater cost and effort or even introduce construct irrelevant variance. Perhaps surprisingly, automated scoring may place more initial demands on assessment design than human scoring. Programming and/or calibrating automated scoring demands explicit representation of the relevant elements of the response and how they should be valued. This determination should be made prior to or in concert with the design of the tasks. With human scoring it is possible to design tasks with vaguely defined scoring rubrics and rely on the training process for human graders to define the rubric through illustrative examples from examinee submissions rather than explicit definition. Such a practice could result in uncertainty about the construct represented by scores or ambiguous criteria leading to idiosyncratic scoring and poor inter-rater agreement. The risks of such a practice are well-known in human scoring and protocols for best practice (e.g., Baldwin, Fowles, & Livingston, 2008) target the reduction or elimination of such risks. The use of automated scoring may encourage the explicit consideration of scoring criteria that might otherwise be deferred, thus improving the design. The first step in successful scoring is good design, and the more novel and/or performance-oriented the construct the more value there is in for-

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    159

malized design methodology. These include Assessment Engineering (Luecht, 2007), Berkley Evaluation and Assessment Research (BEAR) (Wilson & Sloane, 2000) and Evidence Centered Design (Mislevy, Steinberg, & Almond, 2003; Mislevy & Haertel, 2006). Such methods reduce the tendency to move too quickly into task development by prioritizing the evaluation of the construct and linking innovations in task design to the explicit needs of the construct rather than implicit understanding of the item writers. Effective design privileges the concept of evidence, with assessments collecting sufficient evidence to support conclusions about examinees and both assessment and task scoring incorporating answers to three primary questions of design: • Evidence of what? Understanding the construct and the knowledge, skills and abilities (KSAs) that constitute performance in the domain is fundamental to assessment design. Similarly, effective scoring of tasks requires explicit representation of the KSAs targeted by each task. • What evidence is relevant, and of this, achievable? Construct definitions include distinctions among levels of ability in the domain. These construct-level definitions of evidence are the basis for defining similar evidence that can be obtained under more constrained circumstances of assessment tasks. These evidential distinctions drive the design of tasks that allow examinees to demonstrate such distinctions. • How do we transform data into evidence, and subsequent action? The stored performances of examinees on tasks are data that, through a combination of logical and statistical mechanisms, are used as evidence of ability. Over the course of an assessment the evidence should be sufficiently compelling to facilitate decision making on the part of score users. These questions are not restricted to formalized assessment design, but are a core part of the philosophy of educational assessment. For example, the assessment triangle of Pelligrino, Chudowski, & Glaser (2001) in Figure 7.1 represents educational assessment as a relationship among observation, interpretation and cognition. Cognitive structures of students allow them to produce observable responses to assessment tasks as representations of their abilities, which are, in turn, interpreted in context to make inferences about the cognition of students. In this cycle the concept of evidence is the implicit core of the triangle. The models of cognition in the domain are based on research using methods (such as cognitive task analyses) that elicit aspects of cognitive processing and/or representations of examinees. This constitutes the evidential basis for cognitive models of

160    D. M. WILLIAMSON

Figure 7.1  Evidence as the core of the assessment triangle.

performance. These, in turn, drive the design of tasks that elicit observations interpreted as evidence of distinctions in cognitive ability. The treatment of evidence as the foundation of assessment, and by extension as the driver for scoring, is the basis of Evidence Centered Design (ECD). What follows is a brief orientation to ECD concepts and terminology as a basis for further discussion of scoring concepts for this chapter, with a more complete representation available in Mislevy, Steinberg, and Almond (2003) or Mislevy and Haertel (2006). At the core of ECD are three types of models, with the construction and interaction between them representing the fundamental logic of the assessment design, and by extension a portion of the validity argument. These models are: • Proficiency Model. The Proficiency Model represents the configuration of examinee knowledge, skills and abilities that represent the construct of interest to be measured. • Evidence Models. Evidence models represent the logical and statistical relationships between observations that are made about examinee performance and inferences made about Proficiency Model variables. • Task Models. Task Models specify the characteristics, both substantive and situational, of assessment tasks that are appropriate for the construct of interest. These models correspond directly to the three primary questions of evidence referenced above, with the Proficiency Model answering the question “evidence of what?”, the Task Models representing what evidence is achievable, and the Evidence Models specifying the transformation of data into evidence. The specification of these models, and the relationships among them, constitute the rationale for the design of the assessment. The

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    161

Figure 7.2  Basic models of evidence centered design.

composition and interaction of these models is illustrated in Figure 7.2, and described briefly below. The leftmost portion of the figure is the Proficiency Model and includes two primary elements, stars and circles, representing different aspects of the model. Stars represent reporting goals, which may be driven by a prospective score report representing examples of what would be presented in a score report, about which examinee(s), and to whom the report would be conveyed, including anticipated actions that the recipient might take on the basis of the report. Multiple stars represent multiple goals, such as individual score reports being provided to teachers, students and parents for their use in educational practice and support as well as aggregate reports provided to school and district administrators to inform policy decisions. In both cases the reports may include summary scores for classification purposes as well as performance feedback highlighting areas for further practice. Circles connected by arrows represent KSAs constituting the construct of interest, with the figure showing a multivariate model (five variables) with hierarchical relationships. Naturally, the number and structure of these KSAs would depend on the assessment and can range from simple univariate models to rich construct elaborations typical of intelligent tutoring systems (see examples later in this chapter). The rightmost portion of the figure are Task Models specifying the design of assessment tasks providing evidence needed to support score reporting. Each of the many task models defines a range of potential tasks that might be generated to target different kinds of evidence or to provide for multiple tasks related to as single type of evidence. Task model concepts are related to automatic item generation (Alves, Gierl, & Lai, 2010; Bejar, 2009), with both concerned with design structures and the impact of variations on construct representation and prediction of statistical performance of items (Enright & Sheehan, 2002). The square in the upper rightmost portion of the Task Models box represents the presentation material an examinee interacts with to understand and complete the task. This encompasses static material such as prompt text and/or dynamic material such as interactive components in a simula-

162    D. M. WILLIAMSON

tion environment, as well as the mechanism for the examinee to respond to the task, whether through selection among multiple choices or through constructed responses. The shapes in the vertical rectangle at the left portion of the Task Model represent saved data from examinee interaction, with elements within the rectangle representing different kinds of saved elements. Some of these might be directly related to the needs for scoring (such as the particular response to a multiple-choice question), while others may be used for other purposes or simply saved for future reference. These might include such commonly saved data as elapsed time on the task, views of supplemental information or instructions, changes to answers, or a multitude of other forms of information that can be tracked and retained in a computer-delivered assessment, especially in interactive simulation-based tasks. The lower right-hand section of the Task Model shows a set of three features represented as though they are drop-down menus in a computer interface to represent the idea that many tasks can be produced from a single Task Model based on the particular options selected. For example, a Task Model for addition might allow for items requiring addition of two single digit integers to form a single digit answer, or having one or more of the integers as double digit. Flexibility could be expanded by allowing more options (from the “drop down” menu of features) in numbers to be added to include fractions, mixed numbers, negative numbers, or decimals. It could also be broadened further to include other operators, such as subtraction, with such a change shifting the Task Model from one of addressing fundamental addition to one addressing basic mathematical operations. These options allow Task Models to be designed in a hierarchical manner, with some being very specific and fine-grained with relatively limited variability of tasks (with more predictable statistical performance and conceptual relevance) and others higher up in the hierarchy being more general in nature and allowing for greater range and numbers of items to be produced. This is true even for multiple-choice items when the range of response options and stimulus material is broadened, such as through word problems and/or through “show your work” responses to mathematics items. Some features and the values they can take on may influence the nature of the evidence collected (e.g., addition vs. subtraction) and/or the difficulty of the item. Others may change the appearance of tasks without changing the nature of the construct or difficulty, such as by replacing one type of fruit for another in the typical “How many apples does Maria have?” kind of task. Some features clearly impact construct and/or difficulty, but the literature also contains examples in which seemingly innocuous changes appear to have some unanticipated effects on item performance, so decision-making about grain-size and characteristics of Task Models warrants attention in design. A more extensive discussion of Task Models and related concepts

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    163

can be found in Mislevy, Steinberg, and Almond (2002) and Riconscente, Mislevy, and Hamel (2005). Evidence Models, in the center of Figure 7.2, are central to the concept of Evidence Centered Design and provide the conceptual logic and corresponding statistical mechanisms for making inferences about ability on the basis of task performance. In Figure 7.2 this role is represented as two separate subcomponents: evidence rules and the statistical model. Evidence rules represent the mechanisms by which data are extracted and represented as observable variables, indicated as squares in the Figure 7.2. For multiple-choice items these rules determine whether the examinee response matches the predefined key and represents the outcome as a dichotomous right/wrong variable. In constructed-response tasks or when factors such as response times are used in scoring, these rules can be complex and require research to define. The statistical model updates estimates of Proficiency Model variables on the basis of the observables from evidence rules and can range from familiar number right or item response theory theta estimation procedures to less well-known methods such as Bayesian networks or neural networks. The models of Evidence Centered Design constitute a rationale for the assessment design and correspond to key parts of a validity argument as put forth by Messick (1994). Figure 7.3 illustrates the connections among the assessment triangle, Messick’s views on validity, and ECD as they constitute a common structure for addressing the three primary questions of design. The ECD design process is typically highly iterative but begins with the leftmost portion of Figure 7.2, the Proficiency Model, with the specifica-

Figure 7.3  Intersection of ECD, the assessment triangle, and validity.

164    D. M. WILLIAMSON

tion of claims, which drive proficiency variables and relationships, which identifies evidential needs of the assessment, which in turn establishes elements that must be present in task models to provide required evidence and terminating at the rightmost portion of the figure with implemented tasks. The end result of the design process is a “chain of reasoning” for the assessment design, as illustrated in Figure 7.4. While the design process tends to flow in iterative steps from left to the right in Figures 7.2 and 7.4, the scoring is in the reverse order as illustrated in Figure 7.5, with scoring consisting of two stages: evidence identification and evidence accumulation. The evidence identification process transforms task performance data into scored elements called observables, represented by squares. The evidence accumulation process applies statistical model(s) to update estimates of proficiency based on the task performance. Further discussion of this process of using design hypotheses to transform data into evidence of ability can be found in Mislevy, Steinberg, Almond, and Lukas, 2006.

Figure 7.4  Illustration of a portion of the chain of reasoning from evidence centered design.

Figure 7.5  The Scoring process with evidence centered design.

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    165

Regardless of which methodology is used in design, the better the design, the easier it is to implement appropriate scoring for innovative items. The following provides some examples of different design structures and the implications for the demands on scoring. Figure 7.6 illustrates a simple design in which the left-hand panel shows the item-level relationship between a single proficiency of interest (theta), represented by a circle, and a single item, represented by a square, with the direct connection showing how the item response provides one piece of information updating the estimate of proficiency. The right-hand panel shows the test level perspective, with multiple items constituting a form of the test and all items providing a single piece of information used to update a single proficiency estimate. This univariate model with single linkage between items and proficiency estimates are common and can be scored by a multitude of methods from number right to IRT, including adaptive models. The left-hand panel of Figure 7.7 represents a more complicated model in which multiple scored observables from a single item link to a single proficiency variable. The three scored observables are not entirely independent, as performance may be related to proficiency but may also be related to the fact that all three observables come from a common task context. A common example is when multiple reading comprehension items are based on a single reading passage, but performance may be influenced by prior familiarity with the topic of the reading passage and/or engagement with the content rather than from reading comprehension alone. Other types of such dependency occur in simulation-based tasks in which the context effects from an interactive environment (and potential navigation issues within that environment) may be stronger than what is traditionally observed for reading comprehension items. Further complications arise when there is an explicit dependency between task observables, such as when a simulation-based assessment requires an examinee to locate some information in a web search (X1), evaluate the appropriateness and accuracy of the information source (X2), and then apply the information to solve

Figure 7.6  Example of univariate with single linkage.

166    D. M. WILLIAMSON

Figure 7.7  Example of univariate with conditional dependence.

a problem (X3). In this example there is an explicit dependency such that X3 cannot be solved without first solving both X1 and X2 successfully and the model may require explicit representation of the dependency between observables rather than a generalized context effect. The corresponding right-hand panel represents the test level relationship between proficiency and the observables from multiple tasks with conditional dependence among observables within a task. The scoring is challenged in whether and how to model conditional dependence among observables and the consequences for accuracy of ability estimates. Testlet response theory (Wainer & Kiely, 1987; Wainer et al., 2006) is one approach to scoring items with conditional dependence. Another design with scoring complications is illustrated in Figure 7.8, where the left-hand panel is identical to the simple model of Figure 7.6, but the right-hand panel shows that three separate proficiencies are being measured. The one-to-one relationship between observables and proficiency variables avoids complications that would arise from more complex associations. The primary complication of this model is the potential for induced dependencies between the proficiency variables, represented by the lines between them, and the corresponding question of whether these can be ignored or must be explicitly modeled in the scoring. For example, if ∅1 and ∅2 were reading and vocabulary, there may be a stronger motivation to formalize the relationship between them than if they were reading and mathematics. This situation is not uncommon and simple sets of multivariate measures like those pictured may be less challenging to score than some variations on Figure 7.7 with dependencies among observables, especially if such dependent observables inform multiple proficiencies.

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    167

An alternative to the implicit relationships among proficiency variables in Figure 7.8 is an explicit representation through the tree-like structure of Figure 7.9. In this example the test level model is now explicitly structured so that ∅1 and ∅2 are more fine-grained representations of a more general ability represented by ∅3. For example, if ∅1 and ∅2 are reading and mathematics abilities then ∅3 might be defined as an integrated problem solving ability and the tasks X3,1 and X3,2 might be word problems that require both reading and mathematical ability to solve. While this approach addresses some challenges, it also induces scoring challenges for ∅3 in combining direct evidence from task performance with indirect evidence from ∅1 and ∅2. A design like Figure 7.10 illustrates additional scoring challenges, beginning with the left-hand panel for item level scoring showing multiple ob-

Figure 7.8  Example of multivariate with single linkage.

Figure 7.9  Example of multivariate with tree structure.

168    D. M. WILLIAMSON

Figure 7.10  Example of multivariate with conditional dependence and multiple linkages.

servables from a single task informing multiple proficiencies. There is still one-to-one linkage between the observable and the proficiencies, but the context effect of the common task is no longer isolated to a single proficiency and so must somehow be shared among multiple proficiencies. A further complication is illustrated in the right-hand panel for test level modeling in which there are not only multiple instances of context effects relating to multiple proficiencies, but also single observables informing multiple proficiencies (e.g., X4 informing ∅3 and ∅4). An example is when a mathematics word problem informs both reading and mathematics proficiency variables rather than an integrated problem solving proficiency. The combination of context effects and multiple linkages to proficiency variables is illustrated in task Ti, which represents a scoring challenge. Perhaps the most complex design is one in which all of the scoring challenges discussed above are combined in a single model: multivariate proficiency model with a tree-like structure, multiple observables that have direct or induced dependencies from context effects, and multiple linkages between observables and proficiency variables. A situation very close to this, but without the conditional dependence among observables, is illustrated in Figure 7.11. Note that this figure differs from convention of earlier examples in that observables in boxes do not represent conditionally dependent observables and are grouped solely to simplify the number of lines between observables and proficiency variables. A subset of Figure 7.11, for observables 9, 14 and 16, is provided as Figure 7.12 to show what the more explicit representation would be if observables connected to the same proficiency variables were not grouped in Figure 7.11. Due to the scoring challenges for such complex models, they are uncommon in assessment but are not atypical of belief

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    169

Figure 7.11  Example of complex multivariate structure with multiple linkages.

Figure 7.12  Expanded view of a subset of observables from Figure 7.11.

about performance in domains of practice. As a case in point, the model of Figure 7.11 is not hypothetical but is the model of Mislevy (1995) for performance on mixed-number subtraction items from Tatsuoka (1987, 1990). The examples in Figures 7.6 through 7.11 illustrate progressively increasing degrees of complexity in design and related scoring. While the examples and discussion have referenced test-level complexities of scoring, the same considerations and issues hold true at the item level, particularly when scoring non-traditional items or simulations that have multiple observables or complex relationships among the observables and the proficiencies. That is, inno-

170    D. M. WILLIAMSON

vative items, especially performance-based items, may have many observables that can be scored within a single performance task, and these may relate to multiple proficiencies that influence performance, both through single-linkage relationships and multiple-linkage relationships to multiple proficiencies. The models of relationships for such tasks look much more like the test level panels of Figures 7.6 through 7.11 than the item level panels. The design of tasks, especially innovative performance-oriented tasks, has substantial implications for the options, constraints and challenges of scoring. This section focuses on the conceptual basis for scoring with the belief that good scoring begins with good design that defines proficiencies to support score use, specifies relevant evidence for making distinctions among ability levels, and structure tasks that elicit targeted distinctions in behavior. A formal design methodology encourages good scoring by representing and moderating the relationship between design and scoring, with multiple examples provided to illustrate the interplay between design decisions and scoring implications. The next section moves into some further considerations about commercial and custom automated scoring that may help satisfy the needs of an assessment design. Automated Scoring Methods This section provides some considerations in applying automated scoring systems for assessment designs that call for constructed response tasks. These include the value and challenges of automated scoring, some currently available commercial systems for automated scoring of common tasks, and the construction of customized scoring for innovative tasks. Automated scoring offers the potential for construct-fidelity of constructed response tasks, which educators may consider more rich and highly valued than multiple choice, with the efficiency and consistency of multiple-choice. Potential strengths of automated scoring include scoring at a level of detail, precision and rigor that human graders have difficulty maintaining, having complete consistency and objectivity, and being fully transparent and tractable in the justification of scores. This latter aspect of automated scoring not only facilitates explanation of scores, but also the examination, criticism and refinement of the scoring algorithms. Automated scoring is also highly efficient and allows for faster score reporting, lower costs of scoring, and the elimination of the effort and cost associated with scheduling and coordination of human raters. These advantages permit the use of constructed response items where the costs and challenges of human scoring previously made use of such items infeasible. Automated scoring may also allow for performance feedback that is more extensive and explicit than could be provided by human raters in large scale assessment.

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    171

Automated scoring is not without challenges in the same areas of quality, efficiency and construct representation that constitute advantages. There are still a number of aspects of scoring that humans can do better, sometimes dramatically so, compared to the current state of the art automated scoring systems. Further, the consistency of automated scoring can be a liability when some element of scoring is wrong, leading to potential systematic bias in the scoring based on the inadequacy of some part of the scoring algorithm. Since many such automated scoring systems are based on anticipated response patterns and modeling of aggregate data, they may not handle some unusual responses well, particularly those that represent unanticipated but appropriate responses. Use of automated scoring is inexpensive, but the cost of development can be high and take long periods of time. Once completed, the way in which automated scoring systems score responses may be somewhat different from what human raters actually do, although the question of what human graders actually do when scoring is less well known than would be desired (Bejar, Williamson, & Mislevy, 2006). Finally, the collective “résumés” of human raters’ background and experiences in the domain may convey a certain confidence in human scores even without a tractable rationale for individual scores, including allowances for disagreement among qualified professionals, while automated scoring has no such inherent credentials beyond the rationale of the scoring algorithm and any research to support the quality of scores. Automated scoring systems can be classified as either response-type or custom systems. The response-type systems are more readily commercialized because they are generalizable in scoring a type of response regardless of the population, assessment or domain and include such examples as automated scoring of essays, of mathematical equations, or of spoken fluency. Considerations in the use of such systems for assessment of common core standards is provided in Williamson et al. (2010). By contrast, custom systems are characterized by their specificity to a particular domain, and often to a particular task or small set of tasks, due to being programmed with custom-coded decision rules for scoring responses to a particular type of task. Examples include scoring algorithms for computerized simulations of troubleshooting computer networks or operating flight controls. The following provides an overview of commercially available response-type systems followed by some comments on the development of custom scoring systems and references to examples of successful systems in the literature. The most widely available and frequently used response-type system is for automated scoring of essays (Shermis & Burstein, 2003). Originally developed more than 45 years ago (Page, 1966), there are now no fewer than 12 commercially available automated scoring systems for essays. Of these the four most widely known are Intelligent Essay AssessorTM (Landauer, Laham, & Foltz, 2003), e-rater® (Burstein, 2003; Attali & Burstein, 2006), IntelliMet-

172    D. M. WILLIAMSON

ricTM (Elliot, 2003; Rudner, Garcia, & Welch, 2006), and Project Essay Grade (Page, 1966, 1968, 2003). Each of these systems has a number of commonalities as well as some differences in how they approach the scoring of essays. While a complete representation of their similarities and distinctions is beyond the scope of this chapter, this section provides a general comparison. The four automated scoring systems for essays have in common that they rely on a variety of scoring features that are computer-identifiable and relevant to the construct of writing. Some features may be direct aligned with the construct of writing, such as spelling errors, proper punctuation, or proper use of collocations, while others are proxies that may be relevant to the construct of writing but do not quite represent elements “worth teaching to,” such as average length of words used in an essay. The major systems use many such features that have been independently developed such that the particular features and how they are computed and used within each system varies substantially. For example, both Intelligent Essay Assessor (IEA) and e-rater have features designed to measure the content of an essay, but the former uses latent semantic analysis as the method to summarize content representation in an essay, while e-rater uses a conceptually similar but statistically different methodology called content vector analysis. As such, it does take some effort on the part of the consumer, and some transparency on the part of the vendor to understand the difference and similarities among these scoring systems. All four automated scoring systems use statistical methods to aggregate features into summary scores, though they differ in which methods they employ and how they are implemented. Several use variations of multiple regression, while IntelliMetric uses classification techniques related to neural networks, with all four systems calibrating these statistical models on sets of human-generated scores. These calibrations have generally targeted scoring traditional academic essays, written under timed conditions, primarily for fluency and general aspects of content in the essay as a whole rather than for explicit accuracy of particular content statements in the essay. As such, they are typically used for essay prompts that are more like “What did you do on vacation last summer” than “Name and describe the function of each structure in a cell, then identify the unique structures of plant and animal cells and the function they serve.” The latter prompt is likely to be seeking a degree of specificity and accuracy of information that is more difficult for automated scoring systems to correctly score. There are several products that use these automated scoring systems as part of educational, practice and placement decisions. These include the WriteToLearnTM system, driven by IEA, the MyAccess! TM System based on IntelliMetric scoring, and the CriterionTM service using the e-rater scoring system. Each of these tools uses both the summary scoring capability of the automated scoring systems and the scores on individual and aggregated sets

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    173

of features to provide performance-specific feedback on their writing to the students and teachers who are working with them in the classroom. In more consequential assessments, the first large-scale admissions test to use automated scoring of essays was the GMAT® (Graduate Management Admission Test), used for admissions to graduate business schools, which began use of e-rater in 1999 and then IntelliMetric as part of an overall vendor transition in 2006. The e-rater system was deployed for the GRE® (Graduate Record Examination), used for admissions to graduate schools, in 2008 followed by the TOEFL® (Test of English as a Foreign Language), a test of English proficiency used in admissions decisions for international students to English language educational institutions. The TOEFL program began using e-rater to score the independent essay, in which the examinee responds to a traditional prompt type in 2009 and the integrated essay, calling for integration of information from spoken and reading sources, in 2010. The PTE (Pearson Test of English), similar in purpose to TOEFL, has used IEA to score essays since 2009 (Pearson, 2009). The traditional model for scoring essays in large scale admissions tests is to use two human raters and to adjudicate cases in which they disagree beyond a reasonable threshold. Similarly, in most of the uses of automated scores for admissions testing cited above the automated score is used as one of the two scores and a human score the other, with any discrepancies beyond an agreement threshold subjected to adjudication from other human raters. The exception is the PTE, which uses only IEA automated scores. The successful use of automated scoring of essays in multiple assessments that have consequences for the examinee attests to the strengths of these systems, including the ability to score academic essays administered under timed conditions and that emphasize fluency as the basis for scoring. Multiple empirical studies demonstrate the correspondence between the automated scores and human scores for such essays, showing that automated scores tend to agree as often or even more often with a single human rater than two human raters do with each other. A further strength of such systems is the ability to provide performance feedback related to fluency quickly and effectively, thus providing for learning and practice tools described above. However, the general state-of-the-art of automated scoring is such that weaknesses of these systems persist, despite their proven utility for certain applications. Such limitations include little or no ability to evaluate the explicit accuracy of content or to identify and represent such elements of writing as audience, rhetorical style or the quality of creative or literary writing. The extent to which automated scoring systems are able to identify and “understand” content is notably simplistic compared to their human counterparts and therefore works best at the aggregate concept level rather than at the level of explicit detail with an essay. Other limitations include the potential vulnerability to score manipulation, which depends on the

174    D. M. WILLIAMSON

features used in the scoring and how they are aggregated to a summary score. It is still unclear how much might be gained by an examinee with low writing ability but substantial insight into the scoring algorithm who purposely manipulates the response to take advantage of the algorithm, and how different this might be from similar impact for human scores. Also, despite numerous scoring features these systems still do not detect every type of writing error, nor do they classify the errors they are designed to detect with 100% accuracy. The scoring features also tend to focus on identifying errors rather than recognition of strengths, except as defined by absence of error. Finally, there are also some indications of unexplained differences in agreement between human and automated scores based on demographic variables that have yet to be explored to fully understand whether this may be strength or weakness of automated scoring methods (Bridgeman, Trapani & Attali, in press). The state-of-the-art for automated scoring of essays has advanced substantially in the past 45 years, and these advances, paired with successful implementations in a variety of learning and admissions contexts, form the basis for some enthusiasm about their use. However, this enthusiasm must also be tempered by an appropriate sense of caution and critical evaluation given the current limitations of such systems. Further, all automated scoring systems for essays rely on computer-entered text and cannot score handwritten responses, even if digitally imaged. Although automated scoring systems for essays do not score for the explicit correctness of content, other automated scoring systems are designed to score short text responses for the correctness of information, regardless of the writing quality. An example of such an item in the biology content area is provided as Figure 7.13. These systems include Automark (Mitchell, Russell, Broomhead, & Aldridge, 2002), c-rater (Leacock & Chodorow, 2003; Sukkarieh & Bolge, 2008), and Oxford-UCLES (Sukkarieh, Pulman,

Figure 7.13  Example item automatically scored for correctness of content.

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    175

& Raikes, 2003; Sukkarieh & Pullman, 2005). Of these, only c-rater is known to have been deployed commercially, with this application being for state graduation assessment. A fully correct answer for the item in Figure 7.13 would include any two of the following concepts, however expressed in the box provided for freetext entry of the response: • • • • •

Sweating Increased breathing rate Decreased digestion Increased circulation rate Dilation of blood vessels in skin

The challenge for automated scoring is to recognize and appropriately classify the various ways in which these concepts might be correctly expressed. For example, dilation of blood vessels in skin might be expressed as “increased blood flow,” as “becoming flushed,” as “turning red,” or even “more blood pumping.” The scoring system must be able to identify each of potentially many correct answers, as well as whether to give credit for certain borderline responses (such as “more blood pumping”). Automated scoring systems use a combination of scoring keys for expected response patterns, a synonym engine, and other natural language processing systems designed to determine equivalent expressions to score responses. The example item is worth two points if two or more concepts are identified, one point if only one concept is identified, and zero points for failure to identify at least one concept, although other scoring models are also possible. Automated scoring of correct responses is a complement to automated scoring of essays, with essay scoring systems emphasizing fluency more than content and these emphasizing correct content rather than expression. Compared with automated scoring of essays, there is relatively little published about automated scoring for correct answers, but from what is available and from direct experience there are some known strengths and limitations. Strengths stem from research demonstrating that the agreement with human scores can be on par with that of independent human graders. Also, experience suggests that targeting the use of automated content scoring necessitates an emphasis on principles of good item design, particularly the anticipation of the range of responses that might result and how to classify them as correct or incorrect, that may be reduced when using human scoring. One limitation is that despite practicing good item design, the automated scoring may not match the rate of agreement between independent human raters for a particular item, due in part to the difficulty in predicting how many variations of a correct answer might be submitted by students. Further, unlike the scoring of essays, in which reasonable experts may dis-

176    D. M. WILLIAMSON

agree, there is a higher rate of agreement between human raters and therefore a higher baseline standard for automated scoring systems to meet to be comparable with human scoring. Another difference from automated scoring of essays is that distributions of differences between human and automated scores tend to be more directional rather than equally distributed, so that there might be systematically lower or higher scores from the automated scoring than from the human scores. Finally, the model building efforts for each item can be more labor intensive and additional controls on item production and rubric generation can be a challenge for test developers accustomed to permitting ambiguity in the scoring rubric to be addressed during grader training. Automated scoring of mathematical responses, including equations, graphs and plots, geometric figures and of course numeric responses, is another generalizable method that is commercially available. As there are many such systems, the interested reader is referred to Steinhaus (2008) for a more complete inventory of available systems. However, two of these systems exemplify current uses: Maple TATM for classroom learning, practice, and placement and m-raterTM (Singley & Bennett, 1998) for assessment, with past application in state assessment for graduation. There are strengths and limitations of automated scoring of mathematical responses, just as there are in the other systems. One strength is that the empirical nature of the domain contributes to performance that typically exceeds that of human raters while being flexible enough to compute the mathematical equivalence of unanticipated representations, such as unusual forms of an equation that are mathematically accurate. They can also provide partial credit scoring and performance feedback indicating correct/incorrect aspects of the response. Limitations of automated scoring for mathematical responses include challenges with the interface used to complete the responses. Examinees may be challenged by the interface for equation editors if they are not already familiar with them, as well as in drawing geometric figures. Tasks requiring the examinee to show their work are also difficult because of interface differences from paper and pencil. Finally, these systems are still largely unable to score responses that mix text and equations. Automated scoring systems for spoken responses are another type of commercially available system and can be classified as systems designed to score predictable responses and those designed to score unpredictable responses. Predictable responses include read-aloud prompts in which the examinee reads pre-existing texts as well as prompts that are less explicitly predictable, but still induce a relatively narrow range of words in a response, such as describing a picture or providing directions between two known points. In the latter examples the precise wording is not known in advance but the general range of words that could be used to successfully complete

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    177

the task are highly predictable. Unpredictable responses are those in which there is very little predictability in what the examinee might say, such as “If you could go anywhere in the world where would it be and why?” There are multiple commercial systems for predictable responses, and many of these are deployed as practice or learning systems rather than assessment systems, such as the Rosetta Stone® and TELL ME MORE® commercial products for learning a foreign language. Relatively few are designed for assessment, but two examples include VersantTM (Bernstein, De Jong, Pisoni, & Townshend, 2000) and Carnegie Speech AssessmentTM. The latter is the basis for several assessments of English language for industrial sectors including healthcare, aviation and government employment. The Versant system is the basis for scoring the PhonePass SET-10 assessment of English language proficiency for employment as well as the spoken responses for the Pearson Test of English used for admissions to English language higher education institutions. The SpeechRater (Zechner, Higgins, Xi & Williamson, 2009) system is designed to score unpredictable speech and is used to score spoken responses for the TOEFL Practice Online, a practice and preparation assessment for TOEFL. Strengths and limitations of automated scoring of spoken responses include notable strengths in the accuracy of scoring predictable speech, particularly for native speakers. For non-native speakers the accuracy of scoring predictable speech is also high enough to be comparable with human scores. However, for unpredictable speech the automated speech recognition (ASR) for non-native speakers is notably lower than for native speakers, resulting in score agreement between automated and human scores that is lower than typical agreement among human raters. However, the basic word recognition accuracy of ASR for non-native speakers continues to improve and may improve performance for unpredictable speech substantially in the near term. The calibration of scoring models for both predictable and unpredictable speech is data intensive, requiring large data sets of speech samples and often presenting challenges in securing a sufficient range of speech samples from certain native languages or accents to ensure high-quality models for non-native speakers. Finally, the ability to score content in unpredictable spoken responses is limited due to the ASR quality for non-native speakers. Some assessment designs may not lend themselves to using or adapting off-the-shelf systems like those above, calling instead for development of customized automated scoring. This section provides a (oversimplified) perspective on major steps in building custom automated scoring, followed by some references to examples of successful practice. Discussion presumes a design consistent with earlier sections as a prerequisite for good scoring, so that innovations in item development will feed the design, rather than the reverse. Such a design would specify relevant features of automated

178    D. M. WILLIAMSON

scoring from the evidence identification process, which would then be subjected to reviews and empirical evaluation to support appropriateness and accuracy of features. These features are aggregated into one or more scores for the task using any of a multitude of statistical methods ranging from the mundane, such as number right and weighted counts, to the relatively exotic, such as Kohonen networks and support vector machines. The following represent a few such methods that have been used in prior efforts: • • • • • • • • • • • • • •

Number-right Kohonen networks Item response theory (univariate and multivariate) Weighted counts Cluster analysis Regression Factor analytic methods Neural networks Rule-based classification Support vector machines Classification and regression trees Rule-space Bayesian networks Latent class models

While there are many ways to determine scores from features, the methods must match the intent of the design. The reader interested in an extended discussion of the stages of item scoring, from features to scores, for innovative items may refer to Braun, Bejar & Williamson (2006). While custom automated scoring systems from prior efforts may be domain-specific and not readily generalizable, the development procedures and methods used for scoring may serve as models for other efforts. The following provides examples of successful custom automated scoring representing a range of goals, constraints and domains of interest, including scoring graphical responses for architectural designs, interactions with simulated patients by physicians, solving applied problems in accounting, assessing information problem solving skills in digital environments, and troubleshooting computer networks. In the domain of architectural design, the Architect Registration Examination (ARE), an assessment developed by the National Council of Architectural Registration Boards (NCARB) and used by states for the licensure of architects, uses computer aided design (CAD) software for examinees to construct architectural designs that meet certain professional requirements. The assessment consists of multiple independent design tasks that require examinees to design such elements as parking lots, building interi-

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    179

ors, site layouts, and other architectural plans that meet professional standards. The scoring of responses is accomplished through a combination of algorithmically computed features and a rule-based aggregation of these features into summary pass/fail decisions. Algorithms are designed to score elements of a solution such as water flow across a site, presence of natural light, or the slope of a roof and are only generalizable to other tasks in the assessment if those tasks also score those elements of the design. A distinguishing characteristic of the ARE is the use of isomorphs; tasks designed to tap the same KSAs using the same scoring algorithms, but appearing to be different tasks. This approach was also adopted for the Graduate Record Examination® as variants of essay prompts (Bridgeman, Trapani, & Bivens-Tatum, 2009). The ARE also differentially weights features with some representing such notable health, safety and welfare issues (e.g., violation of fire safety standards by having no means of secondary egress for major parts of the structure if the primary entrance/exit is ablaze) that unsatisfactory feature performance results in an unsatisfactory task evaluation regardless of the quality of other features. Feature and task scores also incorporate a range of scores between the satisfactory and unsatisfactory range representing uncertainty in whether the response was satisfactory (borderline response). Feature scores are combined to task scores using combination rules that differentially weight features and are in turn combined using rules that differentially weight tasks to determine the section score. Braun, Bejar, and Williamson (2006) provide more in-depth description of scoring and Williamson, Bejar, and Hone (1999) provide an evaluation of the quality of automated scoring, while Sinharay, Johnson, and Williamson (2003) summarize calibrations of task families based on isomorphic variations. Another potential model for development of custom automated scoring is in the domain of medical licensure in which the National Board of Medical Examiners (NBME) uses simulated patient tasks as part of the United Sates Medical Licensing ExaminationTM required for physician licensure in the United States (Margolis & Clauser, 2006). The examinee is presented with a simulated patient presenting symptoms of a medical problem and can interact with the patient to diagnose and treat the condition. A variety of actions are possible, including questioning, medical tests, monitoring, and treatment. The scoring algorithms for features consider multiple components including the medical necessity of procedures, the procedures ordered by the examinee and the extent to which those procedures are invasive and/or carry risks, the types of patient monitoring and observation undertaken, and the timeframes in which procedures and monitoring take place. A regression-based method uses these features to determine the score for the case, though rule-based scoring was also evaluated as a competing scoring model.

180    D. M. WILLIAMSON

A third example from certification and licensure testing is in the domain of accounting, with the Uniform CPA Examination®, developed and administered by the American Institute for Certified Public Accountants® (AICPA), requiring examinees to apply accounting knowledge and skills in simulations requiring them to solve naturalistic accounting problems (DeVore, 2002). Tasks call for examinees to use knowledge and reference materials to identify and address accounting issues from documents and spreadsheets of a fictional company. The algorithmic scoring of features are primarily logic and rule-based, with a set of aggregation rules applied to determine how well the overall task was performed and issuing a summary score. The iSkillsTM assessment measures ability to identify and solve information problems in digital environments to help ensure that higher education students can appropriately and responsibly use digital technologies to solve information problems. The assessment consists of simulations requiring students to perform such tasks as recognizing information needs, finding sources of information to solve problems, evaluating the credibility of information sources, and appropriately using and citing information sources. Features are algorithmically scored based on task outcomes as well as sequences of actions that students undertake to solve problems, which are aggregated to task scores using number-right scoring (Katz & Smith-Macklin, 2007). A final example combining ECD and scoring of innovative simulationbased tasks is the NetPASS system for assessing computer networking skill in a learning environment. NetPASS was developed by the Cisco Learning Institute for the Cisco Networking Academy Program to help develop the skills of computer networking engineers. The assessment uses simulated networks and diagram tools to assess network design, implementation, and network troubleshooting. Feature scores are based on multiple components, some of which are driven by statistical natural language processing of network commands, some by rule-based methods, and some through number right counts. The scoring for overall task performance is based on Bayesian networks. Several sources provide more information on the goals, design, implementation, and scoring of the assessment, including Behrens, Mislevy, Bauer, Williamson, and Levy (2004), Williamson, Bauer, Steinberg, Mislevy, and Behrens (2004), DeMark and Behrens (2004) and Levy and Mislevy (2004), all of which appear in a special issue of International Journal of Testing devoted to the NetPASS assessment. This section provides an overview of automated scoring methods illustrating the variety of tools available for scoring computer-delivered tasks, whether by commercial vendors or custom designed. It continues the theme that scoring tools should be chosen based on specified need in the design. Where such needs are present, there are a number of commercially available automated scoring systems designed to score several different kinds of responses, though these should be evaluated for their fit with the design.

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    181

Innovative items might incorporate aspects of commercial scoring systems or may require customized scoring, but such innovations are best driven by good design, and innovative items do not necessarily require innovative scoring. When custom tasks and automated scoring are targeted, there are a number of prior efforts that may serve as models for new efforts. The Science of Scoring The previous sections emphasize the design process and some options for automated scoring to satisfy the design. This section builds on this with a discussion of the scientific basis of scoring, followed by some observations on pitfalls to avoid in scientific inquiry of scoring. In order to be considered science, the following principles must drive our efforts: 1. A theory about the natural world is capable of explaining and predicting phenomena. 2. The theory must be subject to support or refutation through empirical testing of hypotheses. 3. Theories must be modified or abandoned in favor of competing theories based on the outcomes of this empirical testing of hypotheses. In applying these principles to scoring as a scientific endeavor they might take a somewhat different form. First, the design of an innovative task and corresponding scoring constitutes a theory of performance in the domain, which, if it is a good theory, must be capable of both explaining and predicting phenomena. On the basis of this theory, experimentation in the form of pilot testing and other administrations allows for support or falsification of the hypotheses regarding how examinees with certain knowledge or ability would behave, thus distinguishing them from examinees of different knowledge or ability. Item designs and/or scoring should be modified or abandoned as a result, leading to better items and scoring through iterative stages of hypothesis generation, evaluation, and modification. The first portion of this scientific perspective on scoring rests on the idea of scoring as part of a coherent theory of assessment for a construct of interest. This is the premise of the first section of this chapter with assessment design specifying a theory of performance consistent with a domain of interest and targeted decision-making. A central question for scoring systems is the construct representation of the scoring within the design. That is, when dealing with an automated scoring system what does the score mean? This comprises the theory as explanation portion of good theory. How complete is the representation of the construct embodied in the score?

182    D. M. WILLIAMSON

To what extent are the portions of the construct represented in the score direct measures as opposed to proxies? How does the score differentially value or weight different elements of the construct of interest to determine the score? Of course, these same questions can be posed for human scoring as well, as there is still considerable uncertainty regarding exactly what human raters attend to under operational human scoring involving judgment (e.g., essays), particularly given halo effects, central tendency effects, sequential dependencies, attentiveness and other construct-irrelevant variations that occur both within and between human raters. Whenever it can be expected that reasonable experts can disagree on a score to assign, then the actual construct represented by a score becomes less certain. In automated scoring there are compounded challenges in representing precisely what the score means. On the one hand, the transparency of automated scoring is a substantial advantage; we always know exactly what the score means since it is always the logical and/or statistical aggregation of individual elements of scoring. However, a deeper dive into what these elements measure and how confident we might be that they are measured precisely can either reinforce or undermine this confidence. At the same time, automated scoring systems are often built by calibrating the features on human scores, so that the configuration of feature weights is statistically optimized to best match a human scoring criterion. Given the uncertainty of exactly what the human criterion represents, this makes the interpretation of automated scores somewhat more challenging. The contrast of human and automated scores means comparing the very explicit, precise and perhaps somewhat unflattering honesty of automated scores to more mysterious, amorphous and inconsistent human scoring relying on sophistication of processing and broad experience. As a theory of scoring the question of what a score means, whether automated or human, can be more rich and worthy of investigation than suggested by casual consideration. As a result of these complexities, it is not always trivial to define meaningful hypotheses about the meaning of scores to present for support or falsification. There are a number of potential hypotheses about scoring as predictive theory of domain performance as this aspect of scoring is most often referenced in the literature. One of these is whether scoring distinguishes among examinees of differing ability, typically evaluated through examination of score distributions, item difficulty, unused distracters (though distracters may be a very loose term in simulation-based assessment), biserial correlations, expectations of monotonic increasing relationships between ability and performance, and a variety of other common tools used in psychometric item analysis. Another is the prediction of relationships between automated scores and human scores, which might be comparisons of agreement rates based on percent agreement, kappa, various models of weighted kappa (such as quadratic or unit weighted), and correlations. Also relevant

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    183

are distributional measures such as means and standard deviations and evaluations of subgroup differences in the measures above. A third type of prediction relates scores to external measures and predicted patterns of convergent and divergent associations with such measures, which can also be contrasted with human scores for comparison. Finally, there is the prediction of how scores would change as a result of changing from human scores to automated scores, or some combination thereof, including predicted differences in the relationships discussed above. Together, these represent only a sample of empirical measures to support or refute hypotheses about performance represented by automated scoring algorithms. Williamson, Xi, and Breyer (in press) provide an elaborated discussion of the evaluation of automated scoring of essays, including some proposed criteria for operational use. Empirical evaluations of hypotheses about behavior embodied as scoring algorithms are the basis for modifying or abandoning aspects of automated scoring in order to improve the quality of the scoring, the items and, ultimately, the assessment as a theory of domain performance. Common ways scoring can be changed can target the feature level or task score level, with feature level manipulations including collapsing feature score categories for those that do not discriminate among ability levels (e.g., from 5 to 3 categories) and modifying scoring algorithms so that responses originally hypothesized to be indicative of lower ability are designed as higher ability or vice versa. This is similar to the idea of a mis-keyed item and is done only if the empirical investigation provided some better understanding of performance in the domain resulting in a revised theory supporting this change. Other feature score changes include changes to the task to alter understanding and/or performance on the task. These are just a few ways feature scores can be changed on the basis of better understanding from empirical investigations. Similar changes can be made at the level of task scores from aggregate features. In addition to changes to features and task scores, empirical investigation can also induce policy changes. For example, it is common practice in consequential assessment for essays to receive scores from two human raters and the final score is the average of the two unless they disagree by more than a specified agreement threshold, in which case adjudication is conducted. When automated essay scoring is used, common practice is for one of the two human raters to be replaced by an automated score and for similar procedures to be followed as when two human scores are used, but the selection of an agreement threshold may differ based on empirical results. Some organizations take a conservative approach and when human and automated scores are discrepant by more than .5 (exact agreement threshold, rounding normally) they are considered discrepant, while others use a threshold of 1.5 which, when rounded normally, mirrors common

184    D. M. WILLIAMSON

practice in which adjacent scores from human raters are considered to be in agreement. The choice of agreement threshold should be driven by empirical investigations similar to those described above, as should other policy decisions, including when to use automated scores without human scoring, when to provide performance feedback based on automated scores, and what kinds of instructional interventions might be implied by the scores. Emphasis on empirical rigor in development and evaluation of automated scoring helps avoid pitfalls in scientific support for score use, which can take a number of forms, some of which are provided here based on prior experience with opportunities for erroneous conclusions or for claims about performance of automated scoring that exceed what is warranted by the data. These areas are coarsely classified as areas of shallow empiricism, confounding outcomes and process, validity by design alone, and predisposition for human scores. Of these pitfalls shallow empiricism, in which an initial set of empirical results appear promising but conceal concerns that are evident only upon further analysis, is most frequently encountered. The most obvious is reliance on percent agreement as a measure of association between automated and human scores, which are readily understandable by lay audiences but are influenced by the number of points on the score scale and the base distribution of human scores. Scales with fewer score points, such as a 0–3 scale, would be expected to have higher agreement rates due purely to chance than scales with more score points, such as a 0–6 point scale. Even for scales with more score points, percent agreement may be influenced by the distribution of human scores. For example, if there is a 6-point rating scale and human scores tend to be normally distributed around the score point of 4 with a standard deviation of 1, experienced human raters would be aware of this and could be expected to treat this like a Bayesian prior in scoring, such that even without seeing a response they would tend toward scores of 3, 4 or 5 rather than a 1, 2 or 6. The stronger the tendency of human raters to use a narrow range of scores, the more apt they are to agree in their ratings. This kind of shallow empiricism can be countered with measures such as kappa and weighted kappa that take score scale and distribution into account. Another kind of shallow empiricism stems from analyses of aggregated data only, which can mask areas of concern about subsets of data. Examples include multiple cases where the association between human and automated essay scores for a pool of prompts appeared very high in the aggregate, but at the prompt level some prompts had substandard performance and were dropped from the pool. Another example is for demographic subgroups where aggregate association between human and automated scores is high but evaluation by subgroup reveals groups with notable unexplained differences (Bridgeman, Trapani & Attali, in press).

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    185

A third form of shallow empiricism is overgeneralization, which occurs when the empirical support for performance of automated scoring in one population is assumed to transfer to another population. This occurs most frequently when automated scoring models for essays are developed for one institution and are then assumed to hold for other institutions. Due to differences in student populations, including writing ability and styles, the scoring models can and do differ when calibrated by institution, even for institutions of similar educational mission (e.g., different community colleges). The difference in mean scores between models not calibrated for an institution and those that are can be as much as 1-point on a 5-point scale. Another type of pitfall is the confounding of outcomes and process, in which the similarity between automated and human scores is taken as evidence that automated scoring uses the same scoring process as human graders. While the automated scoring process is completely tractable, true understanding of what human graders do is elusive despite a number of different approaches to the study of human ratings, including cognitive task analyses, think-aloud protocols, domain analyses and retrospective recall studies. Even with these approaches we still do not know what human raters consistently value in scoring and how those values shift between experimental and operational conditions and the circumstances (fatigue, sequence effects, etc.) of operational human scoring, not to mention differences in values among individual raters. Experts in writing education, for example, could be expected to argue that automated scores do not fully represent the construct of writing that is valued in education despite agreement rates between automated and human scores that exceed the rates of agreement between two independent human raters. A third type of automated scoring pitfall is validity by design alone, in which a strong design process is the sole support for the quality of automated scoring without corresponding empirical evidence. Despite previous emphasis on strong design, this is not a sufficient condition to ensure a successful scoring model, and it is not uncommon for the best intentions of good test designers, working with subject matter experts, to be refuted or at least moderated by empirical results. The potential for discrepancy between the expectations of experts and performance of examinees is why designs must be considered initial theories rather than finalized models of performance. This potential is greater as task designs move away from traditional tasks and toward more interactive, experimental and integrated tasks. In short, design alone does not make a valid assessment. A final pitfall in the science of scoring is an unquestioned predisposition for human scores, in which human scores are treated as a gold standard that automated scoring is expected to perfectly reflect. While human scores represent a relevant criterion for both development and evaluation of automated scores, they are also problematic. Automated scoring research

186    D. M. WILLIAMSON

regularly engages in a certain degree of self-conflicting argumentation by citing the potential advantages of automated scoring over human scores in consistency of scores, transparency of criteria and other elements that promise improvements, while simultaneously calibrating empirical models and evaluating performance against human scores to demonstrate the quality of scoring. On the one hand automated scores may be better in some ways than human scoring (thus expected to occasionally diverge), and on the other hand the explicit expectation of automated scores is that they agree with human scores. While the goal of agreement with human scores is a sound criterion for the early stages of automated scoring development, as the quality of automated scoring matures we may well start to question whether automated scoring should at some point start to diverge from human scores, particularly human scores that may be generated under operational conditions that might not be consistently optimal representations of good scoring practice by individual raters. The question then becomes when automated scoring should agree and when it should disagree with human scores and what criteria would be used to determine this expected pattern of agreement and disagreement. In short, there comes a time in the development of automated scoring in which the human scores also need to be questioned along with the automated scores rather than taken as an unquestioned standard of performance (Williamson, Bejar & Hone, 1999). This section has emphasized scoring, particularly automated scoring, as a science that pairs design innovations with scientific empiricism. In this conceptualization of the science of scoring, the item designs, with their associated scoring methods, constitute testable hypotheses about a theory of domain proficiency to be empirically tested to support or falsify the hypotheses embodied in task design and scoring. The designer can then modify or abandon unsupported theories for better models of performance and therefore better item types and scoring. The Future of Automated Scoring The editors explicitly requested consideration of what the current state-ofthe-art implies about the future of assessment. In considering a prediction of the future of automated scoring, I confess some trepidation about longer term prognosticating, mostly because the passage of time tends to make such predictions look a bit silly in hindsight. After all, today we routinely use handheld devices with capabilities dramatically beyond the imaginations of most science fiction writers of a generation ago, while cohorts from my father’s generation note that they still do not have the flying car they had come to expect from their youthful reading. So, in an effort to steer a middle course between the dramatic predications of a somewhat distant but

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    187

revolutionary future and the mundane but likely accurate predictions for relatively near term innovations, I offer the following. The field will be transformed by what will become possible within the next decade. The current state-of-the-art for word accuracy rates for nonnative speakers of English, at near 80% accuracy, is well beyond the 50% accuracy rate of four to five years ago and this will continue to progress and enhance the capacity of automated scoring systems for spoken responses. In addition, the state-of-the-art of natural language processing (NLP) systems to represent and classify responses for the content of the response, including explicit factual statements for common facts, relationships among entities, and interactions of opinion and fact in argumentation, will expand beyond the primitive models currently employed in automated scoring systems and be more like what would be expected from human evaluators. Together, the combination of stronger content representation systems from the NLP field and the improved quality of speech recognition and scoring systems will allow for scoring of spoken responses for the content of the response that is well beyond what is represented today. In combination with continuing unprecedented growth in delivery technologies and potential formats for item design, these advances will allow the field to fundamentally question what constitutes an assessment and whether the traditional model of distinguishing a test from a learning experience will still hold in the same way as it has in the past. These initial predictions capitalize on work that is currently underway and assume that current trends will continue, more or less, into the near future. However, the most engaging ideas about the future of assessment come not from the logical outcomes of current research and development, but from unpredicted innovations that force a nonlinear disruption in the field of assessment and fundamentally change how we think about scoring. While predicting a nonlinear disruption is somewhat contrary to the definition, there are certain fields of work that seem to have greater promise for such substantial change in assessment, two of which are offered here as fertile soil for potentially revolutionary ideas in the practice of assessment. One idea is that academic assessment may not be restricted to a single event or set of events but instead be based on a combination of discrete test-like events distributed over time as well as passive monitoring of academic work conducted over the course of an academic period (such as a semester, or an academic year). This possibility is based on the confluence of several trends, one of which is technological and based on a future in which handheld digital devices (digital notepads and similar devices) displace textbooks and worksheets as a primary means of delivering and interacting with academic content. Digital educational content would allow for the collection of substantial data on how students interact with their educational material, which would allow for use of this data as part of as-

188    D. M. WILLIAMSON

sessment through data mining or related methods designed to sort through large data sets for patterns indicative of characteristics of interest (such as indicators of ability). If such techniques can be used to leverage this data as passive assessment to be combined with traditional assessments, we may be able to develop better estimates of ability with less frequent formal assessment. This would also depend on the success of current research on distributed assessment models, in which results are aggregated over time for a summary score, taking into account patterns of learning, forgetting or other temporal issues. If successful, there could be a future in which ability is estimated from assessment and learning activities on digital devices distributed over an extended academic period. Another area of potentially radical change is in assessment tasks. If the aforementioned advances in automated speech recognition, automated representation of content, facts, opinions and related concepts come to pass, this would allow for assessment tasks that start to move away from the concept of prompt-response and incorporate ideas from intelligent tutoring systems. Assessment tasks would engage a student much like a teacher might probe a student for understanding, by having an initial question posed to the student, listening to the response, and following up with subsequent probing questions based on the student’s responses until the teacher is satisfied that they understand the limits of student knowledge. The difference would be that instead of an expert human evaluator, the evaluation system could be based on automated systems for speech recognition, content representation, and speech generation to produce an assessment task that is more dialog based and probing, yet still uniform and standardized for students. Obviously, the success of such a task type would rely on joint research from the fields of NLP, speech recognition, intelligent tutoring systems and psychometrics to determine just how interactive such a task could be and still satisfy psychometric expectations, but past work with adaptive testing with multiple choice items provides a model for how different assessment experiences might be placed on a common measurement scale in such a future. Despite the enthusiasm for what might be possible in the future of assessment and automated scoring, a sure prediction for the future is that regardless of our technological and methodological advances, we will be frustrated by what is still not possible and/or practical. Regardless of the advances in automated scoring systems, there will still be domains for which human scoring will be notably different and more highly valued than automated scoring. Advances in the quality and comprehensiveness of current automated scoring systems will not eliminate criticism, but will instead shift criticism to areas for which automated scoring will still be less than ideal (for example, from aspects of writing for communication to writing for artistic merit, such as poetry). While statistical methods that

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    189

are used in automated scoring systems may advance, they will not eliminate problems of statistical inference or the need for good design and evaluation. Advances in delivery formats and technologies will not alter the fundamental need for assurances regarding individual work and test security that will inhibit full use of these technologies. Finally, the challenges of both time and money will continue to thwart the best intentions of researchers and there will always be a gap between what is possible to assess and what is feasible to assess given practical constraints of assessment time and funding for development and delivery. Perhaps the most effective way to think about what the future of automated scoring may hold is to reflect on the more distant past. Toward that end, the following notes the origins of automated scoring, albeit initially for multiple choice items rather than constructed response: The International Business Machine (I.B.M.) Scorer (1938) uses a carefully printed sheet . . . upon which the person marks all his answers with a special pencil. The sheet is printed with small parallel lines showing where the pencil marks should be placed to indicate true items, false items, or multiplechoices. To score this sheet, it is inserted in the machine, a lever is moved, and the total score is read from a dial. The scoring is accomplished by electrical contacts with the pencil marks. . . . Corrections for guessing can be obtained by setting a dial on the machine. By this method, 300 true–false items can be scored simultaneously. The sheets can be run through the machine as quickly as the operator can insert them and write down the scores. The operator needs little special training beyond that for clerical work. (Greene, 1941, p. 134)

Since that time the automated scoring of multiple choice items and true false items has changed dramatically and progressed beyond the processing of hand-marked sheets to delivery and scoring by computer and computerized adaptive testing. Further, the challenges of automated scoring have now shifted from those of scoring multiple choice items to scoring of constructed responses, with automated scoring systems for essays now deployed in consequential assessments. While the scoring systems themselves have changed and are targeting ever more sophisticated kinds of tasks, the fundamental challenges in how to ensure that the scoring systems are accurate and efficient, while also supporting rather than inhibiting the educational practices they are intended to reinforce, remain very similar to these origins. If we, as a field, are diligent, conscientious and fortunate, perhaps we can look back on the state-of-the-art of automated scoring today with the same perspective on how the field has solved our current challenges and built upon them to address more ambitious challenges of the future.

190    D. M. WILLIAMSON

Conclusion This chapter offers a perspective on automated scoring of performance tasks that emphasizes the conceptual basis for scoring innovative tasks, an overview of currently available automated scoring systems, and some comments on the science of scoring. It is hoped that this discussion will help provide practitioners responsible for large-scale assessment development and/or selection with a high-level perspective on the design and appropriate use of automated scoring algorithms, as well as a perspective on the potential future of assessment as such capabilities continue to develop over time. Note 1. The author thanks Elia Mavronikolas, Susan J. Miller, Janet Stumper, and Waverely VanWinkle for their assistance in the production of this chapter and to Isaac Bejar and Robert Mislevy for their advice on a draft version.

References Alves, C. B., Gierl, M. J., & Lai, H. (2010, May). Using automated item generation to promote principled test design and development. Paper presented at the Annual Meeting of the American Educational Research Association, Denver, CO. Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater v.2. Journal of Technology, Learning, and Assessment 4(3). Baldwin, D., Fowles, M., & Livingston, S. (2008). Guidelines for constructed-responses and other performance assessments (No. RR-07-02/TOEFL iBT 02). Princeton, NJ: Educational Testing Service. (Dowloadable from http://www.ets.org/Media/About_ETS/pdf/8561_ConstructedResponse_guidelines.pdf) Behrens, J. T., Mislevy, R. J., Bauer, M. I., Williamson, D. M., & Levy, R. (2004). Introduction to evidence centered design and lessons learned from its application in a global e-learning program. International Journal of Testing, 4(4), 295–301. Bejar, I. I. (2009). Recent developments and prospects in item generation. In S. E. Embretson (Ed.), Measuring psychological constructs: Advances in model-based approaches (pp. 201–226). Washington, DC: American Psychological Association Books. Bejar, I. I., Williamson, D. M., & Mislevy, R. J. (2006). Human scoring. In D. M. Williamson, R. J. Mislevy & I. I. Bejar (Eds.), Automated scoring for complex constructed response tasks in computer based testing. (pp. 49–81). Mahwah, NJ: Lawrence Erlbaum Associates. Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 17(4), 9–17. Braun, H., Bejar, I. I., & Williamson, D. M. (2006). Rule-based methods for automatic scoring: Application in a licensing context. In D. M. Williamson, R. J.

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    191 Mislevy & I. I. Bejar (Eds.), Automated scoring for complex constructed response tasks in computer based testing (pp. 83–122). Mahwah, NJ: Lawrence Erlbaum Associates. Bridgeman, B., Trapani, C., & Attali, Y. (in press). Comparison of human and machine scoring essays: Differences by gender, ethnicity, and country. Applied Measurement in Education. Bridgeman, B., Trapani, C., & Bivens-Tatum, J. (2009). Comparability of essay question variants. Princeton, NJ: Educational Testing Service. Bernstein, J., De Jong, J., Pisoni, D., & Townshend, B. (2000). Two experiments on automatic scoring of spoken language proficiency. In P. Delcloque (Ed.), Proceedings of InSTIL2000 (Integrating Speech Tech. in Learning) (pp. 57–61). Dundee, Scotland: University of Abertay. Burstein, J. (2003). The e-rater® scoring engine: Automated essay scoring with natural language processing. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 113–121). Hillsdale, NJ: Lawrence Erlbaum Associates. DeMark, S. F., & Behrens, J. T. (2004). Using Statistical Natural Language Processing for Understanding Complex Responses to Free-Response Tasks. International Journal of Testing, 4(4), 371–390. DeVore, R. (2002, April). Considerations in the development of accounting simulations. Paper presented at the Annual Meeting of the National Council on Measurement in Education, New Orleans. Elliot, S. (2003). IntelliMetric: from here to validity. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: a cross disciplinary approach (pp. 67–81). Mahwah, NJ: Lawrence Erlbaum Associates. Enright, M. K., & Sheehan, K. S. (2002). Item generation for test development. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development (pp129– 157). Mahwah, NJ: Lawrence Erlbaum Associates. Greene, E. B. (1941). Measurements of Human Behavior. New York: The Odyssey Press. Katz, I. R., & Smith-Macklin, A. (2007). Information and communication technology (ICT) literacy: Integration and assessment in higher education. Journal of Systemics, Cybernetics, and Informatics, 5(4), 50–55. Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87– 112). Hillsdale, NJ: Lawrence Erlbaum Associates. Leacock, C., & Chodorow, M. (2003). C-rater: Scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405. Levy, R., & Mislevy, R. J (2004). Specifying and refining a measurement model for a computer-based interactive assessment. International Journal of Testing, 4, 333–369. Luecht, R. M. (2007, April). Assessment engineering in language testing: From data models and templates to psychometrics. Invited paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago. Margolis, M. J., & Clauser, B. E. (2006). A regression-based procedure for automated scoring of a complex medical performance assessment. In D. Williamson,

192    D. M. WILLIAMSON R. Mislevy & I. Bejar (Eds.), Automated scoring of complex tasks in computer based testing (pp. 123–167). Hillsdale, NJ: Lawrence Erlbaum Associates. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2) 13–23. Mislevy, R. J. (1995). Probability-based inference in cognitive diagnosis. In P. Nichols, S. Chipman & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 43– 71). Hillsdale, NJ: Lawrence Erlbaum Associates. Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence-centered design for educational testing. Educational Measurement: Issues and Practice, 25(4), 6–20. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). On the roles of task model variables in assessment design. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development (pp. 97–128). Mahwah, NJ: Lawrence Erlbaum Associates. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3–67. Mislevy, R. J., Steinberg, L. S., Almond, R. G., & Lukas, J. F. (2006). Concepts, terminology and basic models of evidence-centered design. In D. M. Williamson, R. J. Mislevy & I. I. Bejar (Eds.), Automated scoring of complex tasks in computer based testing (pp. 15–48). Hillsdale, NJ: Lawrence Erlbaum Associates. Mitchell, T., Russell, T., Broomhead, P., & Aldridge, N. (2002). Towards robust computerized marking of free-text responses. In Proceedings of the Sixth International Computer Assisted Assessment Conference (pp. 233–249). Loughborough, UK: Loughborough University. Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238–243. Page, E. B. (1968). The use of the computer in analyzing student essays. International Review of Education 14(2), 210–225. Page, E. B. (2003). Project essay grade: PEG. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43–54). Hillsdale, NJ: Lawrence Erlbaum Associates. Pearson. (2009, March). PTE academic automated scoring. Retrieved April 3, 2009 from http://www.pearsonpte.com/SiteCollectionDocuments/AutomatedScoringUS.pdf Pelligrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know. Washington, DC: National Academy Press. Riconscente, M. M., Mislevy, R. J., & Hamel, L. (2005). An introduction to PADI task templates (No. Technical Report 3). Menlo Park, CA: SRI International. Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetricTM essay scoring system. The Journal of Technology, Learning and Assessment, 4(4). Retrieved from http://www.vantagelearning.com/docs/articles/Intel_MA_ JTLA_200603.pdf Shermis, M. D., & Burstein, J. C. (2003). Automated essay scoring: A cross-disciplinary perspective. Hillsdale, NJ: Lawrence Erlbaum Associates. Singley, M. K., & Bennett, R. E. (1998). Validation and extension of the mathematical expression response type: applications of schema theory to automatic scoring and item generation in mathematics (GRE Board Professional Report No. 93-24P). Princeton, NJ: Educational Testing Service.

The Conceptual and Scientific Basis for Automated Scoring of Performance Items    193 Sinharay, S., Johnson, M. S., & Williamson, D. M. (2003). Calibrating item families and summarizing the results using family expected response functions. Journal of Educational and Behavioral Statistics, 28(4), 295–313. Steinhaus, S. (2008, July). Comparison of mathematical programs for data analysis. Retrieved August 13, 2010 from http://www.scientificweb.com/ncrunch/ Sukkarieh, J. Z., & Bolge, E. (2008). Leveraging c-rater’s automated scoring capability for providing instructional feedback for short constructed responses. Lecture Notes in Computer Science, 5091, 779–783. Sukkarieh, J. Z., & Pulman, S. G. (2005). Information extraction and machine learning: auto-marking short free text responses to science questions. In Proceedings of the 12th International Conference on Artificial Intelligence in Education (AIED) (pp. 629–637). Amsterdam, The Netherlands: IOS Press. Sukkarieh, J. Z., Pulman, S. G., & Raikes, N. (2003, October). Auto-marking: Using computational linguistics to score short, free text responses. In the 29th annual conference of the International Association for Educational Assessment (IAEA), Manchester, UK. Tatsuoka, K. K. (1987). Validation of cognitive sensitivity of item response curves. Journal of Educational Measurement, 24, 233–245. Tatsuoka, K. K. (1990). Toward an integration of item response theory and cognitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M. G. Shafto (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. 453–488). Hillsdale, NJ: Lawrence Erlbaum Associates. Wainer, H., Brown, L. M., Bradlow, E. T., Wang, X., Skorupski, W. P., Boulet, J., & Mislevy, R. J. (2006). An application of testlet response theory in the scoring of a complex certification exam. In D. M. Williamson, R. J. Mislevy & I. I. Bejar (Eds.), Automated scoring of complex tasks in computer based testing. Hillsdale, NJ: Lawrence Erlbaum Associates. Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185–201. Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13(2), 181–208. Williamson, D. M., Bauer, M. I., Steinberg, L. S., Mislevy, R. J., & Behrens, J. T. (2004). Design rationale for a complex performance assessment. International Journal of Testing, 4(4), 303–332. Williamson, D. M., Bejar, I. I., & Hone, A. S. (1999). ‘Mental model’ comparison of automated and human scoring. Journal of Educational Measurement, 36(2), 158–184. Williamson, D. M., Bennett, R. E., Lazer, S., Bernstein, J., Foltz, P. W., Landauer, T. K., Rubin, D. P., Way, W. D., & Sweeney, K. (2010, July). Automated Scoring for the Assessment of Common Core Standards. Retrieved from http://www.ets. org/s/commonassessments/pdf/AutomatedScoringAssessCommonCoreStandards.pdf Williamson, D. M., Xi, X., & Breyer, F.J. (in press). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice. Zechner, K., Higgins, D., Xi, X., & Williamson, D. M. (2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51(10), 883–895.

This page intentionally left blank.

Chapter 8

Making Computerized Adaptive Testing Diagnostic Tools for Schools Hua-Hua Chang

Introduction Computerized Adaptive Testing (CAT) is a method of administering tests that adapts to the examinee’s trait level, and it has become popular in many high-stakes educational testing programs. A CAT differs profoundly from a paper-and-pencil (P&P) test. In the former, different test takers are given different sets of items according to each examinee’s ability level. In the latter, all examinees are given an identical set of items. The major advantage of CAT is that it provides more efficient latent trait estimates (θ) with fewer items than that required in P&P tests (e.g., Wainer et al., 1990; Weiss, 1982). Examples of large scale CATs include the Graduate Management Admission Test (GMAT), the National Council of State Boards of Nursing (NCLEX), and the Armed Services Vocational Aptitude Battery (ASVAB). The implementation of CATs has led to many advantages, such as new question formats, new types of skills that can be measured, easier and faster data Computers and Their Impact on State Assessments, pages 195–226 Copyright © 2012 by Information Age Publishing All rights of reproduction in any form reserved.

195

196   H.-H. CHANG

analysis, and immediate score reporting. Furthermore, CAT has the capability of administering a test to small groups of examinees at frequent time intervals, which is referred to as continuous testing. This provides examinees with the flexibility of scheduling the test. Though CAT in K–12 applications is still quite limited, it has a glowing future in the K–12 context. According to Way (2006), educators and policy makers in many states are particularly excited about the potential for efficient measurement of student achievement through innovative test delivery models, and using CAT to deliver standards-based assessments is becoming increasingly attractive. CAT improves the security of testing materials since different examinees are given different sets of items from a large item bank; and CAT’s potential to provide diagnostic information to parents, teachers, and students is also increasingly relevant as educators turn to assessment as critical sources of information to direct additional instruction to the areas needed most by individual students (e.g., see Cheng, 2009; Kingsbury, 2009; McGlohen & Chang, 2008). According to Quellmalz and Pellegrino (2009), more than 27 states currently have operational or pilot versions of online tests for their statewide or endof-course assessments, including Oregon (a pioneer of online statewide assessment), North Carolina, Utah, Idaho, Kansas, Wyoming, and Maryland. The landscape of educational assessment is changing rapidly with the growth of computer-administered tests. Moreover, the new federal grant program entitled “Race to the Top” (RTTT) puts emphasis on developing state-wide longitudinal data warehouses for monitoring student growth and learning so that teachers can provide highly targeted and effective instruction in order to prepare the next generation of students for success in college and the workforce (U.S. Department of Education, 2009).”. Clearly, the RTTT emphasis on technologically-based instructional improvement systems opens the door to increasing use of CAT and cognitive diagnosis in the K–12 context. The RTTT program emphasizes both accountability and instructional improvement. Thus, providing diagnostic information to promote instructional improvement becomes an important goal of the next-generation assessment. CAT in K–12 applications will demonstrate the advantages that it already exhibited in regular educational testing. However, the ability for providing cognitive diagnostic information for a particular domain is a much greater challenge. Note that most current CAT systems were originally developed for large-scale, high-stakes admissions and licensing exams, in which accurately estimating the total true score is the major concern for the design of item selection algorithms. In the K–12 assessment, on the other hand, in addition to the total score, teachers are also interested in getting instructional feedback from their students’ test results. The utility of the testing

Making Computerized Adaptive Testing Diagnostic Tools for Schools    197

can be enhanced if it also provides students and their teachers with useful diagnostic information in addition to the single “overall” score. Ideally, an exam would be able to not only meet the rigorous psychometric standards of current large-scale assessments, but also provide specific diagnostic information regarding individual examinees’ educational needs. In fact, the desire to provide this additional diagnostic information has recently become a requirement: the No Child Left Behind Act of 2001 mandates that such feedback be provided to parents, teachers and students as soon as is practical. This chapter introduces a variety of psychometric methods that can be utilized to assemble CAT systems as diagnostic tools for K–12 schools to classify students’ mastery levels for a given set of cognitive skills that students need to succeed. Various issues are being discussed: for example, how to select a cost effective design of hardware and network that schools can afford; how to incorporate the function of cognitive diagnoses into an item selection algorithm; how to get more efficient control over non-psychometric constraints such as content balance, item exposure control, and so on. Furthermore, information and discussions are provided with regards to psychometric underpinnings of the two CAT designs; one is the regular computerized adaptive testing (CAT) that has been used for more than three decades, and the other is the newly emerged cognitive diagnostic CAT (CDCAT). Finally, some promising results from large scale implementations of CD-CAT in China will be reported on the subject of applicability of the proposed methods in K–12 settings. Building Cost-effective CD-CAT Systems for Schools Hardware and network infrastructures of CD-CAT should be essentially the same as the predecessor—CAT. With proper refinements most current CAT systems can be readily applied in schools to help teachers to classify students’ mastery levels of the cognitive skills currently being taught. Unfortunately, most currently available CAT systems are proprietary and can be operated and managed only by testing companies or commercial testing sites. Schools usually have no access to resources and activities such as system management and item bank maintenance. In order to make CAT a learning tool that provides feedback for classroom teaching, the new systems should have the possibility of being owned and operated by schools themselves at each grade level. Current CAT systems typically require specialized test administration software and dedicated administration workstations, which would be too cumbersome and expensive for most schools and districts. Instead, a new, feasible in-school CAT system should have a turnkey server application that can be installed easily on an existing laptop or

198   H.-H. CHANG

desktop machine, and should provide test administration through a common, web-based, Internet browser application. The cutting-edge Browser/Server (B/S) architecture allows schools to implement the CAT with little to no additional cost using their current computer labs and networks. The B/S architecture uses commonly available web-browsing software on the client side and a simple server that can be fitted onto a regular PC or laptop connected to the school’s existing network of PCs and Macs. See Figure 8.1 for a demonstration. The B/S architecture presents a cost-effective and user-friendly alternative to the more-traditional Client/Server design, since it does not require specialized client software, extensive additional hardware, or detailed knowledge of the network environment. Since its invention in 1990, the World Wide Web (WWW) has exploded from rudimentary network of hyperlinked text to a ubiquitous, media-rich platform for distributed, real-time computer applications. In the new “Web 2.0” environment, applications and data reside not on individual users’ computers, but on servers, which deliver programs on demand to users through the Internet via web browsers. This new technology has been successfully adopted by many businesses, with services such as SharePoint and Google’s office productivity suite (including Gmail, Google Calendar, and document authoring applications). Web browsers and the associated software for media-rich web applications have become standard on most computers (or can be downloaded freely and installed easily), allowing rapid and inexpensive deployment of brows-

Figure 8.1  A B/S-based CAT system can be built on a common PC or laptop and connected to school’s existing PC’s via Internet or Intranet.

Making Computerized Adaptive Testing Diagnostic Tools for Schools    199

er-based applications to a wide audience. Since the application resides on the server, browser-based applications free users from software installation and maintenance, and application administers have only a single system to maintain. For modest applications (a single school or small district), server software is compact enough to be installed on a standard laptop or desktop user workstation. For large-scale applications (a large district or state-wide program), the Web 2.0 industry has developed fully enough that a number of relatively inexpensive application hosting services (such as Amazon Web Services) exist to free administrators from the burden of maintaining actual hardware, allowing them to focus on developing high-quality software and content. See Figure 8.2 for an example of a test frame design for a Webdelivered multimedia-rich individualized assessment. The picture in Figure 8.3 was taken from a large scale cognitive diagnostic assessment in Dalian China in January 2011. About 15,000 students participated in the web-browser delivered assessment. In contrast, many current CAT systems were developed under the older client/server architecture, in which specialized client software must be installed separately on every workstation. In some cases, these workstations must be dedicated solely to test administration. Software must be developed for each platform (Mac OS, Windows) separately, and each time the soft-

Figure 8.2  The Browser/Server (B/S) architecture allows a Web browser to conveniently deliver a multimedia rich individualized assessment to any PC that is connected to the Internet. Note that all the web options shown in the frame can be easily blocked so that test takers have no access to test related information.

200   H.-H. CHANG

Figure 8.3  The cutting-edge Browser/Server (B/S) architecture allows schools to implement the CAT with little to no additional cost using their current computer labs and networks. Students in Grade 5 in Dalian China are taking a cognitive diagnostic English proficiency assessment.

ware is updated or improved, administrators must ensure that every workstation receives the update, creating a significant maintenance burden. To make CAT diagnostic tools for many schools, the system design should take advantage of advances in web technology to create a browser-based test delivery application. Since the Web 2.0 environment has created a common platform for developing cross-platform applications quickly, the new CAT system will run on Windows, Mac, and Linux machines, turning almost any computer connected to the Internet into a potential test-delivery station. As a result, schools and districts will be able to make use of their existing computer and network equipment, with little additional cost. Moreover, since the only requirement for test delivery is an Internet connection and Internet access, points are rapidly multiplying; with a browser/server CAT system, mobile testing becomes an imminently achievable reality. Liu and colleagues (Liu, You, Wang, Ding, & Chang, 2010) reported a successful example of B/S architecture based CD-CAT applications in a recent large scale educational assessment in China. About 15,000 students in Dalian, China participated in a field test. A total of 2,000 PCs were connected. Thanks to B/S architecture, these machine are not simply serv-

Making Computerized Adaptive Testing Diagnostic Tools for Schools    201

ing as testing terminals; their central processing units (CPUs) are sticking together to form a gigantic computing force to perform such large scale testing flawlessly. Understanding Cognitive Diagnosis and Its Modeling Attributes and Q-Matrices More recently, there is great interest in how formative assessment can be used to improve learning as a diagnostic or screening mechanism. As a result, a new trend in psychometric research is to classify students’ mastery levels for a given set of attributes the test is designed to measure, where an attribute is a task, subtask, cognitive process, or skill involved in answering an item. For example, we are interested in understanding the processes of a student with respect to how he/she solves a particular task, which is how we infer his/her knowledge of the task and the strategy he/she is using to solve it from his/her performance. The process is referred to as cognitive diagnosis. Assume each examinee, say i, has a latent class αi = (αi1, . . . , αiK) to indicate his/her knowledge status, where αik = 1 indicates examinee i has mastered attribute k, = 0, otherwise, i.e.,

1, if examinee i has mastered attribute k αik =  0, otherwise.

(8.1)

The k th element, αik , of αi is a binary indicator of an examinee’s classification with regard to the k th attribute. For instance, in the case of fraction subtraction, k might denote mastery of converting a whole number to a fraction. Vector α is a K-dimensional latent class, and its values cannot be observed, but they can be estimated by appropriate latent class models. To understand how the attributes are utilized to construct a response, the relationships of which attributes are required for which items need to be identified by test developers, content experts, as well as cognitive researchers. Tatsuoka (1995) proposed to relate items with the attributes the test is designed to measure by a Q-matrix. An entry of the matrix qik indicates if the i th item measures the skill indexed by the k th attribute, in other words,

1, if item i measures skill k q ik =  0, otherwise.

(8.2)

202   H.-H. CHANG

Let’s consider a simple example of a 3-item test measuring 3 attributes. A Q-matrix, with row indicating items and column indicating attributes, can be given in the following:



 0  Q = 1   0

1 1 1

0   0   1 

(8.3)

According to the relationship specified by the Q-matrix, in order to answer item one correctly, the student needs to master attribute two. Clearly item two requires attributes one and two, whereas item three requires attributes two and three. The usefulness of a pre-specified Q-matrix and the latent class modeling is apparent in that two students can have the same achievement level; however, their mastery status on the attributes can be different. If each α can be classified from the student’s responses, feedback from an exam can be more individualized directly addressing a student’s specific strengths and weaknesses. For example, in Figure 8.4, Bo needs further help regarding Attributes one to four, whereas Jane needs help for Attributes one, three, six, and seven, though their total scores are the same. In order to estimate the values of α, some item response models for cognitive diagnosis should be used. Though these methods were originally developed in a non adaptive manner for a variety of existing tests, our goal is to design an adaptive algorithm to select the

Figure 8.4  Two students can have the same achievement level. However, their statuses on the seven attributes are different.

Making Computerized Adaptive Testing Diagnostic Tools for Schools    203

next item according to how much we can say about the mastery level of the student on a specific attribute, such as division. Therefore, it is important to understand how cognitive diagnostic modeling works. Cognitive Item Response Models Specialized latent class models for cognitive diagnosis need to be derived under the assumptions on which attributes are needed for which items, and how the attributes are utilized to construct of a response. Let Xj be the item response for the j th item; Xj = 1 if the answer is correct, Xj = 0 if incorrect. Let Xj = 1 with probability Pj[α] = P {Xj = 1|α}, and = 0 with probability Qj[θ] = 1 – Pj[θ], namely 1 with probability P j [θ], Xj =  0 with probability Q j [θ], where α is a K-dimensional latent class with 2K possible outcomes. For a given test with n items, consistent with traditional latent variable models in psychometrics, X1,X2, . . . , Xn are modeled as statistically independent given the la­tent vector α = (α1, α2, . . . , αK). Throughout the progression of cognitive diagnosis research, an abundance of models have been proposed to provide cognitively diagnostic information in the assessment process (for details, see Hartz, Roussos, & Stout, 2002). What distinguishes models from one another are the assumptions that dictate how attributes are utilized to construct responses. Among these models, the Deterministic Inputs, Noisy “And” Gate (DINA) model (Junker & Sijtsma, 2001; Macready & Dayton, 1977) has been used widely by researchers and practitioners in simulation studies and large scale implementations. Let ηij denote whether the i th examinee possesses the attributes required for the j th item. Note that K

ηij = ∏ αikjk , q

k =1

where αik and qjk are defined in and (Equation 8.1) and (Equation 8.2), respectively. Clearly, 1, if examinee i has mastered all the attributes item j measures ηij =  0, otherwise.

(8.4)

There are only two parameters in the DINA model: One is slipping parameter, sj , and the other guessing parameter, gj , where sj is the probability of

204   H.-H. CHANG

slipping and incorrectly answering the item when ηij = 1, and gj is the probability of correctly guessing the answer when ηij = 0. Specifically,

s j = P (X j = 0 η = 1),

(8.5)

g j = P (X j = 1 η = 0).

(8.6)

and

The DINA model partitions students into two classes for each item: those who have mastered all the attributes required by an item (ηij = 1) and those who have not (ηij = 0). The item response probability can be written as:

η

1−ηij

P (X ij = 1 αi ) = (1 − s j ) ij g j

.

(8.7)

The DINA model is one of the simplest cognitive models currently available, and it is computationally less intensive regarding both parameter calibration and latent class classification. For a real-time multi-user system such as CAT, computation efficiency is a desirable feature. Note that many other models can also be used for cognitive diagnoses, such as the Fusion model (Hartz, Roussos, & Stout, 2002), NIDA model (Maris, 1999), Hierarchical DINA model (de la Torre & Douglas, 2004), DINO model (Templin & Henson, 2006), Multicomponent Latent Trait model (Embreston, 1985), and so on. Several recent reports indicate that some of the models can be used to fit the types of response data collected from the current achievement or state accountability tests. For example, McGlohen and Chang (2008) used the Fusion model and the 3PL model to calibrate item parameters on the basis of a simple random sample of 2,000 examinees for each of three administrations of a state-mandated large-scale assessment. They used BILOG MG and Arpeggio 1.2 to obtain the 3PL and fusion-model-based item parameters, respectively. This large-scale assessment consisted of a math portion and a reading portion. This may indicate that the Fusion model can be conveniently used to fit well with response data from existing state assessments. Liu et al. (2010) used the DINA model and the 3PL model to successfully calibrate two sets of item parameters from the same achievement test taken by about 120,000 students with two subjects: math and English.

Making Computerized Adaptive Testing Diagnostic Tools for Schools    205

Basic Elements in Computerized Adaptive Testing CAT Methods for Estimating θ The theoretical groundwork of CD-CAT is evolved from traditional computerized adaptive testing. To understand how a CD-CAT works, it is helpful to have a quick review of the basic elements of traditional CAT. Assume a latent trait θ is being measured, say math ability, then the objective of CAT is to adaptively estimate the student’s θ. The most important component in CAT is the item selection procedure that is used to select items during the course of the test. According to Lord (1970), an examinee is measured most effectively when test items are neither too difficult nor too easy. Heuristically, if the examinee answers an item correctly, the next item selected should be more difficult; if the answer is incorrect, the next item should be easier. In doing so, able examinees can avoid answering too many easy items, and less able examinees can avoid too many difficult items. In other words, the test is tailored to each examinee’s θ level, thus matching the difficulties of the items to the examinee being measured. Clearly, CAT can help teachers, schools and states get more precise information about student achievement levels. To carry out the branching rule described above, a large item pool is needed in which the items are pre-calibrated according to their psychometric characteristics under certain statistical models for the item responses, such as item difficulty, item discrimination, and guessing probability. Let Xj be the score for a randomly selected examinee on the j th item where Xj = 1 if the answer is correct, and Xj = 0 if incorrect. Let Xj = 1 with probability Pj(θ), and 0 with probability Qj(θ) = 1 – Pj(θ), namely 1 with probability P j (θ), Xj =  0 with probability Q j (θ), where θ has the domain (–∞,∞) or some subinterval on (–∞,∞). When the three-parameter logistic model (3PL) is used, the probability becomes

P j (θ) = c j +

(1 − c j ) − a (θ−b ) 1+ e j j

where aj is the item discrimination parameter, bj is the difficulty parameter, cj is the guessing parameter.

(8.8)

206   H.-H. CHANG

The Maximum Item Information Approach A standard approach to item selection in CAT has been to select the item with the maximum Fisher item information as the next item. Note that Fisher item information is defined as 2

 ∂P j (θ)  I j (θ) =   / P j (θ)[1 − P j (θ)],  ∂θ  and Fisher test information is defined as n



I (θ) = ∑ I j (θ),

(8.9)

j =1

An important feature of I(θ) is that the contribution of each item to the total information is additive. Since the latent trait θ is unknown, the optimal item selection rule could not be implemented, but it may be approximated using the updated estimate θˆ each time a new item is to be selected. According to Lord (1980), the next item to be selected should have the ˆ which is highest value of item information at the current trait estimate θ, referred to as the maximum item information criterion (MIC). Under the 3PL model, maximizing Fisher information is intuitively to match item difficulty parameter values with the latent trait level of an examinee. Under general Item Response Theory (IRT) modeling assumptions, θˆ n , the maximum likelihood estimator of θ, is asymptotically normal, centered at the true θ with variance approximated by I −1(θˆ n ), where I(θ) is the Fisher test information function defined in Equation 8.9. Thus maximizing Fisher information is asymptotically equivalent to minimizing the sample variance of estimate θˆ n , and for that reason, the MIC method has been the most popular item selection algorithm for the last three decades. Alternative Methods for Item Selection Other sequential procedures designed for traditional CAT include Owen’s (1969; 1975) approximate Bayes procedure, Chang and Ying’s (1996) maximum global-information criterion, Veerkamp and Berger’s (1997) likelihood-weighted information criterion, among others. One of the most commonly recognized drawbacks to the standard MIC item selection method for CAT is that no matter how large the item pool size is, only a small fraction of the items tend to be used, which wastes resources and enhances security risks (Wainer, 2000). The MIC usually results in the selection of items with extremely high discrimination parameters at the beginning, followed by a descending pattern of discrimination parameters, causing items with low discrimination to be rarely administered.

Making Computerized Adaptive Testing Diagnostic Tools for Schools    207

For a given item pool, the randomized item selection method yields the best test security in comparison with all other item selection methods. However, while randomized item selection yields the best test security, it also yields the least accuracy of trait level estimation. It has been shown that, with proper stratification and blocking techniques, the a-stratified item selection method (AST) equalizes item exposure distribution and hence yields the best test security. Also, while maintaining the best security, AST does not sacrifice estimation efficiency (e.g., see Chang, Qian, & Ying, 2001; Chang & Ying, 1999). An original version of the AST method can be described as follows: Step 1. Partition the item pool into K levels according to item discrimination levels. Step 2. Partition the test into K stages. Step 3. In the k th stage, select items from the k th level based on ˆ then administer the the similarity between item difficulty and θ, items. Step 4. Repeat Step 3 from k = 1, 2, . . . , K. The rationale behind AST is that because the accuracy generally becomes greater as the test progresses, one effective strategy is to administer items from the lowest a-level at the early stages of the test, and administer the highest level at the last stage of the test. At each stage, only items from the corresponding level are selected. Item pool stratification also affects item exposure rates. Because estimation accuracy generally improves as the test progresses, the AST method selects items conservatively in the early stages of the test, and reserves the highly discriminating items, which strongly influence ability estimates, for the latter stages of the test when there is more certainty about the examinee’s estimated ability. Thus, AST automatically adjusts step sizes in the estimation of θ throughout the course of the test, with smaller steps at the beginning to reduce the chance of extreme values in estimating θ, and larger weights at the end to provide the greatest accuracy. While the AST method provides a necessary improvement over MIC in CAT, the original algorithm proposed by Chang and Ying (1999) did not fully consider the possibility of imposing multiple constraints on item selection. Numerous refinements have been made to overcome the limitations, including how to select items to meet all kinds of constraints necessary for K–12 assessments (e.g., see Cheng, Chang, Douglas, & Guo, 2009). Many issues, both theoretical and applied, have been addressed (e.g., see Chang, 2004; Chang & Ying, 2008; Chang & Ying, 2009; Chang & Zhang, 2002).

208   H.-H. CHANG

CD Methods for Cognitive Diagnosis Some attempts have been made to use the current CATs to get diagnostic information. For example, Kingsbury (2009) has dubbed adaptive tests geared toward cognitive diagnosis as “Idiosyncratic Computerized Adaptive Testing” (ICAT) and has found promising applications in providing teachers with information for targeted instruction. According to the MAP technical manual (NWEA, 2009, p. 42), MAP has the capability to calculate goal performance scores based on subscales—that is, the score from only those items pertaining to a given goal. Their methods fall short in providing a robust methodology for classifying students’ latent classes and offering specific interventions according to the classified status. While an approximate measure of a student’s mastery of a particular skill can be obtained simply from such a sub-score, statistically advanced cognitively diagnostic models provide a level of control in scaling, linking, and item banking unavailable with simpler methods. As an extension of traditional CAT, Cognitive Diagnostic CAT is designed to classify mastery levels of students on the attributes the test is designed to measure. Over a dozen cognitively diagnostic latent class models have been proposed (e.g., see Embreston & Reise, 2000). Though these models can be used in a non-adaptive manner for a variety of existing tests, incorporating cognitive diagnosis into sequential design in CAT has the potential to determine students’ mastery levels more efficiently, along with the other advantages of CAT already identified. Sequentially selecting items based on an examinee’s current classification is particularly appropriate for developing assessment to address today’s challenges in learning. The goal of a CD-CAT is to tailor a test to each individual examinee via an item selection algorithm that allows the test to hone in on the examinees’ true status of αj in an interactive manner. In this regard several new methods were proposed: for example, Xu, Chang, and Douglas (2003), Cheng and Chang (2007), Cheng (2009), and McGlohen and Chang (2008). Among them the two methods proposed by Xu et al. (2003) are predominant; one is based on maximizing Kullback-Leibler information and the other is based on minimizing Shannon entropy (Shannon, 1948). The K-L Information Method (KL) In general, the KL information measures the distance between two probabilities over the same parameter space (Cover & Thomas, 1991; Lehmann & Casella, 1998), and it is usually defined as

 f (X )  KL[g  f ] = E f  log  . (X )  g 

(8.10)

Making Computerized Adaptive Testing Diagnostic Tools for Schools    209

Note that the expectation is taken over f(X), which is referred to as the true distribution of X, and g(X) is an alternative distribution. In the context of cognitive diagnosis, suppose α0 is the true latent vector of the examinee and α1 is an estimator. Now, let f(X) = P(Xj = x |α0) and g(X) = P(Xj = x |α1). To distinguish any fixed α0 from α1, examine the difference between values of P at α0 and α1. Such a difference can be captured by the ratio of the two values, resulting in well-known likelihood ratio test (Neyman & Pearson, 1936). By Neyman-Pearson theory (Lehmann, 1986), the likelihood ratio test is optimal for testing α = α0 versus α = α1. In other words, it is the best way to tell α0 from α1 when the cognitive item response model is assumed for the item responses observed. Definition: KL Item Information. Let α0 be the true latent class. For any vector with the same size α1, the KL information of the j th item with response Xj is defined by

  P (X j | α 0 )  K j (α1  α 0 ) ≡ E α0  log    .  P (X j | α1)   

(8.11)

A straightforward calculation using Equation 8.11 shows that the KL information can be expressed as  P (X = x | α )

1



∑ log  P(X jj = x | α01)  P(X j = x | α 0 ).

(8.12)

x =0

ˆ is the j th estimate of the examinee’s latent class α0 after the Assume α first j items are administered. Since α0 is not known, a simple way to construct a single index from Kj is by taking the average over all possible patˆ terns of α. 2K



1 ˆ)   P (X j = x | α ˆ ) = ∑ ∑ log  ˆ) KL j (α  P (X j = x | α  P (X j = x | α1) l =1 x = 0

(8.13)

where K is the total number of attributes measured by the test. If we view ˆ the (j +1)th item should be seEquation 8.13 as an objective function of α, ˆ And the (j +1)th response will be used to relected that maximizes KLj(α). estimate α0 , repeating the process until a stopping rule is satisfied. The Shannon Entropy Method (SHE) Xu et al. (2003) proposed another promising method based on minimization of Shannon entropy (SHE; Shannon, 1948). Shannon entropy

210   H.-H. CHANG

measures the uncertainty associated with the distribution of a random variable. Let X be a random variable that takes discrete values x1,x2, . . . , xI . Let   P(X = xj) be the probability that X takes a specific value xi , i = 1,2, . . . , I. The Shannon entropy for X is defined by: I   1 SHE(X ) = ∑ P (X = x j )log   . P ( X = x ) j  i =1 



(8.14)

In the context of cognitive diagnosis, suppose the examinee has answered the first j items; the SHE function is used to select the item that minimizes the uncertainty of the posterior distribution of the test taker’s attribute pattern estimate. Hence, for item j+1, the expected value of the SHE function for the posterior distribution of the test taker’s attribute pattern is defined as:

 K    1 E[SHE j (α)] = E ∑ P (α k | X 1, X 2 ,, X j +1)log   P ( α | X , X ,  , X ) k 1 2 j +1      k =1

(8.15)

1  K    1  = ∑ ∑ P (α k | X 1, X 2 ,, X j +1)log   P (X j +1 = x ) α P ( | X , X ,  , X ) k j +1   1 2 x =0   k =1  

where p(α k | X 1,, X j +1) ∝ P (X 1,, X j +1 | α k )P (α k ) Intuitively, the item that produces the smallest expected value of SHE is associated with the least amount of uncertainty in the test taker’s attribute pattern distribution and therefore will be chosen as the next item. Simulation studies were conducted to compare the performances of the two item selection algorithms, and the findings indicate that both the SHE and KL algorithms perform well regarding accuracy and consistency of classification (See, Xu et al, 2003, and Liu, You, Wang, Ding, & Chang, under review). The item selection methods can also be used on other cognitive diagnostic models, such as the DINA model (Haertel, 1984; Junker & Sijstma, 2001; Macready & Dayton, 1977), Fusion model (DiBello, Stout, & Roussos, 1995; McGlohen & Chang, 2008), NIDA model (Maris, 1999), and DINO model (Templin & Henson, 2006). CAT Methods for Estimating both θ and α A major advantage of a traditional CAT is that it tailors the test to fit the examinee’s ability level θ in an interactive manner. The objective of cognitive diagnosis intends to provide information about specific skills in which

Making Computerized Adaptive Testing Diagnostic Tools for Schools    211

the examinee needs help. It will be interesting to incorporate the two aims into a single item selection algorithm so that both the traditional ability level estimate (θ) and the attribute mastery feedback provided by cognitively diagnostic assessment (α) can be adaptively classified. The Shadow Test Approach McGlohen and Chang (2008) proposed a two-stage method, in which a “shadow” test functions as a bridge to connect information gathered at θ for IRT and information accumulated at α for CDM. Building an item selection algorithm based on both θ and α will combine the benefit of specific feedback from cognitively diagnostic assessment with the advantages of adaptive testing. They investigated three approaches to combining these: (1) item selection based on the traditional ability level estimate (θ), (2) item selection based on the attribute mastery feedback provided by cognitively diagnostic assessment (α), and (3) item selection based on both the traditional ability level estimate (θ) and the attribute mastery feedback provided by cognitively diagnostic assessment (α). Results from these three approaches were compared for theta estimation accuracy, attribute mastery estimation accuracy, and item exposure control. The θ- and α-based condition outperforms the α-based condition regarding theta estimation, attribute mastery pattern estimation, and item exposure control. Both the θ-based condition and the θ- and α-based condition perform similarly with regard to theta estimation, attribute mastery estimation, and item exposure control, but the θ- and α-based condition has an additional advantage in that it uses the Shadow Test method, which allows the administrator to incorporate additional constraints in the item selection process such as content balancing, item type constraints, and also to select items based on both the current θ and α estimates that can be built on top of existing 3PL testing programs. Dual Information Approach Cheng and Chang (2007) propose a dual information method (DIM). Suppose the student has answered m items; the next item, say j, should be chosen by considering two Kullback-Libeler (KL) information indexes, where one is based on θ and the other is based on α, denoted them by KLj(θ) and KLj(α) respectively. According to Chang and Ying (1996) KL j (θˆ m ) =



θˆm +δm

∫θˆ

m −δm

KL j (θ  θˆ m )dθ

(8.16)

where

 P j (θˆ m )   1 − P j (θˆ m )  KL j (θ  θˆ m ) = P j (θˆ m )log   + [1 − P j (θˆ m )log   ,  P j (θ)   1 − P j (θ) 

(8.17)

212   H.-H. CHANG

and δm → 0 as m → ∞, where θˆ m is the m th θ estimation. See Chang and ˆ m) Ying (1996) for a detailed discussion about δm . On the other hand, KL(α is defined by 2K  1  ˆ m )  P (X j = x | α ˆ m ) = ∑  ∑ log  ˆ m ), KL j (α P (X j = x | α   P (X j = x | αc )  c =1   x =0



(8.18)

ˆ m is the m th α estimate and K is the total number of attributes. If where α the purpose of a test is purely diagnostic without interest in estimating student’s achievement level θ, according to Tatsuoka (2002), Xu, Chang, and Douglas (2003), and Cheng and Chang (2007), the next item should be selected by maximizing the KL information defined in Equation 8.18, across all available items in the pool. Since our objective is to tailor the test to each student according to both θ and α, Cheng and Chang (2007) propose to use a weighted dual information index: ˆ m ) = wKL j (α ˆ m ) + (1 − w )KL j (θˆ m ), KL j (θˆ m , α



(8.19)

ˆ Therefore where 0 ≤ w ≤ 1 is a weight assigned to the KL information of α. the next item can be selected to maximize the weighted dual information defined in Equation 8.19. Intuitively speaking, at the beginning of the test the value of w should be small, whereas in the end of the test w should be large. This is because at early stages of a CAT, it is unlikely that θ, the student θ estimate, is accurately estimated. Note that our primary concern is to get a more accurate estimate for the student’s achievement level. According to Chang and Ying (1996), maximizing KLj(θˆ m) will lead to more accurate subsequent θ estimates. At later stages of a CAT, however, the emphasis should be on α. Aggregate Ranked Information method (ARI) Recently, Wang, Chang, and Wang (2011) extended the Dual Information algorithm to a method called Aggregate Ranked Information (ARI). The rationale of developing the ARI method is that the two KL information functions defined in Equation 8.19 may not be on the same scale. Speˆ in Equation 8.19 is computed in terms of cifically, the integration in KL(θ) summation as follows: KL(θˆ) ≈ where

k





i =− k





p(θˆ)









1 − p(θˆ)

 

∑  p(θˆ)log  p(θˆ + i∆θ)  + (1 − p(θˆ))log  1 − p(θˆ + i∆θ)   ∆θ , (8.20)  

Making Computerized Adaptive Testing Diagnostic Tools for Schools    213

∆θ=

2δ 2k + 1

ˆ is in Equation 8.19 comprised of 2k addends, while is the step size. KL(α) ˆ depends on how you slice the integration the number of addends in KL(θ) domain, such as “k” in Equation 8.20. In addition, the size of the terms differs greatly because every addend in Equation 8.20 has Δθ as a multiplier, ˆ will always be much smaller than KL(α). ˆ Therefore which means KL(θ) ˆ KL(α) will play a dominant role in item selection. To solve this non-comparability issue, Wang et al. (2011) propose a modification to transform the two information measures to ordinal scales in such a way that each item will ˆ and KL(α) ˆ separately. ARI is calculated as have two percentiles for KL(θ)

(

)

ˆ )) + (1 − λ)pe KL(θˆ) ARI = λpe ( KL(α

(8.21)

where pe(•) represents “percentile.” The rationale behind this method is that by using the ordinal scale, the information captured by θ and α can be put together into one index, and the weight λ (0 ≤ λ ≤ 1) will reflect the true importance of the two pieces. Wang et al. (2011) also proposed several variations of Equation 8.21. Simulation results showed that ARI led to better estimation of both θ and α comparing with Cheng and Chang’s (2007) original dual method under various different weights. In addition, the method is flexible enough to accommodate different weights for different attributes. Their finding also indicates that assigning a higher weight on an attribute with difficulty to estimate may increase the recovery rate of that attribute but may decrease the recovery rate of the whole pattern. Incorporating Multiple Constraints in Both CAT and CD-CAT An operational CAT program needs to consider various non-statistical constraints. Examples of the non-statistical constraints include: a certain proportion of items should be selected from each content area (known as content balancing), correct answers should fall approximately evenly on options A, B, C and D (known as answer key balancing), and only a limited number of “special” items are allowed on a test, such as items with negative stems (e.g, “Which of the following choices is NOT true?”). An operational CAT will allow for inclusion of many constraints including constraints to control the exposure rate of each item, which is defined as the ratio of the number of times the item is administrated to the total number of examinees.

214   H.-H. CHANG

If the algorithm only selects “the best items” in the pool, it will leave many items essentially unused. Item writing is a very expensive process; all items in the pool have gone through pre-testing and passed a rigorous review process, and hence should be actively used. In addition, it becomes easier to effectively cheat on an exam because the effective size of the item pool is severely shrunken. As testing in K–12 contexts moves even more heavily into large-scale applications, the potential for adverse consequences in current CAT implementation must be avoided. One technical challenge is to develop a user friendly algorithm to balance various non-statistical constraints. Content Balancing There are essentially two types of content balancing; one is fixed content balancing and the other is flexible content balancing. In the former, the number of items of each content area is fixed. As to the latter, the number of items of each content area is constrained between a lower bound and an upper-bound. Let nm denote the number of items from each content area, m = 1,2, . . . , M, where M is the number of the content areas. For flexible content balancing, nm must satisfy

l m ≤ nm ≤ um



∑ nm = L

(8.22)

M

(8.23)

m =1

where lm and um are lower bound and upper bounds respectively (m = 1,2, . . . , M, and M is the total number of the content areas) and L is the test length. Note that fixed content balancing is a special case of flexible content balancing when lm = um = nm. Techniques to handle fixed content balancing have been proposed by many authors. Yi and Chang (2003) proposed a method that blocks content effect by a pre-stratification process so that the item selection algorithm is able to handle fixed content balancing effectively. However, since most testing programs are using flexible content balancing, to apply an item selection method in a real testing situation it is necessary to equip it with the capability of flexible content balancing. Methods of handling flexible content balancing control have been developed by several authors. Among them, Stocking and Swanson’s (1998) weighted deviation model (WDM) and van der Linden’s (see van der Linder and Chang, 2003) linear programming approach are the most popular. These methods are capable of handling many practical constraints of both item content and type and therefore are good choices for K–12 CAT applications. However, they require rather complex techniques in operation research such as using external linear programming

Making Computerized Adaptive Testing Diagnostic Tools for Schools    215

software like CPlex, which could be burdensome for some K–12 assessment developers. The Two-Phase Content Balancing Method. To provide a simpler alternative, Cheng, Chang, and Yi (2007) proposed a two-phase content balancing method. In the first phase, lm items are chosen from each content area for the first L1 = ∑m =1l m M

items, to meet the lower bound constraints. Then, in the second phase, the remaining L2 = L – L1 items are selected within the upper bounds of each content area. The content areas from which to select each item are determined from a modified multinomial model (Chen & Ankenmann, 2004). In the multinomial model, each content area m has target proportion lm/L1 in the first phase and (um – lm)/(U – L1) in the second. Once a satisfactory sequence of content areas is obtained, items can be selected from the specified content areas with the method proposed by Yi and Chang (2003). In addition to content constraints, this method is able to handle all common constraints including item exposure rate control. Moreover, the method can be used for different item selection methods in both CAT and CD-CAT. Two-Phase Priority Score Content Balancing. Recently Cheng and Chang (2009) proposed a priority score extension of the two-phase method to handle all common constraints in conjunction with a variety of item selection methods. In this method, each item is assigned a priority score based on the weighted number of remaining permitted items for each relevant constraint. In the first stage, the weighted number of remaining permitted items for constraint m is given by

f jm =

(l m − x m ) , lm

(8.24)

where xm is the total number of items that have been selected for the m th constraint at a given stage. Denoting a constraint relevancy matrix by C. C is a J × M matrix defined over the item pool with entries cjm = 1 if item j is relevant to constraint m, and 0 otherwise. Then a priority score for the j th item in the item pool (or stratum) can be computed by a function of these using the M content constraint categories.

M

p j = I j ∏( f jm ) jm , m =1

c

(8.25)

216   H.-H. CHANG

ˆ or at α, ˆ where θˆ where Ij can be either the item information of item j at θ, ˆ are the current estimates of θ and α respectively. The next item is seand α lected by maximizing the priority index pj within the current item stratum. Note that when constraint m reaches its lower bound lm , fjm becomes 0, and the relevant priority index pj turns 0 too. It is therefore given lower priority than any other item in the pool that has a positive priority index. For instance, suppose a test has only two content constraints and one of them has already reached its lower bound. Then this fulfilled content area will become dormant and no more items can be selected from it until the other content area catches up. As a result, all the lower bounds will be met at the end of the first phase. In the second phase, the fjm’s are computed by:

f jm =

(um − x m ) . um

(8.26)

Again when constraint m reaches its upper bound um , fjm will be 0, and no   more items from this content area will be selected as long as there are other content areas that have not been completely capped yet. It is interesting to mention that fjm here is not limited for a content constraint. It can be used for any constraint. Item Exposure Rates Control. To reduce the impact of item sharing, item exposure rates should be controlled. The exposure rate of an item is defined as the ratio between the number of times the item is administered and the total number of examinees. As already noted, the maximum information based methods tend to select certain types of items, which may cause these items overexposed. Remedies to restrain overexposure of some items have been pro­posed by many authors. The most common method for controlling exposure rate is the SH procedure (Sympson & Hetter, 1985), which puts a “filter” between selection and administration. MIC-SH can effectively suppress the usage of the most overexposed items and spreads their usage over the next tier of overexposed items. However, because items that are not selected cannot be administered, items with small probabilities of being selected will still have small exposure rates; thus, the S-H method does not increase exposure rates for underexposed items (e.g., see Chang & Ying, 1999). A new method is proposed to guarantee item exposure rate control that is achieved by coding the control task as a special constraint for any given item selection procedure (Cheng & Chang, 2009). Suppose constraint k′ requires that the exposure rate of item j cannot exceed rj . Let N be the   number of examinees who are taking the CAT and nj be the number of examinees who have seen item j so far. Then fjk′ can be computed with:

Making Computerized Adaptive Testing Diagnostic Tools for Schools    217



f jk ′

N −nj    r j −  N  = , rj

(8.27)

where (N – nj)/N is the provisional exposure rate of item j. Following Equation 8.27 we can compute the priority index p for every item in the stratum. Then the item with b-parameter closest to θˆ and also with the largest p will be selected. It should be noted that AST tends to equalize item exposure rates. Let rj = r for all j. By putting a constraint from Equation 8.27 into Equation 8.25 it can be assured that no item’s exposure rate exceeds r. One-Phase Item Selection. Instead of selecting items in a two-phase fashion, Cheng and Chang (2009) proposed a one-phase method to incorporate both upper bounds and lower bounds. The priority score for item j becomes

(um − x m − 1) um

(8.28)

(L − l m ) − (t − x m ) L − lm

(8.29)

f1m =

and

f 2m =

where t is the number of items that have already been administered. The quantity f1m measures the distance from the upper bound. L – lm is the upper bound of the sum of number of items that can be selected from other content areas. When f2m = 0, it means that the sum of items from other content areas have reached its maximum. Note that when xm increases, f1m decreases and f2m increases. Therefore, the index pj defined in Equation 8.25 tries to strike a balance between them—in other words, to keep the number of items from content area m between the lower- and upper-bounds. A Weighted Priority Index Operational CAT and CD-CAT should have the capacity to handle many constraints, and it is often the case that some constraints are more important than other constraints. The importance of each constraint can be quantified by a weight wk , k = 1,2, . . . , K. In practice both the C matrix and K-vector are specified before the testing by content experts or test designers. Let Ij be the item information. In order to determine which item is the best, a priority index of item j can be computed with: K

p j = I j ∏(w k f jk ) jk k =1

c

218   H.-H. CHANG

where cjk = 1 if constraint k is relevant to item j, and 0 otherwise. Thus, fjk indicates the scaled “quota left” for constraint k. Cheng and Chang (2009) studied a two-phase version of the CW method for a large scale operational CAT program. A simulation study was conducted according to an item pool and test specifications and constraints used for the operational placement test. The simulation results indicate that the CW greatly outperformed the traditional control method in terms of constraint management. CW also maintains high measurement precision and improves item exposure control over the MIC method. Note that, though all the methods introduced in this section were originally developed for CATs for θ-estimation, they should work successfully for most CD-CATs. Implementation of CD-CAT Working with Existing Test Data A major advantage of using a cognitive diagnostic model is that the feedback from the test is individualized to the student’s strengths and weaknesses. Many researchers propose to develop new cognitive diagnostic assessments that focus on the estimation of attribute vector α (e.g., see a special JEM issue edited by DiBello and Stout, 2007). As a cost and time savings measure, it is also applicable to apply cognitive diagnosis on the existing assessments that were developed mainly for estimating θ , such as in achievement measures. McGlohen and Chang (2008) developed an algorithm that combines the estimation of individual achievement levels along with an emphasis on the diagnostic feedback provided by individual attribute. The Q-matrixes were constructed by analyzing the test blueprints and the items in the booklets. The item parameters were calibrated on the basis of a random sample of 2,000 examinees for each of three administrations of a statemandated large-scale assessment, using BILOG and Arpeggio to calibrate the 3PL and fusion-model-based item parameters, respectively. Their study may indicate that states have the potential to develop cognitively diagnostic formative assessments quickly and inexpensively by drawing on the wealth of items already developed for their state accountability assessments. Develop CD-CAT for Large Scale Assessment An important question that needs to be answered is how the item selection algorithms in the cognitive diagnostic CAT interact with the accuracy of the Q matrix in selecting items and classifying examinees. In addition, the number of attributes measured by the items in the bank also potentially

Making Computerized Adaptive Testing Diagnostic Tools for Schools    219

affects the performance of the diagnostic CAT. Given that the length of a CAT test is often limited, there is a potential trade-off between including more attributes in the test and allowing for more items to measure each attribute. One would expect that the more attributes measured by a CAT test, the fewer the number of items per attribute. And therefore test results would be less reliable and accurate. How can we quantitatively evaluate the degree to which Q matrix accuracy and the number of attributes affect the classification rates of student attribute mastery or non-mastery in a diagnostic CAT test? To get the answers to the research questions, it is important to conduct real person field tests as well as the follow-up validity studies. Most recently, Liu et al. (2010) reported a large scale pilot CD-CAT application in China with about 38,600 students in paper-pencil based pretesting and 584 in Web-based CAT testing. In the study they found that the post-hoc approach of constructing Q-matrix may not be ideal since the attributes can only be matched with the items currently available in the item pool. As a result, an attribute being assessed may not have sufficient number of items to be measured. A different approach in Liu et al. (2010) is that the attributes and relationships among them were defined by the content experts before the test development, so that the test developers can write items by following the instruction about the attributes, relationships, and other pre-specified requirements. Interestingly, when the Q matrix was constructed before test development, the average point-biserial among item scores and test scores tends to be higher than that obtained under the post-hoc approach. It is clear that providing cognitive diagnostic training to test developers before item writing results in high quality of test items. Note that in measuring how closely performance on a test item is related to performance on the total test, point-biserial is a key index in item analysis according to Classical Test Theory. Liu et al. (2010) also indicated that if the statistical models do not fit well with the pretest data, appropriately adjusting the elements in the Q matrix may yield better fitting with the same data. One occasion that frequently occurred in the process of item parameters calibration is that test developers and psychometricians worked together to tackle “bad-data-fit.” Most of the times, the “model fitting” would be improved by fine-tuning the Q matrix after re-examining the cognitive process that examinees might use to solve the problems. A CD-CAT system was developed to measure English achievement for grades five and six. About 38,600 students from 78 schools took the paper-pencil pretest with 13 different booklets. A Q-matrix based on eight attributes was constructed. Two sets of item parameters for 352 items were calibrated; one is for the DINA model and the other is for the 3PL model. Given the fact that all the eleven item writers went through a two-day psy-

220   H.-H. CHANG

chometric training emphasizing the 3PL and DINA models before writing the 352 items, both the DINA and 3PL models fit with the data satisfactorily. The DINA model and minimizing Shannon Entropy method described in Equation 8.15 were utilized. In a field test of cognitive diagnostic assessment designed to mimic English Proficiency Level II, a well known English assessment in China, 200 PCs from eight schools in Beijing were connected via the Internet to a laptop server, and 584 fifth and sixth grade students from the eight schools participated in the CD-CAT tests. The assessment is a multimedia rich Internet application including sufficient audio instructions and animations during the course of the testing. Despite the heavy load of “multi-tasking,” the administration of the individualized assessment went smoothly. Ninety students were sampled to participate in two pilot validity studies; one was to compare the classification consistency between CD-CAT and teachers, and the other between CD-CAT and an existing achievement test. In the former, the mastery levels of the eight attributes classified by the CD-CAT system for each student were compared with those conducted by their school teachers. The latter was to compare the CD-CAT results with a regular academic achievement test that was developed for evaluation of the compulsory education in Beijing. According to Liu et al.(2010) the consistency levels for both comparison studies were pretty high. In January 2011 Liu and her colleagues successfully conducted a large scale on-line field test in Dalian, China. About 15,000 students participated in this three-day continuous assessment with a maximum of 2,000 students taking the test simultaneously (Liu et al, under review). (See Figure 8.5 for one of the groups of students taking the test in their school’s multimedia room.) Given that the scope of CDCAT applications is getting bigger and bigger, more and more research questions will be answered. Conclusions Over the past thirty years, student assessment has become an increasingly important feature of public education. For the wealth of assessment outcomes about student learning to be truly useful for teachers in instructional planning, new assessment methods are obliged to provide reliable and accessible information about how students think and what they understand in order to pinpoint areas needing further (re)teaching. Currently, there are significant barriers to designing assessments that can be used simultaneously for such wide-ranging purposes as accountability and improvement. Recognizing the intrinsic limitations of summative assessment, educators are looking for new assessments to inform and track student learning during the year. Large numbers of vendors are now selling what they call

Making Computerized Adaptive Testing Diagnostic Tools for Schools    221

Figure 8.5  Students in Grade 5 in Dalian China are taking a cognitive diagnostic English proficiency assessment by using their school’s PCs.

“benchmark,” “diagnostic,” “formative,” and/or “predictive” assessments with promises of improving student performance. These systems often lay claim to the research documenting the powerful effect of formative assessment on student learning. However, the research in this area evaluated formative assessments of a very different character than essentially all current commercially-available interim assessment programs (Perie, Marion, & Gong, 2007; Shepard, 2008). In our view a truly diagnostic instrument must be the one that can be tailored to fit well with each individual student, and therefore, CAT should be one of the most promising approaches. Although well-known and sophisticated CAT techniques are available for implementing large scale admission and licensure testing, the psychometric foundations for assessment designs that would tailor assessments to each individual to better incorporate individualized diagnostic information to improve teaching and learning are still maturing. As emphasized in this chapter, the newly developing cognitive diagnosis and Computerized Adaptive Testing have great potential for maximizing the benefit of assessments for students and for developing balanced assessment systems across classroom, district, state, national, and international levels that are mutually reinforcing and aligned to a set of world-class common core standards. One big challenge to bring CAT to schools is the affordability of hardware, software and network delivering. To this end, Liu and her colleagues

222   H.-H. CHANG

have set an exceptional example that a large scale CD-CAT implementation can be based on the cutting-edge Browser/Server Architecture, which is indeed a cost-effective and user-friendly alternative to the more-traditional Client/Server design, given that it does not require specialized client software, extensive additional hardware, or detailed knowledge of the network environment. The experience Liu and her colleagues achieved from test development to overall quality control may help many practitioners in their system designs and future large scale implementations. However, if constructing this sort of diagnostic assessment from scratch is not practical due to budgetcuts nationwide from the downturn of the economy, schools still have the potential to develop cognitively diagnostic formative assessments quickly and inexpensively by drawing on the wealth of items already developed for their state accountability assessments. Since the goal of formative assessment is to inform instructional decisions, these assessments should provide as much information about the current state of students’ learning as possible. Thus, instead of merely providing a global score of examinee ability, formative assessments should ideally classify students’ mastery of a given set of attributes pertinent to learning. These attributes constitute tasks, subtasks, cognitive processes, or skills involved in answering each test item. Classifying students’ mastery in this way is referred to as cognitive diagnosis. With knowledge of which attributes students have mastered, teachers can target their instructional interventions to the areas in which students need most improvement. The potential significance of large scale CD-CAT applications lies in the substantially increased evidence that both achievement levels and skillmastery levels can be accurately estimated. With regard to this, Liu’s team is demonstrating that CAT can be used innovatively not only to estimate an examinee’s latent trait, but also to classify the examinee’s mastery levels of the skills the assessment is designed to measure. The result of the field test reported in Liu et al. (2010) presents a clear illustration about how to validate the classification results generated by CD-CAT. It is interesting that they found high consistency among the diagnoses generated by the CD-CAT assessment and those by the school teachers, and this provides evidence in support of the intended inferences and actions to be made based on the reported test results. Developing validity rationale for any measurement instrument and gathering sufficient evidence is always important. This book chapter provides an array of information which might be useful towards designing robust techniques that can be used to combine cognitive diagnosis with adaptive testing. Relating to adaptive item selection emphasizing cognitive diagnosis, CD-CAT has an innovative future application to utilize the CAT approach to cognitive diagnosis in the realm of web-based learning. Although the research of CD-CAT was originally inspired by the

Making Computerized Adaptive Testing Diagnostic Tools for Schools    223

problems in K–12 accountability testing, its findings may also be beneficial to other domains such as quality of life measurement, patient report outcome, and media and information literacy measurement (Chang, 2011). CAT has already had a substantial influence on the functioning of society by affecting how people are selected, classified, and diagnosed. The presented research will lead to better assessment and hence, will benefit society. CAT is revolutionarily changing the way we address challenges in assessment and learning. References Chang, H. (2004). Understanding computerized adaptive testing: From Robbins-Monro to Lord and beyond. In D. Kaplan (Ed.), The sage handbook of quantitative methods for the social sciences (pp. 117–133). Thousand Oaks, CA: Sage Publications. Chang, H. (2011, July). Building affordable CD-CAT systems for schools to address today’s challenges in assessment. Paper presented at the76th Annual and 17th International Meeting of the Psychometric Society, Hong Kong, China. Chang, H., Qian, J., & Ying, Z. (2001). A-stratified multistage CAT with b blocking. Applied Psychological Measurement, 25, 333–341. Chang, H. & Ying, Z. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20, 213–229. Chang, H., & Ying, Z. (1999). A-stratified multistage computerized adaptive testing. Applied Psychological Measurement, 23(3), 211–222. Chang, H. & Ying, Z. (2008). To weight or not to weight? Balancing influence of initial items in adaptive testing. Psychometrika, 73, 441–450. Chang, H., & Ying, Z. (2009). Nonlinear sequential designs for logistic item response theory models with applications to computerized adaptive tests. The Annals of Statistics, 37, 1466–1488. Chang, H., & Zhang, J. (2002). Hypergeometric family and item overlap rates in computerized adaptive testing. Psychometrika, 67, 387–398. Chang, H., & van der Linden, W. J. (2003). Optimal stratification of item pools in alpha-stratified computerized adaptive testing. Applied Psychological Measurement, 27, 262–274. Chen, S., & Ankenmann, R. D. (2004). Effects of practical constraints on item selection rules at the early stages of computerized adaptive testing. Journal of Educational Measurement, 41(2), 149–174. Cheng, Y. (2009). When cognitive diagnosis meets computerized adaptive testing: CD-CAT. Psychometrika, 74(4), 619–642. Cheng, Y., Chang, H., & Yi., Q. (2007). Two-phase item selection procedure for flexible content balancing in CAT. Applied Psychological Measurement, 31, 467–482. Cheng, Y., & Chang, H. (2007). The modified maximum global discrimination index method for cognitive diagnostic CAT. In D. Weiss (Ed.), Proceedings of the 2007 GMAC Computerized Adaptive Testing Conference. Retrieved from www. psych.umn.edu/psylabs/CATCentral/

224   H.-H. CHANG Cheng, Y., & Chang, H. (2009). The maximum priority index method for severely constrained item selection in computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 62, 369–383. Cheng, Y., Chang, H., Douglas, J., & Guo, F. (2009). Constraint-weighted a-stratification for computerized adaptive testing with nonstatistical constraints: Balancing measurement efficiency and exposure control. Educational and Psychological Measurement, 69, 35–49. Cover, T., & Thomas, J. (1991). Elements of information theory. New York, NY: Wiley. De la Torre, J., & Douglas, J. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333–353. DiBello, L., & Stout, W. (Eds.). (2007). Special issue. Journal of Educational Measurement, 44, 285–392. DiBello, L. V., Stout, W. F., & Roussos, L. A. (1995). Unified cognitive/psychometric diagnostic assessement likelihoon-based classification techniques. In P.D. Nichols, S.F., Chipman, & R.L. Brennan (Eds.), Cognitively diagnostic assessment (pp. 361–389). Hillsdale, NJ: Erlbaum. Embrestson, S. E. (1985). Multicomponent latent trait models for test design. In S. E. Embretson (Ed.), EmbreTest design: Developments in psychology and psychometrics (pp. 195–218). New York: Academic Press. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum. Haertel, E. H. (1984). An application of latent class models to assessment data. Applied Psychological Measurement, 8, 333–346. Hartz, S., Roussos, L., & Stout, W. (2002). Skills diagnosis: Theory and practice [User manual for Arpeggio software]. Princeton, NJ: Educational Testing Service. Junker, B., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258–272. Kingsbury, C. (2009, November). An investigation of the idiosyncratic computerized adaptive testing (ICAT) procedure. Paper presented at the 2009 Association of Test Publishers Conference, Palm Spring, CA. Lehmann, E.L. (1986). Theory of point estimation. New York, NY: Wiley. Lehmann, E.L., & Casella, G. (1998). Theory of point estimation (2nd ed.). Berlin: Springer. Liu, H., You, X., Wang, W., Ding, S., & Chang, H. (2010, May). Large-scale applications of cognitive diagnostic computerized adaptive testing in China. Paper Presented at the Annual Meeting of National Council on Measurement in Education, Denver, CO. Liu, H., You, X., Wang, W., Ding, S., & Chang, H. (under review). The Development of Computerized Adaptive Testing with Cognitive Diagnosis for English Achievement Test in China, Manuscript currently under review. Lord, M.F. (1970). Some test theory for tailored testing. In W. H. Holzman (Ed.), Computer Assisted Instruction, Testing, and Guidance (pp. 139–183). New York: Harper and Row. Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale NJ: Erlbaum.

Making Computerized Adaptive Testing Diagnostic Tools for Schools    225 Macready, G. B., & Dayton, C. M. (1977). Use of probabilistic models in the assessment of mastery. Journal of Educational Statistics, 2, 99-l20. Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64, 187–212. McGlohen, M., & Chang, H. (2008). Combining computer adaptive testing technology with cognitive diagnostic assessment. Behavioral Research Methods, 40, 808–821. Neyman, J., & Peason, E.S. (1936). Contributions to the theory of testing statistical hypotheses. Statist. Res. Mem. 1–37. NWEA (2009). Technical manual of MAP. Portland, OR: Northwest Evaluation Association. Owen, R. J. (1969). A Bayesian approach to tailored testing. Research Bulletin, 69– 92. Princeton, NJ: Educational Testing Service. Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context of adaptive mental testing. Journal of the American Statistical Association, 70, 351–356. Perie, M., Marion, S., Gong, B., & Wurtzel, J. (2007). The role of interim assessments in a comprehensive assessment system: A policy brief. Achieve, Inc. Retrieved from www.achieve.org Quellmalz, E. S., & Pellegrino, J. W. (2009). Technology and testing. Science, 323, 75–79. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423. Shepard, L. (2008). Formative assessment: Caveat emperator. In C. A. Dwyer (Ed.), The future of assessment (pp. 279–304). New York, NY: Erlbaum Stocking, M. L., & Swanson, L. (1998). Optimal design of item banks for computerized adaptive tests. Applied Psychological Measurement, 22, 271–279. Sympson, J.B., & Hetter, R.D. (1985). Controlling item exposure rates in computerized adaptive testing. In Proceedings of the 27th annual meeting of the Military Testing Association (pp. 937–977). San Diego, CA: Navy Personnel Research and Development Center. Tatsuoka, C. (2002). Data analytic methods for latent partially ordered classification models. Journal of Royal Statistical Society, 51, 337–350. Tatsuoka, K. K. (1995). Architecture of knowledge structures and cognitive diagnosis: A statistical pattern recognition and classification approach. In P. D. Nichols, S. F. Chipman, & R. L. Brennan (Eds.), Cognitively diagnostic assessment (pp. 327–359). Hillsdale, NJ: Erlbaum. Templin, J., & Henson, R. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11, 287–305. U.S. Department of Education. (2009). Race to the top program executive summary. Washington, DC: Author. Retrieved from http://www2.ed.gov/programs/ racetothetop/executive-summary.pdf Veerkamp, W.J.J., & Berger, M.P.F. (1997). Some new item selection criteria for adaptive testing. Journal of Educational and Behavioral Statistics, 22, 203–226. Wainer, H. (2000). Rescuing computerized adaptive testing by breaking Zipf’s law. Journal of Educational and Behavioral Statistics, 25, 203–224

226   H.-H. CHANG Wainer, H., Dorans, N. J., Flaugher, R., Green, B. F., Mislevy, R. J., Steinberg, L., & Thissen, D. (1990). Computerized adaptive testing: A primer. Hillsdale, NJ: Lawrence Erlbaum. Wang, C., Chang, H. & Wang, X. (2011, April). An enhanced approach to combine item response theory with cognitive diagnosis in adaptive tests. Paper presented in National Council of Educational Measurement annual meeting, New Orleans, Louisiana. Way, W. (2006). Practical Questions in introducing computerized adaptive testing for K–12 assessments. Research Report 05-03. Iowa City, IA: Pearson Educational Measurement. Weiss, D. J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6, 473–492. van der Linden, W. J., & Chang, H. (2003). Implementing content constraints in a-stratified adaptive testing using a shadow-test approach. Applied Psychological Measurement, 27, 107–120. Xu, X., Chang, H., & Douglas, J. (2003, April). A simulation study to compare CAT strategies for cognitive diagnosis. Paper presented at the Annual Meeting of National Council on Measurement in Education, Chicago. Yi, Q., & Chang, H. (2003). a-Stratified multistage CAT design with content-blocking. British Journal of Mathematical and Statistical Psychology, 56, 359–378.

Chapter 9

Applying Computer Based Assessment Using Diagnostic Modeling to Benchmark Tests Terry Ackerman, Robert Henson, Ric Luecht, John Willse University of North Carolina at Greensboro Jonathan Templin University of Georgia

Abstract This chapter presents an in progress study of an application applying a loglinear diagnostic model to an Algebra II benchmark test with the ultimate goal of transforming the current pencil and paper format into a multistage computer adaptive assessment. Unlike previous applications of diagnostic models, which have been post hoc adaptations, in this project we worked with teachers purposively to create a test that matched a targeted Q-matrix format. Several different phases including pilot testing, standard setting, administration, and reporting of results, as well as next steps to convert this process to a multistage computerized test are described. Computers and Their Impact on State Assessments, pages 227–244 Copyright © 2012 by Information Age Publishing All rights of reproduction in any form reserved.

227

228   T. ACKERMAN et al.

Currently we are part of the evaluation effort of a locally and state funded project called the Cumulative Effect Mathematics Project (CEMP). As part of that effort, we are applying diagnostic classification modeling (DCM) to a benchmark test used in an Algebra II course. Our goal is to eventually make this a computerized DCM assessment. The CEMP involves ten high schools in the target county that had the lowest performance on the end-of-course (EOC) tests in mathematics. The EOC test is part of the federally mandated accountability test under the No Child Left Behind Legislation. The ultimate goal of the CEMP is to increase and sustain mathematics scores at these ten high schools in line with the other schools in the district. In North Carolina, teachers follow strict instructional guidelines called the “standard course of study.” These guidelines dictate what objectives and content must be taught during each week of the school year. The instruction must “keep pace.” Given this regimented pacing, teachers often struggle with how to effectively assess students’ learning to make sure they are prepared to take the EOC test at the end of the school year. The test has very “high-stakes” because it has implications for both the student (passing the course) and the teacher (evaluation of his or her effectiveness as a teacher). One common method of formative assessment is the “benchmark test.” These tests provide intermediate feedback of what the student has learned up to the administration of the benchmark test so that remediation, if necessary, can be implemented prior to the end-of course test. Benchmark tests constructed with a DCM framework have several advantages: • Student information comes in the format of a profile of skills that the student has mastered and has not mastered. • The skills needed to perform well on the EOC are measured directly. • The DCM profile format can diagnostically/prescriptively inform classroom instruction. • The profile can help students better understand their strengths and weaknesses. • When presented in a computerized format there is immediate feedback provided to the teacher and students Many DCMs are built upon the work of Tatsouka (1985) and require the specification of a Q-matrix. For a given test, this matrix identifies which attributes each item is measuring. Thus, for a test containing J- items and K-attributes the J × K Q-matrix contains elements, qjk , such that

1 if item j requires attribute k q jk =  0 else

(9.1)

Applying Computer Based Assessment to Benchmark Tests    229

Figure 9.1  An example Q-matrix for six items and five attributes.

Instead of characterizing examinees with a continuous latent variable, examinees are characterized with a 0/1 vector/profile, αi , whose elements denote which of the k attributes examinee i has mastered. An example of an item by attribute Q-matrix is shown in Figure 9.1. Note that item six requires both attributes D and E. Whereas, Attribute F is being measured by only items two and five. We chose to use the attributes as defined by the North Carolina Department of Public Instruction’s standard course of study’s course objectives and goals because on the EOC students would ultimately be evaluated in relation to these course objectives and goals. In addition, teachers were already familiar with those definitions and the implied skills. For Algebra II there were five goals: • 1.03 Operate with algebraic expressions (polynomial, rational, complex fractions) to solve problems • 2.01 Use the composition and inverse of functions to model and solve problems; justify results • 2.02 Use quadratic functions and inequalities to model and solve problems; justify results; solve using tables, graphs and algebraic properties and interpret the constants and coefficients in the context of the problem • 2.04 Create and use best-fit mathematical models of linear, exponential, and quadratic functions to solve problems involving sets of data; interpret the constants, coefficients, and bases in the context of the data; and, check the model for goodness-of-

230   T. ACKERMAN et al.

fit and use the model, where appropriate, to draw conclusions or make predictions • 2.08 Use equations and inequalities with absolute value to model and solve problems; justify results using tables, graphs and algebraic properties; and interpret the constants and coefficients in the context of the problem. Our initial work began with three master teachers. Master teachers are teachers who have demonstrated mastery on rigorous examinations that cover both Professional Teaching Knowledge (PTK) and subject area knowledge. Master teachers are certified by the American Board for Certification of Teacher Excellence (ABCTE). We explained the concept of the Q-matrix to these three teachers and then had them develop a pool of items for which each item was measuring at least one of the attributes. From this pool of “benchmark” items a 28-item pencil-and-paper pilot assessment was created. These items were then pre-tested, and the quality of each item was evaluated using traditional Classical Test Theory techniques. A final form was created and the Q-matrix was further verified by another set of five master teachers. They were first shown a very basic example, shown in Figure 9.2, to help them understand how to classify each item.

Figure 9.2  An instructional example used with master teachers to review items on the final test and verify which attributes each requires.

Applying Computer Based Assessment to Benchmark Tests    231 Table 9.1  Dependability Coefficients for Each Attributes/Objectives for Various Numbers of Raters Objective Raters

1.03

2.01

2.02

2.04

2.08

1 2 3 4 5 6 7 8 9 10 11 12

0.38 0.55 0.64 0.71 0.75 0.78 0.81 0.83 0.84 0.86 0.87 0.88

0.73 0.84 0.89 0.91 0.93 0.94 0.95 0.96 0.96 0.96 0.97 0.97

0.34 0.50 0.60 0.67 0.72 0.75 0.78 0.80 0.82 0.84 0.85 0.86

0.48 0.65 0.74 0.79 0.82 0.85 0.87 0.88 0.89 0.90 0.91 0.92

0.66 0.79 0.85 0.88 0.91 0.92 0.93 0.94 0.95 0.95 0.95 0.96

We also conducted a generalizability study to examine the dependability of the process of assigning the attributes to the items. The items were treated as the object of measurement (i.e., how consistently could an item be evaluated in terms of its classification on an objective). Sources of variability included raters (i.e., teachers indicating which attributes were required in order to answer items) and the attributes influencing the items. Attributes were conceptualized as a fixed facet, making separate analysis by attribute. In G-theory there is a coefficient for relative decisions (i.e., ranking), g, and one for absolute decisions (i.e., criteria-based), Φ. These dependability coefficients were calculated for one to twelve raters and are displayed for each attribute or objective. The dependability coefficients in the highlighted row having five raters shows the values for this study (Table 9.1). Objectives 2.01 and 2.08 received the most reliable ratings. The final Q-matrix is shown in Figure 9.3. The average Q-matrix complexity was 1.36 (i.e., the average number of attributes being measured by an item). Sixteen of the items are measuring only one attribute, nine are measuring two attributes. Item two, which was judged by the master teachers to be measuring attributes 1.02 and 2.01, has the following stem: If one factor of f(x) = 12x2 – 14x – 6 is (2x – 3), what is the other factor of f(x) if the polynomial is factored completely?

232   T. ACKERMAN et al.

Figure 9.3  The 25-item Q-matrix that was used as the first benchmark test for the Algebra II course.

Diagnostic Classification Modeling Background One of the simpler DCMs is called the DINA model (deterministic input noisy “and” gate model). This model was developed and utilized by Macready and Dayton (1977), Haertel (1989), Doignon and Falmagne (1999), Junker and Sijtsma (2001), and Tatsouka (2002). This model divides examinees into two classes: those who have mastered all the attributes and those who have not. In this model, the probability of correctly responding to an item where xij indicates whether examinee i has mastered all the required attributes for item j, can be written as

ξ

(1−ξij )

P (X ij = 1 | ξij , s j , g j )=(1 − s j ) ij g j



(9.2)

Applying Computer Based Assessment to Benchmark Tests    233

where sj is the probability that an examinee answers an item incorrectly even though he or she has mastered the required attributes ( a “slip”), that is s j = P (X ij = 0 | ξij = 1)



(9.3)

and gj is the probability that an examinee answers an item correctly even though he or she has not mastered all the required attributes (“guessing”), that is, g j = P (X ij = 1 | ξij = 0)



(9.4)

The DINA model may be too simplistic in some applications because any item partitions people into only two classes. If an examinee has not mastered one attribute, the assumption is that he/she will perform as well as a person who has not mastered any of the attributes. In these cases, it may be more realistic to assume that the probability of a correct response is a function of the required attributes that one has mastered (i.e., someone lacking only one of the required attributes may perform better than an examinee who has not mastered any of the required attributes). A model that allows for differing item responses that are a function of the item’s required attributes is Junker and Sijtsma’s (2001) NIDA (noisy input; deterministic “and” gate) model, which is based on the multiple latent class model developed by Maris (1999). This model can be expressed as K

P (X ij = 1 | αi , s , g ) = ∏ (1 − s k )

αik

k =1

q jk

g k1−αik 

where a new latent variable ηijk represents the ability of examinee i to correctly apply attribute k to the j th item; and sk and gk are now defined in terms of the examinee’s ηijk values and the Q-matrix entries (qjk values):

( = P (η

) = 1)

s k = P ηijk = 0 | αik = 1, q jk = 1 gk

ijk

= 1 | αik = 0, q jk

The problem with the NIDA model is that, while it allows for differing expected item responses based on the required attributes that have been mastered (unlike the DINA), it does not allow for items with the same Qmatrix entries to differ (i.e., item analyses are not reasonable). Specifically,

234   T. ACKERMAN et al.

for any item requiring attribute k the slip parameters are constrained to be the same as well as the guessing parameters. As an alternative to the DINA and NIDA, the Reparameterized Unified Model (RUM) is a model that allows for both differing item parameters and differing item responses that depend on the mastered attributes required by that item. This model defined by Hartz (2002) was based on the Unified Model, which was first introduced by DiBello, Stout, and Roussos (1995). In addition, the RUM (and the Unified Model; Roussos, DiBello, & Stout, 2007), unlike the DINA and NIDA models, introduced a continuous latent variable θi to account for attributes that are not specified in the Q-matrix. This model is given as: K



P (X ij = 1 | αi , θi ) = π *j ∏r jk k =1

(*1−αik )×q jk

Pc j (θi )

(9.5)

Where πj* is the probability of getting item j correct given that all of the required skills are mastered and his/her θi is high, rjk* is the discrimination parameter for item j on skill k (indicates the level to which item j is discriminating mastery vs. non-mastery on skill k). The parameter rjk* can also be thought of as the “penalty” for not mastering attribute k. Pcj(θi) is a Rasch-type model with a negative difficulty parameter cj . θi is examinee’s i’s knowledge not specified in the Q-matrix. (Note, θ is different from the latent ability estimated in traditional IRT models.) In this particular study we decided to use a more general DCM called the Log-linear Cognitive Diagnosis Model, or LCDM (Henson, Templin, & Willse, 2009; Rupp, Templin, & Henson, 2010). The LCDM is a special case of a log-linear model with latent classes (Hagenaars, 1993) and thus is also a special case of the General Diagnostic Model (von Davier, 2005). The LCDM defines the logit of the probability of a correct response as a linear function of the attributes that have been mastered. For example, given the simple item, 2 + 3 – 1 = ?, we can model the logit of the probability of a correct response as a function of mastery or nonmastery of the two attributes (addition and subtraction). Specifically,

 P (X ij = 1)  ln   = λ 0 + λadd αadd + λ sub α sub + λadd *sub αadd α sub  1 − P (X ij = 1)

(9.6)

Although the LCDM item parameters can be estimated, it was important to define the parameters so that mastery classifications would be consistent with the standards set by the EOC test. In getting these probabilities the standard is set for all possible combinations of mastery. Thus, we define how a student will be classified in the mastery of each attribute.

Applying Computer Based Assessment to Benchmark Tests    235

Four of the five teachers we used to verify the Q-matrix also helped perform a standard setting analysis using a modified Angoff approach. For each item, teachers were asked to identify the expected proportion of 100 students with a particular profile that would answer the item correctly. This question was repeated for all possible combinations of mastery and nonmastery of the attribute measured for that item. These proportions were then averaged across raters and used to determine the parameters for each item in the LCDM model. For example, consider the following item, 1. If f (x ) = x 2 + 2 and g (x ) = x − 3 find: a. x 2 − 6x + 11 b. x 2 + 11 c. x 2 + x − 1 d. x 3 − 3x 2 + 2x − 6 Each teacher provided a judgment of what he or she felt was the probability of a correct response for students who had not mastered the requisite skills for an item and those who had mastered the skills. Figure 9.4 and Figure 9.5 presents the two groups of probability. Based on the teachers’ standard setting responses, the average probability of a correct response was calculated. These averages are used to compute item parameters. Specifically, using the probabilities from the standard set-

Figure 9.4  Standard setting results for Item 1, which requires only one attribute.

236   T. ACKERMAN et al.

Figure 9.5  Standard setting results for Item 6, which requires two attributes.

ting associated with each response pattern (α’s) then we can compute the logit and solve for the item parameters (λ’s). We then administered the test, and using these estimated λ’s, we obtained estimates of the posterior probability that each skill has been mastered for each of the students. A mastery profile, a, was created. Examinees’ probabilities were categorized as mastery or non-mastery using the rule Non-master