This edited volume presents a systematic analysis of conceptual, methodological and applied aspects related to the valid
177 41 6MB
English Pages 463 [453] Year 2021
Table of contents :
Contents
1 Introduction
1.1 A Book on Validity
1.2 Conceptual Contributions
1.3 Technical Contributions
1.4 Chapters That Illustrate Validation Processes
1.4.1 Based on Multiple Evidence
1.4.2 Based on Specific Evidence
1.5 Challenges Around Validity
1.6 To Conclude: Book Structure
References
2 Is Validation a Luxury or an Indispensable Asset for Educational Assessment Systems?
2.1 The Modern Definition of Validity and Validation
2.2 The History of Validity
2.2.1 Early Definitions of Validity
2.2.2 Messick’s Unified Construct of Validity
2.2.3 Cronbach’s Notion of Validation
2.2.4 Kane’s Contribution on Validity Arguments
2.2.5 Are Consequences of Testing Part of Test Validation?
2.3 The Standards for Educational and Psychological Testing
2.3.1 Evidence Based on Test Content
2.3.2 Evidence Based on Response Processes
2.3.3 Evidence Based on Internal Structure
2.3.4 Evidence Based on Relations to Other Variables
2.3.5 Evidence Based on Consequences of Testing
2.3.6 Validation as Integration of Different Types of Evidence
2.4 The Political Dimension of Validation
References
Part I Validity of Student Learning Assessments
3 How to Ensure the Validity of National Learning Assessments? Priority Criteria for Latin America and the Caribbean
3.1 Introduction
3.2 What Does It Mean to Validate Assessments?
3.3 Validation Criteria
3.4 Dimension of Test Alignment with the Official Curriculum
3.4.1 Criterion 1: The Design of the Assessment is Justified in Reference to the Curriculum
3.4.2 Criterion 2: The Assessment Domain is Operationalized by Taking into Consideration Actual Student Learning
3.4.3 Criterion 3: Test Results Allow Accurate and Unbiased Monitoring of the Achievement of Curricular Learning Over Time
3.5 Dimension of Curricular Validity of Performance Levels
3.5.1 Criteria: 4. Performance Levels Are Aligned with the Curriculum and 5. Performance Levels Are Operationalized with Actual Student Learning in Mind
3.5.2 Criterion 6: Performance Levels Describe Qualitatively Different Stages of Learning
3.5.3 Criterion 7: Performance Levels Balance Stability and Change in the Context of a Dynamic Curriculum Policy
3.6 Dimension of Consequential Validity of the Assessments
3.6.1 Criteria: 8. Assessment Results Are Effectively Communicated, and 9. There Are Formal Mechanisms to Support the Use of Assessments to Improve Learning
3.6.2 Criterion 10: There Are Formal Mechanisms to Monitor the Impact of Assessments on the Education System
3.7 Conclusions
References
4 Contemporary Practices in the Curricular Validation of National Learning Assessments in Latin America: A Comparative Study of Cases from Chile, Mexico, and Peru
4.1 Introduction
4.2 Methodology
4.3 Curriculum Validation Practices in Latin America
4.3.1 Dimension of Test Alignment with the Official Curriculum
4.3.2 Dimension of Curricular Validity of the Performance Levels
4.3.3 Dimension of Consequential Validity of Assessments
4.4 Achievements, Challenges, and Implications for a Validation Agenda
References
5 Learning Progress Assessment System, SEPA: Evidence of Its Reliability and Validity
5.1 Introduction
5.2 Learning Progress Assessment System, SEPA
5.3 Routine Validation Agenda in SEPA
5.3.1 Arguments Based on the Content of SEPA Tests
5.3.2 Validity Evidence on Internal Test Structure
5.3.3 Reliability Checks
5.3.4 Fairness Checks
5.4 Occasional Validation Studies Agenda in SEPA
5.4.1 Evidence on the Relationship with Other Variables: Correlation with SIMCE
5.4.2 Usage Studies as Validity Evidence Based on Consequences
5.5 Conclusions
References
6 Validation Processes of the National Assessment of Educational Achievements—Aristas: The Experience of the INEEd (Uruguay)
6.1 The Standardized Assessment in Uruguay and the Creation and Development of the INEEd
6.1.1 Standardized Assessment in Uruguay
6.1.2 The Creation and Development of the INEEd
6.2 Aristas, the National Assessment of Educational Achievement
6.2.1 Reading
6.2.2 Mathematics
6.2.3 Social-Emotional Skills
6.2.4 School Life, Participation and Human Rights
6.2.5 Family Context and School Environment
6.2.6 Learning Opportunities
6.3 Validation Processes for the Use of Aristas Results
6.3.1 Evidence of Content-Related Validity of Reading and Mathematics Tests: Defining the Conceptual Framework and Itemization Process
6.3.2 Evidence of Validity Associated with the Internal Structure of Reading and Mathematics Tests: Psychometric Analysis
6.3.3 Evidence of Validity Associated with the Content of the Social-Emotional Skills Instrument: Definition of the Conceptual Framework and Process of Itemization
6.3.4 Evidence of Validity Associated with the Internal Structure of the Social-Emotional Skills Instrument: Psychometric Analysis of the Instrument
6.4 Conclusions: Challenges Regarding the Use of Evaluation
References
7 The Validity and Social Legitimacy of the Chilean National Assessment of Learning Outcomes (SIMCE): The Role of Evidence and Deliberation
7.1 A Brief History of SIMCE
7.2 The External Committees Convened to Revise SIMCE
7.2.1 A Comprehensive Evaluation Framework for Judging SIMCE
7.3 Evidence to Support the Validity of SIMCE Result Interpretations
7.3.1 Validity as a Relation to Other Measures
7.3.2 Validity as Curriculum Alignment
7.3.3 Validity of the Interpretation of Comparisons Between years
7.3.4 Quality of Education: More Than SIMCE and More Than Learning Achievement
7.3.5 Validity as Consequences and Impact of SIMCE
7.4 On the Social Legitimacy and Political Feasibility of SIMCE
7.4.1 Two SIMCE Review Committees to Address Crises of Social and Political Legitimacy
7.5 Conclusions and Discussion
7.5.1 A Validation Agenda not Only for SIMCE but also for the Categorization of Schools and the Quality Assurance System
7.5.2 The Institutionality for Conducting a Validation Agenda
7.5.3 On the Social Justification and Political Viability of the National Assessment System
References
Part II Validity of International Assessments
8 Test Comparability and Measurement Validity in Educational Assessment
8.1 What is Comparability and Why is It Important for Validity?
8.1.1 Comparability Between Scores from Different Forms of a Test, Applied on the Same Occasion
8.1.2 Comparability Between Scores from Different Forms of a Test, Applied on Different Occasions
8.1.3 Comparability Between National and International Test Results
8.1.4 Comparability Between Scores from Tests Administered to Different School Levels or Grades
8.1.5 Comparability Between Scores from the Same Test Administered to Different Populations or Sub-populations
8.1.6 Comparability Between Results from Same Test Administered with Different Devices, to Equivalent Populations
8.2 Technical Considerations for Comparability Between Forms from the Same Test: Equating Between Forms
8.2.1 Equating Designs
8.2.2 Psychometric Methodology to Equate Forms
8.3 Technical Considerations on Comparability Between Measurements from Different Occasions: Non-Equivalent Groups
8.3.1 Equating Designs
8.3.2 Psychometric Methodology to Equate Forms
8.4 Technical Considerations on Comparability Between National and International Tests: Alignment and Concordance Between Scores
8.5 Conclusions
8.6 Appendix: Suggested Readings for Further Discussion on Issues of Comparability
References
9 Measurement of Factor Invariance in Large-Scale Tests
9.1 Introduction
9.2 Why is It Necessary to Measure Invariance in Large-Scale Testing?
9.3 Measurement Invariance and Measurement Bias
9.3.1 Measurement Bias
9.3.2 Invariance
9.4 Measurement of Invariance and Validity
9.5 An Approach to the Factor Analysis Model
9.6 Factor Invariance Levels
9.6.1 Configural
9.6.2 Strict
9.7 Goodness-of-Fit Indices to Assess Levels of Invariance
9.8 Other Aspects of Invariance Measurement
9.8.1 Measurement Invariance and Prediction Invariance
9.8.2 What to Do When There is no Measurement Invariance?
9.9 An Example: Measuring Factor Invariance of the PISA 2015 Environmental Awareness Scale
9.9.1 Background
9.9.2 Method
9.9.3 Results
9.9.4 Synthesis of Findings from the Example
9.10 Conclusions
References
10 Invariance of Socioeconomic Status Scales in International Studies
10.1 Introduction
10.2 Theoretical Framework
10.3 Methodology
10.3.1 Data
10.4 Analytical Strategy
10.5 Results
10.6 Discussion
10.7 Conclusions
References
11 Measuring Attitudes Towards Gender Equality in International Tests: Implications for Their Validity
11.1 Introduction
11.2 The International Civic and Citizenship Education Study (ICCS) 2009
11.2.1 Construction of Scales and Comparability
11.3 Studies on Gender Attitudes in Students
11.3.1 Measuring Attitudes Towards Gender Equality: Rights, Roles and Spaces for Participation
11.4 Data, Variables and Methods
11.4.1 Data
11.4.2 Variables
11.4.3 Method
11.5 Results
11.5.1 One-Dimensional Model Evaluation of Attitudes Towards Gender Equality
11.5.2 Two-Dimensional Model Evaluation of Attitudes Towards Gender Equality
11.5.3 Relationship of the 1 and 2 Factor Models with the Conventional Citizenship Scale
11.5.4 Scope for Latin America
11.6 Discussion
References
Part III Validity of Admissions and Certification Assessments
12 Validity of Assessment Systems for Admissions and Certification
12.1 Introduction
12.2 High-Stakes Tests: The Importance of Defining Their Uses
12.2.1 How to Explore Intended Uses: Program Theory or Logic Model
12.3 Admission Tests and Predictive Validity Studies
12.3.1 Definition of the Criterion
12.3.2 Other Technical Aspects
12.3.3 Example: Ability of the High School Ranking to Predict University Performance in Chile
12.4 Certification Tests, the Standard-Setting Process, and Their Validity
12.4.1 Example Regarding the Validity of Standards: Teacher Assessment in Chile
12.4.2 Criticism of Traditional Standard-Setting Methods
12.5 Validity and Bias in High-Stakes Tests
12.6 Consequential Validity: Unexpected Consequences of High-Stakes Tests
12.6.1 Equity
12.6.2 Other Possible Unexpected Consequences
12.7 How to Integrate Evidence on the Validity of High-Stakes Tests
12.8 Conclusion
References
13 Is the SABER 11th Test Valid as a Criterion for Admission to Colombian Universities?
13.1 Presentation
13.2 The SABER 11th Test
13.3 The Validity of SABER 11 as a Criterion for Admission to Higher Education
13.3.1 Prerequisites Survey
13.3.2 Results
13.4 Natural Science
13.5 Social Sciences
13.6 Mathematics
13.7 Quantitative Reasoning
13.8 Critical Reading
13.8.1 Content Validity
13.8.2 Evidence of Reliability
13.8.3 Predictive Validity Study
13.8.4 Results
13.9 Implications for validating various uses of the test
13.10 Conclusions
References
14 Validity Evidence of the University Admission Tests in Chile: Prueba de Selección Universitaria (PSU)
14.1 General Antecedents of the University Admission Exams in Chile
14.2 Uses of Test Scores
14.3 Dimensions of Validity That Are Relevant to This Type of Evidence
14.4 Evidence on the Validity of PSU Tests
14.4.1 Reliability of Test Scores
14.4.2 Validity Evidence Based on the Content of PSU Tests
14.4.3 Validity Evidence About Prediction of Academic Performance
14.4.4 Evidence on Predicting Student Retention
14.4.5 Evidence of Validity and Differential Prediction
14.5 Conclusions and Future Agenda
14.5.1 Future Challenges in the Context of Constituting an Admission System for Universities and Technical Higher Education Institutions in Chile
References
15 Validity of the Single National Examination of Medical Knowledge (Eunacom)
15.1 Introduction
15.2 Uses of the Instrument and Abilities Considered
15.3 Framework of Reference
15.4 Reliability
15.5 Validity
15.5.1 Evidence of the Internal Structure of the Examination
15.5.2 Evidence Linking EUNACOM to Test-Takers’ Performance During Their University Studies
15.5.3 Relationship Between EUNACOM Results and International Medical Exam Results
15.6 Evidence of Relationship Between Theoretical and Practical Examination
15.7 Evidence of Relationship Between EUNACOM and PSU
15.8 Relationship Between the Accreditation of Medical Careers in Chile and EUNACOM
15.9 Conclusions
References
Part IV Validity of Teacher Evaluations
16 Teacher Evaluation with Multiple Indicators: Conceptual and Methodological Considerations Regarding Validity
16.1 Introduction
16.2 Validity and Multiple Indicators in Teacher Evaluation: Conceptual Problems
16.2.1 Intended Uses and Usefulness for Public Policy
16.2.2 Frameworks, Standards, and Constructs
16.3 Methodological Considerations on Validity and Multiple Indicators
16.3.1 Instruments and Measurements
16.3.2 Instruments and Sources of Information
16.3.3 Reliability and Validity. Psychometric Models and Multiple Measures
16.4 Future Implications and Considerations
References
17 How Valid Are the Content Knowledge and Pedagogical Content Knowledge Test Results of the Teacher Professional Development System in Chile?
17.1 Introduction
17.2 Teaching Career System and the CK-PCK Test Results
17.3 Consequences of the CK-PCK Test Results
17.4 Institutions Involved in the CK-PCK Test
17.5 Development of the CK-PCK Test
17.5.1 Construct Definition
17.5.2 Table of Specifications
17.5.3 Item Development
17.5.4 Pilot Testing
17.6 Validation of the CK-PCK Test Results
17.6.1 Defining Validity and Validation
17.6.2 Assumptions Underlying the Use of the CK-PCK Test Results
17.6.3 How to Interpret the Evidence?
17.6.4 Assumption # CK and PCK Foster Students’ Learning
17.6.5 Assumption #2 the CK-PCK Test Content Adequately Reflects CK and PCK
17.6.6 Assumption #3 CK-PCK Test Results Are Reasonably Reliable
17.6.7 Assumption #4 Sources of Construct Irrelevant Variance Do not Excessively Influence Responses to the CK-PCK Test
17.6.8 Assumption #5 CK-PCK Test Results Can Identify Four Levels of Performance
17.7 Discussion
17.7.1 Is There Coherence Between Test Constructs and Intended Policy Use?
17.7.2 Is the Test Content Adequate?
17.7.3 Are Test Results Reasonably Reliable?
17.7.4 What are the Sources of Construct-Irrelevant Variance Affecting the Test Results?
17.7.5 Are They Four CK-PCK Performance Levels?
17.8 Final Recommendations
17.9 Future
References
18 The Portfolio in the National Teacher Evaluation System in Chile: Collecting Evidence of Validity as Part of the Instrument Construction Process
18.1 Introduction
18.2 Uses of the Portfolio in Chilean Teacher Assessment
18.3 Collecting Evidence of Validity
18.4 Construct Modeling and Evidence of Validity
18.5 The Stages of Portfolio Development
18.6 Characterizing the Construct: Portfolio Design and Content Evidence
18.7 Designing the Tasks: Portfolio Development and Teacher Response Processes
18.8 Creating the Outcome Space: Evidence of Rater’s Response Processes
18.9 Selection of a Measurement Model: Looking for Evidence of the Internal Structure of the Portfolio
18.10 Discussion
References
Jorge Manzi María Rosa García Sandy Taut Editors
Validity of Educational Assessments in Chile and Latin America
Validity of Educational Assessments in Chile and Latin America
Jorge Manzi · María Rosa García · Sandy Taut Editors
Validity of Educational Assessments in Chile and Latin America
Editors Jorge Manzi Pontificia Universidad Católica de Chile Santiago, Chile
María Rosa García Pontificia Universidad Católica de Chile Santiago, Chile
Sandy Taut Ministry of Education Gunzenhausen, Germany
ISBN 978-3-030-78389-1 ISBN 978-3-030-78390-7 (eBook) https://doi.org/10.1007/978-3-030-78390-7 Jointly published with Ediciones UC Translation from the Spanish language edition: Validez De Evaluaciones Educacionales En Chile Y Latinoamérica by Jorge Manzi, María Rosa García and Sandy Taut, © Ediciones Universidad Católica de Chile, 2019. Original Publication ISBN 978-956-14-2471-5. All rights reserved. Previous Edition: Translation from the Spanish language edition: Validez de evaluaciones educacionales en Chile y Latinoamérica by Jorge Manzi, et al., © Ediciones UC 2019. Published by Ediciones UC. All Rights Reserved. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jorge Manzi, María Rosa García, and Sandy Taut
2
Is Validation a Luxury or an Indispensable Asset for Educational Assessment Systems? . . . . . . . . . . . . . . . . . . . . . . . . . . . Sandy Taut and Siugmin Lay
Part I 3
4
5
1
17
Validity of Student Learning Assessments
How to Ensure the Validity of National Learning Assessments? Priority Criteria for Latin America and the Caribbean . . . . . . . . . . . María José Ramírez and Gilbert A. Valverde Contemporary Practices in the Curricular Validation of National Learning Assessments in Latin America: A Comparative Study of Cases from Chile, Mexico, and Peru . . . . . . Gilbert A. Valverde and María José Ramírez Learning Progress Assessment System, SEPA: Evidence of Its Reliability and Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrea Abarzúa and Johana Contreras
37
69
93
6
Validation Processes of the National Assessment of Educational Achievements—Aristas: The Experience of the INEEd (Uruguay) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Carmen Haretche and Mariano Palamidessi
7
The Validity and Social Legitimacy of the Chilean National Assessment of Learning Outcomes (SIMCE): The Role of Evidence and Deliberation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Lorena Meckes and María Angélica Mena
v
vi
Contents
Part II
Validity of International Assessments
8
Test Comparability and Measurement Validity in Educational Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Jorge González and René Gempp
9
Measurement of Factor Invariance in Large-Scale Tests . . . . . . . . . . . 205 Víctor Pedrero
10 Invariance of Socioeconomic Status Scales in International Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Ernesto Treviño, Andrés Sandoval-Hernández, Daniel Miranda, David Rutkowski, and Tyler Matta 11 Measuring Attitudes Towards Gender Equality in International Tests: Implications for Their Validity . . . . . . . . . . . . 259 Juan Carlos Castillo, Daniel Miranda, and Angélica Bonilla Part III Validity of Admissions and Certification Assessments 12 Validity of Assessment Systems for Admissions and Certification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 María Verónica Santelices 13 Is the SABER 11th Test Valid as a Criterion for Admission to Colombian Universities? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Julián P. Mariño, Adriana Molina, and Yadira Gómez 14 Validity Evidence of the University Admission Tests in Chile: Prueba de Selección Universitaria (PSU) . . . . . . . . . . . . . . . . . . . . . . . . 331 Jorge Manzi and Diego Carrasco 15 Validity of the Single National Examination of Medical Knowledge (Eunacom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 Beltrán Mena Part IV Validity of Teacher Evaluations 16 Teacher Evaluation with Multiple Indicators: Conceptual and Methodological Considerations Regarding Validity . . . . . . . . . . . 373 José Felipe Martínez and María Paz Fernández 17 How Valid Are the Content Knowledge and Pedagogical Content Knowledge Test Results of the Teacher Professional Development System in Chile? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Edgar Valencia, Martha Kluttig, and Beatriz Rodríguez
Contents
vii
18 The Portfolio in the National Teacher Evaluation System in Chile: Collecting Evidence of Validity as Part of the Instrument Construction Process . . . . . . . . . . . . . . . . . . . . . . . . . 431 David Torres Irribarra and Álvaro Zapata
Chapter 1
Introduction Jorge Manzi, María Rosa García, and Sandy Taut
1.1 A Book on Validity Educational assessments have had a strong development in Latin America in recent years. Almost all countries in the region regularly assess their students’ achievement. Several also do so for selection purposes—for example, access to higher education. It is also increasingly common the development of assessments that address teacher performance. Moreover, international assessments have shown great impact, which has led to a sustained increase in the number of countries in the region that participate in and use PISA, TIMMS and ERCE results. The results of these assessments, in addition to the direct use assigned to them according to the declared purposes (training, diagnosis, selection, certification and promotion), conform a fundamental antecedent for judging the state of progress of education systems, as well as for making comparisons between groups (for example, men versus women and socioeconomic and regional groups), and for estimating trends in educational achievement over time. As a result, assessments have become central tools for educational policy, and strongly influence opinions that the elite and citizens have about the educational system. In this context, assessments in the region have reached a certain level of maturity as have the development of technical capacities to design, implement and use the results of assessments. At the same time, they have been given increasingly broader uses that have greater impact at the individual, institutional and national levels. Thus, J. Manzi · M. R. García (B) Pontificia Universidad Católica de Chile, Santiago, Chile e-mail: [email protected] J. Manzi e-mail: [email protected] S. Taut Ministry of Education of Bavaria, Munich, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. Manzi et al. (eds.), Validity of Educational Assessments in Chile and Latin America, https://doi.org/10.1007/978-3-030-78390-7_1
1
2
J. Manzi et al.
the fundamental question this book seeks to answer is: Is the information provided by these tests and assessment programs of sufficient quality to make interpretations and proposed uses valuable and defensible? It is known that assessments should not be used in the educational sphere if there are no studies to support the validity of the interpretation of the scores, the use of such tests, as well as the possibility of making comparisons of their scores over time and between groups. However, there is a consensus that, in spite of the recognition given to validity, relatively few educational assessment programs have accumulated sufficient evidence to support their interpretations and uses. The problem is that without validity evidence, it is not feasible to know how the scores or trends resulting from those assessments should be interpreted, nor is it possible to support the uses of those scores, even if they are explicitly reported or embodied in a legal body. In short, the absence of validity evidence represents a serious threat to assessments, compromising their value as well as their political and technical viability. Within the framework of the Standards for Educational and Psychological Testing revision, published in 2014 (American Educational Research Association [AERA] et al., 2014), this book addresses conceptual, methodological and applied aspects related to the efforts made in several countries to validate different types of largescale educational assessments. These educational assessments range from traditional tests that assess student learning achievement to recent assessments of non-cognitive aspects, teacher assessments, as well as certification and selection tests. Additionally, the book compiles the experience of validity studies regarding the main international programs present in Latin America (PISA, TIMSS, ERCE, ICCS). Finally, it reveals the challenges that must be considered when assessments are used to compare countries, groups or achievement trends over time. This book makes a unique contribution in our region due to several reasons. One is the fact that this book reflects a more contemporary view of validity. According to the Standards, validity refers to the amount of evidence that exists to support particular interpretations for the specific proposed uses of test scores. Thus, it is not about establishing validity of an assessment in a general way, since what needs to be validated is the interpretation and use of the scores in a given context and for a specified population. Consequently, as stated by Kane (2006), validation is a process that involves accumulating evidence that allows us to articulate (or refute) a validity argument. Validation begins by explicitly stating the intended interpretations and uses. In this context, validity is a broad and flexible concept. The book illustrates this aspect by including chapters that make a conceptual contribution to validity, as described later on, and other chapters that review not only empirical evidence of test scores but also the process of development of assessment and their consequences showing the broad meaning of validity as it is currently understood. A second contribution of this book is based on the fact that validity has been recognized as a requirement that should be asumed by those who decide to create and use assessment systems. The Standards are clear on this issue and state that test developers and those who decisions based on test results are responsible for
1 Introduction
3
conducting validation studies. Therefore, it is essential to take validity into account from the moment a decision is made in order to develop an assessment program. This implies that resources have to be allocated to provide sufficient evidence to support or refute the interpretation and use of the instruments involved. This requirement is especially relevant in the case of assessments that have high impact for individuals, groups, or institutions based on the results of educational assessments. Thirdly, this book illustrates how different countries have addressed the collection of validity evidence in various educational assessment systems. In this way, this book seeks to reaffirm the need for decision-makers to achieve sufficient awareness on this matter, so that development and use of assessments, especially those with highest stakes, have sufficient support for the decisions that are made based on them. We will review the book’s main contributions in more detail below, considering (i) conceptual contributions; (ii) technical contributions; and (iii) chapters that illustrate validation processes based on multiple or specific evidence. Thus, the different chapters will be covered according to the contribution they make. This presentation does not follow the book’s Table of Contents order.
1.2 Conceptual Contributions The book contains chapters that constitute important conceptual contributions regarding validity in particular spheres. The chapter of Taut and Lay, Is Validation a Luxury or an Indispensable Asset for Educational Assessment Systems? Initiates this volume with a conceptual discussion on validity as it reviews the history of the concept—considering the contributions of main referents: Messick (1989, 1994, 1995), Cronbach (1989) and Kane (1992)—and establishes the current definition of this construct according to the international Standards for Educational and Psychological Testing (AERA et al., 2014). Based on the Standards, it summarizes the five areas in which it is possible to collect validity evidence on assessments and are based on (i) test content; (ii) cognitive processes; (iii) internal structure; (iv) relations to other variables; and (v) test consequences. Regarding this last source of evidence, the authors discuss the controversy that has existed among those who argue that the study of the consequences of assessments should not be part of the concept of validity (Borsboom et al., 2004; Cizek, 2012; Maguire et al., 1994; Popham, 1997; Wiley, 1991), and those who, in contrast, consider it relevant to study both intended and unintended consequences as part of validity, especially in high-stakes assessments for examinees (Kane, 2013; Lane & Stone, 2002; Lane et al., 1998; Linn, 1997; Messick, 1989, 1995; Shepard, 1997). This chapter’s contribution also relies on explaining how these different pieces of evidence should be integrated into a line of argumentation on validity, and the different actors that need to be involved in the process. Finally, the authors analyze the political dimension of validation, pointing out the main obstacles that need to be tackled when research, policy and practice converge in the development of a validation program.
4
J. Manzi et al.
Another chapter that makes an important contribution at the conceptual level is the one by Ramírez and Valverde, How to Ensure Validity in National Learning Assessment? Priority Criteria for Latin America and the Caribbean. It stands out for its innovative contribution to guidance and orientation for the development of validation studies in the field of student learning assessments. The authors propose ten priority criteria or quality standards for the validation of learning assessments in Latin America and the Caribbean (LAC). They organize them in three dimensions or sources of evidence: (i) the dimension of evidence relative to the alignment of tests with the official curriculum; (ii) the dimension of evidence relative to the curricular validity of performance levels used to report assessments results; and (iii) the dimension of evidence relative to consequences or impact of assessments in the improvement of the education system in general, and in the improvement of learning in particular. For each of these dimensions, the authors describe different criteria to consider, and they give examples of the aspects that are relevant to bear in mind. Another important conceptual contribution of the book is found in the chapter by Santelices, Validity of Assessment Systems for Admissions and Certification, who approaches the study of validity in contexts of high-stakes assessments for individuals, such as selection processes for Higher Education and assessments associated with professional certification. The author discusses and provides examples of validity studies that examine the most frequent uses for this type of assessments: predictive validity studies for the case of selection for Higher Education and validity of performance standards for professional certification, as seen in the case of the Teacher Assessment in Chile. In the latter, the process for setting cutoff points is described, through which a cut-score is established (associated with a performance level), with the expectation that examinees above that score could be seen as having the required abilities or knowledge to perform certain tasks. The analysis of validity evidence regarding those categories is also described. As proposed by the Standards (AERA et al., 2014), Santelices describes the need to start by making explicit the uses that will be given to the scores obtained in these assessments, which can be done by studying the program theory underlying the assessment. Subsequently, the author analyzes some of the most important limitations found in predictive validity studies, which are related to the definition of the variable that will be used as a criterion and which will later be linked to the performance in the instrument of interest. Also, Santelices analyzes a common methodological problem in this type of studies known as restriction of range in the variability of the tests being analyzed. Finally, the author discusses the relevance of studying, as part of the validity argument, the bias and consequences that selection and certification systems may be having, especially those that are undesirable. Particularly relevant are those related to social equity, considering that they affect the social legitimacy of the assessments. Finally, the analysis of conceptual aspects of validity is completed with the chapter by Martínez and Fernández, Teacher Evaluation with Multiple Indicators: Conceptual and methodological considerations regarding validity. The authors analyze the validity study of the teacher assessment systems, reviewing both the conceptual and technical challenges faced when defining, operationalizing and
1 Introduction
5
measuring key aspects of a construct as complex as teaching practice. In this context, they start from the premise that the object of validation is not the instruments or the indicators of teaching practice, but rather the judgments or inferences derived from them, and that this requires evidence from various types and sources. The chapter considers teacher assessment systems that involve multiple indicators, from the perspective of the validity of inferences about teacher performance, providing guidance and orientation on how to combine these different indicators into a model. This chapter makes a conceptual contribution regarding teacher assessments that use multiple indicators to evaluate performance, analyzing the explicit and implicit assumptions that underlie modern models and systems of teacher accountability and assessment, and their implications for determining the validity of inferences about performance that are drawn from these systems. The authors discuss, firstly, the purposes of teacher assessment systems and the most relevant conceptual problems related to the definition and operationalization of the constructs measured; secondly, the choice and properties of the instruments used to measure these constructs; thirdly, the approaches to conceptualize and investigate reliability and validity with multiple indicators; and finally, the implications for the design and study of teacher assessment systems aimed at improving teacher practice and learning, based on multiple indicators. Thus, this chapter is a good starting point to begin reading about validity studies associated with teacher assessment, since it offers an enriched conceptual discussion that allows understanding the main dilemmas and challenges associated with the scope of valid assessments in this context.
1.3 Technical Contributions From another perspective, the book incorporates chapters that make substantive contributions to the study of validity at a technical level. In this line, González and Gempp, Test comparability and measurement validity in educational assessment, establishes the important connection between validity and comparability in educational tests. If it is understood that validity has to do with the interpretation of scores, then the comparability of scores is a necessary condition for making judgments on tests that are partially renewed over time, that have alternative forms, or whose results lead to trends over time. The chapter identifies the different types of possible comparisons and sets out designs that should be considered by assessment developers to determine comparability or equating. Finally, it discusses the most commonly used methodologies for obtaining comparable scores, and provides practical recommendations for the design of comparability studies. Pedrero, Measurement of Factor Invariance in Large-scale Tests, addresses the potential problem of tests that may have different meanings for examinees belonging to different groups. Although it is usual in educational assessment to assume that the construct or domain of interest can be assessed in different groups using the same instrument, there is awareness that this assumption may not be appropriate. The construct may have a different meaning among social and cultural groups
6
J. Manzi et al.
within the same country, or it may operate differently when an instrument is translated into different languages in international studies. To tackle this problem, Pedreros presents the factorial invariance technique. With this tool, it is possible to judge the degree to which an instrument has an invariant performance in different groups by characterizing the levels of invariance that have been established. Clearly, the lack of invariance poses a major threat to making valid comparisons among groups. The chapter illustrates factorial invariance with a scale taken from an international study.
1.4 Chapters That Illustrate Validation Processes 1.4.1 Based on Multiple Evidence From another point of view, this book gathers chapters that illustrate validation processes that collect multiple pieces of evidence in order to make a judgment on the validity of a particular program or assessment. Examples of these are the chapter by Valverde and Ramírez, Contemporary Practices in the Curricular Validation of National Learning Assessments in Latin America: A comparative study of cases from Chile, Mexico and Peru, who carry out a comparative study of national learning assessment programs in these three Latin American countries; the chapter by Abarzúa and Contreras, Learning Progress Assessment System, SEPA: Evidence of its reliability and validity, regarding language and mathematics tests that are administered to elementary and secondary school students in the Chilean school system; the chapter by Meckes and Mena, The Validity and Social Legitimacy of the Chilean National Assessment of Learning Outcomes (SIMCE): The role of evidence and deliberation, referring to the Chilean national assessment system of school learning outcomes; the chapter by Mena, Validity of the Single National Examination of Medical Knowledge (EUNACOM), a test that Chilean and foreign medical doctors must take to practice in the Chilean public health system; the chapter by Valencia, Kluttig and Rodríguez, How Valid are the Content Knowledge and Pedagogical Content Knowledge Test Results of the Teacher Professional Development System in Chile?, about the tests that teachers take in the context of the Teaching Career in Chile; and the chapter by Torres and Zapata, The Portfolio in the National Teacher Evaluation System in Chile: Collecting Evidence of Validity as Part of the Instrument Construction Process, regarding the instrument that gathers a recording of a class, along with documentation and devices used by the teacher as part of his or her practice. Below there is a brief description of each chapter. Firstly, the chapter by Valverde and Ramírez carries out a comparative study of national learning assessment programs in Latin America, delving into the cases of Chile, Mexico and Peru. The authors analyze the extent to which the assessment programs in these three countries currently address (a) the alignment of the tests in their assessment system with the official curriculum, (b) the curricular validity of
1 Introduction
7
the performance levels used to report test results and (c) the consequential validity evidence of the assessments. For each of these dimensions, the chapter explores the differences and similarities in the countries’ assessment practices. The analysis is based on a taxonomy of comparison dimensions (along with guidelines, criteria and key questions) for the validation of national assessments that the same authors present in a conceptual chapter of this book. They conclude by identifying strengths and, especially pending challenges in the validation of national assessments in Latin America. The chapter by Abarzúa and Contreras shows multiple evidences on the validity of the Learning Progress Assessment System (hereinafter SEPA), which offers school administrators, principals and teachers information about student performance in language and mathematics, using standardized tests aligned with the Chilean national curriculum. The authors begin by describing this assessment program presenting its main purposes as well as the types of information it produces, and then they summarize validity evidence of SEPA following guidelines of the Standards (AERA et al., 2014). They first review evidence based on the regular validation agenda of the program, which includes evidence regarding test content and internal structure, as well as reliability and fairness verifications. Subsequently, they present validity studies that are performed in a more time-spaced manner, aimed at supporting the arguments on relations to other variables (convergent and discriminant evidence), and on the uses and consequences of the assessments. Therefore, this chapter is a good example that illustrates the different types of evidence that, according to the Standards (AERA et al., 2014), can be accumulated in order to judge the validity of the interpretations and uses of assessment results, in this case, in the field of learning achievement and progress made by students in the school system. Another chapter that discusses a wide range of validity evidence, and also refers to the conceptual discussion of validity, is the one by Meckes and Mena. In particular, this chapter stands out conceptually by broadening the notion of validity following Newton (2007). This author considers, as part of the criteria associated with validation of an assessment system, its social legitimacy and its political and economic viability, considering how decisive they may be for the understanding, implementation and continuity of an assessment system. The authors of this chapter use this conceptual framework to analyze the consequences and public questioning of the Chilean National Assessment of School Learning Results (SIMCE). To this end, they review the contributions made by external working commissions that have examined the system twice, deliberating and concluding with a diagnosis and recommendations. The method followed and the contributions of these non-conventional and complementary validation instances are presented in order to analyze the impact of this assessment system, which is associated with important consequences for schools. This chapter highlights the huge complexity involved in installing and simultaneously updating an assessment system of this magnitude, which has expanded its uses and consequences over time. In the same vein, another chapter that illustrates various types of validity evidence is that by Mena, which refers to the EUNACOM. This exam is used to certify knowledge and abilities to practice general medicine in Chile. Passing the test is a legal
8
J. Manzi et al.
requirement to practice medicine for medical doctors who have graduated in Chile and abroad. The test is also used to assign places for training in medical specialties. The author reviews evidence of reliability and validity of this test by considering multiple perspectives related to content validity evidence, internal test structure and relations with other variables. Regarding the latter, the author analyzes the relation between EUNACOM and (1) examinees performance during their university studies; (2) antecedents and exams of doctors graduated abroad; (3) score in admission tests for Chilean universities (PSU); and (4) accreditation of the universities where examinees were trained. Together with this, the relation between the knowledge and practical sections of this exam is reviewed. The evidence presented shows that there is adequate support for the interpretation and main use of this test. Thus, this chapter is another interesting example of a validation program that collects multiple evidence for a high-stakes assessment. Another chapter that shows multiple validity evidence in the context of teacher assessment is the one by Valencia, Kluttig and Rodríguez. The authors analyze validity evidence of results of disciplinary and pedagogical knowledge tests taken by teachers working in Chilean schools, in the context of the Chilean Recognition and Promotion System of Teachers’ Professional Development (better known as Teaching Career). As part of the validation study, they start by defining the assumptions underlying the appropriate use of results for the decisions the system seeks to inform. Then, based on empirical and theoretical evidence, they argue how reasonable each of the assumptions is, providing an initial judgment related to the validity of the test scores. In this context, the chapter constitutes a good reference example to guide a program validation study, based on the validity conceptualization of the Standards, which state that validity refers to the amount of evidence that exists to support particular interpretations for the specific intended uses of test scores (AERA et al., 2014). Thus, the authors show the process of validation, while collecting evidence for constructing (or refuting) a validity argument as declared by Kane (2006). Finally, one last chapter that shows a range of validity evidence for an assessment program, also in the context of the Chilean teacher assessment, is the one by Torres and Zapata. The authors address validity evidence of the portfolio, an instrument used to assess teachers in Chile, not from the scores and results standpoint, but rather from the processes that give rise to those scores. Therefore, the focus of this chapter is on the instrument’s development process. The relation between instrument development and accumulation of validity evidence is addressed from the conceptual framework of Wilson’s (2005) Construct Modeling, associating different stages of instrument development to different sources of validity evidence considered in the Standards (AERA et al., 2014). This chapter emphasizes the importance of perceiving the validation process as an integral part of the development of instruments, not only to subsequently document the evidence supporting the intended uses but also to examine instrument quality in a framework of continuous improvement.
1 Introduction
9
1.4.2 Based on Specific Evidence This book also includes chapters that show more specific and delimited validity evidence, generally linked to one of the types of evidence proposed in the Standards (AERA et al., 2014). Examples of this are (a) presenting validity evidence associated with the instrument development process, such as the chapter by Haretche and Palamidessi, Validation Processes of the National Assessment of Educational Achievements—Aristas: The experience of the INEEd (Uruguay); (b) showing evidence of relation to other variables, with a particular focus on predictive validity of university selection tests, such as the chapter by Manzi and Carrasco, Validity Evidence of the University admission tests in Chile: Prueba de Selección Universitaria (PSU), and the one by Mariño, Molina and Gómez, Is the SABER 11th Test Valid as a Criterion for Admission to Colombian Universities?; (c) addressing invariance as a substantive attribute to support validity evidence, such as chapters by Castillo, Miranda and Bonilla, Measuring Attitudes Towards Gender Equality in International Tests: implications for their validity, based on the international ICCS test, or also, Treviño and authors, Invariance of Socioeconomic Status Scales in International Studies who address invariance of this construct using data from different international testing programs (PISA, TERCE and ICCS). Below, we describe in more detail the contribution of each one of the chapters. Haretche and Palamidessi describe validity evidence related to the process of development of the ARISTAS instruments, which is the achievement assessment program carried out by the National Institute for Educational Evaluation (INEEd) in Uruguay since 2017. An interesting aspect of this chapter is that ARISTAS evaluates educational achievement from a broad perspective, which not only includes cognitive dimensions associated with student performance in reading and mathematics, but also considers, as achievements of the educational system, improvements made in the school environment, development of socioemotional skills, attitudes toward peers and participation, and learning opportunities that teachers offer their students in the classroom. All these dimensions are assessed in the third and sixth grades of Primary Education and in the third year of Secondary Education. In the chapter, the authors describe validity evidence associated with the development process of the tests that assess reading and mathematics and of the student questionnaires that assess socioemotional skills, along with evidence about the internal structure of both instruments. In the chapter by Manzi and Carrasco, they summarize the validity evidence on the test with the highest individual stakes in the Chilean context: the University Selection Test (PSU). The focus is on predictive validity, an aspect that Santelices had highlighted in her chapter as a major requirement for selection tests. Besides referring to predictive evidence centered on student performance in their first year of university studies, the chapter includes information that projects predictive validity beyond that period of time, especially as a predictor of university dropout or mobility. Complementarily, the chapter also includes the existing evidence regarding the curricular alignment of the PSU, based on an international review of these tests, and analyzes
10
J. Manzi et al.
the evidence referring to the validity and differential prediction of the scores, which allows tackling the possibility that its scores may be biased (in this case, considering sex and type of school of origin, which in the Chilean case is closely associated with socioeconomic origin). The chapter by Mariño and authors reviews the predictive validity evidence of the SABER 11° test used for admission to Colombian universities, in relation to the academic performance of the students during university. The authors compare the predictive value of the specific knowledge tests implemented a few years ago as a university admission criteria (Social Sciences, Language, Mathematics, Biology, Physics and Chemistry) with the new generic knowledge tests (critical reading, quantitative reasoning and citizenship competencies), regarding their ability to predict the average grade obtained by students during the first semester of university. The chapter shows the analysis of the differences found between the two tests (specific knowledge versus generic abilities), substantiating the change toward using more generic knowledge tests as admission criteria to Colombian universities. They also discuss the complexity that the multiple purposes of the exam represent, and the implications this has for the study of validity. Another example of a chapter that reviews specific evidence of validity is the one by Castillo, Miranda and Bonilla. The authors analyze data from the International Civic and Citizenship Education Study (ICCS), which, in addition to measuring civic knowledge, considers other instruments for the concept of citizenship, aimed at measuring attitudes, beliefs and behaviors. In particular, the authors analyze the validity evidence of the scale of attitudes regarding equal rights between men and women, using a factorial invariance study. This chapter makes an important contribution to the book by questioning the use of scales in international studies that do not have sufficient validity evidence across countries. The authors warn about the risks of misinterpreting the scores of theses scales in large testing programs, and show the relevant analyses that are needed to support validity in these studies. Another example is the chapter by Treviño, Sandoval-Hernández, Miranda, Rutkowski and Matta. Based on data from three international studies, PISA, TERCE and ICCS, they analyze the invariance of the socioeconomic scales. These scales are essential to understand and compare the role of socioeconomic antecedents in the performance of individuals and groups. Therefore, supporting their invariance across countries is very relevant. The chapter compares the invariance of socioeconomic scales when the analyses are performed on all countries participating in each study, in comparison to the more homogeneous Latin American context. The authors find lower levels of invariance in the more heterogeneous set of countries, suggesting that comparisons among countries should be done with caution.
1.5 Challenges Around Validity This book is based on the premise that validation should be a parallel and consubstantial effort to the development of educational assessments. Even when assessments are
1 Introduction
11
not associated with consequences, and their results are intended to only have diagnostic functions, it is necessary to have support to properly interpret their scores. However, it is also necessary to warn that there are many circumstances surrounding validation, that in several cases lead to dilemmas and potential tensions, some of which we want to mention in this introduction in order to encourage the reading of chapters that address such issues. • Who should be in charge of validating an assessment? The Standards state that this should be an effort that engages both those who create or develop assessments and those who decide to use them. The former must provide fundamental evidence to support the interpretation of the scores and the uses that have guided the development process. Users are especially responsible for validation when they extend or modify the intended uses of a particular assessment. It is evident that an exact delimitation of functions is not possible, although it is ideal that when that delimitation is feasible, there is a coordinated plan to establish a validation program that not only provides initial evidence but also addresses changes in the instruments (which are often altered when new uses are established). It is equally necessary to consider potential conflicts of interest that instrument developers can experience when they receive compensation for the use of the assessments. Thus, it is very important to promote the participation of third parties (especially from the academic field), who can carry out independent validation studies. All these efforts require technical and financial resources, which must be considered, especially in the case of assessment instruments associated with public policies. • How to integrate different evidence about validity? Is it necessary to accumulate all the types of evidence set out in the Standards? Validation represents a sustained effort over time, which must address in a systematic way the different focuses relevant to an assessment, either in terms of interpretation of scores or of established uses. In almost all cases, it is possible to establish the preeminence of certain type of evidence (such as predictive validity in selection instruments). On other occasions, when an assessment has high personal or institutional stakes, the consequential evidence of validity becomes particularly relevant. Furthermore, it is also evident that virtually every assessment must be able to study its alignment with the frame of reference that guides it (evidence about the content of the test). Given the above, and as documented in several chapters of this book, in many cases more than one type of evidence is obtained, demanding a professional judgment in order to combine the different pieces of evidence in accordance to the validity arguments that have initiated the validation studies. In this integration, it will be necessary to prioritize the accumulated evidence, according to its relevance to the arguments mentioned, as well as in terms of its methodological quality. In sum, although the Standards consider different types of evidence to support validity, it is clear that not every type of evidence is required. However, it is necessary to establish criteria that allow for integration of different types of evidence, especially when results are inconsistent or contradictory.
12
J. Manzi et al.
• Validity and reference framework. All assessment should be guided by an explicit reference framework (which should be relevant and well-founded). Therefore, alignment with this framework becomes a fundamental basis of the validation work. In the educational context, this alignment poses particular challenges. On the one hand, this framework, which in the case of school assessments is generally represented by the curriculum, is exposed to revisions and changes that in many cases can compromise (or in the most extreme case hinder) the comparison of scores over time (see chapter by González and Gempp). The general rule is that when curriculum updates become frequent, it strains the assessment alignment, affecting the interpretation of scores. This does not imply that the frame of reference should remain unchanged (for if it did, it would risk obsolescence). However, review cycles should be reasonably timed, and if possible, planned, so that entities in charge of assessments can make any necessary adjustments to maintain an adequate consistency with that framework. • ¿What is the balance between focusing the validation on scores versus the process that leads to the scores? The chapter by Torres and Zapata, as well as the chapter by Valencia, Kluttig and Rodríguez, exemplifies the importance of taking into account the processes that produce the scores. We are referring mainly to issues regarding the stages associated with the delimitation and analysis of the reference framework, as well as the processes that take place during the development of questions, stimuli or rubrics. These chapters warn us that validation should not only be focused on studying the meaning and use of the scores once they have been produced, since the study of the processes that generate them give us fundamental clues that improve the development of instruments, thus guaranteeing a better quality of information. • Development of technical capacity to support validity research. This book demonstrates the wide variety of conceptual and technical knowledge required to establish and implement a validity program. Among other aspects, it requires personnel with conceptual, theoretical and applied knowledge about aspects as diverse as the assessed domains and constructs, quantitative and qualitative research design, sampling, the use of advanced statistical and psychometric tools, etc. This knowledge and competencies are usually scarce in Latin America, making it necessary to coordinate efforts for developing such capacities, both through advanced professional training (especially graduate studies in statistics and psychometrics) and though professional development in the institutions in charge of assessments. International studies (such as ERCE, PISA and ICCS) offer a privileged space to develop abilities in the participating national teams. ERCE has established regular training opportunities for participating countries. • Can validity be sacrificed when there is little time and few resources? It is well known that educational policy sometimes faces windows of opportunities in which it is possible to agree upon and implement assessment systems. Under these circumstances, the time frame makes the development of prior validity studies unfeasible. This has happened and will continue to happen in all latitudes, especially in our region. Evidently, this type of context conflicts with the main message of this book. Our aspiration is for decision-makers to recognize the problem these
1 Introduction
13
circumstances pose, and that they establish, as soon as possible, the conditions for the provision of validity evidence about the assessments. What is clearly undesirable, but unfortunately not an exception, is concentrating all resources and efforts on the development and implementation of assessment programs without accompanying them with validity research, since validity evidence is also a way of protecting the assessments in the face of mistrust or rejection that they may trigger in certain groups. • What happens when assessments are set up with unreasonable expectations regarding their value? On many occasions, assessments are presented as central tools in certain education policies (for example, to promote the improvement of school learning or to stimulate teacher professional development). In these contexts, it is very important to establish studies about the consequential validity of assessments. In this type of studies, as indicated in the chapters by Taut and Lay, and by Santelices, it is fundamental to establish the theory of action that guides assessment programs, identifying, from the perspectives of their proponents, the uses and purposes of such programs, and the conditions under which the assessments could promote change. The mere inquiry into the theory of action allows an initial judgment about the realism of the expectations that accompany the assessment, which must be complemented with evidence about the effective, intended and unintended uses that occur after the introduction of the assessment program. The use of this type of studies can gradually contribute to the regulation of expectations, since it is known that when expectations are very high, they will most likely fail. Through such studies it is also possible to address cases where too many purposes are sought with the same instrument, which represents a special case of unreasonable expectations.
1.6 To Conclude: Book Structure This introductory chapter has reviewed the various contributions that this book makes from a comprehensive perspective. However, as observed in the Table of Contents, the structure is somewhat different from the presentation in this introduction. The first chapter presents a conceptual introduction to validation and to the validation processes of educational assessment systems. After that, the rest of the chapters are organized in four sections: (i) validity in student learning assessments; (ii) validity in international assessments and questionnaires; (iii) validity in assessments for selection and certification purposes; and (iv) validity in teacher assessments. In each section, there is first a conceptual chapter that reviews the validity study of the assessments that are specific to that section. The subsequent chapters in each address both multiple and specific validity evidence of assessments in different contexts in Latin America.
14
J. Manzi et al.
References Abarzúa, A., & Contreras, J. (2019). Learning progress assessment system, SEPA: Evidence of its reliability and validity. In J. Manzi, M. R. García, & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America. Ediciones UC. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. Borsboom, D., Mellenbergh, G., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. https://doi.org/10.1037/0033-295X.111.4.1061 Castillo, J. C., Miranda, D., & Bonilla, A. (2019). Measuring attitudes towards gender equality in international tests: Implications for their validity. In J. Manzi, M. R. García, & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America. Ediciones UC. Cizek, G. J. (2012). Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychological Methods, 17(1), 31–43. https://doi.org/10.1037/a0026975 Cronbach, L. J. (1989). Construct validation after thirty years. In R. Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp. 147–171). University of Illinois Press. González, J., & Gempp, R. (2019). Test comparability and measurement validity in educational assessment. In J. Manzi, M. R. García & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America. Ediciones UC. Haretche, C., & Palamidessi, M. (2019). Validation processes of the national assessment of educational achievements—Aristas: The experience of the INEEd (Uruguay). In J. Manzi, M. R. García, & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America. Ediciones UC. Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527– 535. https://doi.org/10.1037/0033-2909.112.3.527 Kane, M. T. (2006). Validation. In R. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). American Council on Education/Praeger. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000 Lane, S., Parke, C. S., & Stone, C. A. (1998). A framework for evaluating the consequences of assessment programs. Educational Measurement: Issues and Practice, 17(2), 24–28. https://doi. org/10.1111/j.1745-3992.1998.tb00830.x Lane, S., & Stone, C. A. (2002). Strategies for examining the consequences of assessment and accountability programs. Educational Measurement: Issues and Practice, 21(1), 23–30. https:// doi.org/10.1111/j.1745-3992.2002.tb00082.x Linn, R. L. (1997). Evaluating the validity of assessments: The consequences of use. Educational Measurement: Issues and Practice, 16(2), 14–16. https://doi.org/10.1111/j.1745-3992.1997.tb0 0587.x Maguire, T., Hattie, J., & Haig, B. (1994). Construct validity and achievement assessment. Alberta Journal of Educational Research, 40(2), 109–126. Manzi, J., & Carrasco, D. (2019). Evidence of validity of the University Selection Test (PSU). In J. Manzi, M. R. García, & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America. Ediciones UC. Mariño, J., Molina, A., & Gómez, Y. (2019). Is the SABER 11° test valid as a criterion for admission to Colombian Universities? In J. Manzi, M. R. García, & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America. Ediciones UC. Martínez, J. F., & Fernández, M. P. (2019). Teacher evaluation with multiple indicators: Conceptual and methodological considerations regarding validity. In J. Manzi, M. R. García, & S. Taut (2019), Validity of educational assessment in Chile and Latin America. Ediciones UC. Mena, B. (2019). Validity of the single national examination of medical knowledge (EUNACOM). In J. Manzi, M. R. García, & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America. Ediciones UC.
1 Introduction
15
Meckes, L., & Mena, M. A. (2019). The validity and social legitimacy of the Chilean national assessment of learning outcomes (SIMCE): The role of evidence and deliberation. In J. Manzi, M. R. García & S. Taut (2019), Validity of educational assessment in Chile and Latin America. Ediciones UC. Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd ed., pp. 13–100). American Council on Education. Messick, S. (1994). Foundations of validity: Meaning and consequences in psychological assessment. European Journal of Psychological Assessment, 10(1), 1–9. https://doi.org/10.1002/j.23338504.1993.tb01562.x Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice, 14(4), 5–8. https://doi.org/10.1111/j.1745-3992. 1995.tb00881.x Miranda, D., & Castillo, J. C. (2018). Measurement model and invariance testing of scales measuring egalitarian values in ICCS 2009. In A. Sandoval-Hernández, M. Isac, & D. Miranda (Eds.), Teaching tolerance in a globalized world IEA research for education (A Series of In-depth analyses based on data of the International Association for the Evaluation of Educational Achievement (IEA)) (Vol. 4, pp. 19–31). Springer. Newton, P. (2007). Evaluating assessment systems. Qualifications and curriculum authority. http:// dera.ioe.ac.uk/18993/1/Evaluating_Assessment_Systems1.pdf Pedrero, V. (2019). Measurement of factor invariance in large-scale tests. In J. Manzi, M. R. García, & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America. Ediciones UC. Popham, W. J. (1997). Consequential validity: Right concern-wrong concept. Educational Measurement: Issues and Practice, 16(2), 9–13. https://doi.org/10.1111/j.1745-3992.1997.tb00586.x Ramírez, M. J., & Valverde, G. (2019). How to ensure the validity of national learning assessments? Priority criteria for Latin America and the Caribbean. In J. Manzi, M. R. García, & S. Taut (2019), Validity of educational assessment in Chile and Latin America. Ediciones UC. Santelices, V. (2019). Validity of assessment systems for admissions and certification. In J. Manzi, M. R. García, & S. Taut (2019), Validity of educational assessment in Chile and Latin America. Ediciones UC. Shepard, L. (1997). The centrality of test use and consequences for test validity. Educational Measurement: Issues and Practice, 16(2), 5–8. https://doi.org/10.1111/j.1745-3992.1997.tb0 0585.x Taut, S., & Lay, S. (2019). Is validation a luxury or an indispensable asset for educational assessment systems? In J. Manzi, M. R. García, & S. Taut, (2019), Validity of educational assessment in Chile and Latin America. Ediciones UC. Torres, D., & Zapata, A. (2019). The portfolio in the national teacher evaluation system in Chile: Collecting evidence of validity as part of the instrument construction process. In J. Manzi, M. R. García, & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America. Ediciones UC. Treviño, E., Sandoval-Hernández, A., Miranda, D., Rutkowski, D., & Matta, T. (2019). Invariance of socioeconomic status scales in international studies. In J. Manzi, M. R. García, & S. Taut, Validity of educational assessment in Chile and Latin America. Ediciones UC. Valencia, E., Kluttig, M., & Rodríguez, B. (2019). How valid are the content knowledge and pedagogical content knowledge test results of the teacher professional development system in Chile? In J. Manzi, M. R. García, & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America. Ediciones UC. Valverde, G., & Ramírez, M. J. (2019). Contemporary practices in the curricular validation of national learning assessments in Latin America: A comparative study of cases from Chile, Mexico and Peru. In J. Manzi, M. R. García, & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America. Ediciones UC. Wiley, D. E. (1991). Test validity and invalidity reconsidered. In R. E. Snow & D. E. Wiley (Eds.), Improving inquiry in the social sciences: A volume in honor of Lee J. Cronbach (pp. 75–107). Erlbaum.
16
J. Manzi et al.
Wilson, M. (2005). Constructing measures: An item response modeling approach. Lawrence Erlbaum Associates.
Jorge Manzi Doctor in Psychology from University of California, Los Angeles, United States, and psychologist from Pontificia Universidad Católica de Chile. He is currently Full Professor at the School of Psychology at Pontificia Universidad Católica de Chile and leads the MIDE UC Measurement Center. His areas of expertise are educational assessment, social psychology and political psychology. During the last two decades, he has contributed to the development of educational assessment of national scope in Chile, such as the University Selection Test (the Chilean national test for admission to university and the Chilean Teacher Evaluation System. He has also been a member of the technical team associated with UNESCO’s TERCE and ERCE. Contact: [email protected]. María Rosa García Psychologist and Master in Psychology from Pontificia Universidad Católica de Chile. She is currently an assistant professor at the School of Psychology at the same university and a professional at MIDE UC Measurement Center. She has done consulting and teaching, mainly on issues related to the construction of measurement instruments and learning assessment. Contact: [email protected]. Sandy Taut Deputy Head of the Quality Agency, Ministry of Education of Bavaria, Germany, Ph.D. in Education from University of California, Los Angeles (UCLA), USA, and psychologist from University of Cologne, Germany. She has worked and researched on issues related to educational assessment, teacher, instructional and school quality, and validation of measurement and assessment systems. Contact: [email protected].
Chapter 2
Is Validation a Luxury or an Indispensable Asset for Educational Assessment Systems? Sandy Taut and Siugmin Lay
2.1 The Modern Definition of Validity and Validation Most Latin American education systems have developed assessment programs, and their results constitute a significant reference for discussion on education in each country. These programs range from standardized national and international student assessment systems to teacher evaluation programs and university entrance exams. All these types of evaluations are included in the book you hold in your hands. The information that these assessments produce is expected to be used to inform different types of decisions by different stakeholders. For example, a standardized student assessment system can serve, in a first case, to monitor student competencies at different levels of the education system or, in a second case, to provide diagnostic and comparative information on what students in an institution know and can do. Consequently, users of assessment information are, in the first case, national, regional, and local policymakers and legislators, and in the second case, administrators, school leaders, teachers, students, and their families. This book aims to address a fundamental question in this regard: is the information provided by these tests and evaluation programs of sufficient quality to make these proposed interpretations and uses in fact useful and justified? This question refers to the standards of quality that a test or assessment system must meet. All contributions to this book refer to the Standards for Educational and Psychological Testing (American Educational Research Association [AERA] et al.,
S. Taut (B) Bavarian State Office for Schools, Gunzenhausen, Germany e-mail: [email protected] S. Lay Pontificia Universidad Católica de Chile, Santiago, Chile e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. Manzi et al. (eds.), Validity of Educational Assessments in Chile and Latin America, https://doi.org/10.1007/978-3-030-78390-7_2
17
18
S. Taut and S. Lay
2014,1 hereinafter referred to as “the Standards”), which describe such expectations for educational (and psychological) testing and assessment systems. The most central of standards—in fact, the first one—points to the concept of validity. According to the Standards, validity refers to how much evidence exists to support particular interpretations of test scores for specific uses of tests. Therefore, the validity of a test or evaluation system does not exist by itself, but only in relation to specific interpretations and intended uses. It has been a long time since validity was understood as a property of the test or evaluation. The statement “this test has high validity” can no longer be said when we adhere to a modern understanding of validity and validation, one that is proposed in the Standards and in this book. For instance, can a school leader interpret the results of an eighth grade standardized mathematics test as diagnostic information about the skills his or her students have acquired up to eighth grade? Are the test results useful for drawing conclusions about teacher quality in mathematics in a given classroom? Can the school leader use the test results to compare his or her school to other schools? These are questions that relate to the validity of a student assessment system, and for each of the interpretations and uses mentioned previously, validity must be examined systematically. It is necessary to know how much evidence exists to support these interpretations and uses, and the extent to which the evidence tells us that these interpretations and uses are not justified and should, in fact, be avoided. The process of examining validity and gathering evidence to support (or refute) a validity argument is called validation (Kane, 2006). Validation begins by explicitly stating the intended interpretations and uses. As can be seen from the mathematics test example above, it is not always easy to identify or agree on what are the intended interpretations and uses of the test scores, and what would be unintended uses in a given context. An educational assessment system must often serve multiple purposes. It is also common for various stakeholders to have conflicting interests about the interpretations and uses of the obtained results. However, having clarity about the overall purposes of the tests and evaluations, the main users, and the interpretations, uses, as well as intended and unintended consequences (perhaps summarized in the form of a logic model or theory of action, see Santelices (2021) in this book) is of great importance as a basis for validation work to be productive. In the same logic, a detailed description of the construct to be measured is needed, because, as a conceptual framework, it tells us what interpretations may or may not be derived from assessment scores. For example, if we want to assess student achievement in a given area, then we must conceptually define what achievement means and implies in this context, so that the assessment really focuses on important aspects of this construct. In this case, one way to conceptually define achievement is through the curricular system that establishes learning objectives and competencies by educational level. If the assessment measures content or competencies not included in the curriculum, then the assessment result could no longer be interpreted as an indicator of student achievement in that particular area. 1
The 2018 version of the Standards corresponds to the Spanish translation of the 2014 Standards edition; thus, it does not refer to a new version of the Standards.
2 Is Validation a Luxury or an Indispensable Asset …
19
Therefore, the validation process should explain the propositions that must be met in order for test scores to be interpreted as intended. Each interpretation for a specific use must be supported by evidence. Because testing and assessment systems have dynamic contexts and undergo frequent changes and/or adaptations, validation should be an ongoing process in which evidence is regularly collected in order to re-examine and strengthen a line of argument about the validity of each interpretation or use. Different sources of evidence are necessary and appropriate to generate and contrast a validity argument. According to the Standards, these include: (1) evidence based on test content, (2) evidence based on response processes, (3) evidence based on internal structure, (4) evidence based on relations to other variables, and (5) evidence based on test consequences. Each one of these sources of evidence will be explained in greater detail later in this chapter. Although each of these sources of evidence contributes separately to examining the propositions that underlie certain interpretations of the test, these do not constitute different types of validity, since validity is now understood as a unitary concept. The final, and the most challenging step in validation is to integrate the different sources of evidence into a coherent argument about the validity of a given interpretation and use of the test or assessment results. For example, one might want to validate the use of scores in a math test to compare the math skills of students from different schools in an education system. For this use, we could collect evidence that the test content is reflecting the curriculum implemented in the schools; examine that the test’s internal structure shows the different sub-domains of mathematical competence; ensure—through studies of response processes—that students understand the test items in the intended way; examine relations to other relevant variables (e.g., grades in mathematics, at school level); and implement studies on the intended and unintended consequences of testing on students, teachers, and schools. Then, we would have to integrate the evidence and reach a conclusion about the possibility of interpreting these math test scores as a basis for making comparisons between schools. While this book’s focus is on validity as the central aspect of ensuring a useful and effective testing and assessment system that fulfills its purposes, the Standards indicate that reliability/precision and fairness are additional, important aspects of quality. On the one hand, reliability/precision refers to the consistency of the score between different applications, forms, or parts of the test, and is typically reported as a reliability index and a standard error of measurement. Reliability/precision is a necessary, but not sufficient, condition for validity. That is, even when test scores are highly consistent, this does not imply that the instrument is able to diagnose, predict, or promote appropriate decisions. Fairness, on the other hand, points to the ethical obligation to give all examinees or test takers equal access to demonstrate their ability in the test without obstacles. Fairness also impacts on validity and reliability/precision, particularly when the test is biased towards certain subgroups, i.e., when—despite having the same level of the attribute to be measured—a specific subgroup obtains a lower score in the assessment than the rest of those being tested.
20
S. Taut and S. Lay
We will now briefly describe the history of the concept of validity in order to explain how it became a unitary concept, how it evolved into an approach about interpretations and intended uses, and how it became an argument-based approach.
2.2 The History of Validity 2.2.1 Early Definitions of Validity Until the early 1920s, there were no major questions regarding the quality of the measurements used in research and their relationship to the constructs to be measured. This issue was only addressed in the course of the measurement industry boom for selection purposes in World War I. In order to achieve a consensus on the procedures and terms used in measurement, back then validity was defined as the degree to which a test or assessment measures what it claims to measure (Ruch, 1924). This corresponds to the classical definition of validity, and at that time referred to the measurement itself and not to the interpretation of the results or the use of the test. Based on this definition, two different approaches were developed in order to establish proof of validity about a test. On the one hand, a logical analysis of test content by experts could indicate whether the evaluation was measuring what it claimed to measure. On the other hand, the correlation between the test and what it was intended to measure was a method that could provide empirical evidence of validity. The development of these different approaches to establish test validity prompted the definition of validity to evolve, and various types or aspects of it began to be differentiated. For instance, Cronbach (1949) differentiated between logical and empirical validity, while the APA in its first Standards (American Psychological Association, 1952; American Psychological Association et al., 1954) went further and distinguished four types of validity: predictive, concurrent, content, and construct validity. These four types of validity also implied different types of analyses to provide relevant evidence. Predictive validity was to be established by correlating test results with a variable allegedly predicted by the measured construct, assessed some time later. Concurrent validity was to be established by evaluating the relationship between the results of a test and a variable measuring the same construct, at the same point in time. Content validity was to be established when the test was considered a sample of the universe that operationally defines the variable to be measured (e.g., evaluating writing speed through a writing exercise). Finally, construct validity was to be established when performance on the test was presumed to rest upon a hypothetical attribute or quality. That is, unlike content validity, this attribute would not be defined operationally but theoretically. Therefore, in construct validity, the interpretation of what is measured becomes relevant. In this case, no particular type of analysis existed by default, but theory-driven predictions had to be generated and tested, since neither logical analysis nor empirical evidence was sufficient.
2 Is Validation a Luxury or an Indispensable Asset …
21
Later, in the second and third editions of the Standards (APA et al., 1966, 1974), content and construct validity were preserved, while predictive and concurrent validity were brought together under a single category: criterion-related validity. In sum, validity initially did not refer to the interpretation of results or the use of the test, but to the measurement instrument itself. Additionally, four types of validity were differentiated: predictive, concurrent, content, and construct validity (or three, when considering predictive and concurrent validity as part of criterionrelated validity). However, these different aspects or approaches to validity were paving the path for alternative conceptions of validity.
2.2.2 Messick’s Unified Construct of Validity Messick (1989, 1994) criticized this traditional view on validity, considering it incomplete and fragmented, while making considerable contributions to the theory of validity. First, Messick indicates that validity no longer refers to a property of the test regarding whether or not it measures what it claims to measure, but validation is a scientific process that must focus on the meaning or interpretation of test scores. This change of focus is based on the fact that test scores do not exclusively depend on the questions and items in the test, but also on the examinees and the context. Thus, what needs to be valid is the scores’ meaning or interpretation, as well as any practical implications that arise from the test (Messick, 1995). Moreover, Messick recognizes that the study of validity not only plays a scientific role in measurement, but it has a social value and plays a political role too, by impacting on judgments and decisions outside the measurement realm. Because of this, validity must take into consideration not only the scores’ meaning as a basis for action, but also the social consequences of the use of the test. In other words, Messick integrates an ethical component into validity. Another contribution of Messick is to conceive validity as a unitary concept of construct validity. The author identifies two main threats to the valid interpretation of test scores: factors irrelevant to the construct and under-representation of the construct. Through the unitary concept of construct validity, Messick integrates logical analysis and empirical evidence into the validation process. Therefore, for this author, validity is no longer the result of a single analytical or empirical study, and he calls for as much analysis and evidence as possible to be gathered before establishing a validity claim. As can be seen, he no longer refers to types of validity, but to types of evidence that support validity. Therefore, according to Messick, the types of evidence must be combined, and finally, based on the evidence gathered, a judgment must be made as to whether or not the test scores can be interpreted as intended and whether the test can be used for the purposes for which it was intended. Messick’s unified concept of validity considers the aspects of criterion, content, and consequences within a construct validity framework. Therefore, construct validity is the validity of the test that looks
22
S. Taut and S. Lay
into the meaning of the score and the consequences of test use. Messick’s contributions to validity have been fundamental, and they strongly set the foundations for the fifth edition of the Standards (AERA et al., 1999) and inspired Kane’s more recent contributions (2015, 2016).
2.2.3 Cronbach’s Notion of Validation Cronbach, like Messick, emphasized the importance of constructs in validation. Together with Meehl, he pointed out that a construct is defined by a nomological network that would relate this construct to other constructs as well as to observable measures (Cronbach & Meehl, 1955). These expected relationships make construct validation possible. If in a validation study, those expected interrelationships are not found, then there would be a flaw in the posited interpretation of the test score or in the nomological network. In the first case, construct validity would not be supported by the evidence; in the second, the nomological network would have to be modified, and therefore, the construct would change. In this sense, the theory behind the construct is central, since it establishes the relationships between the different constructs, guiding the hypotheses that will be tested in the construct validation studies. In the same vein, Cronbach (1989) advocated a strong validation program. He described a weak program as one in which the researcher does not define a construct to lead the research, but rather the investigation aims to obtain some result without having a clear course. Whereas a strong validation program has the construct at its core, from which construct hypotheses are derived, and different types of relevant evidence are gathered to test them. In other words, establishing a construct in the strong validation program allows a more focused validation, while in the weak program this focus is lost. According to Cronbach, a strong program with well-defined constructs must be at the core of validation. Finally, Cronbach (1989) also indicates that there is an obligation to review the appropriateness of a test’s consequences and prevent them from being adverse. However, unlike Messick, Cronbach does not make any statements as to whether or not the consequences should be included in the validation process itself.
2.2.4 Kane’s Contribution on Validity Arguments Kane complements Cronbach’s and Messick’s contributions. While Messick emphasized the importance of making a general judgment of validity based on different types of evidence, Kane developed in more detail how such a judgment can be constructed and tested. Kane (1992) stated that the interpretation of the scores and the use of the test always involve reasoning based on interpretive and use arguments. This reasoning describes the propositions that lead from the test score—the premise—to the interpretations
2 Is Validation a Luxury or an Indispensable Asset …
23
and decisions based on that score—the conclusions. Hence, for Kane, the validation process consists of two steps: (1) all propositions at the basis of the test scores’ interpretations and uses must be converted into an explicit argument for interpretation and use, and (2) the plausibility of these propositions must be critically assessed in a validity argument. This way, propositions will lead the search for adequate methods and evidence that will allow to support or reject any given interpretation and use of the test (Kane, 2015, 2016). In summary, Kane emphasizes the importance of propositions at the base of interpreting and using test scores, and elaborates the validation process as an argumentation exercise. Kane’s contribution has been incorporated into the most recent editions of the Standards (AERA et al., 1999, 2014).
2.2.5 Are Consequences of Testing Part of Test Validation? Several scholars have highlighted the importance of evaluating measurement consequences in validation processes, in particular, those unintended consequences that are generated as a result of the assessment (Kane, 2013; Lane & Stone, 2002; Lane et al., 1998; Linn, 1997; Messick, 1989, 1995; Shepard, 1997). This issue is especially relevant in high-stakes contexts, since research has shown that several undesirable processes often occur in these systems that distort the results, for instance, the excessive preparation on the test content that can lead to a narrowing of the curriculum (Brennan, 2006; Kulik et al., 1984; Shepard, 2002). As Kane (2006) points out, high-stakes assessment programs must be considered as educational interventions that require thorough evaluation, like any other educational program or major reform attempt (see also Zumbo, 2009). Lane and her colleagues (1998) argue that it is essential to include both intended and unintended consequences of educational assessments as part of validation studies. The authors argue that these studies should consider the perspectives of all possible stakeholders in the assessment, from the community and policy makers, to students and their families (Linn, 1997). Moreover, this evaluation of the effects should also be conducted at various levels of analysis: at a program level, at a school district level, and at school and classroom levels (Lane & Stone, 2002). However, there is controversy on this issue, as other researchers have argued that while the consequences of assessments are important to consider, they should not be part of the validity concept (Borsboom et al., 2004; Cizek, 2012; Maguire et al., 1994; Popham, 1997; Wiley, 1991). For instance, Popham (1997) states that validity must take into account the accuracy of inferences that are made based on test scores, i.e., validity refers to the meaning of the scores. The consequences would be a step beyond inferences. Therefore, while examining the consequences is relevant, it should not be part of the validation process. Furthermore, Cizek (2012) states that scientific and ethical arguments are logically incompatible, and therefore, validity must refer to the inferences of the test scores, while the use—and therefore the consequences—would only be part of the justification for the use of the test. In other words, validity would
24
S. Taut and S. Lay
not depend on the use of the test. In this sense, these authors choose to leave the analysis of a test’s consequences out of the validation process. However, the Standards clearly position themselves in favor of including the consequences in the validation. In the following section, the Standards’ recommendations are described in more detail since they represent a key reference in the educational measurement context.
2.3 The Standards for Educational and Psychological Testing According to the most recent edition of the Standards, validity refers to “the degree to which evidence and theory support interpretations of test scores for proposed uses of tests” (AERA et al., 2014, p. 11). The first standard of validity expresses in the following way what is—in general—expected as the basis for any validation exercise: “Clear articulation of each intended test score interpretation for a specified use should be set forth, and appropriate validity evidence in support of each intended interpretation should be provided” (Standard 1.0; AERA et al., 2014, p. 23). First, the Standards describe how the proposed interpretations and uses must be articulated. These should not only be clearly described, but also the population for which the test is intended must be defined, along with explicitly stating unforeseen interpretations, and/or interpretations that have not been sufficiently supported by evidence. If, following the interpretation and use of test scores, specific effects or benefits are expected, these should be supported by relevant evidence. This also applies to indirect benefits that are expected as a result of the assessment system in question. The degree to which the preparation for the test would change, or would have no effect, on the test result must also be explained. Finally, the samples and contexts in which validity studies are implemented must be described in sufficient detail to judge their quality and relevance. In addition, the multiple interpretations and uses of the test must be validated independently of each other. In principle, it seems difficult for an assessment to have multiple purposes and uses, but in practice this is not unusual. Nonetheless, each of them must be supported by sufficient evidence of validity to be justified. Then, the Standards distinguish between different types of evidence of validity that can be collected and that will be more or less relevant depending on the interpretation and use that we want to give to the test results. Each type of evidence is described below in more detail.
2 Is Validation a Luxury or an Indispensable Asset …
25
2.3.1 Evidence Based on Test Content When the rationale for test score interpretation for a given use rests in part on the appropriateness of test content, the procedures followed in specifying and generating test content should be described and justified with reference to the intended population to be tested and the construct the test is intended to measure or the domain it is intended to represent. If the definition of the content sampled incorporates criteria such as importance, frequency, or criticality, these criteria should also be clearly explained and justified (Standard 1.11; AERA et al., 2014, p. 26).
The test content consists of the topics, wording, and format of the test items, tasks, and questions. Content-based evidence refers to the consistency between the test content and the content domain or the construct being measured. This type of evidence also analyzes whether the content domain is relevant to the interpretation we want to give to the test scores. This highlights the importance of the content domain specification, since it describes and classifies the content areas and types of items to be used. A test may fail to capture some important aspect of the construct to be measured, thus the construct would be underrepresented. For example, a test that aims to measure reading comprehension should consider a variety of texts and reading materials, in order to cover reading comprehension of all possible types of texts (e.g., essays, news, etc.). At the same time, test results could be affected by sources of variance irrelevant to the construct to be measured, as would be the case if the test had very long reading texts that required the examinees to have a great capacity to memorize. In this case, we would be evaluating not only reading comprehension, but also memory. This type of evidence may involve logical or empirical analyses. The use of expert judgment can support the search for evidence based on test content, and to assist in the identification of possible inconsistencies between test content and the content domain. In the example above, language teachers could judge how representative of the curricular standards the texts and questions are on a reading comprehension test. The search for this type of evidence is of utmost importance in situations where a test that was designed for a given use is to be used for a new purpose, since the appropriateness of the construct to be measured is related to the inferences that are made based on test scores. Evidence based on test content is also useful for addressing score differences between subgroups of examinees, as these may account for sources of variance that are irrelevant to the content domain. For example, an employment test that uses more advanced vocabulary than required for the job position may create an unfair disadvantage for applicants whose first language is not English, accounting for a source of variance irrelevant to the construct.
26
S. Taut and S. Lay
2.3.2 Evidence Based on Response Processes If the rationale for score interpretation for a given use depends on premises about the psychological processes or cognitive operations of test takers, then theoretical or empirical evidence in support of those premises should be provided. When statements about the processes employed by observers or scorers are part of the argument for validity, similar information should be provided (Standard 1.12; AERA et al., 2014, p. 26).
Another type of evidence is that which analyzes the consistency between the construct and the response process of test takers. For instance, in the case of a reading comprehension test, it is expected that there will indeed be an understanding of the words used and a global understanding of the texts presented, and not that the nature of the examinees’ responses reflects a different mental process. To collect this type of evidence, we can ask a sample of examinees about their response strategies, examine their response times or eye movements, or look into the relations between different parts of the test. Evidence based on response processes can also be used to detect measurement bias, as it can shed light on differences in the interpretation of scores between subgroups of respondents. This type of evidence could inform about capacities that may be influencing subgroups’ performance in different ways. This evidence can be gathered not only from examinees, but also from reviewers who can provide useful and relevant information. In the event that the test is scored or corrected by an external party, we can evaluate the consistency between the scorer’s correction processes and the criterion to be used according to the construct to be measured. For example, among raters who correct an essay, we must ensure that their grading is not influenced by the quality of the examinee’s handwriting, but that they use the criterion that refers to the construct they must evaluate.
2.3.3 Evidence Based on Internal Structure If the rationale for a test score interpretation for a given use depends on premises about the relationships among test items or among parts of the test, evidence concerning the internal structure of the test should be provided (Standard 1.13; AERA et al., 2014, pp. 26–27).
Evidence based on the internal structure points to the degree of consistency in the relationship between the test items and the defined components of the construct to be measured. For instance, if the conceptual framework of the test defines the construct to be measured as unidimensional, then this must be reflected in item homogeneity. However, if the construct was multidimensional, the item response pattern should reflect this. For example, in a science test that assesses knowledge in the natural sciences, including chemistry, physics, and biology, one would expect chemistry items to be related to each other more strongly than with physics and biology items. There are different statistical techniques that allow us to examine and check if one
2 Is Validation a Luxury or an Indispensable Asset …
27
or more factors are associated with the items, such as exploratory factor analysis and confirmatory factor analysis. In the case of the science test, we would expect that the exploratory factor analysis would account for three factors according to the sub-dimensions (chemistry, physics, and biology).
2.3.4 Evidence Based on Relations to Other Variables In many cases, the intended interpretation for a given use implies that the construct should be related to some other variables, and as a result, analyses of the relationship of test scores to variables external to the test provide another important source of validity evidence (AERA et al., 2014, p. 16).
Evidence based on relationships with other variables takes on special relevance when the intended interpretation for the use of a test assumes a relationship between the construct and another variable external to the test. This external variable can be a similar or a different construct to the one intended to be measured, a criterion that the test should be able to predict, another test that measures the same construct, etc. For example, one might expect that in a test of critical thinking skills, results would depend on the intensity of instruction and practice with this type of problems. In this case, we could obtain evidence based on relationships with other variables by evaluating the relationship between the test score and indicators about the quality or frequency of the examinee’s instruction in this type of skill. We can also analyze the relationship between the result of a test and other measurements of the same construct or variables that are similar to the construct to be measured. In this case, the evidence would be convergent, since the two measurements are expected to be related. For example, the results of a multiple choice test that assesses knowledge in chemistry should be related to other tests that also assess knowledge in that discipline, but in a different format, such as tests using open-ended questions. On the contrary, we could evaluate the relationship between the test score and other measurements of variables that are theoretically different and should not be associated with the construct. The evidence in this case would be discriminant, as we would expect the relationship between the two to be low or non-existent. In the case of the chemistry test, its results should be less related to the result of tests in more distant disciplines, such as mathematics or history. The test–criterion relationship should be evaluated when the test is expected to predict an attribute operationally different from the test, as in the case of an employment test to select suitable applicants for a particular position. In this example, it may be important to gather evidence on whether performance on the test effectively predicts subsequent performance in that position. This would be a predictive study, as the test scores are expected to predict the criterion scores obtained later (i.e., subsequent performance). In cases where alternative measures are to be developed to assessments already approved for a given construct, the designed study would be concurrent, since both tests must be carried out at approximately the same time in
28
S. Taut and S. Lay
order to avoid differences that could be explained by the time lag. Whenever the test– criterion relationship is to be studied, we should report information on the criterion’s technical quality and appropriateness, since the credibility and usefulness of these studies depend on this information. In order to gather this type of evidence is central to establish an appropriate criterion and to measure it by applying high technical standards.
2.3.5 Evidence Based on Consequences of Testing Decisions about test use are appropriately informed by validity evidence about intended test score interpretations for a given use, by evidence evaluating additional claims about consequences of test use that do not follow directly from test score interpretations, and by value judgments about unintended positive and negative consequences of test use (AERA et al., 2014, p. 21).
Test use entails several consequences that may or may not be aligned with the interpretation and use of the test initially intended by the test developer. First of all, evidence based on consequences of the test use must analyze whether the effect or consequence of the assessment is consistent with the intended test interpretation and use. For example, if it is claimed that a math test can be used by teachers to diagnose students’ needs for remediation classes regarding specific sub-areas, then evidence should be presented as to whether the test results actually provided useful diagnostic information as perceived by teachers. However, there may be other anticipated consequences of the use of the test that go beyond the direct interpretation of the score. It could be expected that a test primarily used for admission to higher education might indirectly improve teaching quality via teachers taking responsibility for the learning achievements of their students (i.e., accountability). This consequence of the use of the test should then also be validated. Additionally, there may be unforeseen consequences that may or may not affect the validity of the interpretation of the results. This would depend on whether or not these consequences originate from a source of error in the interpretation of the test score. In the example of the selection test, a subgroup of test takers (e.g., depending on age) may have a higher pass rate than the rest of test takers, and this would constitute an unintended consequence. If at the basis of this consequence, it is found that there are components in the test that negatively impact the score obtained by this subgroup, and that these are irrelevant to the construct to be measured, this consequence would invalidate the interpretation and use of the test (e.g., items in the test related to technology that are not part of the construct to be measured, and that favor younger test takers because they are more familiar than the rest of the test takers with this particular topic). However, if at the base of this consequence, we find in the student population an unequal distribution of the knowledge to be measured and this knowledge is relevant to the construct to be measured, then this difference would not invalidate the interpretation of the test results.
2 Is Validation a Luxury or an Indispensable Asset …
29
In this regard, it is essential to differentiate between (a) consequences arising from an error in the interpretation of the test score for an intended use and (b) consequences arising from another source unrelated to the test itself. The latter may influence the decision to use or not to use the test, but they do not directly invalidate the interpretation and use of the test. For instance, another unintended consequence would be that schools neglect the teaching of some subjects to focus only on those assessed in the higher education entrance test. This consequence could affect the decision to continue or not using the test, but it will not invalidate its interpretation and use.
2.3.6 Validation as Integration of Different Types of Evidence A sound validity argument integrates various strands of evidence into a coherent account of the degree to which existing evidence and theory support the intended interpretation of test scores for specific uses (AERA et al., 2014, p. 21).
In order to determine what types of evidence should be collected to validate an interpretation of scores for an intended use of a test, it is useful to consider the propositions that support the intended interpretations and, subsequently, to look for relevant evidence. Because multiple propositions can support a given use of a test, it is necessary to seek evidence for each one of them and integrate all the evidence to validate the interpretation. For example, we might want to judge the validity of the use of a science test for determining whether students are prepared to take a more advanced course in that subject. A first statement that would support this use of the test is that the test scores effectively predict student success in the advanced science course. This statement would be examined with studies that provide evidence of relationships with other variables, such as the final grade in the advanced course. But other arguments must also be considered, such as that the assessment procedures are equally suitable for all students taking the test. We should also make sure that there are no secondary variables that may have an impact on the test result. For instance, the use of higher level language may negatively impact on the results of examinees whose second language is English, which would not reflect a lack of scientific knowledge, but rather a lack of language proficiency (i.e., a construct-irrelevant component). Another proposition that should be examined in this case is that all the knowledge areas to be measured in science, previously described in the specification of the content domain, are actually reflected in the test items, questions, and tasks (i.e., adequate construct representation). Furthermore, we should look for evidence that the examinees’ response process indeed reflects their knowledge in science and not, for example, their language and writing skills in open-ended questions and that the correction of the test is based
30
S. Taut and S. Lay
on clear criteria not influenced by other irrelevant factors such as the examinees’ handwriting (i.e., construct-irrelevant components). Another proposition to evaluate would be that the different dimensions of the domain knowledge-in-science should be reflected in the internal structure of the test. Therefore, if the test evaluates the areas of biology, chemistry, and physics, then we would expect the relationships between the items to account for these three dimensions (i.e., evidence on internal structure). Finally, another argument that should be examined is that the science test score is strongly and positively associated with scores obtained on other already accepted tests that measure knowledge in science (i.e., evidence on relations to other variables). By gathering sufficient evidence of validity to support the interpretation of scores for an intended use of the test, a provisional validity judgment can be generated. This judgment may support the current interpretation and use of the test, or suggest redefinitions of the construct to be measured, changes in the test and its application, or may lead to future studies to investigate other areas of the test. In our example, it is possible that the first proposition is supported by the evidence, i.e., studies show that the science test score effectively predicts the final grade in the advanced course. However, in another study, we may observe that those examinees whose native language is not English, systematically obtain lower scores in the science test than the rest of the examinees, which could be associated to an irrelevant component of the construct to be measured. To examine this, we would require a measurement bias study (e.g., of differential prediction of the test between language groups) in order to establish whether this is a difference due to bias or due to effective differences between language groups. Then, we would have to look for explanations for this result and weigh the evidence, complement it with other studies, and come to a preliminary conclusion. It is important to note that integration of the different types of validity evidence is the most complex step of the entire validation process. This is the step that requires the strongest expertise in measurement and assessment. Although it is based on empirical evidence that is systematically reviewed, it also involves expert judgments that interact with the specific context and that should be made transparent by those responsible for validation. New validity studies will be required each time the construct, method of assessment, application, or test population is modified. In these situations, further evidence of validity should be collected to support the interpretation of the scores in this new context. In summary, the validation process is a continuous process, which must be supported by the integration of all available evidence and should have an impact on the test itself and its use.
2.4 The Political Dimension of Validation Undoubtedly, the political nature of educational measurement and assessment makes validation a political process. In what follows, we present numerous problems that stress the complexity of validation in a political context. We do this in order to
2 Is Validation a Luxury or an Indispensable Asset …
31
generate awareness about this issue, since these political reasons are the ones that often create important obstacles to validation as a fundamental process to ensure the quality of educational assessment. One key question that needs to be answered in every measurement and assessment system is: who is responsible for validation and who provides the necessary financial resources? Test developers are often contracted entities that depend on the perceived quality of their work. Therefore, they are less likely to critically evaluate the validity of their products themselves, unless they are explicitly asked to do so and to report on it, or they are highly professional and abide by the Standards as a basis for their professional self-definition. Moreover, those who finance test development often do not have the technical capacity to understand how necessary it is for validation to take place; they follow a political logic and are more concerned with meeting deadlines, saving resources, and ensuring acceptance of the tests, while being less concerned with the assessment’s quality. Finally, those affected by assessments often have even less technical expertise and sometimes less voice to effectively demand validation to support intended uses and avoid unintended uses of assessments. Because of this, it is common that there is no clarity as to who should be responsible for validating a test. This entails that there are often no resources considered for test validation. However, the Standards are clear on this issue: test developers and those who make decisions based on test results are the ones responsible for conducting validation studies. Now then, when should the validation take place? There are multiple conflicting interests regarding this question. Political agendas frequently demand that assessment programs are implemented quickly. Political commitment regarding assessment is often fragile, and thus the speed of assessment implementation is often paramount. Furthermore, validation is a complex process that must begin before the assessment system is put into action, but must also consider real-life implementation conditions in order to reach justifiable conclusions (particularly in contexts of high-stakes consequences). Finally, validation must be continuous, as evaluation contexts are subject to political pressures, and therefore change over time. In summary, validation should start as soon as possible after the decision to develop an assessment system has been taken, also considering real implementation conditions, and should then be a recurrent process that includes the changes occurring in the assessment context. Taking into account political and time/resource pressures, is validation research independent enough to reach conclusions that may suggest changes to the assessment system in question? Reporting negative validation results is a sensitive issue, as it could be used as ammunition to end an assessment program, even though it would be unrealistic to expect to find only positive evidence regarding the validity of an assessment system. There is no easy or general solution to this problem, considering the specific context and interest groups, who generates the research agenda on validity, who decides how the validation results will be communicated, and who decides what modifications are implemented based on these results. In relation to the implementation of changes based on validation, it would be important that the legal bases and regulations for assessment programs incorporate
32
S. Taut and S. Lay
review cycles that allow these modifications to be made, especially in the case of highstakes assessments. These legal foundations and regulations can provide continuity and useful guidelines for test developers and for those who make decisions based on test results. However, these guidelines must strike a good balance between being sufficiently explicit but not too detailed. The risk of rigid regulations is that validation research may not inform adaptations or modifications to the assessment method and implementation in a timely and feasible manner. Furthermore, an essential condition for generating a research agenda on validity is that there are well-prepared and trained professionals. However, the capacity for validation research in Latin America and the Caribbean is still insufficient, and likewise, professionals in measurement are scarce. Hence, it is important to generate capacities in educational measurement and assessment in the region. Nonetheless, even outside of Latin America and the Caribbean, validation does not play the role that it should according to professional standards of good practice. In sum, if the technical aspects of validation are complex, then the political dimension makes it even more intricate. This book is an important contribution to strengthening the awareness that in educational assessment it is indispensable to have standards of quality, particularly of validity, to ensure that decisions are supported by evidence. For evidence-based policy to be useful, educational assessments are required which results are used only for intended and validated interpretations and uses, and which unintended interpretations and uses are explicitly mentioned to promote their control. “Validity theory is rich, but the practice of validation is often impoverished” (Brennan, 2006, p. 8). This conclusion is frequently expressed in educational assessment circles and, to our knowledge, remains applicable more than a decade later. However, there is less written about what can be done to overcome this division. Above all else, policy makers who implement educational assessment systems should be aware of the importance and complexity of arriving at valid inferences, and at the same time be mindful that a test or assessment system will always be subject to a certain degree of imprecision, uncertainty, and will be more useful to some users than to others. Also, those responsible for validating the proposed interpretations, uses, and consequences should be specified by regulation or law, while validation documentation should be made publicly available in a timely manner, before major decisions are made based on the data. Without the necessary formal or legal obligation, there may never be enough time and resources for validation to be carried out in a sufficiently rigorous manner. The evidence of validity can hardly be complete and definitive, but a serious effort should be visible. In practice and in political circles, capacity in educational assessment is still lacking. Moreover, measurement experts often lack applied experience at the crossroads of policy and practice to be sufficiently sensitive to the political dimension of validation described previously. This contributes to the lack of mutual understanding and makes progress more difficult on the steep path to validation. We hope that this book will provide a basis for knowledge sharing and capacity building, and thus facilitate conversations about validity and validation among different stakeholders.
2 Is Validation a Luxury or an Indispensable Asset …
33
Returning to the title question, validation should not be a luxury, but a fundamental responsibility for educational assessment systems in Latin America and the world.
References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. American Educational Research Association. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. American Psychological Association. (1952). Committee on test standards. Technical recommendations for psychological test and diagnostic techniques: A preliminary proposal. American Psychologist, 7(8), 461–475. American Psychological Association, American Educational Research Association & National Council on Measurement in Education. (1954). Technical recommendations for psychological test and diagnostic techniques. Psychological Bulletin, 51(2), 201–238. American Psychological Association, American Educational Research Association & National Council on Measurement in Education. (1966). Standards for educational and psychological test and manuals. American Psychological Association. American Psychological Association, American Educational Research Association & National Council on Measurement in Education. (1974). Standards for educational and psychological test. American Psychological Association. Borsboom, D., Mellenbergh, G., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. https://doi.org/10.1037/0033-295X.111.4.1061 Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 3–16). American Council on Education/Praeger. Cizek, G. J. (2012). Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychological Methods, 17(1), 31–43. https://doi.org/10.1037/a0026975 Cronbach, L. J. (1949). Essentials of psychological testing. Harper & Brothers. Cronbach, L. J. (1989). Construct validation after thirty years. In R. Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp. 147–171). University of Illinois Press. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological test. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/h0040957 Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527– 535. https://doi.org/10.1037/0033-2909.112.3.527 Kane, M. T. (2006). Validation. In R. Brennan (Ed.), Educational Measurement (4th ed., pp. 17–64). American Council on Education/Praeger. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000 Kane, M. T. (2015). Explaining validity. Assessment in Education: Principles, Policy & Practice, 23(2), 198–211. https://doi.org/10.1080/0969594X.2015.1060192 Kane, M. T. (2016). Validity as the evaluation of the claims based on test scores. Assessment in Education: Principles, Policy & Practice, 23(2), 309–311. https://doi.org/10.1080/0969594X. 2016.1156645 Kulik, J. A., Kulik, C. C., & Bangert-Drowns, R. L. (1984). Effects of practice on aptitude and achievement test scores. American Educational Research Journal, 20(2), 435–447. https://doi. org/10.3102/00028312021002435
34
S. Taut and S. Lay
Lane, S., Parke, C. S., & Stone, C. A. (1998). A framework for evaluating the consequences of assessment programs. Educational Measurement: Issues and Practice, 17(2), 24–28. https://doi. org/10.1111/j.1745-3992.1998.tb00830.x Lane, S., & Stone, C. A. (2002). Strategies for examining the consequences of assessment and accountability programs. Educational Measurement: Issues and Practice, 21(1), 23–30. https:// doi.org/10.1111/j.1745-3992.2002.tb00082.x Linn, R. L. (1997). Evaluating the validity of assessments: The consequences of use. Educational Measurement: Issues and Practice, 16(2), 14–16. https://doi.org/10.1111/j.1745-3992.1997.tb0 0587.x Maguire, T., Hattie, J., & Haig, B. (1994). Construct validity and achievement assessment. Alberta Journal of Educational Research, 40(2), 109–126. Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd ed., pp. 13–100). American Council on Education. Messick, S. (1994). Foundations of validity: Meaning and consequences in psychological assessment. European Journal of Psychological Assessment, 10(1), 1–9. https://doi.org/10.1002/j.23338504.1993.tb01562.x Messick, S. (1995). Standards of validity and the validity of standards in performance assessment Educational Measurement: Issues and Practice, 14(4), 5–8. https://doi.org/10.1111/j.1745-3992. 1995.tb00881.x Popham, W. J. (1997). Consequential validity: Right concern-wrong concept Educational Measurement: Issues and Practice, 16(2), 9–13. https://doi.org/10.1111/j.1745-3992.1997.tb00586.x Ruch, G. M. (1924). The improvement of the written examination. Scott, Foresman and Company. Santelices, V. (2021). Validity of assessment systems for admissions and certification. In J. Manzi, M. R. García, & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America (pp. XX–XX). Springer. Shepard, L. (1997). The centrality of test use and consequences for test validity. Educational Measurement: Issues and Practice, 16(2), 5–8. https://doi.org/10.1111/j.1745-3992.1997.tb0 0585.x Shepard, L. (2002). The hazards of high-stakes testing. Issues in Science and Technology, 19(2), 53–58. Wiley, D. E. (1991). Test validity and invalidity reconsidered. In R. E. Snow & D. E. Wiley (Eds.), Improving inquiry in the social sciences: A volume in honor of Lee J. Cronbach (pp. 75–107). Erlbaum. Zumbo, B. (2009). Validity as contextualized and pragmatic explanation, and its implications for validation practice. In R. Lissitz (Ed.), The concept of validity (pp. 65–82). Information Age Publishing.
Sandy Taut Deputy Head of Quality Agency, Bavarian State Office for Schools, Germany. Ph.D. in Education from the University of California, Los Angeles (UCLA), USA, and psychologist from University of Cologne, Germany. She has worked and researched issues related to educational assessment, teacher, instructional and school quality, and validation of measurement and assessment systems. Contact: [email protected] Siugmin Lay Psychologist from Pontificia Universidad Católica de Chile and Ph.D. in Psychology from Royal Holloway University of London, United Kingdom. She is currently an adjunct researcher at MIDE UC Measurement Center, Pontificia Universidad Católica de Chile. Her areas of interest are intergroup relations and attitudes studied from a social psychology perspective. Contact: [email protected]
Part I
Validity of Student Learning Assessments
Chapter 3
How to Ensure the Validity of National Learning Assessments? Priority Criteria for Latin America and the Caribbean María José Ramírez and Gilbert A. Valverde
3.1 Introduction The number of learning assessments in Latin America and the Caribbean (LAC) has grown significantly. Countries introduce these assessment systems to monitor how well their educational systems pursue curricular objectives and to foster improvements in the system in general, and in student learning in particular. It is reasonable to ask, therefore, how helpful these tests are in substantiating inferences about the achievement compared to goals proposed in curricular policies, to what extent their results can be interpreted as a reflection of such learning, and whether they can be effectively used to promote improvement in learning. To answer these questions, validity evidence is needed. So far, Latin American countries have been more concerned with installing the assessments than with validating them. While most countries introduced national assessments in the 1990s, these assessment regimes have been unstable. In weak institutional contexts, most resources are often put into installing the assessments. There is little capacity to document validity evidence (e.g., in technical reports) and 1
For the purposes of this study, we accept these stated objectives as the fundamentals of the assessment policy in the region, although there may and should be other stated and undeclared objectives for these assessments.
We appreciate Elisa de Padua’s valuable collaboration in collecting and analyzing information on validation practices in learning assessment programs. We also thank all the professionals of the assessment programs contacted. Finally, our thanks to Patricia Arregui (GRADE) for her valuable contributions and comments. M. J. Ramírez Independent Consultant, Education, Alexandria, VA, USA G. A. Valverde (B) University At Albany, State University of New York, Albany, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. Manzi et al. (eds.), Validity of Educational Assessments in Chile and Latin America, https://doi.org/10.1007/978-3-030-78390-7_3
37
38
M. J. Ramírez and G. A. Valverde
even less to conduct validity studies or external audits. In such contexts, a focus on the validity can be viewed as a threat to the legitimacy of the assessments—a legitimacy that was so difficult to build, in the first place. However, now that assessments in the region have matured and become an indispensable part of public discussion on educational issues, it is imperative to prioritize the validity agenda. That is, an agenda that ensures that evidence is collected to support the interpretations, uses, and policy decisions associated with the assessments. Assessment validation is a must: it is necessary to ensure their political and technical feasibility. Without validity evidence, how could one know whether assessments measure what they are supposed to measure, or whether their results truthfully reflect student learning? Without validity evidence, it is impossible to know whether assessments really help improve the educational system. Evidence of validity is the basis for judging the quality of the information used to make decisions. Making inferences from information of questionable quality is no better than making decisions without any information at all. It is like measuring temperature without any idea whether the thermometer is working properly: the risks of misdiagnosis would be extremely high—and who knows if the proposed remedies would help? Not having evidence of validity can lead to what policy analysts call “type 3 error”: solving the wrong problem (Mitroff & Featheringham, 1974). The cost of introducing a national assessment without evidence of validity is too high: it is equivalent to making an investment with no way of judging its value. One can see how important it is to validate an assessment if we consider the costs of not doing it. Consider, for example, the political costs of reporting lowered educational achievement in a country when student results have, in fact, improved—or the social costs of incorrectly classifying a school as “insufficient” or “underperforming.” Lack of validation evidence may lead to criticism that could damage or bring down national assessments. The international community has developed various technical quality standards for validating learning assessments (see American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 2014; UNESCO Institute for Statistics [UIS] & Australian Council for Educational Research [ACER], 2017; Darling-Hammond et al., 2013). However, these standards have mainly been produced for developed countries with their higher (compared to LAC) average levels of student achievement and typically more sophisticated assessment systems. This chapter proposes ten priority criteria or quality standards for the validation of learning assessments in LAC. These have been classified into three dimensions or sources of evidence: (1) the dimension related to test alignment with the official curriculum, (2) the dimension related to the curricular validity of performance levels used to report the results of assessments, and (3) the dimension of consequential validity: the assessment’s impact on the improvement of the education system in general, and on the improvement of learning in particular.
3 How to Ensure the Validity of National Learning Assessments? …
39
The dimensions and criteria selected were chosen based on: (1) learning assessment standards commonly accepted in the international community, (2) information collected by the authors in 2016 about validation practices in learning assessments in nearly 50 countries, including those representing best practices, and (3) the authors’ 20 years of academic and professional experience in designing and implementing assessments, consulting, conducting external audits, and providing international technical assistance in learning assessment in LAC and other regions of the world. This chapter addresses the validation of assessments from a conceptual perspective, and delineates the situation observed in LAC with respect to the different assessment dimensions examined. It also presents examples of good validation practices from Canada and the United States. The arguments presented in this introduction are further developed in the chapter by Valverde and Ramírez where the authors present in-depth case studies on validation practices in different Latin American countries.
3.2 What Does It Mean to Validate Assessments? We define learning assessments as national tools for measuring the progression towards achieving curricular objectives (competencies, content, or skills, as appropriate). Their main purpose is to monitor learning at the country level, and to promote the improvement of the education system in general and of learning in particular.1 To this end, standardized tests are administered to all students (census), or to nationally representative samples, at key points in their school trajectories (e.g., at the end of the first or second cycle of primary education). The tests are usually paper-andpencil2 and include multiple-choice questions, problems, or items and, to a lesser extent, open-ended questions. Assessment results are typically used to inform educational policy and practice, and may have consequences associated with them (e.g., incentives). Validating an assessment means gathering evidence to support its interpretations, uses, and expected consequences. For example, if the assessment measures the attained curriculum, there should exist documentation (evidence) of the alignment between the tests and the curriculum. If the assessment says that teachers will use the results to plan and improve their lessons, there must be evidence to show this is possible. The definition proposed here is based on a unitary concept of validity which also includes a dimension of consequential validity, or the impact of assessments on the education system.
2
Although computer tests are becoming more common.
40
M. J. Ramírez and G. A. Valverde
3.3 Validation Criteria Table 3.1 presents ten priority criteria for the validation of learning assessments in LAC. Each criterion is accompanied by an explanation and examples of evidence to validate the interpretation, uses, and expected impacts of the assessments. A distinction is made between evidence of products (e.g., assessment instruments) and evidence of processes (methods and procedures for producing the product). Table 3.2 presents a checklist for reviewing the validity of national assessments. It operationalizes the criteria of the priority dimensions, presenting more specific examples of evidence for validation. Each row or type of evidence refers to one or more priority criteria identified in Table 3.1. The table also provides space for making notes on each type of evidence (e.g., “Alignment review panels are created, but there is no documentation”). Finally, we want to emphasize that while in any assessment there are other key standards or validity criteria, they are not the subject of this chapter (e.g., those related to field operations or data processing). Since in LAC, assessments are predominantly used to monitor or verify compliance with learning goals established in curricular policies, we consider validation with respect to the curriculum to be a priority. Hence, the following criteria only refer to the dimensions identified as priorities: the dimension of alignment of tests with the official curriculum, the dimension of curricular validity of performance levels, and the dimension of consequential validity of assessments.
3.4 Dimension of Test Alignment with the Official Curriculum In LAC, it is a priority to ensure that curricular learning assessment systems collect validity evidence that their tests effectively measure the curriculum. That is, they need to be aligned with the competencies, capabilities, content, or other equivalents described there. Table 3.1, in the dimension of test alignment with the official curriculum, presents three priority criteria of validity for the region, along with evidence necessary to validate each criterion. These are: (1) The design of the assessment is justified in reference to the curriculum, (2) The assessment domain is operationalized with actual student learning in mind, and (3) Test results allow accurate and unbiased monitoring of the achievement of curricular learning over time. Table 3.2 presents a more detailed checklist with examples of evidence for the validation of alignment between the curriculum and the tests.
1
The design of the assessment is justified in reference to the curriculum
Dimension of test alignment with the official curriculum
Criterion
Validity evidence
(continued)
There is an explanation of the why, • Products: (a) general design of the whom, what, how, and when of the assessment (the why, whom, what, assessment, with explicit references to how, and when it is evaluated) with the curriculum. Possible interpretations references to the curriculum, (b) test and intended uses of the results of the specifications, (c) final version of the assessments are identified. The test • Processes: methods and procedures assessment domain is specified (e.g., for: (a) the designing of the mathematics, language) referring to the assessment, (b) developing test competencies, objectives, content, or specifications, and (c) producing the skills, and performance levels defined final version of the test in the curriculum
Explanation
Table 3.1 Priority dimensions and criteria for assessing the validity of national learning assessments in Latin America and the Caribbean
3 How to Ensure the Validity of National Learning Assessments? … 41
Recognizes the need for assessment to • Products: (a) final version of the test, be aligned with what all students know (b) assessment domain and test and can do, from the most to the least specifications, (c) test results, and (d) advanced. The assessment domain and information on actual student test specifications cover a wide learning (e.g., results of classroom or spectrum of learning. The assessment international assessments) results allow a characterization of what • Processes: methods and procedures for (a) operationalizing the all students know and can do, even assessment domain and making those who have more learning specifications of tests, (b) collecting difficulties information about the real student learning (e.g., classroom observations, reviewing the results of classroom and international assessments) (continued)
Validity evidence
Explanation
2
The assessment domain is operationalized with actual student learning in mind
Criterion
Table 3.1 (continued)
42 M. J. Ramírez and G. A. Valverde
4
Performance levels are aligned with Performance levels are described and the curriculum justified by different aspects of the curriculum (e.g., competencies, complexity, contexts). Performance levels represent necessary learning milestones towards achieving curricular goals. One of the performance levels corresponds to the achievement of the curricular objectives
Dimension of curricular validity of the performance levels
(continued)
• Products: final version of the curricular specifications and psychometric specifications of the performance levels • Processes: methods and procedures for developing performance levels aligned with the curriculum
Test results can be interpreted in terms • Products: achievement scores and of achieving curricular objectives. The measurement errors, equating, and results are published indicating the psychometric analysis (e.g., margin of error or uncertainty difficulty, discrimination, differential associated with the scores, percentages, item behavior) • Processes: methods and procedures differences, or other statistics. The for calculating learning outcomes margin of error is within the acceptable and measurement errors, applying range, given the consequences equating, and calculating associated with the assessment. The psychometric characteristics of the results are free of bias (e.g., gender, tests cultural, linguistic, ethnic) The results of different tests have been equated and put on the same score scale
Validity evidence
Explanation
3
Test results allow accurate and unbiased monitoring of the achievement of curricular learning over time
Criterion
Table 3.1 (continued)
3 How to Ensure the Validity of National Learning Assessments? … 43
Performance levels are designed to • Products: (a) results by performance ensure that the assessment can describe levels, (b) final version of the what all students know and can do. curricular specifications and They cover a wide spectrum of psychometric specifications of learning, from the most advanced to performance levels the most basic. There is a performance • Processes: methods and procedures for: (a) collecting information on real level that describes what the lowest student learning (e.g., classroom performing students know and can do observations, classroom and international assessments), and (b) using such information to develop performance standards (continued)
Validity evidence
Explanation
5
Performance levels are operationalized with actual student learning in mind
Criterion
Table 3.1 (continued)
44 M. J. Ramírez and G. A. Valverde
(continued)
Performance levels define milestones • Products: (a) final version of the on a learning trajectory. What students curricular specifications and know and can do at one level is clearly psychometric specifications of different from what they know and can performance levels, with cut scores, do at the next. Students are classified descriptions of what students know into performance levels according to and can do at each level, and titles whether they have achieved the level of for each level; (b) results according mastery associated with the to performance levels • Processes: methods and procedures appropriate cut scores. There is one for (a) developing performance performance level that is defined by levels and (b) sorting students into default and that corresponds to different levels of performance students who do not reach the lowest cut score on the scaled score
Validity evidence
Explanation
6
Performance levels describe qualitatively different stages of learning
Criterion
Table 3.1 (continued)
3 How to Ensure the Validity of National Learning Assessments? … 45
Explanation
Validity evidence
(continued)
Performance levels balance stability Two versions of performance levels are • Products: (a) updated performance and change, in the context of a used: levels, (b) learning outcomes dynamic curriculum policy 1. Updated performance levels to associated with the updated monitor changes in learning in the performance levels, (c) invariant short term (e.g., from one year to performance levels, (d) learning the next). They are reviewed and outcomes associated with invariant updated according to curricular performance levels • Processes: methods and procedures reforms and changes 2. Invariant performance levels to for: (a) reviewing updated monitor changes in learning in the performance levels, (b) assess long term (e.g., 10- or 20-year students using the updated trends). They describe core performance levels, (c) developing learnings that are usually less invariant performance levels, (d) affected by curricular adjustments assessing students using the invariant performance levels
Dimension of consequential validity of the assessments
7
Criterion
Table 3.1 (continued)
46 M. J. Ramírez and G. A. Valverde
There are formal mechanisms to support the use of assessments to improve learning
9
(continued)
There are formal/institutional • Products: (a) an inventory of formal mechanisms that encourage the use of mechanisms that use assessment assessments for improvement (e.g., inputs school improvement plans, guidelines • Processes: methodology and procedures for collecting information for school supervision, guidelines for on the uses of assessment results by teacher training, pedagogical resources different educational actors consistent with the curriculum and the actual learning levels of students)
Results are clearly, correctly, and in a • Products: (a) communication timely manner disseminated in products used for dissemination of different media and formats (e.g., results and associated information on reports, workshops, brochures, videos), the population of students tested and suitable for different audiences their schools and communities, (b) communication strategy used (e.g., communication products, audiences, timing), (c) levels of access, public trust, and understanding of results by different stakeholders • Processes: methods and procedures for designing the communication strategy (e.g., consultation with key audiences or stakeholders)
8
Validity evidence
Explanation
Assessment results are effectively communicated
Criterion
Table 3.1 (continued)
3 How to Ensure the Validity of National Learning Assessments? … 47
Studies on the intended and unexpected consequences of assessments in the education system are regularly carried out. E.g., evaluations of how assessments have impacted student learning, teacher practice, or parental involvement, or how they are used to inform policy
Note Table constructed by the authors
Explanation
10
There are formal mechanisms to monitor the impact of assessments on the education system
Criterion
Table 3.1 (continued) • Products: studies on the consequences of the assessments • Processes: formal mechanisms to carry out studies of consequential validity: commissions or audits, internal or external financing for such studies
Validity evidence
48 M. J. Ramírez and G. A. Valverde
3 How to Ensure the Validity of National Learning Assessments? …
49
Table 3.2 Checklist for Assessing the Validity of National Learning Assessments in Latin America and the Caribbean Criterion
Validity evidence
Comments on validity: (a) of the process of developing the test instrument (b) of the test instrument
Dimension of test alignment with the official curriculum 1
The purposes of the assessment (“what for?”) are formulated with explicit reference to the curriculum
1
Possible interpretations and intended uses of the assessment results are identified
1
The overall design of the assessment is described: who is assessed; what, how, and when are measurements carried out, with explicit reference to the curriculum
1
The assessment domain is specified (e.g., mathematics, language) with reference to the competencies, objectives, contents, or skills defined in the curriculum
1
The quantity and distribution of items in different categories is justified (e.g., contents, capacities) based on the curriculum
1
The assessment domain is specified (e.g., mathematics, language) and performance levels are considered in the specification
2
There is an explanation specifying the empirical relationship between the definitions of the assessment domains and the actual learning level of students who are lagging
2
There is an explanation specifying the empirical relationship between the test specifications (or blueprint) and the learning of students who are lagging
3
There is a clear explanation showing how test items constitute an adequate sample of the assessment domain
3
The test items are correctly classified according to the categories of the test specifications
3
The actual number and distribution of items in different categories matches the intended test specifications
3
Test specifications and performance levels are used to guide item development
3
Test results are published indicating that there is a margin of error or uncertainty associated with the scores, percentages, differences, or other statistics
3
The level of reliability (accuracy) of the tests is reported, which should be equal to or greater than Cronbach’s alpha = 0.70
3
The classification error related to performance levels is reported, and its size is appropriate to the severity of the consequences associated with the tests
3
Tests and items were subjected to Differential Item Functioning (DIF) analyses to avoid bias in the scores (continued)
50
M. J. Ramírez and G. A. Valverde
Table 3.2 (continued) Criterion
Validity evidence
Comments on validity: (a) of the process of developing the test instrument (b) of the test instrument
3
There is a clear explanation of how different test booklets administered in a single year were put on the same scale of scores through equating
3
There is a clear explanation of how different tests administered in different years were put on the same scale of scores through equating
Dimension of curricular validity of the performance levels 4
Performance levels are described and justified by referring to different aspects of the curriculum (e.g., competencies, complexity, contexts)
4
The level of performance that corresponds to the achievement of the curricular objectives for the grade or cycle assessed is indicated
5
The relationship between the lowest performance level and what the students with the weakest learning levels know and can do is shown
6
What students know and can do at one performance level is qualitatively different from what they know and can do at the next performance level
6
The distance between the cut scores associated with the performance levels is at least half of a standard deviation (SD = 0.50) on the score scale
6
How students are classified into different performance levels depending on whether or not they reach the score associated with each level is explained
6
The results by performance level distinguish between students that reach or don´t reach the lower cut score. Students who do not reach it are classified at a level that is defined by default
6
The lowest cut score is set so that no more than 25% of the students fall below it
7
There is a clear explanation of how performance levels are reviewed and updated following curricular reforms and adjustments
7
The use of updated performance levels to monitor learning in the short term is described
7
The use of invariant performance levels to monitor learning over the long term is described
Dimension of consequential validity of the assessments (continued)
3 How to Ensure the Validity of National Learning Assessments? …
51
Table 3.2 (continued) Criterion
Validity evidence
Comments on validity: (a) of the process of developing the test instrument (b) of the test instrument
8
Assessment results are published in time to inform decisions made by agencies and stakeholders in the education system, according to the objectives of the assessments
8
Assessment results are disseminated in different media and formats (e.g., reports, workshops, brochures, videos) for different audiences (e.g., teachers and school leaders, parents, general public)
8
The communication plan used to disseminate assessment results and information is described
8–9
The extent to which educational stakeholders (e.g., parents, teachers, teacher educators, and policymakers) value, understand and use assessment results and information to make decisions about educational practice or policy is documented
9
There are a variety of formal or institutionalized mechanisms for using assessment results and information
10
There are formal mechanisms for regularly collecting information on the expected and unintended consequences of assessments
Note Prepared by the authors
3.4.1 Criterion 1: The Design of the Assessment is Justified in Reference to the Curriculum To fit their purpose of monitoring curricular learning achievement, assessments must be aligned with the official curriculum. That is, the tests must measure the objectives, competencies, content, or skills (as appropriate) set out in the curriculum. Alignment between tests and curricula is essential so that assessment results can be interpreted as the achievement of curriculum objectives, and in particular, so that a higher score can actually be interpreted as an indicator of higher levels of mastery of the curriculum than a lower one. Alignment with the curriculum also means that performance levels correctly identify students who are at different stages of learning. The assessment should be accompanied by documentation evidence to judge the degree of alignment between the current curriculum and the tests. The assessment framework should clearly indicate the purposes and expected uses of the assessments based on the curriculum; for example, whether the purpose of the assessment is to monitor the achievement of curriculum objectives at the national, sub-national (e.g., regions), school, or classroom level. The framework should provide guidelines on how to correctly interpret assessment results in terms of the curriculum and use them to improve learning.
52
M. J. Ramírez and G. A. Valverde
The design of the assessment should be based on the current curriculum. Test specifications describe and justify the objectives, competencies, content, or capabilities (as appropriate) to be assessed; test format (e.g., paper-and-pencil or computerbased) and type of items (e.g., multiple-choice or open-ended questions); situations or contexts in which students need to demonstrate what they have learned (e.g., abstract or applied mathematics problems); cognitive complexity of tasks to be performed; times to complete test booklets; and so on. Test specifications usually include doubleentry tables indicating the number and type of items to be included in each test, classified according to different categories; for example, a table with axes of contents and skills to be evaluated in mathematics, with the number of items to be included in each cell (crossing contents and skills). The distribution of items in different categories is an indication of the importance assigned to each one in the assessment (weights). In LAC, validity evidence usually focuses on test specifications. Often, this is the only type of documentation available on instrument design. Indeed, the purposes of assessments are often described in very general terms in LAC, without specifying the type of interpretations they are designed to make possible or their appropriate uses. Specifying the purposes of the assessments is more difficult in the context of curricula that are not written with the express purpose of being measured (see chapter by Valverde and Ramírez in this book). Performance levels should inform the development of test items. This is fundamental for measuring each level well, with items that point to the competencies, curricular content, or capabilities that characterize them. Guidance for item development can come from either preliminary or definitive descriptions of performance levels. Preliminary descriptions can be used when there is no empirical evidence of student performance yet (i.e., before the tests are administered). Definitive descriptions can be used after the performance levels have been developed and adjusted in relation to the empirical evidence, i.e., to the test results. In LAC, performance levels are not systematically used to develop new items. This often occurs because performance levels are not part of the curriculum documents. They are usually created afterwards, once test results are available. Given that the performance levels seek to describe what students know and can do in relation to the curricular objectives, it is desirable that these be developed, at least in a preliminary way, before designing the tests, and used as input for the test specifications. This typical absence of specification of performance levels in the design of the test distinguishes assessment programs in LAC from the more sophisticated assessment programs in the world.
3.4.2 Criterion 2: The Assessment Domain is Operationalized by Taking into Consideration Actual Student Learning Test design should take into consideration actual student learning. This ensures that an assessment can monitor the learning levels of all students, from the most advanced
3 How to Ensure the Validity of National Learning Assessments? …
53
to the ones that are lagging behind. Therefore, test specifications and performance levels must account for the competencies, knowledge, and skills of all, including the least advanced students. Consequently, items should be developed so that they cover the full range of students’ abilities. There is a clear tension between measuring curricular objectives and measuring what students who are lagging behind really know and can do. Curricular objectives usually correspond to the most sophisticated and difficult things measured in the tests. In LAC, what struggling students know and can do is often not measured, as it is considered too easy, basic, and distant from the curricular objectives. This does not allow for the visibility of these students, who are precisely those who need more support to improve their learning. There is a close correlation between learning and students’ socio-economic background, repeatedly confirmed by research and assessments in LAC. It means, ultimately, that making the most disadvantaged students invisible is hiding the student populations from the poorest and most vulnerable families in our societies from the sight of policymakers and the public.
3.4.3 Criterion 3: Test Results Allow Accurate and Unbiased Monitoring of the Achievement of Curricular Learning Over Time Any assessment program must meet minimum technical requirements to ensure that its results can be interpreted in terms of curricular learning. These technical requirements relate to test reliability and measurement error, measurement bias, and year-to-year comparability of test results.
3.4.3.1
Reliability and Measurement Error
Test results should allow for precision (reliability) in monitoring the achievement of curricular learning over time. To interpret assessment results, one needs to know the accuracy of the assessment. Not only that, but it is also of crucial importance to publish the level of accuracy: assessment methods are probabilistic, and assessment users need to have information about the probabilities associated with their results. The minimum acceptable level of accuracy (reliability) will depend on the consequences associated with an assessment. In non- or low-stakes assessments (e.g., sample testing for monitoring purposes), it is commonly accepted that they must have internal Cronbach’s alpha reliability levels = 0.70. In tests with higher stakes (e.g., census-based tests reporting at the school level), this indicator must be higher. Similarly, the classification error associated with performance levels may vary depending on the consequences. The results report should account for the degree of uncertainty associated with them by indicating, for example, which score differences are statistically significant.
54
M. J. Ramírez and G. A. Valverde
This can be done by using colors, asterisks, boldface text, or by grouping cases with similar results together (e.g., in the same cell of a table, regions with similar results to the national one can be shown). Another way of indicating the degree of uncertainty is to report the standard error associated with the scores.
3.4.3.2
Measurement Bias
For tests to measure curricular learning well, it is important that they be free of bias. This means that every student should have an equal opportunity to demonstrate what they have learned, so that their gender, geographical, or cultural context are not an obstacle to demonstrating their knowledge and skills. In more technical terms, it means that test scores do not contain systematic errors or interactions with specific group of students. The bias may manifest itself as a general characteristic of the test or at the level of specific item. Overall, it may be that the average score of a test better predicts one criterion (e.g., secondary school grades) for one group of students (e.g., males) than another (e.g., females). At the item level, differential item functioning analysis (DIF analysis) is a commonly used technique to identify items that may have some bias. This analysis is complemented by a qualitative judgment on each item, carried out by professionals. Gender bias can occur in items that are contextualized in issues that are more familiar to males than females, for example, in items about a soccer game. If a question on a test requires familiarity with soccer rules and if there are differences, for example, in knowledge between boys and girls on this subject, it will be easier for the former to answer correctly. However, this does not allow us to infer that boys attained more of the goals of the curriculum, or that they necessarily have better reading abilities. The only inference—irrelevant to curriculum policy—is that boys know more about soccer than girls. National assessments in LAC that do not yet include this type of bias analysis should consider including them.
3.4.3.3
Comparability
Countries need to monitor trends in learning over time. They must report whether learning outcomes have improved, got worse, or have been stable between assessments. To do this, it is essential that assessments be comparable, therefore, they must meet a number of technical requirements: they must measure the same assessment domain (e.g., reading), in the same student population and with equivalent samples (e.g., fourth-grade national sample), and test scores must be on the same scale. It is on this last point where validity evidence is scarce in LAC, and, therefore, comparisons of some results over time tend to be dubious. Measuring changes in learning requires that the tests be on the same score scale. This would be simple to achieve if the exact same tests could be administered in
3 How to Ensure the Validity of National Learning Assessments? …
55
different years.3 However, this is neither possible nor desirable for two main reasons: (1) the tests must be modified to be aligned with curricular updates,4 and (2) the test items must be renewed to replace items released for publication to show what is being measured and how. How do you measure change with constantly changing tests? The key is to make instruments that measure the same assessment domain (e.g., reading), with certain variations to accommodate curriculum updates. To this end, one part of the test can be administered in exactly the same way in different years (e.g., by repeating half of the items in both tests). The other part of the test is new (developing new items for the other half of the test). This design allows one to include both old and new items into the tests on the same score scale. This procedure, known as equating, applies the psychometric model of Item Response Theory (IRT). An important next step for LAC countries will be to use these types of methodologies to measure changes in learning. There are limits to the validity of yearly comparisons between tests. When there are continuous curricular reforms that affect the fundamental elements of the assessed curriculum, these changes unavoidably influence test specifications, and the comparability of assessments administered in different years becomes more questionable. Unfortunately, there are countries in LAC that seem to be dedicated to constant cycles of curricular reforms without regard to the effect these have on the validity of inter-annual comparisons (Fig. 3.1).
3.5 Dimension of Curricular Validity of Performance Levels To ensure assessment validity, LAC countries need to collect evidence on performance levels as measured by these assessments. Table 3.1, the Dimension of Curricular Validity of the Performance Levels, presents four priority criteria for LAC, along with the relevant validity evidence for each. These criteria are: (4) Performance levels are aligned with the curriculum, (5) Performance levels are operationalized with actual student learning in mind, (6) Performance levels describe qualitatively different stages of learning, and (7) performance levels balance stability and change within the context of dynamic curriculum policies. Table 3.2 presents a checklist with more detailed examples of evidence on this dimension.
3
Responding to the principle: “If you want to measure change, do not change the measure.” By curricular updates, we understand adjustments that do not affect the fundamental elements of the evaluated curriculum. This is, adjustments of content, skills, or competencies to be achieved in a certain grade or educational cycle. This is usually the case when making updates or curricular reforms in LAC.
4
56
M. J. Ramírez and G. A. Valverde
The United States Federal Assessment Program NAEP (Naonal Assessment of Educaonal Progress) NAEP is an internaonal benchmark for best pracce in educaonal assessment. This is a federal assessment that is used strictly for the purpose of monitoring learning over me. Since there is no naonal curriculum in the United States, NAEP evaluates assessment frameworks that have been validated and agreed upon by all the states in the naon. These assessment frameworks in detail describe and jusfy the assessment domain of each test. Assessment domains range from the more tradional areas of reading and mathemacs to more innovave areas such as economics and foreign languages. The assessment frameworks remain stable for about 10 years. The NAEP assessment frameworks are operaonalized in the test and item specificaons. The specificaons offer the most concrete guidelines for designing tests and wring items. These guidelines include categories tradionally used in LAC, such as content and skills. They also include categories that are more innovave, such as level of item complexity (low, medium, or high), context (e.g., theorecal or applied mathemacs), item format (e.g., mulple-choice or openended). These documents are available online and serve as a reference for internal quality control and external audits. See mathemacs example here: hps://nagb.gov/naepframeworks/mathemacs.html The alignment between tests and assessment frameworks is validated externally. Panels are formed with technical and polical representaon, including classroom teachers, school administrators, and other stakeholders (e.g., parents, civil society representaves), as well as curriculum and assessment specialists. This parcipaon increases awareness and commitment to the performance levels, and contributes to their social and face validity. NAEP uses a wide range of state-of-the-art psychometric analyses. The IRT model is used to put tests with different items on the same score scale, which allows for the interpretaon of score differences as real differences in student learning. The level of uncertainty associated with the scores (standard error) is known and reported both in reports intended for the public and in technical documentaon. Bias analyses are performed as part of basic tesng procedures.
Fig. 3.1 International example of best practices in alignment. Note Prepared by the authors
3.5.1 Criteria: 4. Performance Levels Are Aligned with the Curriculum and 5. Performance Levels Are Operationalized with Actual Student Learning in Mind There is a growing trend in LAC to report results by performance levels (also called performance standards, achievement levels, learning levels, or equivalent terms). For example, countries may report assessment results by citing percentages of students who have reached the advanced, intermediate, or basic level. Performance levels give pedagogical meaning to the results, indicating what students at different performance levels know and can do. This is essential if teachers and other educational stakeholders are to understand the results of the assessments, value them, and be able to use them for improvement.
3 How to Ensure the Validity of National Learning Assessments? …
57
Performance levels serve a dual purpose. On the one hand, they convey an expectation of what students are supposed to know and be able to do according to the official curriculum (criterion-referenced element, with an absolute criterion). Usually, such descriptions correspond to the curricular expectations of the grade or assessment cycle. On the other hand, performance levels are meant to describe reality, or what students actually know and can do. They show the entire distribution of students’ skills and indicate their relative position on the score scale (normative element). Performance levels should be aligned with both the curriculum and actual student learning. The descriptions associated with each level of performance should refer to the competencies, content, or skills specified in the curriculum. The highest level should correspond to the curricular expectations (key learning, terminal, or minimum learning requirements for the grade or cycle being assessed), and the lowest level should reflect what less advanced students can and do. This requires tests that include items of different levels of difficulty, from the easiest to the most difficult, so that they can discriminate well not only at the top but also at the bottom of the distribution of skills. In LAC, it is common to observe that performance levels are set too high in relation to the actual learning achieved by students. In some countries, these results show that up to half of the students do not reach the first cut score associated with the levels. Results of this type are of limited value in informing educational policy and practice, and do little to help students who need the most support. The fact that performance levels are too demanding can be explained by a combination of factors: (a) (b) (c) (d) (e)
5
Performance levels rely on the curriculum alone,5 without considering the actual learning of the students. There is a distorted view of what students actually know and can do. There is political pressure to align national performance levels with those of international assessments to make the national education look more rigorous. There is apprehension that once a lower level of performance is defined, it would be interpreted as a sufficient minimum level. The cut scores associated with performance levels are set in relative terms, based on the percentiles of the skill distribution (e.g., 50th, 75th, and 90th percentiles), without covering the lower percentiles.
In LAC countries, curricula are frequently written without reference to the evidence showing what is actually taught and learned in classrooms. Consequently, these curricula often have learning objectives that could hardly be attained by large percentages of the student population. This presents an additional challenge to design assessments with performance levels that provide useful information about students who do not meet curricular expectations. It is also common that in the design of the curriculum, there is no collaboration among experts in educational measurement, and curricular experts. In such cases, the curriculum is not designed to be measurable. Therefore, it is often difficult to operationalize the curriculum with acceptable levels of validity for assessment purposes.
58
M. J. Ramírez and G. A. Valverde
3.5.2 Criterion 6: Performance Levels Describe Qualitatively Different Stages of Learning Performance levels must account for qualitatively different stages of student learning. Any rigorous classification requires clearly defined exclusive categories, and this also applies to performance levels. The competencies or skills described in one level of performance must be clearly distinct from the skills described in the next; and they must be written in simple, non-technical language, understandable to the widest possible audience. That is, the same skills should not be paraphrased differently at different levels. To achieve this, it is necessary to define a relatively small number of levels. There also needs to be sufficient distance between them on the scale of scores. Adding more performance levels has important implications for test design, interpretation, and use of results. The more performance levels, the greater the demand for items to measure each level with the necessary accuracy, and the greater the risk of classification errors. In countries with accountability policies, classification errors have more serious repercussions than in countries that do not have such policies. Thus, if assessment results are used for accountability purposes, countries need to make sure the level of accuracy in such assessment is sufficiently high. That is, educational authorities need to be absolutely certain that a school is correctly classified as “unsatisfactory” or “underperforming,” if this would affect its funding and reputation. It is important to distinguish students who reach the lowest cut score associated with performance levels from those who do not. Based on the test results, it can be inferred that the students who do achieve the lowest performance level have acquired the competencies associated with the first cut score. In contrast, it can be inferred that students who do not reach the lowest cut score do not have such skills. In LAC, the distinction is not always made between students who reach the lowest cut score and those who do not. Both groups are reported to be at a single performance level, as if they all had the skills associated with the first cut score. To differentiate between the two groups, it is important to introduce a default performance level that is applied to students who do not reach the first cut score and therefore do not have the skills associated with it.
3.5.3 Criterion 7: Performance Levels Balance Stability and Change in the Context of a Dynamic Curriculum Policy Performance levels must balance stability and change to monitor curricular learning over time. Stability is necessary to make learning comparisons using the same benchmark. Change is necessary to conform to the curricular updates. In other words, performance levels need to resolve the tension of (a) being invariant to ensure that
3 How to Ensure the Validity of National Learning Assessments? …
59
United States Federal Assessment Program NAEP (Naonal Assessment of Educaonal Progress) -
-
-
-
Reports results in four performance, or “Achievement Levels”: NAEP Advanced, NAEP Proficient, NAEP Basic, and Below Basic. To do this, it sets three cut scores. Students at the “Below Basic” level are those who did not reach the lower cut point on the score scale. The cut scores associated with performance levels are set so that they have a gap of approximately one standard deviaon on the score scale. With a gap this large, descripons of what students know and can do are substanvely different. It (what?) also ensures lower classificaon errors and greater reliability of the results. All methods and procedures for reaching final performance levels are published in the public domain and are rigorously documented. This documentaon serves as the basis for regular external audits. To resolve the tension between stability and change in performance levels, NAEP uses a strategy of making assessments that measure two types of performance levels: o Main NAEP measures performance levels that are updated according to the latest curriculum changes. It is used to report changes in learning in the short term (e.g., from one evaluaon to the next). o NAEP Long-Term Trends Assessment measures invariant performance levels that remain intact over me. It is used to measure changes in learning in the long term (e.g., over a 20-year period).
Assessment in the province of Ontario, Canada (Ontario Provincial Assessment Program) -
-
Reports results at four performance levels, numbered 1 through 4. The addional fih level applies to students who do not reach Level 1. This is a level that is defined by default and is described in the following way: "The student does not show enough evidence of the mastery of curriculum expectaons to be categorized to Level 1". Level 1 describes what students with weaker learning performance know and are able to do. It corresponds to the minimum level of observed skills or learning achieved. The cut scores have been set in such a way that only around 2% of students do not reach the cut score separang Level 1 from Level 2.
Fig. 3.2 International examples of best practices for reporting results by performance levels. Note Prepared by the authors
results are comparable over time, and (b) being aligned with a curriculum that is regularly updated. In the context of a dynamic curriculum, more and more countries in the region will need to resolve this tension. Figure 3.2 shows an example of how to do this.
3.6 Dimension of Consequential Validity of the Assessments The current state of development of assessments in LAC brings to the forefront the following criteria for collecting and assessing evidence of the consequential validity or impact of assessments: (8) assessment results are effectively communicated; (9)
60
M. J. Ramírez and G. A. Valverde
there are formal mechanisms to support the use of assessments to improve learning; and (10) there are formal mechanisms to monitor the consequences of assessments for the education system. Table 3.1, Dimension of Consequential Validity or Impact of the Assessments, explains the validity criteria and sources of evidence. Table 3.2 presents a checklist with examples of impact validation requirements.
3.6.1 Criteria: 8. Assessment Results Are Effectively Communicated, and 9. There Are Formal Mechanisms to Support the Use of Assessments to Improve Learning Countries implement national learning assessments in order to monitor learning and encourage improvement. The theory of action is that assessment results together with any associated information will be effectively communicated to different stakeholders (e.g., parents, teachers, principals, politicians), who will use them systematically to make better decisions. Such evidence-based decisions will have positive impact on educational policy and practice. That is, they will contribute to improving classroom teaching practices and student learning. However, in LAC, it is common to hear that assessments have not produced the expected impact. Critics point out that a lot of data are produced, but are not useful information for decision-making; that there is a lack of an assessment culture that allows for the systematic use of such assessments; and that teachers do not use assessment information to make pedagogical decisions. The most critical, argue that the assessments are affecting education negatively by narrowing the notion of educational quality, stigmatizing schools, encouraging competition instead of cooperation between schools, making students drop out, narrowing the curriculum, stressing teachers, and so on (Falabella, 2014; Ministry of Education [MINEDUC], 2015). These effects would be more pronounced in the context of accountability policies, for example, when incentives are published and associated with school performance. Responding to these criticisms requires evidence of impact. Communication of results is a major source of evidence to validate the impact of assessments. For these to have a positive impact, their results must be effectively communicated. That is, educational stakeholders (e.g., parents, teachers, ministry officials) need to have access to such information, understand it and value it positively, and use it effectively. There are very few studies that investigate the communication strategy of assessments in LAC (see Taut et al. (2009), and Sempé and Andrade (2017)). The existence of formal mechanisms for using assessment results is another source of validity evidence. Examples of such mechanisms include policy guidelines that promote the use of assessment results to inform school improvement plans, monitor overall school performance, or provide feedback to teacher training programs. The effective use of assessments requires these formal mechanisms, usually absent in
3 How to Ensure the Validity of National Learning Assessments? …
61
LAC. Assessment programs should be responsible for promoting the use of the information they generate to improve the quality of education.
3.6.2 Criterion 10: There Are Formal Mechanisms to Monitor the Impact of Assessments on the Education System It is key to systematically and regularly collect evidence on the consequences or impact of assessments on the education system. Such evidence should cover both positive and negative, expected and unexpected consequences. It will serve to either confirm the theory of action that guides assessments, or to modify it. The evidence will also be used to respond to the main criticisms against assessments. As long as there is evidence of impact, it will be possible to justify (or modify) the use of assessments for decision-making; such evidence is essential for providing the system with credibility. At the global level, the evidence on the consequences of assessments is mixed. On the more positive side, the report by Mourshed et al. (2010) concludes that education systems that improve the most over time implement rigorous learning assessment policies. These assessments allow educational systems to systematically monitor learning and use results as feedback. These policies are especially relevant for countries where the quality of education is relatively low. On the other hand, there is abundant evidence of the negative impact that high stakes national assessments can have on the education system, including curricular narrowing, student and teacher stress, and even increased marginalization of vulnerable populations (Kearns, 2011; Knoester & Wayne, 2017; Segool et al., 2013). The global evidence regarding the impact of accountability policies is also varied. These policies cover everything from the publication of school results in ranked order (league tables) to the distribution of monetary incentives to schools. In some developing countries, these policies have had a positive impact on student learning, contributed to lowering student drop-out rates, and allowed better control and supervision for education stakeholders (including parents), thus reducing corruption. However, the evidence also suggests that in other developing countries, the policies of accountability have not had the expected impact (Bruns et al., 2011). In LAC, there are few formal mechanisms in place to collect evidence on the impact of assessments on the school system. Rather, there are isolated studies on this subject; studies that are usually financed through external funds, independent of the assessment programs. Evidence on the impact of assessments on the education system in LAC countries is scarce but growing. The most optimistic report states that simply communicating assessment results has positive effects on student learning (de Hoyos et al., 2017). However, the evidence is usually indicative of much more moderate impact, if any. In Chile, the accountability policies have not shown the expected impact either on
62
M. J. Ramírez and G. A. Valverde
improving educational policies and practices in schools (Elacqua et al., 2015), or on parents’ decisions (Mizala & Urquiola, 2007) (Fig. 3.3).
3.7 Conclusions Assessments in LAC have matured enough and have gained enough influence on the education systems and the society at large to warrant a call for validity evidence. This type of evidence is produced and used to prove that assessments effectively measure the curriculum, their results can be interpreted as performance levels, and assessments have a positive impact on the education system in general, and on student learning in particular. This evidence is necessary to give assessments political credibility and viability, and to improve their technical characteristics. It is also key for avoiding the negative costs associated with the misuse of assessment results. This chapter presents three priority dimensions or sources of validity evidence for LAC: (1) evidence regarding the alignment of tests with the official curriculum, (2) evidence regarding the curricular validity of performance levels used to report the assessment results, and (3) evidence of consequential validity, or impact of assessments on the improvement of the education system in general, and of learning in particular. For each of these dimensions, it provides criteria and examples of evidence needed to validate the assessments. The evidence of validation for these three dimensions in LAC is sporadic. It is in the dimension of alignment with the curriculum where the greatest amount of evidence of validation is found. There are two main challenges in this dimension. The first is that LAC countries express considerable reluctance to develop a measurable curriculum. Therefore, curricular objectives are usually formulated in very general terms, without specifying their level of difficulty or complexity. This is in contrast to practice in more technically advanced countries and education systems, where there are close professional and institutional links in curriculum and assessment design that set curricular goals in measurable terms from the outset. One reason for the high quality of assessments in the province of Ontario, Canada, is that they have a curriculum designed to be feasibly measurable. The second challenge is that curricula are constantly changing, either through reforms or updates, which makes it difficult to monitor learning achievement over time. In the dimension of curricular validity of performance levels, countries must resolve technical–political tensions in order to measure the achievement of curricular learning. How many levels should be set? How demanding should these levels be? Where should these levels be set on the scale of scores? Another tension that countries need to solve arises from the United Nations’ Sustainable Development Goals. Goal 4 of “Ensuring inclusive, equitable and quality education and promote lifelong learning opportunities for all” will be measured in part by indicator 4.1.1, “Proportion of children and young people (a) in second or third grade, (b) at the end of primary school and (c) at the end of lower secondary school achieving at least a minimum proficiency level in reading and mathematics, by
3 How to Ensure the Validity of National Learning Assessments? …
63
The United States Federal Assessment Program NAEP (Naonal Assessment of Educaonal Progress) NAEP has wrien communicaon policies that idenfy responsibilies for reporng and disseminang results. NAEP strategic communicaon plans idenfy principles and priories for disseminang assessment results. These plans are developed by the NAEP's Reporng and Disseminaon Commiee from the Naonal Assessment Governing Board (NAGB) and are available online. The NAEP Validity Studies Panel reviews aspects of validity and uses of NAEP publicaons. It also funds external studies to collect evidence on this aspect of validity. NAEP has a validity study schedule that covers evaluang the consequences of the reporng of results. U.S. State Assessment Programs There is a great deal of research in the United States on the impact of assessments on teaching pracces and student learning. These invesgaons are made possible by the existence of public and private funds that priorize funding for impact validity studies. Thus, for example, a study of the impact of state assessments found that when performance levels had more posive names (labels), students were more likely to decide to connue their educaon at the post-secondary level (Papay, Murnane, & Wille, 2016). In New York, a study of the impact that the new statewide fourth grade assessments had on teacher turnover found that there was fewer turnovers both in this grade aer their introducon and in relaon to other grades not assessed (Boyd, Lankford, Loeb, & Wyckoff, 2008). In Texas, a study of the impact of statewide assessment found that the policies of accountability had more posive impact in schools with low average levels of achievement in mathemacs, where it was observed that students took more mathemacs courses and achieved higher levels of performance over me. Surprisingly, the impact was reversed in schools with higher average levels of performance (Deming, Cohodes, Jennings, & Jencks, 2016). Ontario Provincial Assessment Program The EQAO (Educaon Quality and Accountability Office), an organizaon in charge of assessments, regularly collects evidence on how its assessments are used and what impact they have on the educaon system. It implements the following acvies: - Internal reviews (e.g., through forums) to gather informaon about tests and how they are conducted in schools; the relevance of reports to educaonal accountability and improvement; and the impact of tesng on teacher training. These forums are conducted jointly by the Assessment Advisory Commiee and more than 20 organized community interest groups represenng public and private school principals, supervisory agents, teachers, boards of educaon and trustees, parents, and students. - Regular external audits that cover the overall assessment process, including its impact on the educaon system. To this end, consultaons are held with stakeholders and the general public. The external audit report is made available for public comment. As a result of internal and external reviews, the EQAO makes commitments to acon to improve assessments.
Fig. 3.3 International best practices of consequential validity of assessments. Note Prepared by the authors
64
M. J. Ramírez and G. A. Valverde
gender.” When defining performance levels for their national assessments, countries should take into consideration the minimum proficiency level determined internationally by the global teams defining standards for reporting on SDG indicator 4.1.1. However, they should do this without compromising the alignment of performance levels with their own national curriculum. The effective use of assessments requires creating an assessment culture where education stakeholders can access, understand, value, and use assessments for improvement. Forming this assessment culture is a pending challenge in LAC. The centrality of this point contrasts with the marginal budget that many countries allocate to the communication of results and general information on assessments. It also contrasts with the lack of opportunities (e.g., workshops, courses) for teachers, managers, or ministry officials to reflect on the obtained results and possible ways to improve them in the future. Another important step is to move forward in consolidating an agenda for validating the consequences of the assessments. Collecting this type of evidence is not part of the countries’ working agenda. This puts the credibility of the assessments at risk and has important associated costs, such as the costs of decisions based on erroneous assumptions. One example would be assuming that teachers will use results published in the reports to improve their teaching practices, when in fact the reports do not reach their audience and, when they do, teachers do not understand them. Alternatively, an assumption that teachers will improve their teaching practices when the results show that most of their students are clustered at the lowest performance level is a weak one. Without information on the students who are showing least progress, it is implausible that there will be improvements in the pedagogical practices of teachers and in the learning of these students. A challenge for Latin American countries is to resolve the tensions inherent in meeting different validation criteria. For example, there exists a conflict between evaluating curricular objectives and assessing what all students know and can do, including those who lag behind; or between reporting invariant performance levels and adjusting the levels according to curricular updates. It is important for countries to prioritize validity dimensions and criteria, depending on the degree of maturity of their assessments and the local context. For the countries that are currently introducing assessments to their education systems, the dimension of assessment alignment with the curriculum would be a priority with the minimal goal of producing technical documentation on test specifications. In countries that are starting to report results by performance levels, the priority should be for these reports to describe what students know and can do with regards to curricular objectives and actual student learning. Countries that have already put their assessments in place should focus on evidence of validity regarding the effective communication and impact of the assessments. The cost of validating the assessments is lower than the cost of not validating them at all. Expenses associated with conducting an external audit, for example, may seem considerable compared to the annual budget of an assessment program. However, this cost is unlikely to exceed 5% of the annual budget, and its benefits may go far beyond one assessment cycle. The cost of not having evidence of validity, however,
3 How to Ensure the Validity of National Learning Assessments? …
65
can be much higher; for example, negative misinterpretation of the results by the press can lead to a political crisis and even force educational officials to resign. The dimensions and validity criteria presented in this chapter can be used for different purposes: – To carry out an internal review of the methods and procedures used. Assessment program teams can use them to define technical standards to be met and to define their internal work routines for test validation. – To define a validity study agenda. For example, it could be decided that in a current year, a validity study will address the alignment of the tests with the curriculum, and that in the following year it will focus on the impact of the assessments on teaching practices. – To provide technical assistance in assessment to countries. There is strong interest from the international community to strengthen learning assessment programs in the countries of the region. The validation criteria presented here may be useful in identifying weaknesses and technical assistance needs (for example, to review the evidence regarding effective communication of assessments). – To guide external audits. Assessment programs should not only receive internal evaluations of their procedures and results; they must also go through external review. Claiming that a student, school, or education system has reached a certain level of performance is a statement that should be auditable. That is, it must be independently verifiable, in order to justify the decisions made on the basis of these results. Conducting external audits is a regular practice at NAEP (the USA) and the Provincial Assessment Program (Ontario, Canada). In LAC, there are several notable but still scarce examples of this type of practice (see the chapter by Valverde and Ramírez in the present volume). In LAC, validation reviews, where they exist, tend to be internal and more processfocused—that is, the assessment programs themselves collect or judge evidence of validity. Being more focused on processes, they are not conducive to creating a culture of external audits of their products. It is assumed that if the process was carried out as planned (implementation fidelity), the product will be adequate as well. Countries would benefit from establishing a regular external audit policy that is transparent to citizens, including better documentation of the processes used and assessment instruments produced. These audits should be seen as an opportunity to improve the assessments. The degree of maturity of assessments in LAC and of the level of influence they have gained in the public sphere requires evidence of validity: evidence to support the claim that the assessments are fulfilling their purpose of monitoring the achievement of curricular objectives and encouraging improvement. The dimensions and criteria proposed here seek to support countries in their systematic search for such evidence. Following these criteria is expected to give greater credibility, as well as political and technical feasibility to the assessments, in the interest of a better education for all.
66
M. J. Ramírez and G. A. Valverde
References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. Boyd, D., Lankford, H., Loeb, S., & Wyckoff, J. (2008). The impact of assessment and accountability on teacher recruitment and retention: Are there unintended consequences? Public Finance Review, 36(1), 88–111. https://doi.org/10.1177/1091142106293446 Bruns, B., Filmer, D., & Patrinos, H. A. (2011). Making schools work: New evidence on accountability reforms. World Bank. https://doi.org/10.1596/978-0-8213-8679-8 Darling-Hammond, L., Herman, J., Pellegrino, J., Abedi, J., Aber, J. L., Baker, E., & Steele, C. M. (2013). Criteria for high-quality assessment. Stanford Center for Opportunity Policy in Education. de Hoyos, R., García-Moreno, V., & Patrinos, H. A. (2017). The impact of an accountability intervention with diagnostic feedback: evidence from Mexico. Economics of Education Review, 58(C), 123–140. https://doi.org/10.1016/j.econedurev.2017.03.007 Deming, D. J., Cohodes, S., Jennings, J., & Jencks, C. (2016). When does accountability work? The Texas system had mixed effects on college graduation rates and future earnings. Education next, 16(1), 71–76. Elacqua, G., Martínez, M., Santos, H., & Urbina, D. (2015). Short-run effects of accountability pressures on teacher policies and practices in the Chilean voucher system in Santiago, Chile. School Effectiveness and School Improvement, 27(3), 385–405. https://doi.org/10.1080/09243453.2015. 1086383 Falabella, A. (2014). The performing school: The effects of market and accountability policies. Education Policy Analysis Archives, 22(70), 1–29. https://doi.org/10.14507/epaa.v22n70.2014 Kearns, L. L. (2011). High-stakes standardized testing and marginalized youth: An examination of the impact on those who fail. Canadian Journal of Education, 34(2), 112–130. Knoester, M., & Wayne, A. (2017). Standardized testing and school segregation: Like tinder for fire? Race, Ethnicity & Education, 20(1), 1–14. https://doi.org/10.1080/13613324.2015.1121474 Ministry of Education. (2015). Towards a complete and balanced system of learning assessment in Chile. Report task force for the revision of the Simce. Retrieved from http://www.mineduc.cl/wpcontent/uploads/sites/19/2015/11/Informe-Equipo-de-Tarea-Revisi%C3%B3n-Simce.pdf Mitroff, I. I., & Featheringham, T. (1974). On systemic problem solving and the error of the third kind. Behavioral Science, 19(6), 383–393. Mizala, A., & Urquiola, M. (2007). School markets: the impact of information approximating schools’ effectiveness. Journal of Development Economics, 103(C), 313–335. https://doi.org/10. 3386/w13676 Mourshed, M., Chijioke, C., & Barber, M. (2010). How the world’s most improved school systems keep getting better. McKinsey & Company. Papay, J. P., Murnane, R. J., & Willett, J. B. (2016). The impact of test score labels on human-capital investment decisions. Journal of Human Resources, 51(2), 357–388. Segool, N. K., Carlson, J. S., Goforth, A. N., Von der Embse, N., & Barterian, J. A. (2013). Heightened test anxiety among young children: Elementary school students’ anxious responses to high-stakes testing. Psychology in the Schools, 50(5), 489–499. https://doi.org/10.1002/pits. 21689 Sempé, L., & Andrade. P. (2017). Final report: Evaluation of the use of census reports of students in the school. GRADE/FORGE. Taut, S., Cortés, F., Sebastian, C., & Preiss, D. (2009). Evaluating school and parent reports of the national student achievement testing system (Simce) in Chile: Access, comprehension and use. Evaluation & Program Planning, 32(2), 129–137. https://doi.org/10.1016/j.evalprogplan.2008. 10.004 UNESCO Institute for Statistics, & Australian Council for Educational Research. (2017). Principles of good practice in learning assessment. Recovered from http://uis.unesco.org/sites/default/files/ documents/principles-good-practice-learning-assessments-2017-en.pdf
3 How to Ensure the Validity of National Learning Assessments? …
67
Valverde, G., & Ramírez, M. J. (2019). Contemporary practices in the curricular validation of national learning assessments in Latin America: A comparative study of cases from Chile, Mexico and Peru. In J. Manzi, M. R. García, & S. Taut (2019), Validity of educational assessment in Chile and Latin America. Ediciones UC.
María José Ramírez Psychologist from the Pontificia Universidad Católica de Chile and Doctor in Education from Boston College, USA. Currently she is an international consultant in education. Her areas of interest include learning assessment and quality of education. Contact: [email protected] Gilbert A. Valverde He has a degree in Philosophy from the University of Costa Rica and a Ph.D. in Comparative and International Education from The University of Chicago, USA. He is Dean of International Education and Vice-Provost for Global Strategy at the University at Albany, State University of New York and a member of the faculty in the Department of Educational Policy and Leadership. His areas of interest include the international comparative study of assessment policy and curriculum, educational measurement, and large-scale international testing. Contact: [email protected]
Chapter 4
Contemporary Practices in the Curricular Validation of National Learning Assessments in Latin America: A Comparative Study of Cases from Chile, Mexico, and Peru Gilbert A. Valverde and María José Ramírez
4.1 Introduction National learning assessments have been growing in Latin America (LA) with the purpose of monitoring the mastery of the curriculum achieved in schooling and to foster improvements in achievement. Almost all countries in the region now have a national assessment program. By contrast, in the early 1990s, fewer than five countries had one. Most programs state that one of their main purposes is to monitor compliance with the objectives established in their curricular policies1 (Ferrer, 2006; Ferrer & Fiszbein, 2015).
1
We understand by “curriculum policy” the set of policies, norms, ordinances, and institutional practices that make up the official orientations of the educational system to guide the execution of the curriculum. National tests or assessments are typically curricular policy instruments. We use the word “curriculum” to designate the curricular policy instruments that in different countries are called national curriculum, national plan of studies, standards, graduation profiles, mandatory minimum content, or similar, which have in common the function of prescribing the content of national schooling.
We appreciate Elisa de Padua’s valuable collaboration in collecting and analyzing information on validation practices in learning assessment programs. We also thank all the professionals of the evaluation programs contacted. Finally, our thanks to Patricia Arregui (GRADE) for her valuable contributions and comments. G. A. Valverde (B) University of Costa Rica, San Pedro, Costa Rica e-mail: [email protected] M. J. Ramírez Independent Consultant, Education, Alexandria, VA, USA © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. Manzi et al. (eds.), Validity of Educational Assessments in Chile and Latin America, https://doi.org/10.1007/978-3-030-78390-7_4
69
70
G. A. Valverde and M. J. Ramírez
In this chapter, we examine how three countries in LA validate inferences they make from test results to see if their educational systems are reaching objectives established in their curricular policies. How do their assessment programs back inferences about the achievement of the learning proposed in curricular policies? What evidence do they have to determine that the test results can be interpreted as a reflection of such learning and effectively used for improving learning outcomes? In other words, to what extent are these programs concerned with validating the interpretations and expected uses of their evaluations? These are mainly questions about the curricular validity of tests, which is often discussed as an aspect of content or construct validity (Valverde, 2000), and is especially important since assessment instruments in LA have the explicit objective of monitoring performance levels established by curricular policies and achieved in schools.2 As described in other contributions to this volume, validation consists of various methods to accumulate and evaluate evidence to justify the inferences made from the assessments. There is a broad range of theoretical and methodological approaches to address different aspects of validity (Burstein et al., 1990; Kane, 2001; Sireci & Faulkner-Bond, 2014; Stobart, 2001). In this chapter, we focus only on certain aspects of validation: those related to the justification of the use of assessment results to assess student performance level specified in the prescribed curriculum. In our opinion, curricular validation should be a high a priority in the validation agenda for school assessments, since many countries in LA explicitly cite the need to measure learning achieved in their schools as one of the main objectives of their assessments. Curricular validity is fundamentally important: it allows us to be confident that when a national assessment indicates that 20% of students fail to achieve some of the expected learning for their grade, nations are justified in focusing their future educational efforts on those areas where students struggle the most. Have we correctly identified the educational areas that we should focus on? Have we correctly identified the students (and their schools, geographic location, socioeconomic status, etc.) that should be the target of our efforts? To be concerned about curricular validity is to be concerned about the value of the conclusions derived from national assessments as tools for decision-making in the education system. Curricular validity therefore serves to justify our assessmentinformed actions in the educational field. Obviously, in education systems in which assessments have any serious consequences for students, schools, or other institutions or actors, curricular validity is of primary importance: it helps to assess of the reasonableness of decisions, incentives, and sanctions that are based on assessment results. This chapter describes a comparative study of national learning assessment programs in Chile, Mexico, and Peru. To this end, it uses a taxonomy or matrix of dimensions and criteria for curricular validation of national assessments presented 2
For the purposes of this study, we accept these stated objectives as the fundamental objectives of the policy under evaluation in the region, although there may and should be other stated and undeclared objectives for these evaluations.
4 Contemporary Practices in the Curricular Validation …
71
in the chapter by Ramírez and Valverde in this same book. The assessment programs in question are compared in terms of their efforts in test validation regarding (1) the alignment of the tests with the official curriculum, (2) the curricular validity of the performance levels used to report test results, and (3) the consequential validity of the assessments. For each of these dimensions, the chapter explores similarities and differences in the countries’ evaluation practices. Examples of specific validation procedures in the region are included, with careful consideration of their strengths and weaknesses, as well as our recommendations for improvement. The chapter concludes with a summary of findings, recommendations, and identification of challenges in the validation of national assessments that remain unresolved. We hope that the examination of these three cases may yield lessons that would be applicable to systems throughout LA.
4.2 Methodology To investigate how the validation tasks are approached, we selected cases that would satisfy criteria for a comparative case study: we chose three national student assessment programs in LA that would represent the diversity of assessment practices and institutional designs (to capture the range of variability in the region) and for which there are sufficient data and information sources available. The latter criterion presents a major challenge for studying national assessments in the region: it is only recently, and in rare cases, that specific and detailed technical information about the procedures followed to develop national assessments is documented and disseminated to the public (or researchers). Therefore, this chapter focuses on three national assessment cases: Chile, Mexico, and Peru. These countries participate in largescale international assessments, and their assessment systems have been technically influenced by such participation. Our informants mention a common pattern in how international testing influenced their countries’ assessments: first, they brought to light concerns regarding year-toyear comparability of test results, and, secondly, familiarized the countries with technical procedures for measuring and communicating results in terms of performance levels. We have also verified, after a comprehensive review of the assessment agencies in LA that the countries chosen are those that document their technical procedures in greatest detail and most transparently communicate them to the public. At the same time, they present differences that are representative of different types of national assessment systems that exist currently in LA. Table 4.1 gives basic information on the cases. This study began with a review of the technical documentation files available in countries that have participated in at least one round of the Programme for International Student Assessment (PISA).3 Having identified the cases referred to in Table 3
International large-scale comparative test conducted by the Organization for Economic Cooperation and Development (OECD).
72
G. A. Valverde and M. J. Ramírez
Table 4.1 General data on the cases Peru
Mexico
Assessment name and Education Quality characteristics Measurement System (SIMCE) A census-based assessment that is carried out according to a multi-year calendar for different grades in elementary and high schools, in a variety of disciplinary areas
Chile
Student Censual Assessment (ECE), conducted annually in second grade of elementary school in literacy and mathematics
National Plan for the Evaluation of Learning (Planea). The Evaluation of the Achievement referred to the National Educational System (ELSEN) is carried out every 4 years with a country-level sample and samples that are representatives of each one of the Mexican states. It evaluates four school subjects in two elementary grades and one secondary grade. Numerical and verbal reasoning are also evaluated in third preschool grade
Reference to curriculum
The national curriculum is the framework document for the educational policy of elementary education and serves as the benchmark for ECE. It defines performance levels required for graduation from elementary education, national competencies, and the progression of these competency goals from the beginning to the end of the elementary education, as well as expected competency levels by cycle, level, and type of school
The national curricula for each of the school subjects assessed. However, in Mexico, a single document is not considered to be the sole specification of the national curriculum. For this reason, Planea’s reference documents are the study plans and programs, official student and teacher textbooks, worksheets, and various other official instructional materials
The learning standards that are developed are based on the current national curriculum
(continued)
4 Contemporary Practices in the Curricular Validation …
73
Table 4.1 (continued) Agency responsible
Chile
Peru
Mexico
The Agency of Education Quality—an independent unit that is part of the National System of Quality Assurance
The Office for Measuring the Quality of Learning is part of the Strategic Planning Secretariat of the Ministry of Education
National Institute for the Evaluation of Education (INEE). Autonomous public body that reports directly to the Senate of the Republic of Mexico
Note Prepared by the authors. The table presents general characteristics of the cases of Chile, Mexico, and Peru that will be analyzed in this study
4.1 as the best documented and sufficiently representative of the range of variability of national assessments in LA, we proceeded to study each of them in depth, carrying out the following tasks: • Review of technical reports, as well as institutional websites of its programs of learning assessments and the agencies in charge of the official curriculum, followed by interviews via e-mail, phone, or videoconferencing with people linked to each assessment program, with additional surveys and document analysis. • Analysis of notes and reports of technical consulting work done by the authors in the assessment and curriculum agencies in the three countries under study, as well as in other countries of the region. These data were used in a comparative case study to ensure that the contexts of analysis are analytically equivalent,4 but at the same time present similarities and meaningful differences in the variables of interest with respect to curricular validation. The need for sufficiently documented cases resulted in a sample of three of the most consolidated assessment systems in the region. It is important to recognize that many of the national assessment systems in LA are under-documented and less institutionally consolidated than our three cases. At the same time, the countries used in our study reflect typical relationships between assessment systems and official curricula in the region. In the comparative study reported here, we will illustrate the use of the fundamental validation criteria as a comparison tool, criteria that we present in the chapter by Ramírez and Valverde in the present book.
4
For example: the three cases are national evaluation programs; they also have an explicit relationship with the fundamental curriculum document(s) of the education system, and represent tests designed following similar psychometric models.
74
G. A. Valverde and M. J. Ramírez
4.3 Curriculum Validation Practices in Latin America It is important to recognize a central challenge for curricular validation of assessments in LA. It is widely documented in other publications (Ferrer, 2006; Valverde, 2009) that official curricula in LA are not designed to be measurable. It is typical for the region that teams responsible for developing curricula do not have specialists in education measurement among their members. Moreover, such teams commonly fail to consider standardized assessments as a valid tool to evaluate if curricular objectives are achieved. This fundamental characteristic of most curricula is consistently mentioned in interviews with evaluation professionals in the region. It also appears in interviews with members of teams entrusted with developing the curriculum. These teams of curriculum experts are commonly skeptical regarding the effort to operationalize curricular objectives in measurable terms in standardized assessments. As a result, there emerged this idiosyncratic challenge for LA: curriculum assessment is conducted in organizational contexts where curriculum developers and school assessment developers do not typically have institutionalized points of contact or participate in joint technical teams. Given the above, teams that design assessments have to come up with ways to measure curricula that were not expressly designed to be measured. Our case studies are not homogenous in this respect. Chile stands out the most: the country’s curriculum is designed by the Ministry of Education’s Curriculum and Evaluation Unit, expressly for the purpose of being measured and with careful consideration of how to best operationalize this goal. Then, there is an agency in charge of evaluating that curriculum (Quality Agency). Peru, on the other hand, represents a case more typical of LA. Its official school curriculum is written around competencies by the Ministry of Education. However, it was not written with the idea that assessments should cover all the competencies and all the components of them that are put forward, and the theoretical and technical foundations of the curriculum do not always yield themselves easily to measurement. As is typical across the region, the assessment is carried out by the Quality Measurement Unit (UMC), a technical team completely unrelated to the one responsible for the curriculum, even though they both report to the Minister of Education. Mexico’s case is distinct as well: there is no single document that represents the official curriculum, but rather there are a number of documents and other materials that must be analyzed to develop the assessments, which are mentioned in Table 4.1. Yet, it is typical in one fundamental aspect: designers of national assessments in Mexico have to work with curriculum documents that were not created with measurements in mind (Valverde, 2009). The three cases, therefore, allow us to examine the technical approach to curricular validation in contexts that are very different in terms of the structure of curricular policies and their relation to school assessments. It allows us to think about the variety of situations in curriculum and assessment policies in LA and the challenges they present for the validation of assessments.
4 Contemporary Practices in the Curricular Validation …
75
4.3.1 Dimension of Test Alignment with the Official Curriculum In terms of different types of evidence for the dimension of alignment with the curriculum, we have pointed out that the challenges faced by each national case are different, and that in LA it is uncommon to have a curriculum designed to be assessed, even though the agencies responsible for the assessment are charged with measuring that curriculum. Therefore, it is common in LA that the regulations that govern assessment indicate that it should monitor the achievement of curricular objectives: that is, learning is supposed to be measured at regular intervals. In Table 4.2, we present in comparative form the basic information about how each of the three national assessment agencies Table 4.2 Alignment of tests with the national curriculum Validation criteria
Chile
Peru
Mexico
The design of the assessment is justified in reference to the curriculum
The law states that the national assessment must verify compliance with the curriculum using performance standards The justification for the assessment design is based on the results of curriculum analysis used as the basis for the technical specifications of the assessment. These are refined through feedback from the Ministry’s expert teams, including curriculum developers
The national Elementary curriculum is structured around competencies. Learning standards that define the level that all students are expected to achieve by the end of the Elementary school cycle are specified The justification for the evaluation design (in this case, the Table of Specifications) is made on the basis of the learning standards
There is no single document that would outline the curriculum. Assessments are based on curricula, textbooks, worksheets, and other instructional materials The evidence of curricular validity is the results of the curricular analysis of this diverse material, used in the justification of the domains measured in the tests
The assessment domain is operationalized with actual student learning in mind
Evidence of learning from national and international assessments, as well as from classroom assessments, is used as input to adjust the difficulty level of the assessment domain
Evidence from national and international diagnostic and census assessments is used to adjust the difficulty level of the tests
Actual student results are used to adjust the difficulty level of national assessments (no mention of international assessments)
(continued)
76
G. A. Valverde and M. J. Ramírez
Table 4.2 (continued) Validation criteria
Chile
Peru
Mexico
Test results allow accurate and unbiased monitoring of the achievement of curricular learning
Tests from different years are equated and put on the same score scale to ensure year-to-year comparability. An IRT score scale is used. The measurement error associated with the scores is reported, with the error being of acceptable size. DIF analysis is performed to avoid testing bias
Tests are equated from one administration to the other, using standardized Rasch scale scores
Tests are equated using IRT and multiple imputations to estimate plausible values per student, comparable from one administration to the other
Note Prepared by the authors. The table presents a comparison of evidence of curricular validity (dimension of test alignment with the national curriculum) for the cases of Chile, Mexico, and Peru
handles the issue of test alignment with the official curriculum, organized according to the criteria related to the dimension of test alignment with the official curriculum that we presented earlier (see chapter by Ramírez and Valverde 2019). In regard to the first criterion in Table 4.2, Chile presents both an exception and a typical case. It differs from other countries because its curriculum was designed with the explicit purpose of being measured through standardized tests, and shares the LA tendency to explicitly prescribe for school assessments to be aligned with the curriculum. The current General Education Law No. 20.370 (2009) states that the national assessment must verify the degree of compliance with the general objectives of the curriculum by measuring Learning Standards (or performance standards). Thus, test results are to be reported in terms of compliance with the standards that describe what students must know and be able to do to demonstrate compliance with the learning objectives stipulated in the current curriculum. Chile and Peru both link the curriculum and assessments with the help of documents that outline (learning) standards. In Chile, the standards are developed based on the curriculum for the corresponding grade, and include the skills and knowledge students are supposed to have accumulated by the time when the assessment is carried out. In developing these standards, specialists use empirical evidence from previous national and international assessments, as well as from classroom evaluations. In Peru, the official curriculum titled the National Elementary Education Curriculum (Ministry of Education, 2016) was recently amended.5 It is structured 5
As will be seen later, adjustments, reforms, and changes to the curricula are now very frequent in LA and present a challenge for curricular validity and for the year-to-year comparability of evaluations.
4 Contemporary Practices in the Curricular Validation …
77
around competencies and includes learning standards that define the level that all students are expected to achieve by the end of the elementary school cycle. The learning standards in the official curriculum delineate the development of competencies at levels of increasing complexity, from the start to the end of the elementary school cycle, according to the sequence followed by most students who progress in a given competency. These standards are described through eight overlapping levels for primary education for different strands of the subjects that make up the national curriculum. This way of organizing the curriculum is relatively recent, and at present the assessment system is still grappling with the technical challenges of ensuring alignment with this new model of curriculum. The learning standards are intended to serve as benchmarks for assessing learning at both the classroom and national levels (the national Student Censual Assessment [ECE]). However, census assessments evaluate some competencies, but cannot and do not claim to account for all of them. Paradoxically, we can consider Mexico as the most instructive example for the most common situation in LA. This claim may seem counterintuitive, since Mexico is unique in not having a single official curriculum document, but rather a set of documents. What is instructive, though, is the way in which the National Institute for the Evaluation of Education (INEE) in Mexico approaches its work with this set of documents: through a procedure of rigorous and disciplined curriculum analysis. Its methodology is interesting for its potential to address the dilemma faced by many countries in the region: the need to translate, in measurable terms, an official curriculum that was not properly designed to be measured. Mexico’s official curriculum is comprised of plans and programs of study, textbooks, teacher’s books, files, the National Teacher Training Course, and other official curricular policy documents issued by different directorates of the Ministries and Sub-secretariats of Education, corresponding to educational levels, disciplinary areas, and curricula.6 To design tests aligned with the curriculum, the INEE begins by forming academic committees that analyze all relevant documents and determine the complete set of curricular goals by disciplinary area and grade, and then, based on this analysis, select the curricular areas to be measured in the National Plan for the Evaluation of Learning (Planea). Each Academic Committee (AC) is made up of approximately a dozen members that are curriculum specialists in the disciplinary areas, academic specialists, textbook authors, school principals, teachers from teacher training institutions, in-service teachers, and methodological advisors from the INEE technical group. Each AC receives training in curriculum analysis, test design, and other fields considered essential for carrying out their work.7 6
In secondary school, for example, there are four curricula with different specific programs in each discipline area: general Secondary, Tele-secondary, industrial/technical Secondary, and Federal Secondary. 7 Training materials, working protocols, and instruments used in the work of ACs can be found on the INEE website. http://www.inee.edu.mx/index.php/proyectos/excale/excale-documentos-tec nicos.
78
G. A. Valverde and M. J. Ramírez
The AC then analyzes a wide range of documents from Mexico’s formal curriculum, because, in the words of the Technical Manual: Since in practice no document contains everything that should be taught or what is important curriculum-wise, at this first stage it is necessary to carry out content analysis of various sources that define the subject curriculum, such as the syllabi, textbooks and teacher’s books, worksheets, as well as various instructional materials. This is in order to explicitly define the learning levels that need to be achieved according to the curriculum (in the corresponding subject) and to determine their scope (Directorate of Testing & Measurement, 2005, p. 42).
The curriculum analysis conducted by each AC results in a double-entry table (grid) summarizing student learning outcomes teased out from the discipline area curriculum. This grid also includes preliminary recommendations regarding priority areas for assessment. In order to validate and complement the curricular analysis carried out by the ACs, consultations about curricular priorities are made to convenience samples of in-service teachers, and specialists are commissioned with independent studies. The grid (see Fig. 4.1 (Direction of Testing and Measurement, 2005) for an example) is subject to additional analysis that would identify content priorities and performance expectations in the formal curriculum in order to determine the elements to be considered in test designs. Curriculum analysis presented in such grids is complex and includes not only the curricular elements mentioned above but also relationships between content, content and skills, epistemological aspects, and others. There is an analysis protocol to determine relevance, content chains,
Fig. 4.1 Expanded portion of a curriculum analysis grid in Spanish (primary education), from EXCALE (predecessor of the current Planea assessment) Mexico that establishes curricular content and relations between this content, set in the curriculum analysis. These grids result from the analysis of various curricular documents and are used as inputs to identify assessment domains. Retrieved from “Manual técnico para el diseño de exámenes de la calidad y el logro educativo”, (Technical Manual) Direction of Test and Measurement, 2005, Mexico City: National Education Assessment Institute
4 Contemporary Practices in the Curricular Validation …
79
and other elements from the empirical evidence gathered in the curricular analysis, which is explained in detail in the technical materials. Decisions about the curricular expectations to be measured and the weight each expectation will have on the test are recorded in Tables of Contents, which are the basic test design documents. Such grids are used to design test items. Alignment is ensured by ensuring that each test item has its match in the Table of Contents. It is also ensured by checking that the set of test items is properly distributed on the table. Evidence of PLACNEA’s curricular validity—and therefore of its alignment with the official curriculum—is based on the review of the documentation and analysis resulting in the grids and Tables of Contents. So far, there are no independent studies or INEE post-hoc evaluations of how well the assessments are aligned with the curriculum. Indeed, this is typical throughout LA: validating an assessment is mostly interpreted as ensuring that all necessary procedures were followed faithfully. As in Mexico, in LA in general, the evidence of validity referred to curricular alignment is mainly focused on evaluating process reliability rather than auditing the results of such processes. As for the validity of comparing variation in achievement over time, we see that the three cases are similar and are at the forefront of these practices at the regional level. They use psychometric procedures similar to those used in the international tests in which they participate, to ensure that their test results are comparable over time. This represents an important milestone, since for many years it was not common for LA countries to produce valid year-to-year comparisons, even though the assessment results were reported as if they were comparable over time. The only type of evidence that supported validity of such comparisons was the identical design (typically called the specification table) of the test over years. This reasoning, of course, ignores the fact that students sampled for different rounds of assessments as well as test items differ from year to year. The use of appropriate psychometric methods of test matching is therefore an important advance. However, LA is a region currently characterized by dynamic curriculum policy environment: official curricula are frequently reviewed, refined, and reformed. So, how can one simultaneously recognize the change in curricular objectives and expect to have valid comparisons over time? We have not found any examples of countries that have specific plans about how to safeguard the comparability of national assessments across changing curricula. This is a major challenge that many countries will have to address in the nearest future. Although our interviewees expressed certain concerns regarding the issue, technical solutions are still to be discovered.
4.3.2 Dimension of Curricular Validity of the Performance Levels Performance levels have been introduced to LA only recently, and the three country cases under review act as pioneers in this area (see Table 4.3).
80
G. A. Valverde and M. J. Ramírez
Table 4.3 Results by performance levels Validation criteria
Peru
Mexico
Performance levels are By law, performance aligned with the curriculum standards or levels of achievement are instruments that represent the curriculum. The national assessment is required to be based on performance levels The degree of alignment between the curriculum and performance levels is evaluated by experts
Chile
Performance levels are developed by assessment technical teams in reference to the learning standards included in the national curriculum The degree of alignment between the curriculum and performance levels is evaluated by experts
Performance standards are evaluation tools; they are not part of the curriculum. They represent operational definitions of curricular objectives created by collegiate technical committees to translate the curriculum into measurable terms. They are currently used only in assessments themselves, as well as assessment reports and databases The degree of alignment between the curriculum and performance levels is evaluated by experts
Performance levels are operationalized with actual student learning in mind
Performance levels are defined considering evidence of students’ achievements in national and international assessments, and in classroom assessment
Performance levels are defined considering evidence of student achievement in the national assessment
Performance levels are defined considering evidence of student achievement in pilot tests and national trials
Performance levels describe qualitatively different stages of learning
The development of performance standards in Chile takes two components into account: the qualitative and the quantitative ones. Assessment results categorized in three levels: Adequate, Elementary, and Insufficient levels. Each is defined by corresponding knowledge and skills, and a list with the minimum requirements for the Elementary and Adequate levels. These specifications are developed through a process that considers the expectations of the curricular framework, as well as the achievements actually demonstrated by Chilean students
ECE results are reported in terms of levels of performance (or achievement; they are called both ways in official documents) for Mathematics, reading, and writing According to score results, high school students are grouped into four levels of achievement: Satisfactory, In Progress, Early stage, and Before Early Stage. In turn, elementary students can be sorted in three levels of achievement: Satisfactory, In progress, and Early stage Each of these levels describes what a student who scores within a certain capability range knows and can do. Levels of achievement are inclusive, which means, for example, that students placed in the Satisfactory level have a high probability of adequately answering the Satisfactory level questions, as well as questions for In Progress and Early Stage levels
Planea describes four performance standards/levels of achievement: The Below the basic, Basic, Medium, and Advanced. The task of specifying requirements for the levels of achievement is entrusted to collegiate committees. They are in charge of creating labels/names, establishing definitions, and identifying illustrative items corresponding to each level. The first committee focuses solely on reviewing curriculum documents, without reference to previous test results. Other committees are responsible for determining cut scores, following an adaptation of the Bookmark method where work is done based on empirical evidence from the tests
(continued)
4 Contemporary Practices in the Curricular Validation …
81
Table 4.3 (continued) Validation criteria
Chile
Peru
Mexico
Performance levels balance Since performance levels have been introduced only recently, there are no defined stability and change, in the policies or procedures yet to be followed to address changes in performance levels over context of a dynamic time curriculum policy Note Prepared by the authors. The table presents a comparison of evidence of validity of test performance levels according to validation criteria, cases from Chile, Mexico, and Peru
In regard to the first criterion, Chile is an example of an education system that has high expectations regarding performance levels that are called learning standards: their function is to make schools accountable for the learning of their students. In this sense, the performance levels describe a “non-waivable minimum” that is established as an aspiration in the official curriculum. In the interviews with ministry teams in Chile, our informants asserted that the alignment between performance standards and curriculum occurs naturally. This happens, we were told, because the professionals responsible for developing the official curriculum are also in charge of developing the qualitative descriptions of the performance standards. The teams consider this alignment achieved if the descriptions of standards account for the previous steps or stages (either content or skills to be mastered) that must be met in order to achieve the curriculum objectives for the grade to be assessed. Since 2010, validation procedures are determined with the input from teachers and discipline specialists who evaluate how well performance standards are aligned with the official curriculum. Although these validations are not intended to verify curricular alignment, they do show that the performance standards are understandable and that they effectively correspond to previous steps in progression toward the objectives of the curriculum. Ministry professionals participate in these validations, ensuring that suggested changes are aligned with the curriculum. Finally, the National Council of Education in Chile evaluates and approves the performance standards before they are published. This evaluation is carried out with the help of discipline specialists and includes, among other criteria, the verification of the alignment between performance standards and the curriculum according to the judgment of national and international experts.8 In Mexico, the national curriculum is considered to be a broad set of official documents, which specify different areas of the formal curriculum and which are analyzed (in a procedure described above) for the purpose of designing tests. The standards or levels of performance, called “levels of achievement” in Mexico, are tools created by the INEE as categories for the analysis and communication of test results.
8
Obtained from Agreement 075/2012 of the National Education Council, available at http://www. cned.cl/public/Secciones/SeccionEducacionEscolar/acuerdos/Acuerdo_075_2012.pdf.
82
G. A. Valverde and M. J. Ramírez
They were established in an academic seminar, convened by the INEE with the participation of national and international experts, where possible criteria for performance standards were considered, their appropriateness was evaluated, and recommendations were issued. The INEE’s unit of Testing and Measurement, using recommendations from the academic seminar, finalized the levels of achievement and their basic descriptions. Curricular validation of these performance levels is obtained through the judgment of panelists as part of the Planea test design methodology. The test design specifies the analytical relationship between the curriculum and test items, and the use of test items to establish cutoff scores for the levels of achievement to operationalize them. Performance levels are operationalized in assessments and are not present in the curriculum. Strictly speaking, what is validated in the case of Mexico is the interpretation of expert judgment with respect to the performance levels that are implicit in the national curriculum. In relation to the second criterion in Table 4.3, in Chile, Mexico, and Peru the domains of assessment are established with some consideration of actual student learning. In all three, for example, student scores on national and, at times, international assessments are used to set cut scores between levels of student achievement. The preferred technique is a variant of the Bookmark (Karantonis & Sireci, 2006): teams are given booklets of items ordered by levels of empirical difficulty (i.e., according to the student scores on the items). Different levels of performance and the items (and scores) that mark the cutoff points for each level are identified. The concern for creating performance levels, or standards, arises primarily from the agencies in charge of national assessments, rarely from the units responsible for the curriculum. Reporting by performance levels also represents a new challenge in terms of communication: moving from reporting “percentage of correct answers”, or “scores” on tests, to reporting the percentage of students who reach different levels of performance. For example, in 2011 in Peru, there were two performance levels: Level 2 (students who have supposedly achieved the expected learning objectives for the grade) and Level 1 (students who are still in the process of achieving the expected grade). In Chile, three levels of performance are reported, and in Mexico there are four; and in all three systems there is an additional implicit level: those students who do not reach the lowest level of achievement. The validation of the performance levels is mainly done based on evidence from test items. Therefore, the validation of the lowest implicit level is extremely problematic. This has important implications for the validity of the full range of performance levels. For example, in the 2011 Student Censual Assessment (ECE) in Peru (Fig. 4.2), 51% of students were below Level 1 in Mathematics (Ministry of Education, 2012). This meant that the performance levels established in the test did not give information about more than half of the students, which put into question the curricular validity of the levels of performance (i.e., the validity of assessment in reference to what is actually taught and learned in the classrooms). Additionally, these results raised some important questions regarding equity in school education. In LA (and elsewhere), the poorest and most marginalized students and schools have the lowest levels of performance. In the case of Peru, performance levels—and
4 Contemporary Practices in the Curricular Validation … NATIONAL LEVEL
83
REGION
LEVEL 2
ACHIEVED EXPECTATION These students solve mathemacal problems, according to expected for the grade.
LEVEL 1
DID NOT ACHIEVE EXPECTATION These students solve simple mathemacal problems.
BELOW LEVEL 1
DID NOT ACHIEVE EXPECTATION These students have difficules even to solve simple mathemacal problems.
Fig. 4.2 Example from the 2011 Student Censual Assessment (ECE) in Peru
therefore the test itself—were not designed to give information about the most vulnerable populations in the education system. In post-2011 assessment reports in Peru, student scores were sorted into three performance levels: Satisfactory, In Progress, and Early Stage, possibly in recognition of this problem (Ministry of Education, 2017). Students who were identified as “Below Level 1” in 2011 are now categorized as “Early Stage". Even so, the quality of information about this group is much lower than that for the higher levels. This situation is very common in LA, where one could find plentiful examples of tests with performance levels where about a quarter (or, at times, much higher percentages) of students do not reach the lowest achievement level and where, consequently, the levels show little face validity because they were not designed to give information about large populations of students (i.e., the tests are skewed toward measuring higher performance levels).9 This suggests that test equating is done under conditions where high proportions of students skip or fail to answer a significant number of items. This is problematic for stable comparisons and is especially questionable when assessments are to be used to monitor year to year improvements in lower levels of performance. It is also a shortcoming for LA countries participating in international tests such as PISA and Trends in Mathematics and Science Studies (TIMSS) (Xu, 2009).
9
One reason that might explain the conformity that exists with respect to definitions of performance levels that exclude information about such high student proportions is the influence of international testing. It is common that, in tests such as PISA or TIMSS, for example, large proportions of Latin American students do not reach lower performance levels. One possible reason for this tendency to forego assessing knowledge and skills of students in the lower range of achievement may be the influence of international testing. For instance, tests such as PISA or TIMSS focus primarily on higher levels of achievement, and large proportions of LA students from participating countries fail to reach the lowest performance level.
84
G. A. Valverde and M. J. Ramírez
4.3.3 Dimension of Consequential Validity of Assessments In all three countries, much is expected from the national assessments. As assessment systems become more consolidated, there is more research that seeks to gather evidence about their impact. It is expected that the communication of assessment results will generate a positive impact on the quality of learning; therefore, the importance of the criteria is presented in Table 4.4. The first criterion of this dimension concerns effective communication of assessments. One of the most important objectives of the dissemination of results that is common for all three countries could be called the “didactic” objective. It deals with explaining the assessment system to the audience, promoting access to the information, and teaching the public to read and interpret assessment results. This objective is high on Chile’s priority list, though the country has the longest (of our three cases) assessment history. Each major change—for example, the introduction of scaled scores on a new scale (resulting from improvements in psychometric calibration and equating), the use of performance levels, and other innovations—requires instructing the public on accurate interpretation of the results and making correct inferences about the quality of schooling. Studies on the use of assessments in Chile (Ramírez, 2012; Taut et al., 2009), in Mexico (de Hoyos et al., 2017; Organization for Economic Cooperation and Development [OECD], 2013) and in Peru (Sempé et al., 2017) address this important objective and overall conclude that progress has been made in generating public confidence in the assessment system and in evaluating the usefulness of its results. A recent study of the use of ECE reports in Peru (Sempé et al., 2017) points at the challenges that the reports and the overall communication strategy must overcome the public in general and educational stakeholders in particular still find it difficult to understand the reports and interpreting even most simple tables and graphs. Teachers and administrators often attribute suboptimal results to other actors (parents, etc.) and therefore the potential impact of assessments on their actions is mitigated by low accountability. As is the case in other parts of LA, Peru has no coherent education policy structures that would encourage using assessment information for political or administrative decision-making in education. Assessment results compete with other messages and intervention strategies that at times contradict the information generated by the assessments and communicated to the public. Another challenge that Peru shares with many LA countries is that many initiatives and new policies aimed at improving schooling outcomes are being implemented simultaneously, so any effort to isolate the specific impact of the assessment system is rendered futile. There is a noteworthy study of the reports on SIMCE results, commissioned by the Chilean Ministry of Education to a university. Through surveys of nationally representative samples and in-depth interviews, the study investigates how teachers, principals, and parents access, value, understand, and use SIMCE results. The study shows that teachers and administrators face no obstacles in accessing the information, but parents do. Stakeholders appreciate the clarity and usefulness of the information; however, they have difficulty remembering and understanding basic facts related to
4 Contemporary Practices in the Curricular Validation …
85
Table 4.4 Impact of assessments Validation criteria
Chile
Peru
Mexico
Assessment results are effectively communicated
Assessment results are disseminated through various media for different audiences. It includes dissemination through the official website, national reports and reports for educational institutions, among others By law, the assessment has consequences for schools that are classified into four performance categories. This categorization is associated with a system of strong incentives and sanctions. There are no consequences for the students A study of access, appreciation, comprehension, and uses of assessment information has shown that teachers and administrators generally have access to SIMCE results, but have difficulty in interpreting them and using for improvement
Results of the national census-based tests are disseminated through reports at the national and regional levels for educational administrators, teachers, and parents or guardians By ministerial decree, results are disseminated at a national workshop Assessment results have no consequences for students, teachers, or schools In 2014, a survey was conducted on the access and uses of ECE results reports. An additional study was conducted in 2017 that was devoted to the use of assessment reports in the education system
The results are published on the INEE website and disseminated through a Bank of Education Indicators (a national system of education indicators available since the 2004–2005 cycles) and the annual national report titled “Educational Landscape of Mexico” Assessment results have no consequences for students, teachers, or schools There is no systematic evidence about the effectiveness of communication regarding assessment results
There are formal mechanisms to support the use of assessments to improve learning
SIMCE results are systematically used to inform decision-making at the national level, and to evaluate the impact of intervention programs However, a stronger assessment culture is needed at the regional, provincial, municipal, and school levels Assessment results are also used for school accountability
The evidence is mixed. A study on uses of the reports not only suggests positive perception and value of assessments by key stakeholders. But it also indicates that many stakeholders do not take responsibility for the results, attributing them to factors and agents outside their control
The intention of the assessment is for schools and teachers to take the diagnostic information provided by the test to develop roadmaps for school and teacher practice improvement. There is little systematic evidence that the results are used in the intended manner
(continued)
86
G. A. Valverde and M. J. Ramírez
Table 4.4 (continued) Validation criteria
Chile
Peru
Mexico
There are formal mechanisms to monitor the impact of assessments on the education system
There are some formal mechanisms, such as external research funds that finance studies in this area. SIMCE commissions have also contributed to generating evidence on the consequences of assessments
There are no formal mechanisms to monitor consequences. No studies were found on the consequential validity of assessments
There are no formal mechanisms to monitor consequences. There are isolated studies that analyze the impacts of assessments
Note Prepared by the authors. The table presents a comparison of consequential validity of the assessments by the validation criteria for the cases of Chile, Mexico, and Peru
the assessment. These results are especially relevant, since the reports are intended to be used for school improvement (Taut et al., 2009). There is much work to be done regarding the second criterion of consequential validity: the existence of formal mechanisms that support the use of assessments for improvement. In Chile, Mexico, and Peru alike, assessments have allowed education systems to focus on student learning. The results are increasingly more often used to inform policy. Chile is an outlier among our cases, since its assessment is the only high-stakes one. It is possible that this is the reason why the country boasts considerable volume of national research on the impact of assessments. The use of SIMCE results for sorting schools into different performance categories is institutionalized and mandated by law. Such research endeavors are usually funded from sources, either national or international, that are unrelated to SIMCE. The Chilean Ministry of Education has prioritized research and external audit of assessment procedures. A notable report is by the SIMCE Task Force, whose report (Ministry of Education, 2015) outlines both achievements and challenges the country faces in regard to the impacts of the assessment. The report claims that, although performance levels represent an important advance in terms of giving SIMCE assessments greater pedagogical meaning, mechanisms are still lacking to facilitate their use by teachers. There is still no satisfactory evidence that the results are used to improve the curriculum implemented in the classroom. The SIMCE Task Force also notes that there is not yet sufficient evidence of how the introduction of performance levels influences priorities at the school level. Will schools focus their efforts on serving students with the lowest levels of achievement? Or, rather, will they continue to give equal attention to students performing at all levels? Regarding the third criterion, formal mechanisms to monitor the impact of assessments on the education system in general and on learning in particular are still weak. We failed to find any studies on Peru that would address the issue. During our interviews with informants, we collected some preliminary evidence of the positive impact
4 Contemporary Practices in the Curricular Validation …
87
of the assessment. However, it is methodologically impossible in all three countries to isolate the impact of the assessment, given the context of multiple simultaneous interventions. In a study conducted in the state of Colima, Mexico, (de Hoyos et al., 2017), it is argued that simply publishing the results of the assessment would have a positive effect on student learning. However, the study does not investigate the mechanisms through which this impact would occur. In Chile, learning assessments has been highlighted as one of the factors that may have helped improve the quality of education. The country is considered one of the most improved educational systems and one that has significantly improved student learning. It is claimed that such improvements may have stemmed from a system of learning assessment that is methodically used to inform decision-making (Ramírez, 2012). However, some studies on the impact of accountability policies in Chilean schools are inconsistent with this argument. For example, a recent report by the Economic Commission for Latin America and the Caribbean (ECLAC) (Santos & Elacqua, 2016) states that the use of SIMCE results for accountability does not seem to affect parental decisions regarding which school their child would attend. Another study by Elacqua et al. (2016) suggests that the assessment is not associated with improved teaching practices. At the global level, it has been widely documented that high-stake assessments can generate undesirable effects. One commonly documented consequence is the narrowing of curriculum or teaching practices and classroom evaluation. The curriculum policy in Chile (Ministry of Education, 2014) anticipates these possible problems of consequential validity, seeking to align and coordinate all of its curricular instruments and other policies with the curricular framework and the current curriculum. It also seeks to avoid curricular narrowing by developing assessments for a wide range of disciplinary areas, and by attempting to strengthen the assessment system to avoid the possibility of “teaching for the test”, which primarily means developing a large bank of high quality test items.
4.4 Achievements, Challenges, and Implications for a Validation Agenda Assessments originate from the idea that any assertion about the quality of education should be empirically verifiable. Validation, in its turn, is the procedure for verifying the quality of the assessment system and its tools. Although the debate about the quality of assessment instruments in LA has been going on since the first assessments were introduced, it is only recently that the discussion shifted toward the issues of formal and technical aspects of instrument validation. This concern is undoubtedly the result of the evolution and consolidation of education assessment and, especially, of the continual strengthening of curriculum and assessment policies in the region. Naturally, this prioritizes the question of the
88
G. A. Valverde and M. J. Ramírez
curricular validity of assessments and invites an examination of existing practices, an appraisal of achievements and an identification of challenges to be faced. This chapter explored the evidence of validation of existing learning assessments in three countries of the region: Chile, Mexico, and Peru. To this end, it focused on three dimensions related to the curricular validity of the assessments: (1) the test alignment with the official curriculum, (2) the curricular validity of the performance levels used to report test results, and (3) the consequential validity of the assessments. First of all, we note the rise of efforts to having tools and mechanisms that provide evidence of the curricular foundation of the assessment. Faced with the challenging task of measuring a curriculum that was not designed to be assessed, the INEE in Mexico, and its rigorous methodology of curricular analysis, serves as an important example for the region. The analysis performed on the curriculum documents requires focused efforts from all committees involved and results in a set of measurable learning outcomes and well-defined performance levels. These are further subject to review by external audiences that could appraise their curricular validity. We have also observed the emergence of levels of performance (often called performance standards or levels of achievement); these notable instruments aimed at improving evidence of curricular validity. The interest in these tools seems to have arisen in particular from two important influences: the way in which results are reported in international assessments, on the one hand, and the efforts of the assessment technical teams to find “measurable versions” of curricula that are difficult to operationalize, on the other. Standards and levels of performance have become more common in LA and are beginning to form part of the basic policy tools found in the articulation between curriculum and assessment. The formulation and monitoring of procedures for assessment design and implementation based on principles of technical rigor, good international practices, and state-of-the-art technology is also an important achievement. This includes not only the introduction of performance levels but also the use of sound psychometric procedures in the calibration and tests equating that ensure year-to-year comparability, as well as careful documentation of the technical and theoretical rationale for all procedures followed in the development of assessment instruments. These are all essential steps for the consolidation of what many observers have called the “assessment culture”, which requires transparency and public monitoring of the assessment procedures. The official curriculum is a formalization of national objectives in schooling, and therefore evaluations of curricular validity should include informed participation of all stakeholders, most importantly, teachers, parents, and guardians. One challenge that remains is the consideration of validity with respect to the implemented curriculum (sometimes called instructional validity) of the tests. Here two difficulties stand out in particular. First, the discussion around assessments at times does not address the question of how well the tests reflect learning opportunities that actually occur in the classroom. In some cases, classroom teachers are not invited to provide input on the issue of curricular validity. Secondly, the curricular validity of performance levels which fail to capture learning of a large number of the most vulnerable students is clearly problematic. Performance levels that provide no valuable information about a quarter or more of students must be corrected. In some
4 Contemporary Practices in the Curricular Validation …
89
cases, the lowest performance levels have specific descriptions of what students have actually learned, rather than simply a list of the curricular areas that they have yet to master. In these cases, the usefulness and curricular validity of the assessment are greater: they make it possible to track performance with respect to curricular expectations. Yet, most LA countries do not approach the lowest performance levels in this way. This problem can be aggravated by some important factors. For example, national assessment systems in LA are often judged against their countries’ results on largescale international assessments. Typically, even the top LA performers in international tests receive some of the lowest scores among all participating countries. LA countries have a large proportion of students who do not reach satisfactory levels on assessments such as PISA or TIMSS. When international reports show that large proportions of students are below the lower performance levels, national audiences may interpret national assessment results with suspicion. Although home-made performance levels better describe achievement in relation to the official curriculum, they often do not appear to be aligned with international test results. The challenge for national assessment systems is to explain to the public how a test with higher levels of validity with respect to the official country curriculum does not contradict international tests, but rather provides supplementary and information useful for policy making. UNESCO’s Sustainable Development Goals (UNESCO, 2017) will also come into play in the alignment of performance levels with official curricula. International agencies, and especially development banks and donors that provide financial assistance and technical expertise to assessment systems in LA, should ensure that countries design performance levels that are aligned with their national curricula and that also serve to inform the Sustainable Development Goals. A technical solution is also needed to address the evolution of curriculum policy in LA. In a dynamic educational policy context, in which curriculum and educational standards are regularly reformed, new challenges arise. How will the results of assessments be comparable over time when the curriculum changes? Different countries outside the region have resolved this issue in some cases by sacrificing comparability, in others by maintaining some constant curricular standards, and in other cases by only adjusting some standards to reflect changes in the curriculum. The comparable time sequence is longer when curricular standards are maintained the same. It is also necessary to move from a procedural approach to assessment validation (verifying whether the procedures for developing assessments have been followed faithfully) to a model of internal and external audits of the results. The formation of SIMCE Commissions in Chile, studies of the use of assessment reports in Peru, or validation panels in Mexico are examples of best practices in this area. But efforts are only nascent as of yet, and many countries in LA are reluctant to undertake this type of audit for many reasons. First of all, nations may hesitate to allocate time, labor, and money resources to create and organize the documentation required for such an external audit. There is also a fear that audits would result in dangerous summative
90
G. A. Valverde and M. J. Ramírez
judgments for assessment systems that are still developing, often politically vulnerable, and perhaps poorly equipped to take external audit findings and incorporate them into improvement strategies. In regard to the latter, we should keep in mind that curricular validity is an idea that has just started to rise to prominence in LA assessment systems. It is a new aspect that emerges as part of this historical moment of the evolution and consolidation of national assessment systems. In our discussions with education administrators and technical team members, there were voiced genuine concerns about the costs and potential negative effects of validation practices that go beyond ensuring that processes are followed faithfully. There is fear that further efforts could be very costly. Such concerns are, without doubt, justified: every validation procedure comes at a price. Interestingly, none of our informants mention another important cost: that of making a wrong policy decision based on invalid inferences about the teaching and learning levels of students. This cost, widely studied in economics and policy analysis, is also called the cost of “type 3 error” (Mitroff & Featheringham, 1974), the error of trying to solve the wrong problem. Since we rarely have an accurate monetary estimate of the expenses associated with making a wrong decision, it is easy to dismiss that risk in contrast to the budget of a validation study that can be more accurately estimated. However, underpinning the value of assessments for educational improvement is based on the assumption that their results provide valid information about how well students have mastered certain areas of the official curriculum. If we have little evidence of the validity of these inferences, the risk making a Type 3 error is higher. This threat, historically, has not received attention in LA. Learning assessments are not an end in itself; they are valuable to the extent that they contribute to making better decisions that foster student learning. For this to happen, it is essential to have evidence of validity, especially curricular validity. The three case studies presented in this chapter shed light on the existing evidence of validity in different LA countries. They also illuminate the path that other nations can take in order to gather evidence of curricular validity, which is fundamental for improving the technical quality of assessments. Most importantly, it is fundamental to ensure the political feasibility and public credibility of national assessments.
References Burstein, L., Aschbacher, P., Chen, Z., & Sen, Q. (1990). Establishing the content validity of test designed to serve multiple purposes: Bridging secondary-postsecondary mathematics. University of California Los Angeles, Center for Research on Evaluation, Standards and Student Testing. Chile. Ministry of Education. (2014). Standard foundations of mathematical learning, language and communication Reading II Medium. Working paper. Unidad de Currículum y Evaluación. Chile. Ministry of Education. (2015). Towards a complete and balanced system of learning assessment in Chile: Report of the task force for the revision of the SIMCE. Recovered from https://www.mineduc.cl/wp-content/uploads/sites/19/2015/11/Informe-Equipo-deTarea-Revisi%C3%B3n-Simce.pdf
4 Contemporary Practices in the Curricular Validation …
91
de Hoyos, R., Garcia-Moreno, V., & Patrinos, H. (2017). The impact of an accountability intervention with diagnostic feedback: Evidence from Mexico. Economics of Education Review, 58, 123–140. https://doi.org/10.1016/j.econedurev.2017.03.007 Directorate of Testing and Measurement. (2005). Technical manual for the design of tests of educational quality and achievement. National Institute for the Evaluation of Education (INEE). Elacqua, G., Martínez, M., Santos, H., & Urbina, D. (2016). Short-run effects of accountability pressures on teacher policies and practices in the voucher system in Santiago, Chile. School Effectiveness and School Improvement, 27(3), 385–405. https://doi.org/10.1080/09243453.2015. 1086383 Ferrer, G. (2006). Learning assessment systems in Latin America: Balance and challenges. Program for the Promotion of Education Reform in Latin America and the Caribbean (PREAL). Ferrer, G., & Fiszbein, A. (2015). What has happened to learning assessment systems in Latin America? Lessons from the last decade of experience. World Bank Group. General Education Act No. 20370. Official Journal of the Republic of Chile. Santiago de Chile, August 17, 2009. Kane, M. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319–342. https://doi.org/10.1111/j.1745-3984.2001.tb01130.x Karantonis, A., & Sireci, S. G. (2006). The bookmark standard-setting method: A literature review. Educational Measurement: Issues and Practice, 25(1), 4–12. https://doi.org/10.1111/j.17453992.2006.00047.x Mitroff, I., & Featheringham, T. (1974). On systemic problem solving and the error of the third kind. Behavioral Science, 19(6), 383–393. https://doi.org/10.1002/bs.3830190605 Organization for Economic Cooperation and Development. (2013). Education policy outlook: Mexico. OECD Publishing. Peru. Ministry of Education. Office for measuring the quality of learning. (2012). Student census evaluation 2011: Results report for regional government authorities and specialists. Recovered from http://umc.minedu.gob.pe/wp-content/uploads/2017/07/Informe-ECE-2007-2015-1.pdf Peru. Ministry of Education. (2016). National elementary education curriculum. Recovered from http://www.minedu.gob.pe/curriculo/pdf/curriculo-nacional-de-la-educacion-basica.pdf Peru. Ministry of Education. Office for measuring the quality of learning. (2017). Report on the results of the student census evaluation 2007–2015. Series of evaluations and associated factors. Recovered from http://umc.minedu.gob.pe/wp-content/uploads/2017/07/Informe-final_ ECE-2007-2015-vfinal.pdf Ramírez, M. J. (2012). Disseminating and using student assessment information in Chile. The World Bank. Ramírez, M. J., & Valverde, G. A. (2019). How to ensure the validity of national learning assessments? Priority criteria for Latin America and the Caribbean. In J. Manzi, M. R. García & S. Taut (Eds.), Validity of educational assessment in Chile and Latin America. Ediciones UC. Santos, H., & Elacqua, G. (2016). Socioeconomic school segregation in Chile: parental choice and a theoretical counterfactual analysis. ECLAC Review, (119), 123–137. Recovered from https://repositorio.cepal.org/bitstream/handle/11362/40792/RVI119_Santos.pdf?seq uence=1&isAllowed=y Sempé, L., Andrade, P., Calmet, F., Castillo, B., Figallo, A., Morán, E., & Verona, O. (2017). Final report: Evaluation of the use of census reports by students at school. Proyecto Fortalecimiento de la Gestión de la Educación en el Perú (FORGE). Sireci, S., & Faulkner-Bond, M. (2014). Validity evidence based on test content. Psychothema, 26(1), 100–107. https://doi.org/10.7334/psicothema2013.256 Stobart, G. (2001). The validity of national curriculum assessment. British Journal of Educational Studies, 49(1), 26–39. https://doi.org/10.1111/1467-8527.t01-1-00161 Taut, S., Cortés, F., Sebastian, C., & Preiss, D. (2009). Evaluating school and parent reports of the national student achievement testing system (SIMCE) in Chile: Access, comprehension and use. Evaluation and Program Planning, 32(2), 129–137. https://doi.org/10.1016/j.evalprogplan.2008. 10.004
92
G. A. Valverde and M. J. Ramírez
United Nations Educational, Scientific and Cultural Organization. (2017). SDG 4 DATA DIGEST 2017. The quality factor: Strengthening national data to monitor sustainable development goal 4. UNESCO Institute for Statistics. Valverde, G. A. (2000). Justified interpretation and appropriate use of measurement results. In Next steps: Where and how to advance learning assessment in Latin America? Program for the Promotion of Education Reform in Latin America and the Caribbean (PREAL). Valverde, G. A. (2009). Standards and evaluation. In C. Cox & S. Schwartzman (Eds.), Educational policies and social cohesion in Latin America (pp. 57–88). Uqbar Editores. Xu, Y. (2009). Measuring change in jurisdiction achievement over time: Equating issues in current international assessment programs. Ontario Institute for Studies in Education of the University of Toronto.
Gilbert A. Valverde He has a degree in Philosophy from the University of Costa Rica and a Ph.D. in Comparative and International Education from The University of Chicago, USA. He is currently a researcher, professor in the Department of Educational Policy and Leadership. Interim Dean of International Education and Vice-Provost for Global Strategy at the University at Albany, State University of New York and a member of the faculty in the Department of Educational Policy and Leadership. His areas of interest include the international comparative study of assessment policy and curriculum, educational measurement, and large-scale international testing. Contact: [email protected]. María José Ramírez Psychologist from the Catholic University of Chile and Doctor in Education from Boston College, USA. Currently, she is an international consultant in education. Her focus is on the assessment and quality of education. Contact: [email protected]
Chapter 5
Learning Progress Assessment System, SEPA: Evidence of Its Reliability and Validity Andrea Abarzúa and Johana Contreras
5.1 Introduction In the educational measurement field, a distinction is often made between high and low-stakes assessments. The former ones are those oriented towards certification or selection in academic programs, while the latter are not subject to qualifications and are used as a source of information for monitoring actions and planning future interventions. The Learning Progress Assessment System (SEPA, from the acronym in Spanish of Sistema de Evaluación de Progreso del Aprendizaje) presented in this chapter, corresponds to this second type of assessments (American Educational Research Association et al., 2014; Ravela, 2006; Roeber & Trent, 2016). In a broad sense, it is a fact that assessment plays a central role in carrying out the social functions of any school system, particularly that of social distribution (Dubet, 2011; Perrenoud, 1996). Selection is a task that has lain with education systems since its origins and has been reinforced by the massification of education, since the educational institution is the one responsible for judging academic skills that, at the end of schooling, are expected to allow access to different positions in the labor market (Duru-Bellat, 2009). In high-stakes assessments, this selective role is evident: the University Selection Test (PSU, from the acronym in Spanish Prueba de Selección Universitaria) in Chile or the High School Grades (NEM, from the acronym in Spanish Notas de Enseñanza Media)—that summarize grades assigned by teachers— operate as filters for access to Higher Education, having a visible and direct impact on school and social trajectories (Morris, 2011; Ravela et al., 2017). However, standardized low-stakes assessments have consequences too, whether they are intended to provide feedback on teaching and learning in the classroom (Heritage, 2010; A. Abarzúa (B) · J. Contreras MIDE UC Measurement Center, Macul, Chile e-mail: [email protected] J. Contreras e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. Manzi et al. (eds.), Validity of Educational Assessments in Chile and Latin America, https://doi.org/10.1007/978-3-030-78390-7_5
93
94
A. Abarzúa and J. Contreras
Ravela et al., 2017) or to monitor the implementation of educational interventions or policies based on individual outcomes (Organization for Economic Cooperation & Development, 2013). As several authors have argued, educational measurement (and psychometry), has actively participated in the development of scientifically based tools, aimed at the selection and social construction of individual differences in school performance (Ramos, 2018). This is why the literature qualifies assessments as “high” or “low”, but they are never exempt from stakes. Standardized assessments, both high and low-stakes, have expanded considerably in recent decades. This expansion has been driven by the objective of improving quality and equity in educational outcomes, which have been strongly promoted since the 1990s by international organizations (United Nations Educational, Scientific and Cultural Organization [UNESCO], OECD, European Commission, etc.). Quality and equity that, in fact, are gradually beginning to be measured more in terms of learning outcomes than in terms of access to educational systems (Mons, 2009). In addition, consistently with accountability policies and the expansion of the New Public Management model, standardized assessments have been used in many contexts for control, accountability, school ranking, family choice of school, and teacher evaluation purposes (Darling-Hammond, 2007; Falabella & de la Vega, 2016; Koretz, 2010; Mons, 2009). Faced with these uses, a critical movement has emerged in the specialized literature and in civil society, warning about the risks and lack of effectiveness of these test uses. Without distinguishing between high and lowstakes assessments, main actors in education—namely, pedagogical and management teams, students, guardians—are susceptible to showing themselves as critical and resistant to external assessments. For all of the above reasons, the international Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014), urge assessment programs to be concerned with the uses and interpretations of test results, and point out that consequences are a central element in defining the level of rigorousness with which the technical quality of an instrument should be reviewed. Accordingly, the first and most fundamental standard is the articulation of the interpretations that are planned to be made based on the scores derived from a test, and the provision of evidence that supports the validity of each of these interpretations. This way, it could be determined to what extent the test is used according to its objectives. In this context, this chapter addresses the reliability and validity of learning assessments. The questions that are expected to be answered are the following: what rigorous processes can allow us to ensure the quality of an educational assessment instrument? What are the actions that an assessment program can take to ensure that it is as fair as possible and that the information it delivers is consistent with its intended uses? What actions should a program take to make sure that the assessment is used for the purposes for which it was designed? We will answer these questions based on the experience of the Learning Progress Assessment System (SEPA), developed by the MIDE UC Measurement Center at Pontificia Universidad Católica de Chile. Via this case, we will go through what we call the validation process, illustrating the procedures that this program implements to ensure that its assessments, although considered low-stakes, are reliable and valid.
5 Learning Progress Assessment System, SEPA: Evidence of Its Reliability …
95
Firstly, the SEPA program will be presented through the description of its main components, objectives and expected impacts, and the actions undertaken to achieve them. Secondly, the program’s validation agenda will be presented, distinguishing routine procedures aimed at collecting evidence of reliability, fairness, and validity from time-spaced studies aimed at collecting larger empirical data. Finally, in the conclusion we comment on the contributions of the described validation actions and the challenges ahead.
5.2 Learning Progress Assessment System, SEPA SEPA has been developed since 2007 by the MIDE UC Measurement Center of the Pontificia Universidad Católica de Chile. It consists of a set of standardized tests aimed at measuring the level of student learning with respect to the current curriculum. The program, as a whole, also includes advice and accompanying actions for users to manage and use learning results. SEPA’s objective is to provide teachers, management teams and school administrators and, secondarily, guardians and students, with reliable, timely, and useful information for making pedagogical and management decisions. The program emerges in response to the need, perceived by the founders of SEPA, of Chilean schools to have assessments that provide detailed information on student learning at one point in time, but also on the progress measured at two different time points. Moreover, this demand, which is directly posed by some educational foundations, arises in the context of educational policies aimed at monitoring schools’ actions. SEPA currently comprises language and mathematics tests from first grade from first until 11th grade. Since the first stage of experimentation, SEPA has grown in both qualitative and quantitative terms. In its first version, in 2007, around 14,800 tests were applied to 7,400 students from 37 schools. In 2008, 101 schools were already affiliated to the program, and nearly 25,000 students were assessed. Of this total number of students, 5,000 (from 34 schools) had two available assessments, which allowed them to receive a complete report on State, Progress, and Added Value. Of these 34 schools, 9 were municipal (public), 12 were private subsidized, and 13 were private non-subsidized1 (Hein & Taut, 2010). In the November 2017 version, a total of 104,500 tests were applied to evaluate approximately 50,900 students, comprising 234 schools. Of these, 119 schools are municipal (51%), 66 are private subsidized (28%), and 49 are private non-subsidized (21%), distributed in 13 of Chile’s 15 regions.
1
Editor’s note: In Chile, there are three types of school administration: private non-subsidized, completely financed by families; private subsidized, corresponding to private property educational establishments, but financed with State subsidy; and municipal, which are public property educational establishments that are financed with State subsidy.
96
A. Abarzúa and J. Contreras
The measured construct is proficiency in the language and mathematics curriculum, which involves the analysis of learning objectives, contents, and skills that, according to the Chilean Ministry of Education, students should achieve at each level of instruction. As far as item design is concerned, the format chosen by SEPA is mostly paper and pencil multiple choice questions. The test administration is carried out in November (exceptionally in March, to assess previous year’s contents) by schools themselves, guided by protocols and advice from MIDE UC, and making sure that it is carried out in the same period and under similar conditions in all the assessed classes. SEPA tests’ distinctive feature is that they are comparable between levels and years, that is, they allow estimating differences between examinees from a crosssectional and longitudinal angle, offering the possibility of monitoring learning of a cohort of students throughout their school trajectory. This type of comparison is achieved through the construction of tests that include common questions between two consecutive school levels (anchor items) and the use of statistical models based on Item Response Theory (IRT).2 These statistical analyses culminate in the reporting of learning outcomes in achievement percentages and a standardized score presented on an incremental scale.3 More specifically, SEPA hands out three types of reports: • Status reports: They present achievement results for each educational level and assessed area (thematic focus/skill). In addition, it provides detailed information by class and student regarding the degree to which school year learning objectives have been achieved. In this sense, the report combines a normative and criterionbased assessment approach (Ravela, 2006). • Progress reports: In simple terms, Progress is an estimate of the learning that a student has experienced between one assessment and another, which is obtained based on the difference between the score achieved in two consecutive tests, between one grade and the other. SEPA reports by subject, at individual and aggregate level, by educational level and class. According to the literature, the information on student Progress is very useful for teachers to make sense of learning and to make more relevant curricular adaptations in relation to student needs (Wylie, 2017), because, while the State report is an assessment of what the student has learned, Progress report monitors what the student is learning (Briggs, 2017). • Value Added Reports: The Value Added (VA) is an estimate of the contribution made by the school to student learning through a set of statistical techniques that seek to identify the unique effect of the school on student performance, above and 2
For more details, we suggest the reader interested in educational measurement and psychometry to look up the SEPA technical report (only Spanish version), which can be downloaded in the webpage: http://sepauc.cl/zona-de-usuarios/recursos-informes-tecnicos/ or specialized literature, for instance, De Ayala (2009). 3 The SEPA tests results are reported based on a vertical-type scaling, in which, for every subject the eleven educational grades assessed are put in one same score scale, and scores progression accounts for a larger educational progress (that is, it is expected that students’ mean score in one grade is larger than students’ mean score in the immediately previous grade).
5 Learning Progress Assessment System, SEPA: Evidence of Its Reliability …
97
beyond factors that are external to the school’s role. It is a complementary report directed to management teams and school administrators. In SEPA, Value Added reports are presented by school and educational level. Regarding the results and impacts anticipated by the program, SEPA expects that users handle the information in accordance with the uses it promotes. Based on the assessment for learning approach, the program aims for assessment information to be useful for different school community stakeholders (principals, teachers, students, peers, and guardians), looking to improve the teaching–learning process, and stressing feedback, given its known positive impact on academic performance (Hattie, 2008; Hattie & Yates, 2013; Heritage, 2010). It is expected that tests results will be complemented by other available assessments, constituting a body of empirical evidence upon which pedagogical and educational management decisions will be made, and which will trigger reflective processes in the professional communities (Black & Wiliam, 2009; Lai & Schildkamp, 2013). With regard to stakeholders, even though SEPA also includes visualizations of results for students and guardians, the declared expected uses focus on three institutional or individual stakeholders: administrators of public or private educational networks, leadership teams, and teaching teams. In a sense, it is expected that those stakeholders are the ones who access the information and use it to make pedagogical or management decisions. Hence, it is expected that those who administer a network of schools or a geographic area can understand the information provided and use it to assess a whole process or cycle, starting with one school year, Progress in at least two consecutive years, and Value Added in two or three-year cycles. This information should make learning follow up possible for a group of schools in the medium and long term. At the management or leadership teams level, it is expected that they can achieve a conceptual and reflective use, that is, through triangulation of information, selfcriticism, and formulation of hypotheses, about the multiple factors that can explain the results (with special focus on those that are controllable and on which the school community can have an influence). Additionally, instrumental use is expected, in the sense of data-based decision-making (Hein & Taut, 2010; Mandinach, 2012). Moreover, management teams are encouraged to reflect on the data and to generate favorable organizational conditions for a formative use of the assessment information. According to an understanding of learning that is consistent with the “incremental mindset” posited by Dweck (2006), at the classroom level is also expected a conceptual and reflective use. This entails that teachers have the necessary tools not only to adequately interpret the results, but also to use results formatively as a basis for teaching and learning feedback (Heritage, 2010).4 4
The professional team that supports users in the management and use of results has developed a conceptual model that substantiate its interventions, which integrates in an appropriate manner the formative assessment approach with the changes in beliefs regarding students’ skills and the learning process (Dweck’s mindset perspective). The change from beliefs that understand intelligence and learning as a genetic, individual and static trait (fixed mindset), to beliefs about learning as a social process and about skills as mobile and liable to learning (growth mindset), is encouraged.
98
A. Abarzúa and J. Contreras
At the end of the series of results, the program aims for all these actions and, in particular, the reflective processes triggered, to have an effect on pedagogical practices and strategies, thereby accomplishing changes with an impact on student learning. In the case of the SEPA test, the validation agenda combines practical and conceptual needs, and has two components: one of routine type verifications that are executed as part of each year’s procedures, and another of more time-spaced verifications that are executed outside annual procedures, as they require a larger accumulation of information. Both agendas are described below and some of the routine procedures that provide validity arguments of test content (construction and assembly) and its internal structure are explained; reliability and fairness corroborations are also presented.5 Then, occasional studies of relationships with other variables (SIMCE6 results) are described and we refer to the agenda for collecting evidence on uses and consequences.
5.3 Routine Validation Agenda in SEPA 5.3.1 Arguments Based on the Content of SEPA Tests As mentioned above, the definition of the construct assessed in SEPA involves curricular analysis based on the learning objectives, contents, and skills expected for each grade. Following the recommendations of educational measurement and psychometry (Lane et al., 2016; Wise & Plake, 2016), the operationalization of the construct to be measured is achieved through the elaboration of specification tables, designed and evaluated by disciplinary experts. These tables show the articulation between the assessed contents (by level, subject, and thematic focus7 ) and cognitive skills, serving as a basis for defining the number of questions per thematic focus that, according to their relative importance or weight in the current curriculum, guarantees an adequate representation of the contents. To paraphrase Koretz (2010), this refers to the sampling principle of testing, “Just as the accuracy of a poll depends on careful sampling of individuals, so the accuracy of a test depends on careful sampling of content and skills” (p. 20–21).
5
Procedures described here are not all of those that are carried out regularly in SEPA, only the most relevant ones have been selected by way of illustration. 6 Editor’s note: The Chilean national learning outcome assessment system, SIMCE, is census-based assessment carried out every year by the Ministry of Education. For more details, see Meckes & Carrasco (2010). 7 Thematic focus is a group of specific contents under a broader topic, for instance, in math, SEPA reported focuses are: numbers, algebra, geometry, statistics, and probability. In language, focuses are: explicit information extraction, implicit information extraction, global sense, communicative situation and structural elements, language knowledge and resources.
5 Learning Progress Assessment System, SEPA: Evidence of Its Reliability …
99
In order to ensure the quality of the instrument and evidence of validity based on content, procedures are carried out that minimize the under-representation of the construct (i.e., the instrument fails to capture relevant aspects of the construct) or the introduction of irrelevant variance (i.e., the instrument measures more or different aspects than it is expected to) (American Educational Research Association et al., 2014). These procedures, upon which test assembly rests, consist of successive item revisions by assessment specialists and curriculum experts. As suggested by specialized literature (Lane et al., 2016; Rodríguez, 2016), these revisions are part of an iterative development process that starts with the selection and training of item developers (classroom teachers with experience in evaluation), followed by advice during questions development and design, direct revision by specialists of the SEPA team, and revision by the technical commission (SEPA specialists and a team of developers). These revisions are based on explicit criteria that consider the subject’s content in terms of its conceptual accuracy and curricular alignment; the context that serves as stimulus for the question (to be relevant, appropriate, and free of bias against subgroups of the examined population); the quality and appropriateness of the statement; the clarity of the correct answer (to be unique and the best possible answer); the distractors’ properties (to be plausible, at the same logical level, not to be subsumed in another option, to inquire on relevant mistakes, among others); and the coherence between questions, skills and assessment indicators. Subsequently, the items that overcome these stages are submitted to a revision by experts with extensive experience in the subject and in educational assessment. They make a decision on the continuity of the items based on three possibilities: approval, approval with modifications or rejection.8
5.3.2 Validity Evidence on Internal Test Structure Test internal structure, as a type of evidence of validity, obeys a simple and fundamental assumption: validity of the uses of scores obtained in any assessment can only be asserted as long as there is a correspondence between the attributes to be measured and participants’ observed scores in the assessment. In simple terms, if SEPA tests were developed to measure knowledge and abilities in language and mathematics, it would be expected that test takers’ responses were reasonably predicted by those knowledge/skills and not by other reasons (fatigue, knowledge in other domains, etc.). The Standards (American Educational Research Association et al., 2014) state that if a test is designed for an essentially one-dimensional interpretation, it must be supported by a multivariate statistical analysis, such as factor analysis. This is the case with SEPA, in which each subject test relies on the assumption of the 8
This process is afterwards complemented with collection of empirical information about the items’ psychometric properties, tested on a representative sample of the population of SEPA users. With this application results, the items are approved or rejected for usage in final tests.
100
A. Abarzúa and J. Contreras
existence of a single construct: mathematical or language (more specifically, reading comprehension) performance or ability. In this regard, Haladyna (2016) points out that among the analyses to be carried out in order for the interpretations of a test to be valid, are those aimed at determining whether the assessed performance or skill consists of a single dimension, or whether it is composed of various dimensions; whether a total score is sufficient, or whether sub scores are required in order to arrive at more valid interpretations of the assessed variable. The purpose of factor analysis is to define the underlying structure of an observed data set, with the advantage of dealing with all variables simultaneously. Such an analysis should show that the variability of the observed scores is attributable to the dimension or dimensions that constitute the measured construct (Brown, 2006). As a technique, it is simple to implement, since it basically requires a set of representative responses of the population to be measured in order to be carried out, and a clear conceptual delimitation, which comes from the instrument specifications and the actions during the development process aimed at ensuring that its content corresponds to a specific construct. In the case of SEPA tests, this analysis is carried out under a confirmatory approach over an exploratory one. In the exploratory approach, the number of factors, as well as the relevance of the items in each of the factors is undetermined, therefore, the estimation methods’ work consists of establishing the best solution for a given data set (Brown, 2006). In the confirmatory approach, both the number of factors and the relationship of the items with each one of them is previously specified, hence, it is possible to study to what extent the structure of each test is confirmed, i.e., in this case, a one-dimensional structure. As we mentioned earlier, the SEPA tests are based on a one-dimensional assumption for each assessed subject. This means that they have been developed for the assessment of a unitary construct, which is, in this case, the assessment of the learning level at a given educational level and subject. Thus, in these tests, it is expected that scores will lead to a single ranking of examinees, in which those with higher scores will be those who, to a greater extent, have gained knowledge that students are expected to achieve at each education level. In this way, we evaluate the degree of fit with which empirical information supports the theory that we are in the presence of one-dimensional constructs measurement. Tables 5.1 and 5.2 below show the results of these analyses for the SEPA 2016 tests. They show that at in all levels, the one-factor solution has indicators that account for a good fit.9 These indicators are consistent with the hypothesis that SEPA tests are
The revised fit indicators are the absolute statistics of χ2 and the comparative statistics CFI, TLI and RMSEA associated with a one-factor model. By way of reference, the CFI and TLI indicators must be always larger than 0.9 (Hu & Bentler, 1999) and ideally be over 0.95 (Rutkowski & Svetina, 2014; Schermelleh-Engel et al., 2003; Yu, 2002). RMSEA must be inferior to 0.05 (Schermelleh-Engel et al., 2003; Yu, 2002). The statistic χ2 ’s significance should be larger than 0.05 to be considered acceptable, but it is highly influenced by strong statistical power. For this reason, even though this statistic is routinely estimated, its value is not considered as substantively important to conclude about the test dimensionality.
9
5 Learning Progress Assessment System, SEPA: Evidence of Its Reliability …
101
Table 5.1 SEPA language tests’ fit indicators Level
ChiSq Pvalue
CFI
TLI
RMSEA
1
0.00
0.96
0.95
0.03
2
0.00
0.96
0.96
0.03
3
0.00
0.96
0.96
0.03
4
0.00
0.98
0.98
0.02
5
0.00
0.96
0.96
0.03
6
0.00
0.97
0.97
0.03
7
0.00
0.97
0.97
0.02
8
0.00
0.97
0.97
0.02
9
0.00
0.96
0.96
0.02
10
0.00
0.97
0.97
0.02
11
0.00
0.92
0.92
0.03
Table 5.2 SEPA mathematics tests’ fit indicators Level
ChiSq Pvalue
CFI
TLI
RMSEA
1
0.00
0.98
0.98
0.02
2
0.00
0.98
0.98
0.02
3
0.00
0.97
0.97
0.02
4
0.00
0.98
0.98
0.03
5
0.00
0.97
0.97
0.03
6
0.00
0.98
0.98
0.02
7
0.00
0.98
0.98
0.02
8
0.00
0.97
0.97
0.02
9
0.00
0.96
0.96
0.03
10
0.00
0.96
0.96
0.03
11
0.00
0.95
0.95
0.02
one-dimensional. Therefore, they constitute validity evidence in favor of its internal structure.
5.3.3 Reliability Checks Reliability is a fundamental property in a test and the most popularly studied. Despite this, only in rare occasions its relevance is made explicit and its suitability for a particular assessment is explained (Cronbach & Shavelson, 2004). Conceptually, reliability is contrary to the concept of measurement error, therefore, it is a necessary property to ensure technical quality of an assessment, and when it is not achieved,
102 Table 5.3 SEPA language Cronbach’s alpha indices
A. Abarzúa and J. Contreras Level
Cronbach’s alpha 2016
Cronbach’s alpha 2017
1st Grade
0.82
0.82
2nd Grade
0.85
0.84
3rd Grade
0.88
0.89
4th Grade
0.90
0.90
5th Grade
0.88
0.89
6th Grade
0.89
0.88
7th Grade
0.89
0.90
8th Grade
0.88
0.90
9th Grade
0.85
0.88
10th Grade
0.87
0.86
11th Grade
0.86
0.90
by extension, the use and interpretation of the assessment cannot be valid (American Educational Research Association et al., 2014). The American Educational Research Association et al. (2014) standards warn that the term reliability has been used in multiple ways. This diversity is due to the fact that this notion refers to metric properties that are not only the result of different estimation procedures, but also respond conceptually to different facets of measurement, in which this property can be studied. For greater consensus on language, the Standards propose to conceptualize reliability as measurement precision at the examinees level (American Educational Research Association et al., 2014). This conceptualization emphasizes the notion of replicability of the individual-level scores obtained in a given instrument. Thus, a good reliability analysis is one that allows the study of possible threats to the replicability of a test. In the Standards’ words, it is the test developer’s responsibility to both visualize such threats and provide evidence to support that the level of reliability achieved in an assessment is sufficient and adequate. Depending on the type of test developed, there may be different sources of error or threat to replicability. In standardized tests composed entirely or mostly of closed response items, internal consistency index is the most frequent way to study precision. The measure of internal consistency is a measure of replicability, since only a highly consistent test would resist variations in the number of items (understood as representative samples of the same test), providing stable or invariant ability estimates. Among these measures, the most popular is Cronbach’s alpha coefficient (Cronbach, 1951), and its correct interpretation consists of reporting it as the percentage of individual differences found (described as the variance of test scores obtained by examinees) that is attributable to the variance of examinees’ true scores in the measured attribute (Cronbach & Shavelson, 2004). In the case of SEPA tests, this index is estimated for both field trials and final tests. In the latter case, it is part of the psychometric verification routines prior to the
5 Learning Progress Assessment System, SEPA: Evidence of Its Reliability …
103
Table 5.4 SEPA mathematics Cronbach’s alpha indexes Level
Cronbach’s alpha 2016
Cronbach’s alpha 2017
1st Grade
0.82
0.85
2nd Grade
0.83
0.83
3rd Grade
0.86
0.86
4th Grade
0.91
0.90
5th Grade
0.91
0.90
6th Grade
0.89
0.89
7th Grade
0.90
0.90
8th Grade
0.90
0.91
9th Grade
0.90
0.89
10th Grade
0.92
0.92
11th Grade
0.89
0.93
development of a vertical scale. Tables 5.3 and 5.4 below show the obtained indices for the 2016 and 2017 tests. As can be seen, these indices are highly satisfactory in all tests, indicating that the scores reported by SEPA tests are very consistent at all levels. In addition, this reliability provides a solid foundation for the procedures that allow the development of SEPA’s vertical scale. Reliability, measured as internal consistency, can also be studied in the vertical scale context, in which tests applied to each educational level are treated as one large test that captures learning progress between levels. The vertical scale is obtained by means of statistical analysis procedures within the Item Response Theory framework, where, based on anchor items between levels, all tests are treated as one per subject. Within the framework of this theory’s procedures, the reliability estimation can be carried out in a similar way to the classical estimation (Adams, 2005). Under this approach, a reliability coefficient can be calculated for the examinees’ skill estimates for the full vertical scale, composed of the eleven educational levels assessed for each sector. These coefficients are presented below for the years 2016 and 2017 (Table 5.5). As can be seen, these indices are also very high, hence, it can be established that at the level of each test and of the entire scale, there are good internal consistency indices. Table 5.5 SEPA scale, language and mathematics internal consistency indices (EAP reliability) Language
Mathematics
Year
2016
2017
2016
2017
EAP reliability
0.93
0.92
0.95
0.95
104
5.3.3.1
A. Abarzúa and J. Contreras
Test Information Function
Item Response Theory (IRT) allows the study of reliability without the restriction of having a single constant reliability index for the entire test, thus overcoming one of the main limitations of models that are based exclusively on Classical Test Theory (de Ayala, 2009). IRT modeling, by establishing a single scale between items and ability estimates, allows to get precision estimates at different difficulty levels of the test, accounting for the amount of information available at each level of the measured construct (in which information is similar to reliability). In order to do this, it analyses the informative function of the items and the test as a whole. This analysis must be complemented by a decision on the target information function of the test (de Ayala, 2009). In the case of SEPA tests, the target information function is expected to cover the largest proportion of the measured learning continuum. Since the test is designed to inform the achievement of students who are at different performance levels, it should be informative across a wide range of scores. The way to verify this is to examine the alignment of the information function with examinees’ skill estimates, where the greatest possible overlap is expected. In the case of the 2017 SEPA tests, the following test information functions are available (see Fig. 5.1). The line represents the information (precision) provided by the items at each educational level measured, and histograms represent examinees’ skill estimates at each level. As can be seen, there is an overlap between the
Vercal scale Language
Fig. 5.1 Information function of the vertical scale
Vercal scale Mathemacs
5 Learning Progress Assessment System, SEPA: Evidence of Its Reliability …
105
information provided by the tests and by examinees at all levels, accounting for a larger reliability of the examinees ranking at around the average performance of each educational level. Thus, it can be stated that from the Item Response Theory, there is also evidence in favor of SEPA tests’ precision, which supports that they are highly reliable tests.
5.3.4 Fairness Checks The study of fairness of an assessment is especially relevant in the educational context, where results may show important gaps between different groups that participate in the assessment, as a reflection of inequities or differences in access to educational opportunities. These differences in opportunity, as a social, economic, and cultural problem, are translated into differences in performance in standardized measurements, leading to suspicions about whether the test reflects these disparities, or whether the assessment contributes to create them. In the case of SEPA tests, the comparability between examinees from different cohorts and years implies formulating the assumption that the observed variance is not due to other characteristics of the examinees beyond their knowledge, but is only associated with their level of ability and is, therefore, free of distortion. The study of a test’s fairness is the generation of validity evidence about performance differences between groups participating in an assessment, that is, the degree to which these are differences attributable to their level of demonstrated ability, and not to an interaction between membership to a certain group and the test. In this regard, the Standards (American Educational Research Association et al., 2014) state that all test development procedures should be guided by a principle of minimizing variance irrelevant to the measured construct. Membership to a particular group (e.g., gender, ethnicity) should not be a source of variance in assessment results, as the only source of relevant variance should be the level of proficiency in the measured construct. Results interpretations are valid for the intended uses of the assessment, only in the absence of sources of irrelevant variance. In large-scale standardized assessments, in which management procedures are reasonably dismissed as a potential source of bias, it becomes crucial to determine whether there are factors other than examinees’ ability—such as belonging to a specific group—that explain test’s results. There are basically two approaches to determine this: one is bias analysis or differential functioning at the item level, and the other is bias analysis or differential functioning at the test level. The study of Differential Item Functioning (DIF) is a way of investigating distortions in the questions of a test that could be beneficial or detrimental to certain groups (Camilli, 2006), considering examinees’ characteristics that are not directly related to the skill being measured (for example: sex, ethnicity, rurality). When there is an empirical alert of bias, we can say that there is a distortion that affects the validity of the scores interpretation, which would imply that, in certain questions, specific groups show a differential performance that would not be
106
A. Abarzúa and J. Contreras
explained by their level of ability in the domain being evaluated (American Educational Research Association et al., 2014; Camilli, 2006). To sum up, we speak of differential item functioning (DIF) when two (or more) examinees groups, who have the same level of ability, show a different probability of correctly answering a certain question There are different approaches and procedures for detecting differential item functioning, and in this case, logistic regression analyses are carried out, which allow us to check whether belonging to a particular group (for example, being male or female) is a significant predictor of the probability of success in test items. To complement this technique, the criteria established by the Educational Testing Service (ETS) are used to classify items according to the degree to which they show evidence of differential functioning. Table 5.6 shows the three levels established by ETS (Zwick, 2012). These differential functioning categories are reported with a negative sign (−) when the probability of a correct response, once controlled by skill level, is lower for the defined focal group than for the reference group, and with a positive sign (+) when the probability of a correct response is higher in the focal group than in the reference group. The following are the results of the differential item functioning analysis of the SEPA 2017 tests, according to gender (see Table 5.7 and 5.8), where girls were specified as the focus group and boys as the reference group. As can be seen in this analysis, 99% of the revised items in both sectors do not present an empirical alert of differential functioning. In each sector, three items stand out as having possible differential functioning. In the case of language, two of them benefit boys and one of them benefits girls. In mathematics, all three items benefit boys. Since this analysis is strictly empirical, these items are then reported to the test development teams in order to determine whether there are test contents reasons that support this differential Table 5.6 Categories of differential item functioning established by ETS Category
Differential functioning
A
Negligible or Nonsignificant
B
Slight or Moderate
C
Moderate to Large
Table 5.7 Number of items in each differential functioning level, language sector
Criterion P X2MH > 0.05ó |αMH | ≤ 1 P X2MH ≤ 0.05 y1