Measuring College Learning Responsibly: Accountability in a New Era 9780804773515

This book examines current practices in assessment of learning and accountability at a time when accrediting boards, the

140 40 3MB

English Pages 256 Year 2009

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Measuring College Learning Responsibly: Accountability in a New Era
 9780804773515

Citation preview

Measuring College Learning Responsibly

Measuring College Learning Responsibly Accountability in a New Era Richard J. Shavelson

Stanford University Press Stanford, California

To Ali, Amy, Justin, Karin, Patti, Potter, and Remy

Stanford University Press Stanford, California ©2010 by the Board of Trustees of the Leland Stanford Junior University. All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or in any information storage or retrieval system without the prior written permission of Stanford University Press. Printed in the United States of America on acid-free, archival-quality paper Library of Congress Cataloging-in-Publication Data Shavelson, Richard J. Measuring college learning responsibly : accountability in a new era / Richard J. Shavelson. p. cm. Includes bibliographical references and index. ISBN 978-0-8047-6120-8 (cloth : alk. paper) ISBN 978-0-8047-6121-5 (pbk. : alk. paper) 1. Universities and colleges—United States—Examinations. 2. Educational tests and measurements—United States. 3. Educational accountability—United States. 4. Education, Higher—United States—Evaluation. I. Title. LB2366.2.S53 2010 378.1'664—dc22 2009019665 Typeset by Westchester Book Group in 10/14 Minion

Contents

Figures and Tables Preface Abbreviations

vi viii xv

1 Assessment and Accountability Policy Context

1

2 Framework for Assessing Student Learning

8

3 Brief History of Student Learning Assessment

21

4 The Collegiate Learning Assessment

44

5 Exemplary Campus Learning Assessment Programs

70

6 The Centrality of Information in the Demand for Accountability

102

7 Accountability: A Delicate Instrument

121

8 State Higher-Education Accountability and Learning Assessment

133

9 Higher-Education Accountability Outside the United States

161

10 Learning Assessment and Accountability for American Higher Education

183

Notes

211

References

217

Index

231 v

Figures and Tables

Figures

vi

2.1 Framework for student learning outcomes

13

3.1 Sampling of questions on the Pennsylvania Senior Examination

25

3.2 Relationship between mean SAT/ACT scores and CLA scores

36

3.3 Collegiate Learning Assessment performance task

38

4.1 Collegiate Learning Assessment structure

48

4.2 CLA in-basket items from the “Crime” performance task

51

4.3 Faculty perceptions of the CLA’s performance tasks

59

4.4 Relationship between academic domain and performance task type

63

6.1 Accountable to whom?

115

7.1 Standard organizational production model

126

8.1 State-by-state report card

137

8.2 National Center for Public Policy and Higher Education’s “Learning Model”

139

8.3 State profiles of performance on learning assessment measures

140

8.4 State-by-state performance by white and nonwhite students on the Collegiate Learning Assessment

141

Figures and Tables

8.5 Value-added performance on Collegiate Learning Assessment 10.1 Learning outcomes

vii

151 187

Tables 3.1 Summary of Tests and Testing Programs by Era

22

3.2 Characteristics of the Collegiate Learning Assessment

35

3.3 Critique an Argument

37

4.1 Criterion Sampling Approach and the Collegiate Learning Assessment

49

4.2 Scoring Criteria for Performance Tasks

52

4.3 Criteria for Scoring Responses to Analytic Writing Prompts

53

4.4 Students’ Mean (Standard Deviation) Perceptions of CLA Performance Tasks

60

5.1 Cross-Campus Comparison on Dimensions of Development, Philosophy, Operation, and Impact

81

6.1 Competing “Cultural” Views of Accountability

113

8.1 Most Frequent Indicators in State Higher-Education Performance Reports by Type

143

8.2 Early State Learning Assessment Programs

145

8.3 Direct Measures of Learning Found in State Report Cards

149

8.4 State Higher-Education Report Cards Database: Information Collected

159

8.5 State Higher-Education Report Cards

160

Preface

THIS BOOK HAS BEEN GESTATING for almost twenty years. It was conceived, unbeknownst to me at the time, when a program officer at the National Science Foundation asked if I thought that a collegiate version of NAEP could be built.1 I wondered why the government would want a one-size-fits-all, largely multiplechoice test for all colleges and universities in their full diversity. What good might come of information provided by a collegiate NAEP with scores reported publically in league tables? Why adopt wholesale for higher education an assessment built to monitor mandatory precollegiate education? I paused then and said that that wasn’t a good idea, and, if it was tried, I would oppose it. I didn’t see how a single, narrowly gauged achievement test of basic skills could be developed in a manner sensitive to the diversity of education and missions in the nation’s institutions of higher education, including the development of higher-order cognitive abilities and personal and interpersonal skills. I didn’t see how information provided by a single, general test could be used to improve teaching and learning in higher education. And I didn’t see why it would be appropriate to adopt a solution to mandatory precollegiate education for elective higher education, knowing the strengths and limitations of large-scale assessments in an accountability context, as well as the political uses and misuses that have been made of such tests. I then lost sight of the question of higher-education accountability for a couple of years until a friend, a music professor at a small midwestern liberal arts college, phoned. He had been appointed to a campus-wide committee charged with responding to the North Central Accreditation and School Improvement Association’s mandate to assess student learning. He wondered if I thought it viii

Preface

ix

appropriate that his college replace its current system of assessing students with an on-demand, multiple-choice test of largely factual and procedural knowledge in the humanities, social sciences, and sciences to meet accreditation demands. He explained that currently all seniors completed a capstone course with high performance expectations; for example, his opera students had to stage an opera, among other requirements. This, he thought, was more relevant to his students’ achievement than a humanities multiple-choice test. He asked if I saw something wrong in his thinking. I told him that I didn’t think so and suggested that perhaps his committee and his college were overreacting to the accreditation mandate. The questions raised about a collegiate NAEP returned in a new context. A few years later, learning assessment and accountability came to my attention again, this time in a newspaper article. On Sunday, September 27, 1998, the New York Times alerted readers to the New York State Education Department’s plan to evaluate public and private colleges and publish the findings as early as 2001. The department planned to convene a higher-education advisory council of college presidents to guide its efforts to produce a “report card” based on a mandatory test for the state’s higher-education institutions, public and private. New York was following a trend in the United States (and other countries, such as Britain and Australia) toward increased higher-education accountability. The State University of New York, for one, demurred; the proposal needed further study; a system-wide committee was appointed to do the review. The New York situation weighed on me. What alternatives were there to one-size-fits-all assessment? What alternatives were there to U.S.-style accountability? Is the K-12 vision embodied in the No Child Left Behind federal legislation the only reasonable option? These questions were on my mind when a program officer from the Atlantic Philanthropic Service Company (APS), Myra Strober, invited me to lunch to talk about trends in higher education, especially the push for accountability. Myra had just taken a leave from Stanford to direct APS’s higher-education grants program and was in the process of framing a portfolio of new projects. When I told her my concerns about accountability trends, she, too, became concerned about the possible unintended negative consequences for higher education. My discussion with Myra ultimately led to support for the work contained herein, in large part a grant from APS (now called Atlantic Philanthropies). Once Myra asked for a proposal, she turned everything over to Jim Spencer, her predecessor, to avoid any conflict of interest, she and I both being from Stanford.

x

Preface

In this text I examine current practice in assessment of learning and higher-education accountability. By “assessment of learning” I mean the use of both direct measures of achievement (e.g., certification examinations) and ability (e.g., Graduate Record Examination, Collegiate Learning Assessment) and indirect measures (graduation and retention rates, time to degree, job placement and employer satisfaction, and student surveys of engagement). By “accountability” I mean the collection, provision, and interpretation of information on higher-education quality sought by educators and policy makers who have responsibility for assuring the public and “clients”—students, parents, businesses, and government—that invest in education, training, or research. The goal of this text is to provide education policy makers—in the academy, in government, and in the public—with an overview and critical analysis of options for crafting learning assessment and accountability systems that meet needs for campus teaching and learning improvement and external accountability. Along the way, I identify alternative conceptions of and procedures for assessment and accountability systems, some of which may substantively improve college teaching and learning, both in general education and in the disciplines, while at the same time informing external audiences. The book begins by introducing the higher-education policy context in the United States and the current demand for learning assessment and external accountability (Chapter 1). A number of tensions emerge, not the least of which is between the formative (institutional improvement) and summative (comparative) functions of accountability and who controls that agenda. A second, related tension is whether and to what extent campuses’ performances are publicly compared with one another. Chapters 2 through 5 address the quest to assess student learning. Chapter 2 distinguishes among direct and indirect measures of learning, arguing that indirect measures do not measure learning, and distinguishes learning (relatively permanent change in behavior over time) from achievement (level of academic performance at one time point) and propensity to learn (level of achievement within a student’s reach with minimal scaffolding). A framework is then presented for considering assessment of learning and achievement, ranging from knowledge and reasoning within a domain (e.g., quadratic equations) or major (e.g., mathematics) to broad reasoning, decision making, and communicating within the sciences, social sciences, and humanities; to quantitative, verbal, and spatial reasoning; to general ability. The framework locates current learning assessments

Preface

xi

and provides a crosswalk among different notions and recommendations for measuring learning outcomes. In Chapter 3, the 100-year history of learning assessment in higher education is sketched, drawing lessons to be learned from the past for the design of learning assessment and showing that the current debate is not new. I then turn to currently available, externally provided learning assessments and what they attempt to do, concluding that the recent Collegiate Learning Assessment (CLA) offers a great deal of promise. Chapter 4 provides detailed information about the CLA, as it is, arguably, the newest, most innovative assessment of college learning today and relatively little is known about its philosophy and technical qualities. The last chapter of the learning assessment sequence (Chapter 5) examines undergraduate learning assessment as practiced on campuses; campus-based assessment efforts are essential to meet both formative and summative accountability demands. External assessments signal areas in need of improvement by benchmarking campus performance against the performance of campuses viewed as peers; local campus information is needed to pinpoint challenges and to conjecture and test out possible ways of improving learning. The variability among even exemplary campus assessment programs becomes immediately apparent—in how they were started and are sustained, in what they did (do), and in their intended and unintended consequences for student learning and teaching. The goal here is to identify programs and their implementation and operation that appear to have salutary effects on teaching and learning and draw lessons for the design of learning assessment and accountability systems. Chapters 6 through 9 focus on accountability. Chapter 6 addresses the centrality of information in accountability and the “cultural conflict” among academe, government, and clients. Although conflict is inevitable, it can nevertheless be productive when cool heads prevail; given the politics of higher-education accountability, the assumption of cool heads is tenuous. Rather, there is considerable room for mischief on all sides. Chapter 7 examines the role of accountability in a democracy, drawing implications for the role of accountability in higher education. There is a tension between accountability for formative (improvement) and summative (external informative) purposes, as well as between accounting for actions and accounting for outcomes. Moreover, the application of accountability to higher education gives rise to issues such as the presumption of control and causality, the role of sanctions, and the power of whoever controls the stories or accounts

xii

Preface

that that provide interpretations of accountability information for the public. What becomes clear is that accountability is a powerful policy instrument but a delicate one, one that, if misapplied, may lead to as much mischief as good. Chapter 8 explores current state-level accountability practices in the United States. How many states have such practices? What do these practices look like? How do they vary? What consequences, intended and unintended, do they appear to have? Performance reporting of some kind dominates in a wide variety of forms. While myriad indicators are published, few states actually report direct measures of learning. And states report so many indicators that performance reports lack focus; the public and policy makers are overwhelmed by data. In Chapter 9 I analyze accountability systems in different parts of the world, including the European Community generally, especially England and Scandinavia, and Australia, New Zealand, and Hong Kong. Clear alternatives to current practice in the United States (although that is changing) exist. Outside the United States, quality assurance has taken hold. Accreditation, assessment of learning for cognitive and responsibility outcomes, and quality assurance are, in a certain combination, shown to be viable alternatives to current practice in the United States. The book concludes (in Chapter 10) by setting forth a vision of an assessment and accountability “system” that integrates the findings from the previous chapters. I envision a multifaceted approach to the assessment of learning that includes cognitive outcomes in the majors and in broad abilities, including critical thinking, analytic reasoning, problem solving, and communicating. This vision of learning assessment also encompasses individual and social responsibility outcomes, including the development of personal identity, emotional competence, resilience, and perspective taking (interpersonal, moral, and civic). Learning assessment, both internal and external to colleges and universities, is a centerpiece for a quality assurance system of accountability that incorporates accreditation and assessment. Such a system provides both formative and summative information to higher educators, policy makers, clients, and the public while addressing the tension of conflicting policy and education cultures. Inevitably, some readers will find some topics of little or no interest. For example, I have not distinguished between public and private four-year institutions or distinguished institutions by Carnegie classification. I believe that because I have been a faculty member and dean at both public (University of California at Los Angeles and University of California at Santa Barbara) and

Preface

xiii

private (Stanford) universities, what I say here can be applied fruitfully across these institutional types (although perhaps more to some types, such as liberal arts colleges, than to, say, research universities). Setting clear goals, building programs to reach them, monitoring progress, and feeding back findings that provide a basis for improvement and experimentation would seem to be beneficial across the spectrum. Moreover, community colleges and for-profit institutions are not addressed specifically. To be sure, what is said about four-year public and private colleges and universities here may be informative for community colleges and for-profit institutions. However, none of the examples or case studies presented draw on these institutions. Nor was consideration given to their differences and what might be said about them that would differ from what is said about four-year campuses. Simply put, they were beyond the scope of this work. I am indebted to many colleagues, not the least of whom were program officers at APS overseeing this work—Myra Strober, Jim Spencer, Ted Hullar, and Ray Handlan. I have already described Myra’s role. Jim Spencer, in his review of my proposal, said it all sounded academic and why didn’t I immerse myself in practice? (Jim’s an engineer.) His advice led to my involvement in the creation of the Collegiate Learning Assessment. The responsibility for guiding my grant, however, largely fell on the shoulders of Ted Hullar, who replaced Myra as higher-education program director for APS. His strong support for the project and his patience in the face of slow progress were motivating and greatly appreciated. Ultimately, as the APS higher-education program was phased out, Ted saw to it that I had the resources needed to complete the work and write this text; Ray Handlan did the same, following Ted as my contact. I am also deeply indebted to Maria Araceli Ruiz-Primo, formerly of Stanford University and now at the University of Colorado–Denver. She helped design, analyze, and report the empirical research conducted for the book. And she patiently read and critiqued a number of chapters. I am also indebted to Blake Naughton, who, as a graduate student at Stanford, helped conceive and design the study of state accountability systems; to Anita Suen, who assisted with research reported in Chapter 8; and to Gayle Christensen, now at the University of Pennsylvania, who as a graduate student and then a Humboldt Fellow at the Max Planck Institute in Berlin provided research support for the chapter on international approaches to accountability (Chapter 9). Finally, a debt of gratitude goes to Lee Shulman, who provided support, advice, wisdom, and encouragement throughout the project.

xiv

Preface

My colleagues at the Council for Aid to Education—Roger Benjamin, Roger Bolus, and Steve Klein—provided invaluable support for the chapter on the Collegiate Learning Assessment (Chapter 4). My experiences with them in the development and now the use of the CLA proved formative in my thinking about the assessment of learning and its role in higher-education accountability.

Abbreviations

AAC&U AACC AASCU AAU ACT ACU AP or APS ACE CAAP CAE CIRP CLA CNE College BASE COMP CRS EAQAHE ECTS ENQA ETS EVA EU

Association of American Colleges and Universities American Association of Community Colleges American Association of State Colleges and Universities Association of American Universities ACT (formerly American College Testing program) Assessment Centered University Atlantic Philanthropic or Atlantic Philanthropic Service American Council on Education Collegiate Assessment of Academic Proficiency Council for Aid to Education Cooperative Institutional Research Project Collegiate Learning Assessment Comité National d’Evaluation (France) College Basic Academic Subjects Examination College Outcomes Measures Project College Results Survey European Association for Quality Assurance in Higher Education European Credit Transfer System European Network for Quality Assurance in Higher Education Educational Testing Service Danish Evaluation Institute European Union xv

xvi Abbreviations

FU GPA GPRA GRE K-12 LSAT LOU MAPP NAALS NAEP NAICU NASULGC NCLB NCPPHE NGA NCPI NPEC NSSE OECD PEA RAE SAT SCLU SHEEO TIAA/CREF UAP UK VSA VSNU WTO

Flexible University grade point average Government Performance and Results Act Graduate Record Examination kindergarten through 12th grade Law School Admissions Test Learning Outcomes University Measure of Academic Proficiency and Performance National Assessment of Adult Literacy Survey National Assessment of Educational Progress National Association of Independent Colleges and Universities National Association of State Universities and Land-Grant Colleges No Child Left Behind Act National Center for Public Policy and Higher Education National Governors Association National Center for Postsecondary Improvement National Postsecondary Education Cooperative National Survey of Student Engagement Organization of Economic Cooperation and Development Progressive Education Association Research Assessment Exercise SAT (College Admissions Test) Student Centered Learning University State Higher Education Executive Officers Teachers Insurance and Annuity Association / College Retirement Equities Fund Undergraduate Assessment Program United Kingdom Voluntary System of Accountability Association of Dutch Universities World Trade Organization

Measuring College Learning Responsibly

1

Assessment and Accountability Policy Context

ONE MEASURE OF THE IMPACT of a National Commission Report is that it stirs debate and changes behavior. Most such reports, however, come with great fanfare and exit, almost immediately, leaving hardly a trace. The report of former U.S. Secretary of Education Margaret Spellings’ Commission on the Future of Higher Education—A Test of Leadership: Charting the Future of U.S. Higher Education—is an exception to this rule (www.ed.gov/about/bdscomm/list/ hiedfuture/reports/final-report.pdf). It spurred and continues to spur debate; it has demonstrably changed behavior. This chapter sets the policy context for the quest to assess undergraduates’ learning and hold higher education accountable. What follows is a characterization of the Spellings Commission’s recommendations and those of professional associations for a new era of accountability, along with academics’ critiques of the proposals. The chapter then sketches some of the major issues underlying assessment and accountability and concludes with a vision of a new era in which learning is assessed responsibly within the context of an accountability system focused on teaching and learning improvement, while at the same informing higher education’s various audiences.

Spellings Commission Findings and Recommendations While praising the accomplishments of American higher education, the Spellings Commission said that the “system” had become complacent. “To meet the challenges of the 21st century, higher education must change from a system primarily based on reputation to one based on performance. We urge the creation of a robust culture of accountability and transparency throughout higher education” 1

2 Assessment and Accountability Policy Context

(p. 21). The Commission considered “improved accountability” (p. 4) the best instrument for change, with colleges and universities becoming “more transparent about cost, price and student success outcomes” and “willingly shar[ing] this information with students and families” (p. 4). The Commission found fault with higher education in six areas; the three most pertinent here are: • Learning: “The quality of student learning at U.S. colleges and universities is inadequate and, in some cases, declining” (p. 3). • Transparency and accountability: There is “a remarkable shortage of clear, accessible information about crucial aspects of American colleges and universities, from financial aid to graduation rates” (p. 4). • Innovation: “Numerous barriers to investment in innovation risk hampering the ability of postsecondary institutions to address national workforce needs and compete in the global marketplace” (p. 4). Student learning was at the heart of the Commission’s vision of a transparent, consumer-oriented, comparative accountability system. Such a system would put faculty “at the forefront of defining educational objectives . . . and developing meaningful, evidence-based measures” (p. 40) of the value added by a college education. The goal was to provide information to students, parents, and policy makers so they could judge quality among colleges and universities. In the Commission’s words (p. 4): Student achievement, which is inextricably connected to institutional success, must be measured by institutions on a “value-added” basis that takes into account students’ academic baseline when assessing their results. This information should be made available to students, and reported publicly in aggregate form to provide consumers and policymakers an accessible, understandable way to measure the relative effectiveness of different colleges and universities.

The Commission was particularly tough on the current method of holding higher education accountable: accreditation. “Accreditation agencies should make performance outcomes, including completion rates and student learning, the core of their assessment as a priority over inputs or processes” (p. 41). The Commission recommended that accreditation agencies (1) provide comparisons among institutions on learning outcomes, (2) encourage progress and continual improvement, (3) increase quality relative to specific institutional missions, and (4) make this information readily available to the public.

Assessment and Accountability Policy Context 3

Higher Education Responds to the Commission’s Report At about the same time that the Commission released its report, higher-education associations, anticipating the Commission’s findings and recommendations and wanting to maintain control of their constituent institutions’ destinies, announced their take on the challenges confronting higher education. In a “Letter to Our Members: Next Steps,” the American Council on Education (ACE), American Association of State Colleges and Universities (AASCU), American Association of Community Colleges (AACC), Association of American Universities (AAU), National Association of Independent Colleges and Universities (NAICU), and the National Association of State Universities and Land-Grant Colleges (NASULGC) enumerated seven challenges confronting higher education (www.acenet.edu/AM/Template.cfm?Section=Home&CONTENTID=18309 &TEMPLATE=/CM/ContentDisplay.cfm): • • • • • • •

Expanding college access to low-income and minority students Keeping college affordable Improving learning by utilizing new knowledge and instructional techniques Preparing secondary students for higher education Increasing accountability for educational outcomes Internationalizing the student experience Increasing opportunities for lifelong education and workforce training

Perhaps the most astonishing “behavior change” came from AASCU and NASULGC. These organizations announced the creation of the Voluntary System of Accountability (VSA). Agreeing with the Spellings Commission on the matter of transparency, these organizations created the VSA to communicate information on the undergraduate student experience through a common web reporting template or indicator system, the College Portrait. The VSA, a voluntary system focused on four-year public colleges and universities (www.voluntarysystem.org/ index.cfm), is designed to do the following: • Demonstrate accountability and stewardship to the public • Measure educational outcomes to identify effective educational practices • Assemble information that is accessible, understandable, and comparable Of course, not all responses to the Commission’s report and the associations’ letter were positive in nature or reflective of behavior change. The report, as well as the letter, was roundly criticized. Critics rightly pointed out that the proposals did not directly address the improvement of teaching and learning

4 Assessment and Accountability Policy Context

but focused almost exclusively on the external or summative function of accountability. The recommendation for what appeared to be a one-size-fits-all standardized assessment of student learning by external agencies drew particular ire (but see Graff & Birkenstein, 2008). To academics any measure that assessed learning of all undergraduates simply was not feasible or would merely tap general ability, and the SAT and GRE were available to do that. Moreover, it was not possible to reliably measure a campus’s value added. Finally, cross-institutional comparisons amounted to comparing apples and oranges; such comparisons were nonsensical and useless for improving teaching and learning. The critics, moreover, pointed out that learning outcomes in academic majors varied, and measures were needed at the department level. If outcomes in the majors were to be measured, these measures should be constructed internally by faculty to reflect the campus’s curriculum. A sole focus on so-called cognitive outcomes would leave out important personal and social responsibility outcomes such as identity, moral development, resilience, interpersonal and inter-cultural relations, and civic engagement. The report had failed, in the critics’ view, to recognize the diversity of highereducation missions and students served. It had not recognized but intruded upon the culture of academe in which faculty members are responsible for curriculum, assessment, teaching, and learning. The higher-education system was just too complex for simple accountability fixes. Horse-race comparisons of institutions at best would be misleading to the public and policy makers, and at worse would have perverse effects on teaching and learning at diverse American college and university campuses.

Assessment and Accountability in Higher Education The Commission report and the multiple and continuing responses to it set the stage for examining assessment and accountability in higher education in this text. The focus here is on accountability—in particular, the assessment of student learning in accountability. This is not to trivialize the other challenges identified by the Commission or by the professional higher-education organizations. Rather, the intent is to tackle what is one of the three bottom lines of higher education: student learning, which is the hardest outcome of all to get a good handle on. (The other two are research and service.) As we saw, there is a tug-of-war going on today as in the past among three forces: policy makers, “clients,” and colleges and universities. The tug-of-war

Assessment and Accountability Policy Context 5

reflects a conflict among these “cultures.” The academic culture traditionally focuses on assessment and accountability for organizational and instructional improvement through accreditation, eschewing external scrutiny. “Clients”— students and their parents and governmental agencies and businesses—rely on colleges and universities for education, training, and research. They want comparative information about the relative strengths and weakness among institutions in order to decide where to invest their time and economic resources. And policy makers are held responsible by their constituencies to ensure highquality education. Consequently, policy makers have a need to know how well campuses are meeting their stated missions in order to assure the public. Reputation, input, and process information is no longer adequate for this purpose. As the Commission noted, “Higher education must change from a system primarily based on reputation to one based on performance” (p. 21). All of this raises questions such as, “What do we mean by student learning?” “What kinds of student learning should higher education be held accountable for?” “How should that learning be measured?” “Who should measure it?” And “How should it be reported, by whom, to whom, and with what consequences?” The Commission’s report and its respondents also raised questions about the nature of accountability. The Commission took a client-centered perspective— transparency of performance indicators, with intercampus comparative information for students and parents. Four-year public colleges and universities have, in the most extreme response, in the VSA, embraced this perspective. The Commission’s vision is shared by the policy community. The policy community’s compact with higher education has been rocked by rising costs, decreasing graduation rates, and a lack of transparency about student learning and value added. No longer are policy makers willing to provide resources to colleges and universities on a “trust me” or reputational basis; increased transparency of outcomes and accountability are demanded. In contrast, most higher-education professional organizations view accountability as the responsibility of colleges and universities and their accrediting agencies. External comparisons are eschewed (with exceptions noted above); internal diagnostic information for the improvement of the organization and teaching and learning is sought. This is not to say colleges and universities do not recognize the challenges presented to them in the 21st century, as we saw in the open letter issued by the major higher-education organizations in the United States. They do, and they want to control accountability rather than be controlled by it.

6 Assessment and Accountability Policy Context

These varying views of accountability lead back to first principles and questions. “What is accountability?” “What should campus leaders be held accountable for—valued educational processes? Valued outcomes? Both?” “How should accountability be carried out?” “Who should carry it out?” “Who should get to report findings?” “What sanctions should be meted out if campuses fail to measure up?” “Should there be sanctions and, if not, what?” “What are states currently doing to hold their colleges and universities accountable?” “How do other nations hold their higher-education systems accountable?” “What seems to be a reasonable and effective approach to accountability for the United States going forward into the 21st century?”

A Vision of Higher-Education Assessment and Accountability in a New Era The vision of assessment and accountability presented in this text is one of continuous improvement of teaching and learning by campuses evolving into learning organizations, with progress based on an iterative cycle of evidence, experimentation, action, and reflection. The vision, in part, is one of direct assessment of student learning on cognitive outcomes in the major and in general or liberal education (measured by the Collegiate Learning Assessment). However, the vision of learning outcomes goes beyond the cognitive to individual and social responsibility outcomes, including, for example, the development of one’s identity, emotional competence, perspective taking (moral, civic, interpersonal, intercultural), and resilience. Colleges and universities would be held accountable by regional agencies governed by boards composed of higher-education leaders, policy makers, and clients. These agencies would be accountable to a national agency of similar composition. Agencies would conduct academic audits and report findings publicly, in readily accessible form, to various interested audiences. The audit would focus on the processes a campus has in place to ensure teaching and learning quality and improvement. To do this, the audit would rely on and evaluate the campus’s assessment program. The campus assessment program would be expected to collect, analyze, and interpret data and feed back findings into campus structures that function to take action in the form of experiments aimed at testing ideas about how to improve teaching and learning. Over time, subsequent assessments would monitor progress made in the majors, in general or liberal education, and by individual students. In addition to providing data on student learning outcomes, the audit program would include other indicators of

Assessment and Accountability Policy Context 7

quality—for example, admission, retention, and graduation rates and consumer quality surveys. The audit findings—not the learning assessment findings per se—would be made public. The report, based on data from the campus assessment program and a report by an external expert visiting panel, would include appraisals as to how rigorous the institution’s goals were, how rigorous the assessment of those goals was, how well the institution had embedded quality assurance mechanisms throughout the organization (including delving deeply into a sample of departments and their quality assurance processes), and how well the institution was progressing toward those goals. The report would also include a summary of the general strengths and weaknesses of the campus and its quality assurance mechanisms. In this way such published academic audits would “have teeth” and would inform both educators within the institution and policy makers and clients outside.

2

Framework for Assessing Student Learning

OVER THE PAST TWENTY-FIVE YEARS the public, along with state and federal policy makers, has increasingly pressured colleges and universities to account for student outcomes. More recently the mantra has been to create a “culture of evidence” to guide improvement (e.g., Shavelson, 2007b). As part of the move to greater accountability than in past, states today have some form of performance reporting, and about half (Naughton, Shavelson & Suen, 2003; see Chapter 7) have what Gormley and Weimar (1999, p. 3) call report cards: “a regular effort by an organization [in our case, a state] to collect data on two or more other organizations [public colleges and universities in the state], transform the data into information relevant to assessing performance [“indicators”], and transmit the information to some audience external to the organizations themselves [public, parents, students, policy makers].” (Italics in original.) Although virtually all state reports provide indicators of student “learning,” these indicators are typically proxies—for example, graduation rates or student surveys. Today, states and campuses are being pressured to measure learning directly. The Spellings Commission (U.S. Department of Education, 2006), for example, has called for standardized tests of students’ critical thinking, problem solving, and communication skills (see Chapter 1). While most agree that colleges should track student learning, they may frequently have in mind different outcomes (e.g., knowledge in the majors vs. broad abilities like critical thinking), different ways of measuring these outcomes (indirect vs. direct measures), and different notions about what learning is—it is often confused with achievement. This chapter begins by clarifying what is meant by direct and indirect learning measures and argues that the latter do not measure 8

Framework for Assessing Student Learning

9

learning: Direct measures of learning should be used. The chapter then distinguishes among learning, achievement, and propensity to learn and describes the kinds of data collection designs needed to measure each. By the very definition of learning as a permanent change in observable behavior over time, so-called indirect measures cannot measure learning. In order to clarify what we mean by “assessing learning outcomes,” a framework is presented for conceiving and displaying these outcomes. The chapter concludes by using that framework to justify a recommendation to focus on three main learning outcomes: (1) knowledge and reasoning in the majors; (2) broad abilities such as critical thinking, analytic reasoning, and problem solving; and (3) individual and social responsibility.

Direct and Indirect Measures of Learning Until quite recently indicators of student learning have been based largely on indirect measures, including graduation rates; progress or retention rates; employment rates; student, employer, and alumni satisfaction (Naughton, Shavelson & Suen, 2003; e.g., College Results Survey, see Zemsky, 2000; or NCPI, 2002); and student reports of the campus academic environment (e.g., National Survey of Student Engagement [NSSE]; Kuh, 2003). These measures are considered to be indirect because there is a big gap between, for example, graduation rates or students’ reports of their learning and their actual learning as a relatively permanent change in observed behavior over a period of time. Indirect measures of learning are not actual measures of learning because they do not directly tap observable behavior change. For example, even though NSSE has been developed to measure those indicators that past research has shown to be correlated with performance on direct measures of learning, student self-reports on this survey are uncorrelated (typically correlations of less than 0.15) with direct learning measures (Carini, Kuh & Klein, 2006; Pascarella, Seifert & Blaich, 2008). To reiterate, indirect measures of learning aren’t. That said, such measures (e.g., of persistence, graduation rates) may be important indicators of campus performance in themselves or for improving educational processes. For example, NSSE may provide valuable insights into campus processes that support learning and might become the focus of experimentation to improve learning and teaching and surrounding support structures. Direct measures of learning provide concrete observable evidence of behavior change. Such measures typically include scores on licensure (e.g., teacher or nurse certification) and graduate school admissions examinations (GRE; e.g., Callan & Finney, 2002; Naughton, Shavelson & Suen, 2003; Shavelson & Huang,

10

Framework for Assessing Student Learning

2003; see also National Center for Public Policy and Higher Education, 2002, 2004, 2006, 2008). Increasingly, broad measures of critical thinking, communication, and decision making have been used. Examples of these assessments include the Collegiate Learning Assessment (Klein et al., 2005; Klein et al., 2007; Miller, 2006; Shavelson, 2007a,b; Shavelson, 2008a,c), the Collegiate Assessment of Academic Proficiency, and the Measure of Academic Proficiency and Progress (Dwyer, Millett & Payne, 2006). Chapters 3 and 4 provide details on direct measures of learning, especially the Collegiate Learning Assessment.

On Learning, Achievement, and Propensity to Learn Assessment of learning is a catch phrase that includes “indirect” and “direct” “measures of learning.” The phrase is understood vaguely by the public and policy makers; but it communicates its intent—to focus on important outcomes, student learning being the most important, not simply on college inputs and processes as a basis for holding higher education accountable. However, this phrase is technically incorrect. Learning is defined as a relatively permanent change in a person’s behavior (e.g., knowledge, problem solving ability, civic engagement, personal responsibility) over time that is due to experience rather than maturation. In order to measure students’ cognitive learning, tasks are developed in which “correct or appropriate processing of mental information is critical to successful performance” (Carroll, 1993, p. 10). Moreover, we need to measure students’ performance at two or more time points and to be able to interpret the change in their behavior as learning due to environmental factors (e.g., experience, instruction, or self-study). While this argument may seem picky, it turns out to be an important consideration in designing student learning assessments and in interpreting learning indicators in state report cards and elsewhere (e.g., Astin, 1993a). This definition of learning rules out indirect measures of such factors as graduation rates, time to degree, and surveys of satisfaction (e.g., Zemsky, 2000) and student engagement (e.g., National Survey of Student Engagement; Kuh, 2001, 2003) as bearing directly on learning. These output measures do not tap the student-learning outcomes that include cognition (knowledge, reasoning, problem solving, writing), personal growth (ability to accept responsibility, manage on one’s own), social engagement, and civic engagement (described in Chapter 3). Moreover, indirect measures refer to groups of students, not individual students; yet learning, in the last analysis, is a within-individual phenomenon. Finally, indirect measures do not focus on change over time but on rates at one point in time.

Framework for Assessing Student Learning

11

The phrase direct measures of learning is typically a misnomer, as well. For the most part, what gets measured by direct measures of learning is not learning but achievement. Achievement is the accumulation or amount of learning in (1) formal and informal instructional settings, (2) a period of self-study on a particular topic, or (3) a period of practice up to a point in time when student performance is measured (see Carroll, 1993, p. 17). That is, learning is about change in behavior. Most direct measures of learning that get reported to the public do not measure change. Rather, they measure the status of a group of students (e.g., seniors) at a particular point in time. What is measured when students sit for a certification examination or for a graduate admissions examination is achievement, not learning. Moreover, in interpreting that achievement, higher education alone cannot be said to be the “cause” of learning, as students may have learned outside of college while attending college. Attributing causality to one or another agent is problematic for learning assessment and accountability (see Chapters 6 and 7). Finally, learning and achievement need to be distinguished from propensity to learn, which is perhaps what we would ideally like to know about students. Propensity to learn may be defined as a student’s achievement under conditions of scaffolding (Vygotsky, 1986/1934), the provision of sequential hints or supports as the student attempts to perform a task or solve a problem (“dynamic assessment” is an exemplar; e.g., Campione & Brown, 1984; Feuerstein, Rand & Hoffman, 1979). That is, with a little assistance, how well can a student perform? And by implication, how much is she likely to learn from further instruction? Or, put another way, is the student able to apply what she has learned in college (and elsewhere) successfully in new learning situations? Most direct measures of students’ learning are actually measures of their achievement at a particular point in time. Attribution of causality for learning— e.g., solely to a college education—is not warranted, although the college most likely was a major part of the cause. To examine learning, individual students need to be tracked over time. Although ultimately we may want to know a student’s propensity to learn, we do know that prior achievement is the best predictor of future achievement (e.g., Carroll, 1993), so the achievement indicator of “learning” seems a good proxy.1

Framework for Assessing Achievement and Learning Having distinguished learning, achievement, and propensity to learn and argued that most assessment of learning in the current accountability context is actually assessment of achievement, I ask you to consider now the question of what achievement and learning should be measured. Should students’ factual and

12

Framework for Assessing Student Learning

conceptual knowledge in a domain such as economics be measured? Should their ability to reason analytically and write critically be measured? Should their ability to adapt to and learn in novel situations be measured? Should achievement be limited to the so-called cognitive domain and not the personal, social, and moral? As will be seen in the next chapter, answers to these questions have differed over the past one hundred years. Currently, however, the answer seems to be “all of these.” Americans hold diverse goals for their colleges and universities as Immerwahl (2000, table 3— national sample information) reported in a national survey. The public wanted graduates with: • • • • • • •

sense of maturity and ability to manage on own (71 percent of respondents) ability to get along with people different from self (68 percent) improved problem solving and thinking ability (63 percent) high-tech skills (61 percent) specific expertise and knowledge in chosen career (60 percent) top-notch writing and speaking skills (57 percent) responsibilities of citizenship (44 percent)

A conceptual framework, then, is needed to help answer the question of what might or should be measured to “assess learning.” To this end research on cognition (e.g., Bransford, Brown & Cocking, 1999) and cognitive abilities (e.g., Martinez, 2000; Messick, 1984) has been integrated to create a framework for considering cognitive outcomes of higher education (Shavelson & Huang, 2003; Shavelson, 2007a,b). Cognitive outcomes range from domain-specific knowledge acquisition (e.g., Immerwahr’s questionnaire item “Specific expertise and knowledge in chosen career”) to the most general of reasoning and problemsolving abilities (Immerwahr’s questionnaire item “Improved problem solving and thinking ability”). One caveat is in order before proceeding to the framework. Learning is highly situated and bounded by the context in which initial learning occurred. Only through extensive engagement, deliberative practice, and informative feedback in a domain such as “quadratic equations” does this knowledge become increasingly decontextualized for a learner. At this point knowledge transfers to similar situations in general and so enhances general reasoning, problem solving, and decision making in a broad domain (in this case, mathematics) and later to multiple domains as general quantitative reasoning (e.g., Bransford, Brown & Cocking, 1999; Messick, 1984; Shavelson, 2008b). Moreover,

Framework for Assessing Student Learning

13

Intelligence General Abstract, Process Oriented

Fluid

Crystallized

General Reasoning Spatial Quantitative Verbal Example: Graduate Record Examination

Inheritance Accumulated Experience

Broad Abilities

Concrete ContentOriented

Reasoning Critical Thinking Problem Solving Decision Making Communicating In Broad Domains (Discipline—Humanities, Social Sciences, Sciences— and Responsibility—Personal, Social, Moral, and Civic) Example: Collegiate Learning Assessment

Direct Experience

Knowledge, Understanding, and Reasoning In Major Fields and Professions (American Literature, Business) Example: ETS’s Major Field Tests

Figure 2.1 Framework for student learning outcomes. Source: Adapted from Shavelson, 2007a.

what is learned and how well it transfers to new situations depends on the natural endowments, aptitudes, and abilities that students bring with them. These aptitudes and abilities are a product of their education (in and out of school) in combination with their natural endowments (e.g., Shavelson et al., 2002). A useful framework for distinguishing higher-education outcomes, then, must capture this recursive complexity. Moreover, it must allow us to see what cognitive outcomes different tests of learning attempt to measure. One possible framework for capturing knowledge and reasoning outcomes is presented in Figure 2.1 (from Shavelson, 2007a,b; Shavelson & Huang, 2003; see also Cronbach, 2002, p. 61, table 3.1; Martinez, 2000, p. 24, figure 3.2). The framework ranges from domain-specific knowledge, such as knowledge of chemistry, to what Charles Spearman called general ability or simply G. (G is used in the framework to denote general ability and to avoid the antiquated interpretation of G as genetically determined; see Cronbach, 2002; Kyllonen & Shute, 1989; Messick, 1984; Snow & Lohman, 1984.)2

14

Framework for Assessing Student Learning

Working from domain-specific knowledge toward general ability, we find increasingly general abilities, such as verbal, quantitative, and visual-spatial reasoning (and more; see Carroll, 1993), that build on inherited capacities and are typically developed over many years in formal and informal education settings. These general reasoning abilities, in turn, contribute to fluid intelligence and crystallized intelligence. “Fluid intelligence is functionally manifest in novel situations in which prior experience does not provide sufficient direction; crystallized intelligence is the precipitate of prior experience and represents the massive contribution of culture to the intellect” (Martinez, 2000, p. 19). Of course, what has been presented is an oversimplification. Knowledge and abilities are interdependent. Learning and achievement depend not only on instruction but also on the knowledge and abilities that students bring to college instruction. Indeed, instruction and abilities most likely combine or interact to produce learning. This interaction evolves so that different abilities are called forth over time. Moreover, different and progressively more challenging learning tasks are needed in this evolution (Snow, 1994; Shavelson et al., 2002). Consequently, what is sketched in Figure 2.1 does not behave in strict, orderly fashion. (The figure could have been flipped 90 or 180 degrees!) The intent is heuristic: to provide a conceptual framework for discussing learning outcomes and their measures. Domain-Specific Knowledge and Reasoning By domain-specific knowledge and reasoning is meant knowledge in the domain of, for example, physics, sociology, or music, and its use to reason through a task or problem. This is the kind of knowledge that would be assessed to gauge students’ learning in an academic major. Domain-specific knowledge corresponds to such valued higher-education outcomes as “high-tech skills” or “specific expertise and knowledge in chosen career.” Domain-specific knowledge and reasoning can be divided into four types (e.g., Li, Ruiz-Primo, & Shavelson, 2006): • Declarative (knowing that)—knowing and reasoning with facts and concepts (e.g., the Earth circles the sun in a slightly elliptical orbit) • Procedural (knowing how)—knowing and reasoning with simple and complicated routines (e.g., how to get the mass of an object with a balance scale) • Schematic (knowing why)—knowing and reasoning with a system of procedural and declarative knowledge (predicting, explaining, modeling;

Framework for Assessing Student Learning

15

for example, knowing why San Francisco has a change of seasons over the course of a year) • Strategic (knowing when, where, and how to apply these other types of knowledge)—so-called meta-cognitive knowledge and reasoning (e.g., knowing and reasoning when to apply the quadratic equation to solve a problem) Conceptual and empirical support for these distinctions comes from diverse areas. Brain imaging studies have found that different types of knowledge, especially declarative knowledge and procedural knowledge, are localized in different areas of the brain (for a short summary, see Bransford, Brown & Cocking, 1999). Cognitive science research (Bransford, Brown & Cocking, 1999; Pellegrino, Chudowsky & Glaser, 2001) has provided evidence not only of declarative, procedural, and strategic knowledge, but also of what we have called schematic knowledge (Gentner & Stevens, 1983). Distinctions among these various types of knowledge have been made in K-12 content standards (e.g., Bybee, 1996) and in test-development frameworks for large-scale assessments such as the 2009 NAEP Science Assessment Framework. In practice, most tests of domain-specific knowledge still focus on declarative knowledge, as exemplified by, for example, ETS’s Major Field Tests. Disciplinary and Broad Abilities Disciplinary and Broad Abilities3 are complex combinations of cognitive and motivational processes (“thinking”). They come closest to what is implied when we hear that the cognitive outcomes of higher education include critical thinking, problem solving, and communicating. They differ in their specificity. Disciplinary abilities are developed within a discipline—e.g., historians use historiography to disentangle events; statisticians use randomized trials as an “ideal” for modeling nonexperimental data; and physicists reason with diagrams to resolve forces. Disciplinary abilities are typically developed within a major and are closely linked to disciplinary knowledge. Broad abilities are generalized from specific, related disciplines. For example, reasoning in the behavioral and social sciences—as developed in anthropology, political science, psychology, sociology—is generalized from the discipline across common disciplinary reasoning features, such as the use of experimentally generated empirical evidence for arguing knowledge claims. These abilities are organized broadly into areas such as the humanities, social sciences, and sciences.

16

Framework for Assessing Student Learning

These reasoning processes underlie verbal, quantitative and spatial reasoning, comprehending, problem solving, and decision making. They can be called upon within a discipline (e.g., physics) and more generally across domains as situations demand, hence their name. Broad abilities are developed well into adulthood through learning in and transfer from nonschool and school experiences and repeated exercise of domain-specific knowledge. Knowledge development, of course, occurs in conjunction with prior learning interacting with previously established general reasoning abilities. Consequently, these developed abilities are not innate or fixed in capacity (e.g., Messick, 1984). Disciplinary and Broad Abilities, along with different types of knowledge, play out in achievement situations: “In educational achievement, cognitive abilities and ability structures are engaged with knowledge structures in the performance of subject-area tasks. Abilities and knowledge combine in ways guided by and consistent with knowledge structure to form patterned complexes for application and action” (Messick, 1984, p. 226; see also Shavelson et al., 2002). As tasks become increasingly broad—moving from a knowledge domain (discipline) to a field such as social science and then to broad everyday problems— general abilities exercise greater influence over performance than do knowledge structures and domain-specific abilities. Many of the valued outcomes of higher education are associated with the development of these broad abilities. For example, two important higher-education outcomes are “improved problem solving and thinking ability” and “top-notch writing and speaking.” Assessments of learning currently in vogue, as well as some developed in the mid 20th century, tap into these broad abilities. Most have focused primarily at the level of areas—sciences, social sciences, and humanities.4 Nevertheless, many of the area tests (e.g., Collegiate Assessment of Academic Proficiency [CAAP], Undergraduate Assessment Program [UAP]) divide an area such as science into questions on physics, biology, and chemistry. Because too few questions are available in each discipline to produce reliable domain-knowledge scores, an aggregate, broader science area score is provided, even though the questions focus on knowledge at the level of a discipline (see Figure 2.1). The science area score falls between domain-specific knowledge and general reasoning abilities. Other tests are more generic, focusing on critical thinking, writing, and reasoning. Some examples are the GRE’s Issues and Analytic Writing prompts, the College BASE (Basic Academic Subjects Examination), the Academic Profile (recently replaced by the Measure of Academic Proficiency and Performance [MAPP]), CAAP, and what was ETS’s Undergraduate Assessment Program Field

Framework for Assessing Student Learning

17

Tests. Indeed, many tests of broad abilities contain both area (e.g., sciences) and general reasoning and writing tests. Intelligence: Crystallized, Fluid, and General General reasoning abilities occupy the upper parts of Figure 2.1. These abilities have developed over significant periods of time through experience (e.g., school) in combination with one’s inheritance. They are the most general of abilities and account for consistent levels of performance across heterogeneous situations. Cattell (1963) argued that intelligence involves both fluid and crystallized abilities. “Both these dimensions reflect the capacity for abstraction, concept formation, and perception and eduction [sic; Spearman’s term] of relations” (Gustafsson & Undheim, 1996, p. 196). The fluid dimension of intelligence “is thought to reflect effects of biological and neurological factors” (p. 196) and includes speed of processing, visualization, induction, sequential reasoning, and quantitative reasoning. It is most strongly associated with performance on novel tasks. The crystallized dimension reflects acculturation (especially education) and involves language and reading skills (e.g., verbal comprehension, language development, as well as school-related numeracy) and school achievement (see Carroll, 1993). These reasoning abilities are distal from current college instruction. They typically are interpreted as verbal (“crystallized”) and quantitative (“fluid”) reasoning and are measured by tests such as the SAT or GRE. They are developed over a long period of time, in school and out. Nevertheless, there is some evidence of short-term college impact on these abilities (e.g., Pascarella & Terenzini, 2005). The most general of all abilities is general intelligence—the stuff that fuels thinking, reasoning, decision making, and problem solving—and accounts for consistency of performance across vastly different novel and not-so-novel situations. General intelligence involves induction “and other factors involving complex reasoning tasks” (Gustafsson & Undheim, 1996, p. 198). Although education might ultimately be aimed at cultivating intelligence (Martinez, 2000), changes in intelligence due to learning in college would be expected to be quite small and distal from the curriculum in higher-education institutions. What to Assess When Assessing Learning? The question of what to assess when we assess learning, then, is much more complex than thought at first. It seems that domain knowledge and reasoning in a broad domain (e.g., natural sciences) falls well within the purview of academic disciplines and liberal-arts programs. Here students are expected to delve deeply

18

Framework for Assessing Student Learning

into a subject matter and develop considerable declarative, procedural, and schematic knowledge. Moreover, the strategic knowledge they develop includes planning and goal setting, strategies for reaching goals, and monitoring progress toward those goals that are known to be effective in that domain. For example, in physics, one strategy for solving force and motion problems is the use of force diagrams. Such diagrams help students know when, where, and how to apply their knowledge of mechanics. Similarly, broad abilities that include verbal, quantitative, and spatial reasoning; decision making; problem solving; and communicating fall within the purview of liberal or general education. Here students are expected to draw broadly on what they have learned to address everyday practical problems that do not necessarily have convergent answers but involve trade-offs, moral issues, and social relations in addition to domain specific knowledge. As Shavelson and Huang (2003) noted, it is curious that these more complex abilities fall early in the college curriculum, whereas the domain-specific abilities fall later in the major; perhaps the two should be reversed. Throughout the history of assessing learning in American higher education, the pendulum has swung between a focus on domain knowledge and on broad abilities. However, this is not an either-or situation. The two should be balanced in assessing learning—although today the pendulum has shifted to the broad ability part of Figure 2.1, as we shall see in the next chapter. Finally, and not typically thought of when learning outcomes are discussed, although they should be (e.g., Shavelson & Huang, 2003; Shavelson 2007a), are the so-called soft skills (creativity, teamwork, and persistence; Dwyer, Millett & Payne, 2006) or “individual and social responsibility” skills (personal, civic, moral, social, and intercultural knowledge and actions; AAC&U, 2005). To a large extent, this neglect grows out of limitations in measurement technology, lack of research (funding), and suspicion. As Dwyer, Millett and Payne (2006, p. 20) pointed out, “At the present state of the art in assessing soft skills, the assessments are, unfortunately, susceptible to . . . undesirable coaching effects.” Nevertheless, such “soft” outcomes are important and should be measured; not to measure them would mean they would likely be ignored.

Reprise The design of any higher-education accountability system will, as one of its most important outputs, include an assessment of student learning. While there seems to be unanimity as to the importance of this student learning, there is

Framework for Assessing Student Learning

19

disagreement as to how learning might be defined and measured. Learning indicators can be distinguished as to whether they are indirect (reflecting the consequences of learning) or direct (tapping into what and how much has been learned). Indirect measures of learning are not and cannot be such measures. Learning is a relatively permanent change in behavior over time that results from the interaction of an individual with the environment. To gauge learning, student performance needs to be measured at two points in time; indirect measures typically reflect a single time point. Moreover, most accountability systems that profess to measure learning directly actually measure achievement—the relative or absolute level of students’ performance at a particular point in time. Finally, perhaps the best possible measure of learning, but one that would be problematic for large-scale accountability, would be a measure of students’ propensity to learn in new situations. The important consequences of this definitional hieroglyphics are twofold. First, accountability designers need to be clear on what student performance outcome is intended to be measured—achievement or learning. Second, if an accountability system intends to measure student learning, student performance should be measured at least at two points in time. This might be accomplished by following up students upon entry and exit from college. The importance of measuring the performance of all or a representative sample of students longitudinally cannot be overemphasized. Or learning might be measured cross-sectionally by comparing the performance of freshmen and, say, seniors. If the cross-sectional tack is taken, the system would have to provide measures on all or a representative sample of students in their freshman and senior years, adjusting for any change of demographics in the two classes over that time period (e.g., Klein et al., 2007; Klein et al., 2008). This still leaves open the question of what to assess. On the basis of the framework for cognitive learning outcomes, knowledge and reasoning in the disciplines and broader abilities that include critical reasoning, decision making, problem solving, and communicating in the areas of the humanities, social sciences, and sciences should be the focus of learning and assessment of learning. Consequently, a single measure of learning is unlikely to fill the bill; multiple measures—some standardized to benchmark performance and some institutionally built to diagnose curricular strengths and weaknesses—are needed. That is, a balance is needed between external (“standardized”) learning assessments for benchmarking and signaling purposes and internally developed assessments closely reflecting a campus’s mission for improvement.

20

Framework for Assessing Student Learning

Finally, any assessment of learning should include the so-called soft skills (a preferred term, following the American Association of Colleges and Universities, is “individual and social responsibility”). At present, widespread agreement has not been reached upon what set of such knowledge and skills should be measured or how to measure them, but there are currently a number of efforts under way to move this agenda forward. One possible idea for measuring these skills goes like this (Shavelson, 2007a,b). Consider a performance task involving the local environment. Students might be given an “in basket” of information (scientific reports, newspaper articles, opinion editorials, statistical and economic information) and be asked to review arguments made by local environmentalists and the business community for and against removing an old dam. In reviewing material from the in basket, students would find that environmentalists wanted to return the land to its prior state, supporting the natural streams and rivers, flora and fauna that once thrived there and providing hiking, camping, and fishing recreation with personal and commercial benefits. Students also would find that the environmentalists were pitted against other community members who use the manmade lake for fishing, boating, and swimming. Moreover, students would find that homes, restaurants, and other commercial establishments had been built up around the lake since the dam was constructed, that the dam is used to generate the county’s power, and that excess energy is sold to other counties. On the basis of their review and analysis, students would be asked to outline the economic, political, social, and ethical pros and cons of removing the county’s dam and to arrive at a recommendation for a course of action. While there would be no single correct answer, the quality of their reasoning—the application of their social responsibility skills—could be judged.

3

Brief History of Student Learning Assessment

YOU MIGHT CONCLUDE, reading policy documents, newspapers, or even the first chapter of this book that the current focus on learning outcomes is something new. But that’s not so. For well over the past one hundred years, assessment-of-learning “movements”—which usually measured achievement and assumed that reflected college learning—have come and gone. However, there are excellent examples of learning being assessed intentionally, as we shall see, and plenty of good models of assessment that could and probably should inform today’s practice (for details, see Shavelson, 2007b). Four periods in the history of learning assessment can be distinguished: (1) origins of standardized testing of learning in higher education (1900–1933), (2) assessment of learning for general and graduate education (1933–47), (3) rise of the test providers (1948–78), and (4) era of external accountability (1979– present). For ease of reference, the tests and testing programs are summarized by each of these periods in Table 3.1.

Origins of Standardized Testing of Learning in Higher Education: 1900–1933 The first third of the 20th century marked the beginning of standardized, objective testing to assess learning in higher education, spurred by the success of standardized “objective” mental testing in World War I. In 1916 the Carnegie Foundation for the Advancement of Teaching led the testing movement when five graduate students and William S. Learned, Carnegie staff member and learning-assessment visionary, tested students “in the experimental school at the University of Missouri in arithmetic, spelling, penmanship, reading, and 21

22 Brief History of Student Learning Assessment

Table 3.1. Summary of Tests and Testing Programs by Era Era Origins of standardized testing of learning: 1900–1933

Study, Program, or Test Provider Missouri Experimental School Study Thorndike MIT Engineers Study Pennsylvania Study

Assessment of learning for general and graduate education: 1933–47

Chicago College General Education Cooperative Study of General Education

Graduate Record Examination (GRE) Program

Rise of the test providers: 1948–78

ETS ACT New Jersey

Test Objective tests of arithmetic, spelling, penmanship, reading, English, and composition Objective tests of mathematics, English, and physics Objective tests of general culture (literature, fine a ts, history and social studies, general science), English (e.g., spelling, grammar, vocabulary), mathematics, and intelligence Constructed response and objective tests focusing on analysis, interpretation, and synthesis Objective tests of general culture, mathematics, and English (based on the Pennsylvania Study) and inventories of general life goals, satisfaction in reading fiction social understanding, and health 1936: Objective Profile ests of content (e.g., mathematics, physical sciences, social studies, literature, and fine a ts) and verbal ability (cf. Pennsylvania Study) 1939: Above plus 16 advanced tests in major fields (e. ., biology, economics, French, philosophy, sociology) for academic majors 1946: General Education Tests that included the Profile ests plus “effectiveness of expression” and a “general education index” 1949: Verbal and Quantitative Aptitude Tests created as stand-alone tests to replace the Verbal Factor Test and the Mathematics Test in the Profile ests 1954: Area Tests, “entirely new measures of unusual scope . . . [providing] a comprehensive appraisal of the college student’s orientation in three principal areas of human culture: social science, humanities, and natural science” (ETS, 1954, p. 3); replaced the Profile and General Education Tests Undergraduate Assessment Program that included the GRE tests College Outcomes Measures Project evolved from constructed-response tests to objective tests to save time and cost Tasks in Critical Thinking constructedresponse tests

Brief History of Student Learning Assessment 23

Table 3.1. (continued) Era Era of external accountability: 1979–present

Study, Program, or Test Provider ETS ACT CAE

Test Academic Profile and easure of Academic Proficiency and rogress (MAPP), largely objective tests College Assessment of Academic Proficiency (CAAP) largely objective tests Collegiate Learning Assessment, constructed- response tests

source: R. J. Shavelson, 2007b, table in the appendix of monograph.

English composition, using recognized tests, procedures, and scales, and a statistical treatment that though comparatively crude was indicative” (Savage, 1953, p. 284). E. L. Thorndike’s study of engineering students followed, testing MIT, University of Cincinnati, and Columbia students on “all or parts of several objective tests in mathematics, English and physics” (Savage, 1953, p. 285). These tests focused on content knowledge, largely tapping facts and concepts (declarative knowledge) and arithmetic routines (procedural knowledge; see Figure 2.1). The early tests were “objective” in the sense that students responded by selecting an answer (e.g., in a multiple choice test) where there was one correct answer. These tests gained reliability in scoring and content coverage per unit of time over the theretofore widely used essay examination. The monumental Pennsylvania Study (1928–32)—published tellingly as The Student and His Knowledge—emerged from this start; it tested thousands of high school seniors, college students, and even some college faculty members on extensive objective tests of largely declarative and procedural content knowledge. The study was conducted by Learned—“a man who had clear and certain opinions about what education ought to be . . . [with] transmission of knowledge as the sine qua non” (Lagemann, 1983, p. 101)—and Ben D. Wood, director of collegiate educational research at Columbia College and former E. L. Thorndike student who held the view, as did Learned, “that thinking was dependent upon knowledge and knowledge dependent upon facts” (Lagemann, 1983, p. 104). In many ways, the Pennsylvania study was extraordinary and exemplary with its clear conception of what students should achieve and how to measure learning; in other ways, it clearly reflected its time with its focus on factual and procedural knowledge and compliant students sitting for hours of testing.1

24 Brief History of Student Learning Assessment

In the 1928 pilot study no less than 70 percent of all Pennsylvania college seniors, or 4,580 students, took the assessment, as did about 75 percent of high school seniors, or 26,500 high school students. Of the high school seniors, 3,859 entered a cooperating Pennsylvania college, and 2,355 of those students remained through their sophomore year (1930) and 1,187 through their senior year (1932) (Learned & Wood, 1938, p. 211). The assessment itself was a whopping twelve hours and 3,200 items long—yet the examiners expressed regret at not being more comprehensive in scope! It covered nearly all areas of the college curriculum, contained selected-response questions (e.g., multiple-choice, matching, true-false), focusing mostly on declarative knowledge and procedural knowledge (see Chapter 2 and Figure 2.1)— that is, factual recall and recognition of content and application of mathematical routines (see Figure 3.1). The main study focused on student learning and not simply on knowledge (achievement) in the senior year. To examine student learning, Learned and Wood (1938) followed high school seniors and tested them as college sophomores in 1930 and again as seniors in 1932. The Pennsylvania Study is noteworthy for at least four reasons. First, it laid out a conception of what was meant by undergraduate achievement and learning.2 That is, the study focused on the nature, needs, and achievements of individual students, assuming “the educational performance of school and college as a single cumulative process the parts of which, for any given student, should be complementary” (Learned & Wood, 1938, p. xvi). More specifically, achievement resulted from college learning, which the researchers defined as the accumulation of breadth and depth of content knowledge. The second noteworthy aspect of the assessment was its span of coverage. In terms of the cognitive outcomes framework, it focused heavily and comprehensively at the knowledge level, especially on declarative and procedural knowledge (see Figure 2.1). Nevertheless, the assessment program included an intelligence test, so that it spanned the extremes of the cognitive outcomes framework— content knowledge and general ability (Chapter 2). The third noteworthy aspect of the study was that the technology for assessing student learning and achievement followed directly from the researchers’ study framework. That objective-testing technology, influenced by behavioral psychology and especially the work of E. L. Thorndike and spawned by the Army Alpha test developed for recruitment in World War I, created a revolution (Figure 3.1). If knowledge was the accumulation of learning content, objective testing—the new technology—could be used to verify, literally index, the accumulation of that knowledge. In Learned and Woods’ words, “The question, instead of requiring

Brief History of Student Learning Assessment 25

IV. GENERAL SCIENCE, Part II

Directions. In the parenthesis after each word or phrase in the right hand column, place the number of the word or phrase in the left-hand column of the same group which is associated with that word or phrase. 14.

1. Unit of work

Calorie .

.

.

(4)

2. Unit of potential difference

Dyne

.

.

. .

. .

. .

.

. .

(6)

3. Unit of electrical current

Erg

.

.

.

.

.

.

.

(1)

4. Unit of heat quantity

H.P.

.

.

.

.

.

. .

(5) (2)

5. Unit of power

V olt

.

.

.

.

.

. .

6. Unit of force

A mpere .

.

.

.

.

. . (3)

7. Unit of pressure

B.T.U.

.

.

Atmosphere Foot-pound Watt

.

. . .

.

. . .

.

. . .

.

. . . .

.

(4)

. (7) .

.

(1)

.

(5)

V. FOREIGN LANGUAGE ... Multiple Choice 9. Sophocles’ Antigone is a depiction of 1. the introduction of laws into a barbarous state, 2. the prevailing of sisterly love over citicenly duty, 3. idyllic peasant life, 4. the perils of opposing oneself to Zeus 10. Of Corneille’s plays, 1. Polyeucte, 2. Horace, 3. Cinna, 4. Le Cid, shows least the influence of classical restraint VIII. MATHEMATICS Directions. Each of the problems below is followed by several possible answers, only one of which is entirely correct. Calculate the answer for each problem; then select the printed answer which corresponds to yours and put its number in the parenthesis at the right. 5. If two sides of a triangle are equal, the opposite angles are (1) equal

(2) complementary

(3) unequal

(4) right angles

(1)

Figure 3.1 Sampling of questions on the Pennsylvania Senior Examination (Learned & Woods, 1938, pp. 374–78).

written answers, will be of a sort to test memory, judgment, and reasoning ability through simple recognition. . . . By this method a large amount of ground can be covered in a short time” (1938, p. 372). The fourth exemplary aspect of the study was that it did, unlike many accountability systems today, distinguish achievement from learning. It defined achievement as the accumulation of knowledge and reasoning capacity at a particular point in time and learning as change in knowledge and reasoning over the college years. In some cases, the comparison was across student cohorts (“crosssectional”—high school seniors, college sophomores, and college seniors), and in other cases it was longitudinal (the same high school seniors in 1928, tested again as college sophomores in 1930 and then as seniors in 1932). These various designs

26 Brief History of Student Learning Assessment

presage current-day assessments of learning by, for example, the Council for Aid to Education’s Collegiate Learning Assessment.

Assessment of Learning for General Education and Graduate Education: 1933–1947 The 1933–47 era saw the development of general education and general colleges in universities across the country and the evolution of the Graduate Record Examination (GRE). The Pennsylvania Study demonstrated that large-scale assessment of student learning could be carried out—a sort of existence proof—and individuals as well as consortia of institutions put together batteries of tests primarily to assess cognitive achievement. Perhaps most noteworthy of this progressive period in education was the attempt not only to measure cognitive outcomes across the spectrum shown in Figure 2.1 but also to assess personal, social, and moral outcomes of general education. Here I briefly treat the learning assessment in general education, because it was an alternative to rather than an adaptation of the Carnegie Foundation’s view of education and learning assessment. I then focus attention on the GRE. Evolution of General Education and General Colleges The most notable examples of general-education learning assessment in this era were developed by the University of Chicago College and the Cooperative Study of General Education (for additional programs, see Shavelson & Huang, 2003). The former had its roots in the progressive era; the latter had its roots in the Carnegie Foundation’s conception of learning but embraced some progressive notions of human development, as well. In the Chicago program a central University Examiner’s Office, not individual faculty in their courses, was responsible for developing, administering, and scoring tests of student achievement in the university’s general education program (Present and Former Members of the Faculty, 1950). Whereas the Pennsylvania Study assessed declarative knowledge (recall and recognition of facts) and procedural knowledge (application of routines), the Chicago examinations tested a much broader range of knowledge and abilities (Figure 2.1): the use of knowledge in a variety of unfamiliar situations (strategic knowledge); the ability to apply principles to explain phenomena (schematic knowledge); and the ability to predict outcomes, determine courses of action, and interpret works of art (schematic and strategic knowledge). Open-ended essays and multiple-choice questions demanding interpretation, synthesis, and application of new texts (primary sources) characterized the comprehensive exams.3

Brief History of Student Learning Assessment 27

The Cooperative Study of General Education, conducted by a consortium of higher-education institutions, stands out from individual institutional efforts such as that at Chicago for cooperative efforts to build an assessment system to improve students’ achievement and well-being. These institutions initiated the study on the beliefs that several institutions could benefit from a cooperative attack on the improvement of general education; that by sharing costs of test development and use, more could be done cooperatively than singly; and that a formative (improvement) rather than summative (win-lose compared to others) assessment was likely to lead to this improvement (Executive Committee of the Cooperative Study in General Education, 1947; Dunkel, 1947; Levi, 1948; see Chapter 6). Accordingly, the consortium developed instruments such as the Inventory of General Goals in Life, the Inventory of Satisfactions Found in Reading Fiction, the Inventory of Social Understanding, and the Health Inventories. The Evolution of the Graduate Record Examination While assessment of undergraduate learning was in full swing, so were Learned and Wood, parlaying their experience with the Pennsylvania Study into an assessment for graduate education. In setting forth the purpose of the Co-operative Graduate Testing Program, as it was initially called, Learned noted that demand for graduate education had increased following the Depression, that the AB degree had “ceased to draw the line between the fit and the unfit” (Savage, 1953, p. 288), and that something more than number of college credits was needed on which to base decisions about admissions and graduate-student quality. In the initial stages of the GRE, students were tested only after they had gained admission to graduate school, but that changed three years later to an admission test for undergraduates seeking graduate work and was formalized by graduate school deans in 1942. The overall goal of the project, then, was improvement of graduate education. In consort with the graduate schools at Columbia, Harvard, Princeton, and Yale in October 1937, Learned’s team administered seven tests to index the quality of students in graduate education; this was the first administration of what was to be the GRE. A year later Brown University joined the ranks, followed by Rochester and Hamilton in 1939, and Wisconsin, Iowa, Michigan, and Minnesota in 1940. By 1940, the test battery had become a graduate-school entrance examination with increasing subscriptions. In 1945, 98 institutions had enlisted in the program, and in 1947 the number jumped to 175. The program, then, was a success. But it was also a growing financial and logistical burden at a time when the Carnegie Foundation was struggling to keep

28 Brief History of Student Learning Assessment

its faculty retirement system (TIAA) afloat.4 As we shall see, these stresses provided the stimulus for the foundation to pursue an independent national testing service. The original GRE, like the Pennsylvania Study’s examinations, was a comprehensive objective test focused largely on students’ organized content knowledge, but it also tapped verbal reasoning (see Figure 2.1). The test was used to infer students’ fitness for graduate study (Savage, 1953). In 1936, a set of “Profile” Tests was developed on content intended to cover the areas of a typical undergraduate general education program (Educational Testing Service, 1953, 1954). To be completed in two half-day sessions totaling six hours, the tests measured knowledge in “mathematics, physical sciences [differentiated into physics and chemistry in the first revision of the examination], social studies [reduced to history, government, and economics], literature and fine arts [revised to “general literature” and the fine arts], one foreign language [dropped in the first revision], and the verbal factor” (Savage, 1953, p. 289). “The Verbal Factor Test was developed primarily as a measure of ability to discriminate word meanings” (Lannholm & Schrader, 1951, p. 7). In 1939, the second revision of the GRE added sixteen Advanced Tests in subject major fields—biology, chemistry, economics, engineering, fine arts, French, geology, German, government, history, literature, mathematics, philosophy, physics, psychology, and sociology—to complement the Profile Tests (Lannholm & Schrader, 1951; Savage, 1953).5 Combining the elementary and advanced tests, total testing time in 1940 was two periods of four hours each.6 In the spring of 1946, the general-education section of the GRE’s Profile Tests became available. The general-education section overlapped the Profile Tests and added tests of “effectiveness of expression” and a “general education index” (Educational Testing Service, 1953). Consequently, for a short period of time the GRE offered both the Profile Tests and the General Education Test. In spring 1947, the Graduate Record Office (GRO) launched an “ambitious program to involve 20,000 students at fifty-odd accepted colleges and universities in giving the revised examination” (Savage, 1953, p. 292) for the purpose of establishing norms. A second purpose of the Carnegie–Ivy League project was to assist institutions in assessing program effectiveness and individual student need as a means to improvement, much like the Cooperative Study of General Education. “Although scores were not published, they probably made their contribution to the solution of a variety of institutional problems . . . , and the G.R.O. got its new norms” (Savage, 1953, p. 292).

Brief History of Student Learning Assessment 29

In the fall of 1949, the GRE Aptitude Test was introduced (Lannholm & Schrader, 1951), replacing the verbal and quantitative portions of the Profile Tests. This shift to aptitude testing was quite significant in the evolution of learning assessment in higher education. Operationally, in 1950 the mathematics and the verbal factor tests were discontinued as part of the Profile Tests (Educational Testing Service, 1953, p. 3), creating the basis of the current-day GRE with its quantitative and verbal sections. In 1952 the familiar standardized scale of the Educational Testing Service (ETS), with a mean of 500 and a standard deviation of 100, was introduced for the purpose of reporting GRE scores. This change in the GRE marked the beginning of an important shift away from the measurement of content knowledge to the measurement of broad abilities, especially verbal and quantitative reasoning, as the basis for making admission and fellowship decisions (see Figure 2.1). Then, in 1954, ETS announced Area Tests, replacing the Profile Tests and the Tests of General Education with a means of “Assessing the Broad Outcomes of Education in the Liberal Arts” (Educational Testing Service, 1954). The Area Tests focused on academic majors in the social and natural sciences and the humanities. They were “intended to test the student’s grasp of basic concepts and his ability to apply them to the variety of types of materials which are presented for his interpretation” (Educational Testing Service, 1954, p. 3) and were considered “important to the individual’s effectiveness as a member of society” (Educational Testing Service, 1966, p. 3). The tests emphasized reading comprehension, and interpretation; the tests often provided the requisite content knowledge “because of the differences among institutions with regard to curriculum and the differences among students with regard to specific course selection” (Educational Testing Service, 1966, p. 3). This, then, was one more step away from the recall-based Pennsylvania Study and the GRE in earlier years to a test of broader reasoning abilities (Figure 2.1). And unlike the lengthy Pennsylvania tests and the extensive Chicago comprehensives, the “new” GRE Area Tests took 3.75 hours of testing time.

The Rise of the Test Providers: 1948–1978 During the period following World War II and with funding from the GI Bill, postsecondary education enrollments mushroomed, as did the number of colleges to accommodate the veterans and the number of testing companies to assist colleges in screening them. The most notable among those companies were the Educational Testing Service, which emerged in 1948, and the American College Testing program, which emerged in 1959 (becoming simply ACT in 1996).

30 Brief History of Student Learning Assessment

By the time the Carnegie Foundation had moved the GRE to ETS and moved out of the testing business, it had left an extraordinarily strong legacy: objective, group-administered, cost-efficient testing using selected response—now solely multiple-choice—questions. Precursors to the major learning assessment programs today were developed by testing organizations in this era (e.g., Shavelson & Huang, 2003). These 1960s and 1970s testing programs included ETS’s Undergraduate Assessment Program, which incorporated the GRE, and ACT’s College Outcomes Measures Project (COMP). The former evolved via the Academic Profile into today’s Measure of Academic Proficiency and Progress (MAPP), and the latter evolved into today’s Collegiate Assessment of Academic Proficiency (CAAP). Simply put, the Carnegie Foundation’s conception of learning assessment at the turn of the 20th century had an immense influence on what achievement has been tested in higher education and the nature of achievement tests today. However, several developments in the late 1970s augured for a change in the course set by Learned and Wood. Faculty members were not entirely happy with multiple-choice tests. They wanted to get at broader abilities, such as the ability to communicate, think analytically, and solve problems, in a holistic manner. This led to several new developments including ETS’s study of constructed response tests (Warren, 1978), ACT’s open-ended assessments of learning, and the State of New Jersey’s Tasks in Critical Thinking. These assessment programs embraced what college faculty considered as important learning measures. For a short period of time, these assessment programs set the mold; but due to time and cost limitations, as well as scoring issues, they either faded into distant memory or morphed into multiple-choice tests. Warren (1978, p. 1) reported on an attempt to measure academic competence with “free-response questions.” The examination tapped communication skill, analytic thinking, synthesizing ability, and social/cultural awareness. He encountered two consequential problems—scoring and interpretation. Scoring by faculty was complex and time consuming, especially for students from nonselective institutions. Interpretation was complicated because questions that fell conceptually into a common domain did not hang together empirically. At about the same time, ACT was developing the College Outcomes Measures Project. The COMP began as an unusual performance-based assessment that sought to measure skills for effective functioning in adult life in social institutions, in using science and technology, and in using the arts (an area not often addressed by large-scale assessments at the time). The test’s contents were sampled

Brief History of Student Learning Assessment 31

from materials culled from everyday experience, including film excerpts, taped discussions, advertisements, music recordings, stories, and newspaper articles. The test sought to measure three process skills—communication, problem solving, and values clarification—in a variety of item formats: multiple choice, short answer, essay, and oral response (an atypical format). COMP, then, was path breaking, bucking the trend toward multiple-choice tests of general abilities by directly observing performance in simulated, real-world situations. The test was costly in time and scoring, however. In the 1977 field trials students were given six hours to complete it; testing time was reduced to 4.5 hours in the 1989 version. Raters were required to score much of the examination. As a consequence, and characteristic of trends in assessment of learning, a simplified Overall COMP was developed as a multiple-choice only test. In little more than a decade, however, this highly innovative assessment was discontinued due to the costliness of administration and scoring. Roughly the same story can be told about Tasks in Critical Thinking (e.g., Erwin & Sebrell, 2003). The assessment grew out of the New Jersey Basic Skills Assessment Program (1977), New Jersey’s effort to assess student learning in a manner consistent with faculty members’ notion of what was important to assess—students’ performance on holistic, meaningful tasks. Tasks in Critical Thinking was a “performance-based assessment of the critical thinking skills of college and university students . . . [that measured the] ability to use the skills of inquiry, analysis, and communication” (Educational Testing Service, 1994, p. 2) where the prompts “do not assess content or recall knowledge” (p. 2). “A task resembles what students are required to do in the classroom and the world of work” (p. 3). Each task took ninety minutes to complete; students and tasks were randomly matched so that each student received only one task. Local faculty scored students’ performances based on extensive scoring guides. The New Jersey project ended due to the recession of the late 1980s and early 1990s; ETS took over marketing the examination but no longer supports it. The influence of the Carnegie Foundation, then, waned in the mid 1970s. However, as we shall see, the foundation’s vision of objective, selected-response testing remained in ETS’s and ACT’s learning assessment programs.

The Era of External Accountability: 1979–Present By the end of the 1970s, political pressure to assess student learning and hold campuses accountable coalesced. While only a handful of states (e.g., Florida, Tennessee) had some form of mandatory standardized testing in the 1980s, public

32 Brief History of Student Learning Assessment

and political demand for such testing increased into the new millennium (Ewell, 2001). To meet this demand, some states (e.g., Missouri) created incentives for campuses to assess learning; campuses responded by creating learning assessment programs. Tests of College Learning ETS, ACT, and others were there to provide tests. By this time a wide array of college learning assessments was available, following in the Carnegie Foundation tradition of objective tests. ETS currently provides the Measure of Academic Proficiency and Progress; ACT provides the Collegiate Assessment of Academic Proficiency. MAPP, a multiple-choice test battery, measures college-level reading, mathematics, writing, and critical thinking in the context of the humanities, social sciences, and natural sciences. It was designed to enable colleges and universities to assess their general education outcomes with the goal of improving the quality of instruction and learning. CAAP, also multiple choice, measures the domains of reading, writing, mathematics, science, and critical thinking. It was designed to enable postsecondary institutions to measure, evaluate, and enhance the outcomes of their general education programs. From 1979 onward significant contributions to object testing were realized, especially with the rapid evolution in computing capacity. ETS pioneered work in test-item scaling and equating (item response theory) and in computer adaptive testing. However, as we shall see, it was up to a newcomer, the Council for Aid to Education (CAE), a spin-off of the RAND Corporation, to take the next step and marry open-ended assessment of real-world holistic tasks and computer technology to create the next generation of learning assessments for higher education. Vision for Assessing Student Learning As we saw at the end of the 1970s, objective testing did not fit with the way faculty members assessed student learning or wanted student learning to be assessed. For them, life is not a multiple-choice test. Life does not present itself as a clearly defined statement of a problem or task with a set of specific alternatives from which to choose. Rather, faculty members sought open-ended, holistic, problem-based assessment, something like that found in the COMP and in Tasks in Critical Thinking. Intuitively, faculty members suspected that the kind of thinking and performing students exhibited on multiple-choice and other highly structured tests

Brief History of Student Learning Assessment 33

was different from what they exhibited on more open-ended tasks. Empirical evidence supports their intuition. While a multiple-choice test and a “constructedresponse” test may produce scores that are positively correlated with each other, this correlation does not mean that the kind of thinking and reasoning involved is the same (e.g., Martinez, 2000; National Research Council, 2001). In a variety of domains student performance varies considerably when the same task is presented as a multiple-choice question, an open-ended question, or a concrete performance task. Lythcott (1990) and Sawyer (1990), for example, found that “it is possible . . . for [high school and college] students to produce right answers to chemistry problems without really understanding much of the chemistry involved” (Lythcott, 1990, p. 248). Moreover, Baxter and Shavelson (1994) found that middle school students who solved complex hands-on electric circuit problems could not solve the same problems represented abstractly in a multiplechoice test; these students did not make the same assumptions that the test developers made. Finally, using “think aloud” methods to tap into students’ cognitive processing, Ruiz-Primo et al. (2001) found very different reasoning on highly structured and loosely structured assessments; in the former case the students “strategized” as to what alternative fit best, and in the latter case they reasoned through the problem. To be concrete about the difference between multiple-choice and open-ended assessments and what is measured, consider the following example (described in a bit more detail below): College students are asked to pretend they work for DynaTech—a company that produces industrial instruments—and have been asked by their boss to evaluate the pros and cons of purchasing a SwiftAir 235 for the company. Concern about such a purchase has risen with the report of a recent SwiftAir 235 accident. When provided with an in-basket of information, some students, quite perceptively, recognized that there might be undesirable fallout if DynaTech’s own airplane crashed while flying with DynaTech’s instruments. Students were not prompted to discuss such implications; they had to recognize these consequences on their own. There is no way such insights could be picked up by a multiple-choice question. Finally, consistent with the view of faculty, members of the secretary of education’s Higher Education Commission and the Association of American Colleges and Universities (AAC&U) have a particular type of standardized learning assessment in mind—the Council for Aid to Education’s Collegiate Learning Assessment. In the words of the American Association of State Colleges and Universities (AASCU) (2006, p. 4):

34 Brief History of Student Learning Assessment

The best example of direct value-added assessment is the Collegiate Learning Assessment (CLA), an outgrowth of RAND’s Value Added Assessment Initiative that has been available to colleges and universities since spring 2004. The test goes beyond a multiple-choice format and poses real-world performance tasks that require students to analyze complex material and provide written responses (such as preparing a memo or policy recommendation).

The AASCU (2006, p. 4) goes on to say, “Other instruments for direct assessment include ACT’s Collegiate Assessment of Academic Proficiency (CAAP), the Educational Testing Services’s [sic] Academic Profile and its successor, the Measure of Academic Proficiency and Progress (MAPP), introduced in January 2006. Around for more than a decade, these assessments offer tools for estimating student general education skills.” To complete this brief history, then, consider the new kid on the block, the Council for Aid to Education’s Collegiate Learning Assessment, the successor of assessments such as COMP and Tasks (for details, see the next chapter).7 Admittedly I am on shaky ground by presenting as history a current development— the CLA. Historians are a cautious lot. For them, history up to the current time stops no closer than twenty years from the present. Historians notwithstanding, the CLA just might provide a window into the future of standardized learning assessments. The Collegiate Learning Assessment Just as the new technology of objective testing revolutionized learning assessment at the turn of the 20th century, so has new information technology and statistical sampling technology ushered in a change in college learning assessment at the turn of the 21st century. And yet, in some ways, the “new” assessment technology is somewhat a return to the past; it moves away from selected-response, multiplechoice tests to realistic, complex, open-ended tasks. These new developments are best represented by the Collegiate Learning Assessment (e.g., Benjamin & Hersh, 2002; Klein et al., 2005; Shavelson, 2007a,b). The CLA, whose roots can be traced to progressive notions of learning, focuses on critical thinking, analytic reasoning, problem solving, and written communication (see goals in Chapter 2). These capabilities are tapped in realistic “work-sample” tasks drawn from work, education, and everyday issues that are accessible to students from the wide variety of majors and general education programs found on college campuses (see Table 3.2). The capacity to provide these

Brief History of Student Learning Assessment 35

Table 3.2. Characteristics of the Collegiate Learning Assessment Characteristic

Attributes

Open-ended tasks

• Taps critical thinking, analytic reasoning, problem solving, and written communication • Provides realistic work samples • Features alluring task titles such as “Brain Boost,” “Catfish ” “Lakes to Rivers” • Applies to different academic majors • Interactive Internet platform • P aperless administration • Natural language-processing software for scoring written communication • Online rater scoring and calibration of performance tasks • Reports institution’s (and subdivision’s) performance (and individual student’s performance confidentially o student) • Institution or divisions or programs within institutions • Not on individual students’ performance (although their performance is reported to them confidentially • Samples students so that not all students perform all tasks • Samples tasks for random subsets of students • Creates scores at institution or subdivision/program level as desired (depending on sample sizes) • Controls for students’ ability so that “similarly situated” benchmark campuses can be compared • Provides value-added estimates—from freshman to senior year or with measures on a sample of freshmen and seniors • P rovides percentiles • Provides benchmark institutions

Computer technology

Focus Sampling

Reporting

source: R. J. Shavelson, 2007a, Chart 1 and Characteristics of the Collegiate Learning Assessment (p. 32).

rich tasks is afforded by recent developments in information technology. The assessment is delivered on an interactive Internet platform that produces a paperless, electronic administration. Written communication tasks have been scored using natural language–processing software, and performance tasks are scored by online raters whose scoring is monitored and calibrated. Reports are available online. The CLA also uses sampling technology to move away from testing all students on all tasks as was done in the whopping twelve-hour and 3,200-item Pennsylvania Study in 1928. The focus then was on individual student development; CLA focuses on program improvement, with limited information provided to students confidentially (i.e., not available to the institution). Institutional (and subdivision) reports provide a number of indicators for interpreting performance. These include anonymous benchmark institution comparisons; percent of institutions scoring below a certain level; and value added over and above performance expected in the institution, based on admitted-student abilities (see

36 Brief History of Student Learning Assessment

1600

1500

1400

Mean CLA Total Score

1300

1200

1100

1000

900

800

700 700

800

900 1000 1100 1200 1300 1400 Mean SAT (or converted ACT or SLE) Score

1500

1600

Figure 3.2 Relationship between mean SAT/ACT scores (in SAT units) and CLA scores for 176 schools tested in the fall of 2007 (freshmen) and spring of 2008 (seniors). Source: Council for Aid to Education (2008). 2007–2008 CLA Technical Appendices. New York: author (p. 2); www.cae.org/content/pdf/CLA.in.Context.pdf.

Figure 3.2), through cross-sectional comparisons, and through longitudinal cohort studies or some combination. For example, Figure 3.2 shows the performance of entering freshmen (fall 2007) and seniors (spring 2008) at a set of colleges participating in the CLA. Each point on the graph represents the average (mean) college performance on the SAT/ACT and the CLA; the swarm of points shows the relationship between colleges’ mean SAT/ACT and CLA scores. A number of features in this are noteworthy. First, and perhaps most encouraging, the boxes and line (seniors) fall significantly (more than 1 standard deviation) above the circles and line (freshmen). This finding may be interpreted to mean that college does indeed

Brief History of Student Learning Assessment 37

Table 3.3. Critique an Argument A well-respected professional journal with a readership that includes elementary school principals recently published the results of a two-year study on childhood obesity. (Obese individuals are usually considered to be those who are 20 percent above their recommended weight for height and age.) This study sampled 50 schoolchildren, ages 5–11, from Smith Elementary School. A fast-food restaurant opened near the school just before the study began. After two years, students who remained in the sample group were more likely to be overweight relative to the national average. Based on this study, the principal of Jones Elementary School decided to confront her school’s obesity problem by opposing any fast-food restaurant openings near her school. source: www.cae.org/content/pdf/CLA.in.Context.pdf.

contribute to student learning (as do other life experiences). Second, most colleges (dots) fall along the straight (“regression”) line of expected performance based on ability for both freshmen and seniors—but some fall well above and some well below. This means that by students’ senior year, some colleges exceed expected performance compared to their peers, and some perform below expectation. So it matters not only that a student goes to college but also where that student goes.8 The assessment is divided into three parts—analytic writing, performance tasks, and biographical information; the first two are pertinent here. Two types of writing tasks are administered. The first, “Make an Argument,” invites students to present an argument for or against a particular position. For example, the prompt might be: “In our time, specialists of all kinds are highly overrated. We need more generalists—people who can provide broad perspectives.” Students are directed to indicate whether they agree or disagree and to explain the reasons for their position. In a similar vein, the second type of writing task asks students to “Critique an Argument” (see the example in Table 3.3). Students’ responses have been scored by raters in some years and in other years by a computer with a natural language–processing program. The performance tasks present real-life problems to students, providing an “in-basket” of information bearing on the problem (see Figure 3.3). Some of the information is relevant, some not; some is reliable, some not. Part of the problem is for the students to decide what information to use and what to ignore. Students integrate these multiple sources of information to arrive at a problem solution, decision, or recommendation. Students respond in a real-life manner by, for example, writing a memorandum to their boss analyzing the pros and cons of alternative solutions and recommending what the company should do. In scoring performance, there are a set of recognized, alternative, justifiable solutions to the problem and alternative solution

38

Brief History of Student Learning Assessment

You are the assistant to Pat Williams, the president of DynaTech, a company that makes precision electronic instruments and navigational equipment. Sally Evans, a member of DynaTech’s sales force, recommended that DynaTech buy a small private plane (a SwiftAir 235) that she and other members of the sales force could use to visit customers. Pat was about to approve the purchase when there was an accident involving a SwiftAir 235. You are provided with the following documentation: 1. newspaper articles about the accident 2. federal accident report on in-flight breakups in single-engine planes 3. Pat’s e-mail to you and Sally’s e-mail to Pat 4. charts on SwiftAir’s performance characteristics 5. amateur pilot article comparing SwiftAir 235 to similar planes 6. pictures and description of SwiftAir models 180 and 235 Please prepare a memo that addresses several questions, including what data support or refute the claim that the type of wing on the SwiftAir 235 leads to more in-flight breakups, what other factors might have contributed to the accident and should be taken into account, and your overall recommendation about whether or not DynaTech should purchase the plane. Figure 3.3 Collegiate Learning Assessment performance task. Source: www.cae.org/content/pdf/CLA.in.Context.pdf.

paths. Currently, human judges score students’ responses online, but by 2010, the expectation is that responses will be scored by computer. The CLA does not pretend to be the measure of collegiate learning. Rather, as the Council for Aid to Education points out, there are many outcomes for college education; the CLA focuses on critical reasoning, problem solving, and communication. Moreover, with its institutional (or school/college) focus, it does not provide detailed, diagnostic information about particular courses or programs (unless the sampling is done at a program level). Rather, other institutional information, in conjunction with the CLA, is needed to diagnose problems. Moreover, campuses need to systematically test out possible solutions to those problems. (See Benjamin, Chun, & Shavelson, 2007, for a detailed explanation of how the CLA might be used for improvement.) The CLA, then, sends a strong signal to the campus to dig deeper.

Brief History of Student Learning Assessment 39

Reprise The assessment of student learning is a top priority today in the quest to hold campuses accountable. Although it is portrayed as the “new thing,” student learning assessment has a long and distinguished history that can be traced to the Carnegie Foundation for the Advancement of Teaching at the turn of the last century. Over this time period, we have seen the following changes in learning assessment: • from institutionally initiated to externally mandated • from internally written to externally provided • from content based to ability based, i.e. from assessing primarily declarative and procedural knowledge to assessing generic reasoning abilities • from extensive coverage of many subjects toward narrower subject-specific coverage • from an emphasis on an individual’s level of competence (against some standard) to an emphasis on his relative standing (i.e., in comparison to others) • from lengthy objective (e.g., multiple-choice) and constructed-response (e.g., essay) tests toward standardization, multiple-choice format, and short test lengths • from multiple approaches to holistic assessment of broad abilities to largely admissions tests (most learning assessments failing to survive for long due to limited technology) Nevertheless, there is much to be learned from the past in the design of an accountability system today. Over seventy-five years ago the Pennsylvania Study proved to be quite sophisticated and apropos for today in that it was built on (1) a well-articulated notion of achievement and learning, one that is prevalent today; (2) a comprehensive notion of what knowledge (across the college subjects of humanities, social science, and science) campuses should be developing in students; (3) a data collection design that indexed both achievement at one point of time for cohorts of high school seniors, college sophomores, and college seniors and learning of the same cohort of high school seniors throughout their college careers; and (4) state-of-the-art objective testing technology. Over time, especially in the progressive era, assessment of learning in areas other than cognitive outcomes came into vogue, signaling a broader conception of outcomes to be tested than is apparent today. Philosophical differences emerged between the Carnegie vision of knowledge accumulation and the progressive movement’s concern for practical application

40 Brief History of Student Learning Assessment

of and reasoning with knowledge. Disagreement emerged as to the relative value of the two and persists today. The Carnegie vision can be traced to the empiricist philosophers and their focus on internalization of regular patterns in the environment (e.g., Case, 1996). Learned and Wood (1938, pp. 7–8) state the position well when they say that content knowledge “must be a relatively permanent and available equipment of the student; that it must be so familiar and so sharply defined that it comes freely to mind when needed and can be depended upon as an effective cross-fertilizing element for producing fresh ideas; [and that] a student’s knowledge, when used as adequate evidence of education . . . should represent as nearly as possible the complete individual.” One consequence of this concept is that knowledge can be divided into particular content areas, instruction proceeding step by step from one learning objective to the next. A second consequence is that assessment of learning should sample individual pieces of content from a knowledge domain (declarative and procedural knowledge) bit by bit in objective-test fashion, as did the tests developed for the Pennsylvania Study and as do many current learning assessments (e.g., the Measure of Academic Proficiency and Progress, and the Collegiate Assessment of Academic Proficiency). In contrast, the progressive era notion of knowledge stemmed from a rationalist position. This position held that knowledge is built up by the student with its own internal structure. One consequence of this notion was that knowledge should be constructed in a guided-discovery fashion by engaging a student’s natural curiosity and structuring abundant opportunities for exploration and reflection. A second consequence is that from a knowledge domain the assessment of learning should sample complex tasks that have embedded in them both knowledge and reasoning demands that multiple-choice tests are unable to tap adequately. This philosophy was implemented in the University of Chicago’s Examiner’s Office and can be seen most recently in the Collegiate Learning Assessment. The relative emphasis on what should be learned in college has shifted between knowledge and broad abilities over the past hundred years. In the first half of the 20th century learning assessment emphasized declarative and procedural content knowledge; in the second half, assessments emphasized broad abilities and reasoning. The evolution of the current-day GRE reflects this trend nicely. In the end, some balance between outcomes seems reasonable such that learning assessments should tap the knowledge, broad abilities, and general reasoning levels reflected in the cognitive framework introduced in Chapter 2.

Brief History of Student Learning Assessment 41

Assessment of learning in general education should focus on the last two levels; learning assessment in the majors should focus on the first two levels. Assessment of learning, then, should tap multiple cognitive outcomes ranging from declarative knowledge to broad domain abilities to verbal, quantitative, and spatial reasoning. However, as noted in Chapter 2, emphasis on cognitive outcomes is insufficient for assessing learning.9 Learning assessment needs to include what the AAC&U calls Individual and Social Responsibility Outcomes, such as civic engagement, ethical reasoning, intercultural knowledge and actions, and selfdevelopment. The current focus solely on cognitive outcomes is too narrow, judging by the outcomes the public expects from higher education and the mission statements of colleges and universities. Measurement problems in highstakes accountability need to be addressed; these very problems affect indirect measures of student learning such as the National Assessment of Student Engagement and any high-stakes measures of cognitive outcomes. Simply put, a broader set of outcomes should be incorporated into learning assessments. Just as the learning outcomes assessed have changed over the past hundred years, so have those organizations that provide learning assessments. In the first half of the 20th century, foundations and colleges provided the assessments; in the last half, external testing organizations provided the assessments. Testing organizations will continue to provide a wide array of assessments that campuses can use to assess students’ learning. These assessments have mainly focused on cognitive outcomes and have varied in their emphasis on knowledge, abilities, and reasoning, reflecting the philosophical differences noted above. If one assumes, as did Learned at Carnegie, that learning amounts to the accumulation of knowledge, and that the purpose of college education is to fill the student vessel full of that knowledge so that it is readily available for use, multiple-choice assessments of declarative and procedural knowledge would be appropriate for campuses to use to index performance in a major and in general education. Assessments such as MAPP and CAAP can be traced back to Learned’s notion. If, on the other hand, one assumes that learning amounts to the construction of knowledge and reasoning capacities within a knowledge domain or across domains in complex, meaningful, real-world tasks, as did the progressives and, apparently, as do today’s faculty, a campus might seek assessments that employ constructed responses to tasks that are complex and require a

42 Brief History of Student Learning Assessment

variety of knowledge and reasoning types to complete them. Evidence that such assessments can be built and fielded come from the Chicago Examiner’s Office examinations through the attempts in the 1980s to construct such examinations to the present-day Collegiate Learning Assessment. Externally provided learning assessments, however, are not tied directly to any particular campus curriculum. They consequently strike a common denominator such that their content overlaps, generally, with most curricula. As a consequence, they tend to tap multiple content areas in assessing knowledge and reasoning. In doing so, they place considerable demand on strategic knowledge, both within a domain (e.g., psychology) and across domains (e.g., reasoning in social sciences). Strategic knowledge is closely related to general ability or “G.” However as we pointed out, G is relatively stable by the time students get to college. Consequently, such measures as the SAT, which largely taps G, will be a good predictor of performance on these externally provided measures. Two implications follow: 1. Care must be taken in interpreting findings; externally provided measures may not be as sensitive to learning in the curriculum taught at the campus as might be desired and useful. Moreover, invidious comparisons that arise from the demand for summative accountability may lead to misinterpretation of learning assessment findings. 2. Externally provided measures are insufficient indices of student learning. They need to be supplemented with locally devised assessments that are sensitive to campus goals and curricula. Regardless of the nature of the learning outcome measure, externally provided examinations largely serve a signaling function. They flag areas of strength and weakness that campuses might attend to. If MAPP or CLA, for example, signaled a problem with general education outcomes, it might not be sufficiently diagnostic to pinpoint, on a particular campus, what might be improved. In attending to these signals, then, campuses inevitably need campuscontextualized information on student achievement and learning in order to formulate interventions (see Chapter 5). For this, more in-depth assessments of students’ performances might be called for, in conjunction with an understanding of the particular program and its context. Consequently, these externally provided assessments need to be augmented by campus-specific measures. The value of externally provided assessments, then, lies in their ability to benchmark performance (e.g., by norm data, by comparison sets of institutions,

Brief History of Student Learning Assessment 43

by value added) and signal that attention is needed. Assessment of learning should go beyond externally provided assessment and include context-sensitive indicators of learning. It should be combined with a campus’s willingness to experiment with and study the effects of improvement alternatives. Some speak of this as building a “culture of evidence.”

4

The Collegiate Learning Assessment

THE COLLEGIATE LEARNING ASSESSMENT (CLA) has a long, if relatively unknown, pedigree, as we saw in Chapter 3, stemming from the progressive era’s conception of learning in the late 1930s. Yet it is also a newcomer in the sense that measures like the CLA, in their most recent incarnations, faded away about twenty years ago. There is, then, a bit of the unknown about the CLA, especially given its current prominence in higher-education assessment and policy circles. Hence, this chapter highlights one among several learning assessments.1 It begins with background on the origin of the CLA and describes its underlying philosophy. This is followed by a description of the assessment tasks and criteria for scoring them. Then attention turns to score reliability and validity. The chapter concludes with a reprise that addresses published criticism of the assessment.

Background: A Personal Perspective The CLA was conceived jointly by Roger Benjamin, Steve Klein, and me (for origins, see Benjamin & Hersh, 2002; Chun, 2002; Hersh & Benjamin, 2002; Klein 2002a,b). Benjamin came from a political economy background and as a former dean and provost was concerned that colleges and universities were making decisions in the absence of good information about student learning. To be sure, information about learning was available—in grades, pass rates, graduation rates, and the like. But there was no way to benchmark how good was good enough. From Benjamin’s vast experience, he, along with colleague Dick Hersh, a former university president, felt and continues to feel that campuses could do more to improve student learning. Both Klein and I, as psychologists, shared Benjamin’s belief but had come from backgrounds in assessment—traditional and especially 44

The Collegiate Learning Assessment 45

alternative, nontraditional assessment. So we shared a vision of what such assessment of learning might look like. Klein did pioneering work on the Law Bar Examination (Klein, 2002b) and in science performance assessment (e.g., Klein et al., 1998), and I had done work in measurement of astronaut (Shavelson & Seminara, 1968) and military (Shavelson, 1991; Wigdor & Green, 1991) job performance, and in science performance assessment (Shavelson, Baxter & Pine, 1992). We conceived of the CLA much as it has evolved today, as described in Chapter 3. Moreover, as Benjamin had recently become president of the Council for Aid to Education (CAE), we had an organizational structure for putting our ideas into practice. CAE is a national nonprofit organization based in New York City. Initially established in 1952 to advance corporate support of education and to conduct policy research on higher education, CAE today also focuses on improving quality in and access to higher education. CAE was an affiliate of the RAND Corporation from 1996 to 2005,2 and that relationship fostered research and development of the CLA. The CLA is central to CAE’s focus. (Incidentally, CAE is also the nation’s sole source of empirical data on private giving to education, through the annual Voluntary Support of Education survey and its Data Miner interactive database.) In conceiving the CLA, we were clear on several issues. First, none of us believed in “high-stakes” political use of assessment to improve institutional performance. Rather, our concern was and is with signaling to campuses how well they are doing in improving student learning compared to benchmark peers, as well as compared to their own goals over time. CAE and its board are on record about this matter, as follows: We support improving assessment, especially assessment of student learning outcomes in undergraduate education. The goal of undergraduate learning assessment should be to help faculty and administrators . . . use measures to improve teaching and learning. The Collegiate Learning Assessment (CLA) is one tool designed for this purpose. . . . We strongly believe that a national testing regime is not appropriate for America’s higher education system. The greatness of American higher education rests in its independence, diversity of missions, and commitment to teaching, research, and service of the highest quality. A one-size-fits-all testing regime would run counter to the historical success of our postsecondary education sector, inject opportunities for inappropriate political intrusion, and weaken its future ability to innovate and compete in multiple ways. (www.ed.gov/about/ bdscomm/list/hiedfuture/4th-meeting/benjamin.pdf)

46 The Collegiate Learning Assessment

Second, we do not believe the CLA provides all of the information campuses need to make informed improvement decisions (Benjamin, 2008). The CLA signals to campuses how well they are doing against what would be expected of their students’ performance, against their own goals, and against benchmark campuses on broad abilities (Figure 3.2) of critical thinking, analytic reasoning, problem solving, and communication (e.g., Klein et al., 2008). At least two ingredients are missing to capture the overall picture for improvement: (1) External assessments of students’ learning in the majors are needed, as are such assessments of students’ learning in the areas of responsibility—personal, social, moral, and civic (Chapter 3). (2) Internal measures of student learning are needed (Shavelson, 2008a,b). Such measures would be sensitive to the particular curriculum and context of the campus and provide diagnostic information about where improvements in teaching and learning might be made. This is not to say the CLA cannot be used internally; it can be (see Benjamin, Chun & Shavelson, 2007). CLA-like tasks can be used as teaching tools, and classroom interchanges around these tasks can produce a wealth of diagnostic information. That said, the assessment of learning is broader and more deeply contextualized than CLA-type tasks can tap, and campus assessment programs can serve that function (see Chapter 5). Finally, simply providing information about student learning and including this information in some kind of report card or balanced score card (Chapter 8) does not guarantee that that will do any good. A campus needs the will and capacity to make use of this information. In particular, as will be seen in Chapter 5, the campus needs its president on down to deans, department chairs, faculty, and students to be in the feedback loop; improvement needs to be highly valued, experimented on, and closely monitored. Today, the CLA is run by about fifteen full-time CAE staff members and twelve to fourteen part-time consultants. Compared to that of most testing organizations, the staffing is lean. This means that a great deal of the attention given to the CLA is operational, as the assessment grew from fourteen participating campuses in 2005 to over three hundred in the spring of 2007. It also means that while the CLA is being researched at CAE, funding for that research—and producing, publishing, and otherwise publicizing it—has been slower than desired. Nevertheless, there is a substantial body of publications (see, for example, www.cae.org/content/pro_collegiate_reports_publications .htm). What is currently missing is a review summarizing this body of research (although Klein et al., 2007, and Klein et al., 2008, come close). This chapter

The Collegiate Learning Assessment 47

attempts to bring this research and new analyses together in presenting much of what is known about the CLA.

Underpinnings of the CLA The CLA, unlike other assessments of undergraduates’ learning, which are primarily multiple-choice tests, is an assessment composed entirely of constructedresponse tasks that are delivered and scored on an Internet platform (see Chapter 3). The CLA was developed to measure undergraduates’ learning—in particular their ability to think critically, reason analytically, solve problems, and communicate clearly. The assessment focuses on campuses or on programs within a campus— not on producing individual student scores. Campus- or program-level scores are reported, both in terms of observed performance and as value added beyond what would be expected from entering students’ SAT scores. This said, the CLA also provides students their scores on a confidential basis so they can gauge their own performance. The assessment consists of two major components: a set of performance tasks and a set of two different kinds of analytic writing prompts (see Figure 4.1 and see Chapter 3 for examples of tasks and prompts). The performance task component presents students with a problem and related information and asks them either to solve the problem or to recommend a course of action based on the evidence provided. The analytic writing prompts ask students to take a position on a topic, make an argument, or critique an argument. As noted, the CLA differs substantially, both philosophically and theoretically, from most learning assessments, such as the Measure of Academic Proficiency and Progress (MAPP) and the Collegiate Assessment of Academic Progress (CAAP) (Chapter 3; Benjamin & Chun, 2003; Shavelson, 2008a,b). Such learning assessments grew out of an empiricist philosophy and a psychometric/behavioral tradition. From this tradition, everyday complex tasks are divided into component parts, and each is analyzed to identify the abilities required for successful performance. For example, suppose that components such as critical thinking, problem solving, analytic reasoning, and written communication were identified. Separate measures of each of these abilities would then be constructed, and students would take a test (typically multiple-choice) for each. At the end of testing, students’ test scores would be added up to construct a total score. This total score would be used to describe, holistically, students’ performance. This approach, then, assumes that the sum of component part test scores equals holistic performance.

48 The Collegiate Learning Assessment

CLA Critical thinking Analytic reasoning Problem solving Communicating

Performance tasks

Analytic writing tasks

Make an argument

Critique an argument

Figure 4.1 Collegiate Learning Assessment structure. Source: R. Shavelson; www.cae.org/content/pdf/CLA.in.Context.pdf.

In contrast, the CLA is based on a combination of rationalist and sociohistorical philosophies in the cognitive-constructivist and situated-in-context traditions (e.g., Case, 1996; Shavelson, 2008b). The CLA’s conceptual underpinnings are embodied in what has been called a criterion sampling approach to measurement (McClelland, 1973). This approach assumes that complex tasks cannot be divided into components and then summed. That is, it assumes that the whole is greater than the sum of the parts and that complex tasks require the integration of abilities that cannot be captured when divided into and measured as individual components. The criterion-sampling notion goes like this: If you want to learn what a person knows and can do, sample tasks from the domain in which that person is to perform, observe her performance, and infer competence and learning from the performance. For example, if you want to find out not only whether a person knows the laws governing driving a car but also whether she can actually drive a car, don’t judge her performance solely with a multiple-choice test. Rather, also administer a behind-the-wheel driving test. The task would include a “sample” of “real-life” driving conditions, such as starting a car, sig-

The Collegiate Learning Assessment 49

Table 4.1. Criterion Sampling Approach and the Collegiate Learning Assessment Criterion Sampling Approach

Collegiate Learning Assessment

• Samples tasks from real-world domains • Samples “operant” as well as “respondent” responses • Elicits complex abstract thinking (operant thought patterns)

• Samples holistic, real-world tasks drawn from life experiences • Samples constructed responses (no multiple-choice) • Elicits critical thinking, analytic reasoning, problem solving, and communication • Provides tasks for teaching as well as assessment

• Provides information on how to improve on tasks (cheating is not possible if student can actually perform the criterion task)

source: R. Shavelson; www.cae.org/content/pdf/CLA.in.Context.pdf.

naling and pulling into traffic, turning left and right into traffic, backing up, and parking. Based on this sample of performance, it would be possible to draw inferences about her driving performance more generally. Based on the combination of a multiple-choice test on driving laws and this performance assessment, it would be possible to draw inferences about her knowledge and performance. The CLA follows the criterion-sampling approach by drawing from a domain of real-world tasks that are holistic and based on real-life situations (Table 4.1). It samples tasks and collects students’ operant responses. That is, the task of, say, writing a memorandum corresponds to real-life tasks. Moreover, the initial operant responses students generate may be modified with feedback as they encounter new material in an “in-box” and cross-reference documents. These responses parallel those expected in the real world. There are no multiple-choice items in the assessment; indeed, life does not present itself as a set of alternatives with only one correct course of action. Finally, the CLA provides CLA-like tasks to college instructors so they can “teach to the test” (Benjamin, Chun & Shavelson, 2007). With the criterion-sampling approach, teaching to the test is not a bad thing. If a person “cheats” by learning and practicing to solve complex, holistic, real-world problems, she has demonstrated the knowledge and skills that educators seek to develop in students. That is, she has learned to think critically, reason analytically, solve problems, and communicate clearly. Note the contrast with traditional learning assessments, for which practicing isolated skills and learning strategies to improve performance may lead to higher scores but are unlikely to generalize to a broad, complex domain.

50 The Collegiate Learning Assessment

CLA Performance Tasks and Scoring From Chapter 3 recall the DynaTech performance task (see Figure 3.3; see also Shavelson 2007a,b; 2008a,b), which exemplifies the type of performance tasks found on the CLA and their complex, real-world nature. In that task the company’s president is about to approve the acquisition of a SwiftAir 235 for the sales force when the aircraft is involved in an accident. The president’s assistant (the examinee) is asked to evaluate the contention that the SwiftAir is accident prone, given an in-basket of information. The examinee must weigh the evidence and use this evidence to support a recommendation to the president. The examinee is asked the following: • Do the available data tend to support or refute the claim that the type of wing on the SwiftAir 235 leads to more in-flight breakups? What is the basis for your conclusion? • What other factors might have contributed to the accident and should be taken into account? • What is your preliminary recommendation about whether or not DynaTech should buy the plane, and what is the basis for this recommendation? Consider another performance task, “Crime” (Shavelson 2007a,b; 2008a,b,c). The mayor of Jefferson is confronted with a rising number of crimes in the city and their association with drug trafficking. This issue arises just as the mayor is standing for reelection. He has proposed increasing the number of police. His opponent, a City Council member, has proposed an alternative to increasing the number of police—increased drug education. Her proposal, she argues, addresses the cause and is based on research studies. As an intern to the mayor, the examinee is given an in-basket of information regarding crime rates, drug usage, relationship between number of police and robberies, research studies, and newspaper articles (see Figure 4.2). The examinee’s task is to advise the mayor, based on the evidence, as to whether his opponent is right about both drug education and her interpretation of the positive relationship between the number of police and the number of crimes. Performance tasks are scored analytically and holistically (Table 4.2). Judges score specific components of each answer (typically 0 for incorrect or 1 for correct) and also provide holistic judgments of overall critical thinking and writing (on a Likert-type scale). Holistic and component scores are summed up to create a total score. A different analytic scoring system is developed for each performance task. This is necessary because tasks vary in the demands they

This is the fifteenth drug-related arrest in Jefferson this month, and the police are calling it an epidemic. Sergeant Heather Kugelmass said, “Drugs are now the number one law enforcement problem in Jefferson. Half of our arrests involve drugs.” Mayor Stone has called for more money to hire more police officers to reduce the growing crime rate in Jefferson. But the Council is divided on what to do. City Council members Alex Nemeth and LeighAnn Rodd called a press conference to demand that the rest of the council support an increase in the police budget. “If we put more cops on the street,” they said, “we will show that criminals are not welcome in Jefferson.” Mayoral candidate Dr. Jamie Eager called for a different approach. “More police won’t make a difference. We need more drug treatment programs,” Eager said. “The problem is not crime, per se, but crimes committed by drug users to feed their habits. Treat the drug use, and the crime will go away.” The Council is slated to debate the proposed budget increase for police at its next meeting.

Source: www.cae.or/content/pdf/CLA.in.Context.pdf.

Figure 4.2 CLA in-basket items from the “Crime” performance task.

JEFFERSON TOWNSHIP – On Monday police arrested a man suspected of robbing the Smart-Shop grocery store of $125. The arrest came less than six hours after Ester Hong, the owner of the Smart-Shop store, reported the robbery. The suspect, Chris Jackson, was found just a few blocks from the store, and he put up no resistance when police arrested him. He was apparently high on drugs he had purchased with some of the money taken from the store. Ms. Hong told reporters that Mr. Jackson came into the store just after it opened and demanded all the money from the cash register. He threatened the owner with a knife, and Ms. Hong game him all the cash she had. The suspect fled, and Ms. Hong called the police. A few hours later police responded to a telephone complaint and found Mr. Jackson in an alley a few blocks from the store. The arresting officer said he appeared to be stoned and did not attempt to evade arrest. The officers found a syringe and other drug paraphernalia in Jackson’s pocket. He was charged with armed robbery and possession of drugs.

Number of Police Officers per 1,000 Residents 10 9 8 7 6 5 4 3 2 1 0 0

1 3 5 8 10

11510 11511 11512 11520 11522 10 20 90 50 55

Number of Crimes in 1999

20 40 60 80 Number of Robberies and Burglaries per 1,000 Residents

Crime Rates and Police Officers in Columbia’s 53 Counties

Percent of Population Using Drugs

Zip Code

Crime Rate and Drug Use in Jefferson by Zip Code

100

52 The Collegiate Learning Assessment

Table 4.2. Scoring Criteria for Performance Tasks Evaluation of evidence How well does the student assess the quality and relevance of evidence, by doing the following? • Determining what information is or is not pertinent to the task at hand • Distinguishing between rational claims and emotional ones, fact from opinion • Recognizing the ways in which the evidence might be limited or compromised • Spotting deception and holes in the arguments of others • Considering all sources of evidence Analysis and synthesis of evidence How well does the student analyze and synthesize data and information, by doing the following? • Presenting his or her own analysis of the data or information (rather than accepting it as is) • Avoiding and recognizing logical fl ws (e.g., distinguishing correlation from causation) • Breaking down the evidence into its component parts • Drawing connections between discrete sources of data and information • Attending to contradictory, inadequate, or ambiguous information Drawing conclusions How well does the student form a conclusion from his or her analysis, by doing the following? • Constructing cogent arguments rooted in data or information rather than speculation or opinion • Selecting the strongest set of supporting data • Prioritizing components of the argument • Avoiding overstated or understated conclusions • Identifying holes in the evidence and suggesting additional information that might resolve the issue Acknowledging alternative explanations and viewpoints How well does the student consider other options and acknowledge that his or her answer is not the only perspective, by doing the following? • Recognizing that the problem is complex and has no clear answer • Proposing other options and weighing them in the decision • Considering all stakeholders or affected parties in suggesting a course of action • Qualifying responses and acknowledging the need for additional information in making an absolute determination source: www.cae.org/content/pdf/CLA.in.Context.pdf.

make on and the weight given to critical thinking, analytic reasoning, problem solving, and communication to successfully carry out the task.3

Analytic Writing Tasks and Scoring The CLA contains two types of analytic writing tasks, one asking students to make (build) an argument and the other asking them to critique an argument (see Chapter 3 for examples). Analytic writing invariably depends on clarity of thought in expressing the interrelated skill sets of critical thinking, analytic reasoning, and problem solving. Students’ performances, then, depend on both writing and critical thinking as integrated rather than separate skills. Writing performance is evaluated using component and holistic scores that consider several aspects of writing, depending on the task. More specifically, both types of tasks are scored using criteria in Table 4.3, as appropriate to the particular task.

The Collegiate Learning Assessment 53

Table 4.3. Criteria for Scoring Responses to Analytic Writing Prompts Analytic writing skills invariably depend on clarity of thought. Therefore, analytic writing and critical thinking, analytic reasoning, and problem solving are related skills sets. The CLA measures critical thinking performance by asking students to explain in writing their rationale for various conclusions. In doing so, their performance is dependent on both writing and critical thinking as integrated rather than separate skills. We evaluate writing performance using holistic scores that consider several aspects of writing depending on the task. The following are illustrations of the types of questions we address in scoring writing on the various tasks. Pre sen taion t How clear and concise is the argument? Does the student: • Clearly articulate the argument and the context for that argument; • Correctly and precisely use evidence to defend the argument; and • Comprehensibly and coherently present evidence? Development How effective is the structure? Does the student: • Logically and cohesively organize the argument; • Avoid extraneous elements in the argument’s development; and • Present evidence in an order that contributes to a persuasive and coherent argument? Persuasiveness How well does the student defend the argument? Does the student: • Effectively present evidence in support of the argument; • Draw thoroughly and extensively from the available range of evidence; • Analyze the evidence in addition to simply presenting it; and • Consider counterarguments and address weaknesses in his/her own argument?

Technical Considerations: Reliability and Validity Standardized assessments are obliged to provide information about the reliability of scores and the validity of score interpretations. Although considerable research on these technical considerations has been done with the CLA (e.g., Klein et al., 2005; Klein et al., 2007; Klein et al., 2008),4 because it is a fairly new assessment, there are clearly missing pieces of information that in the near future need to be provided. Such pieces of information will be pointed out along the way. Reliability Reliability refers to the consistency of scores produced by a measurement procedure such as the CLA. If a test produces reliable scores, a person would be expected to get about the same score taking the test from one occasion to the next, assuming no intervening learning or maturation (“test-retest” reliability); about the same score from one form of the test to another form (“equivalent-forms” reliability); about the same score from one item to another on a single test (“internal-consistency” reliability); or about the same score from one rater to another rater (“inter-rater” reliability).5 Each method for estimating reliability produces a

54 The Collegiate Learning Assessment

reliability coefficient ranging from 0 (no consistency) to 1.00 (perfect consistency). Coefficients above .70 are useful for aggregates (e.g., campus scores); coefficients above .80 are useful when individual student scores are reported. The CLA produces a variety of scores. It produces total “raw” scores and raw scores for performance and writing tasks. Moreover, it provides value-added scores for total, performance, and writing tasks. Raw scores are produced by the scoring rubrics for each of the CLA tasks. For all tasks, a raw score is the sum of the analytic- and holistic-score components. As the number of components varies from one task to another, raw performance-task scores are scaled to an SAT standardized score. School-level scores are the average scores earned by students at a particular campus. For example, if a campus has a sample of one hundred students responding to a performance task, the school-level raw score would be the average of those one hundred students’ performance-task raw scores (see Klein et al., 2007). The CLA also reports “value-added” scores (Klein et al., 2008). Value-added scores reflect the extent to which a campus performed as expected, better than expected, or worse than expected on the CLA, based on the “quality” of its students upon matriculation as indexed by SAT or ACT scores (see Figure 3.2). Freshman CLA scores are predicted from the students’ SAT scores. A better-than-expected score arises when a campus’s raw CLA score is higher than its expected or predicted CLA score (above the regression line in Figure 3.2). An expected score arises when a campus’s raw CLA score falls on or close to expected (represented by the line). And a below-expected score arises when a campus’s raw CLA score falls below the regression line. That is, for each participating school, a “discrepancy” score is calculated that measures the distance the school’s CLA score falls from what would be expected for a given level of SAT input. So a campus has a Freshman Discrepancy Score and a Senior Discrepancy Score. Each discrepancy score provides an estimate of above, at, or below expectation. In addition to discrepancy scores, the CLA reports a campus’s Value-Added Score, which is its Senior Discrepancy Score minus the Freshman Discrepancy Score. To summarize, • Freshman Discrepancy Score: CLA freshman raw score–expected score based on the SAT • Senior Discrepancy Score: CLA senior raw score–expected score based on the SAT • Value-Added Score: Senior Discrepancy Score–Freshman Discrepancy Score

The Collegiate Learning Assessment 55

Below, reliabilities are reported for raw scores, discrepancy scores, and valueadded scores. This is done for performance-task and analytic-writing raw scores at both the individual and school levels and for discrepancy scores and valueadded scores, which are defined only at the school (or program) level. Performance Task Raw Scores. Reliability data are available for seven performance tasks from spring 2006. The mean and median internal-consistency reliability of raw scores for individual students are .83 and .85, respectively, with a range from one performance task to another of .79–.88 (see Klein et al., 2005, for earlier, similar findings). School-level internal consistency reliabilities should be higher than individual-level reliabilities because average scores are typically more stable than individual scores (Klein et al., 2007). The mean and median internal consistencies for school-level performance-task scores are .90 and .91, respectively, with a range of .81–.93. The total performance-task score (aggregating over tasks as matrix sampled) internal consistency was .85 at the individual level and .93 at the school level. Performance-task scores are based on raters’ analytic evaluations of students’ responses. A different type of reliability coefficient, the inter-rater reliability coefficient, is used to index the consistency of raters’ ratings of student performancetask responses. It reflects the extent to which judges order students’ performances from low to high consistently. In spring 2006, the mean and median inter-rater reliabilities for a single rater—the correlation between two raters’ scores for a sample of students—were .79 and .81, respectively, with a range of .67–.84. In fall 2007, the mean and median inter-rater reliabilities were .86 and .86, respectively, with a range of .82–.98 (see Klein et al., 2005, for similar findings). Critical Writing Raw Scores. For four Make an Argument prompts, the mean and median internal-consistency reliabilities for individual level raw scores in fall 2007 were .94 and .95, respectively, with a range of .93–.95. The corresponding reliabilities for school-level scores were .97 and .97, with a range of .97–.98. With respect to four Critique an Argument prompts given in fall 2007, the mean and median internal consistencies for individual-level scores were .70 and .71, respectively, with a range of .68–.72. At the school level, mean and median reliabilities were .84 and .84, respectively, with a range of .84–.84. These reliabilities are somewhat lower than other measures reported here but certainly acceptable at the school level. Critical-writing raw scores, like performance-task raw scores, are based on raters’ evaluations of students’ responses. However, for critical writing prompts,

56 The Collegiate Learning Assessment

students’ performance might be rated by a human or a machine. Klein et al. (2007; see also Klein et al., 2005) reported inter-rater reliabilities for a single rater based on scores from two human raters to range from .80 to .85, while the human-machine inter-rater reliability was .78, based on data from 2005. Discrepancy and Value-Added Scores. Reliabilities for discrepancy and valueadded scores are expected to be lower than those for raw scores (Klein et al., 2007; Klein et al., 2008). That is because measurement error is compounded by having two measurements involved: SAT and CLA. This has led some (e.g., Banta & Pike, 2007; Kuh, 2006) to conjecture that CLA total value-added scores were unreliable. It turns out that this is not the case. Klein et al. (2007) reported discrepancy-score reliabilities for freshmen and seniors to be .77 and .70, compared to total raw score reliabilities of .94 and .86. As the number of students at a campus (the sample size) increases, so does the reliability of these scores. If discrepancy score reliabilities are expected to be low, the difference between two such scores should be really low. To see if this were so, Klein et al. (2007) estimated value-added score reliability to be .63. Contrary to what might be expected, this is a strong indication of consistency given the complexities of the value-added score. Again, as sample size increases, so does the reliability of value-added scores. This said, the CLA’s value-added approach is a pragmatic solution to a difficult real-world problem; over time it will inevitably be revised as better methods become available. To see its limitations, consider the “ideal” way of estimating value added, in which the same cohort of students is followed from freshman to senior year—a longitudinal design. In this case, CLA’s value-added approach, adjusting for that cohort’s mean SAT score upon matriculation to the college, works well. But longitudinal studies are expensive and difficult to carry out with the churning of students in and out of a college. Moreover, it takes four years to get an estimate of value added. Consequently, most campuses opt for a cross-sectional design. This design collects SAT and CLA scores for freshmen in the fall and for the senior class in the spring of an academic year (e.g., fall 2006 and spring 2007). The design uses the freshman SAT-CLA scores as a proxy for what the seniors’ scores would have been when the seniors were freshmen four years prior. But the “surviving” seniors are not the same as the entering class four years previously; not all students in the freshman cohort have become seniors. That means that both the freshman and senior discrepancy (“residual”) scores need to be estimated to control for differences in SAT scores over time. And that leaves room for

The Collegiate Learning Assessment 57

doubt as to whether the adjustment is proper. If the adjustment is not proper, interpretation of value-added scores in one year is tricky for a campus, and change over years is even trickier. Bottom line: Multiple indicators are needed to make informed decisions about areas in need of improvement. Summary of Reliability Evidence. Fairly extensive evidence suggests that CLA raw scores are adequately reliable, especially for reporting school-level performance. Moreover, both discrepancy scores and value-added scores are, perhaps, unexpectedly adequate, based on the magnitude of measurement error that they might introduce. Validity Validity refers to the degree to which a proposed interpretation of a measurement is warranted by conceptual and empirical evidence. In the case of the CLA, validity depends on the evidence that supports its claim to measure analytic reasoning, critical thinking, problem solving, and communication. There are variety of ways a validity argument can be built. One way is to argue that the tasks on the CLA are representative of real-world tasks drawn from a variety of life situations. Often, expert judges are used to evaluate this representativeness claim. Another way to establish validity is to show that scores on the CLA correlate with other measures as expected. For example, a positive correlation between the CLA and a measure of, say, critical thinking would provide such evidence. A third way is to show that the CLA predicts future performance of experts and novices, or life outcomes, perhaps through correlations with grade-point averages. And a fourth way is to establish that the kind of thinking expected—analytic reasoning, problem solving, for example—is actually demanded when students perform CLA tasks. This is typically done via a “think aloud” method, in which students verbalize their thoughts as they work through a task. A measurement is never “validated.” That is, validation is an ongoing process of building evidence—confirmatory and disconfirmatory—that leads to changes in the measurement, in the conceptual underpinnings, or in both. The CLA, being a new instrument, is in the beginning stages of validation, as we will see. Much progress has been made, but more remains to be done. Content Representativeness. The CLA claims to contain real-world, holistic tasks sampled from domains such as education, science, health, environment, art, and work. Of course, these are not the real-world tasks themselves but simulations of such tasks. The question, then, is to what extent students and faculty

58 The Collegiate Learning Assessment

view the tasks in this way and believe that the capacity to perform the tasks is valuable—what a college education is supposed to prepare students for. To a small degree, data exist to address the question of representativeness, but only for the performance tasks, not the critical writing tasks. Faculty Perceptions of Performance Tasks. In a study designed to set performance levels on the CLA’s performance tasks, Hardison and Valamovska (2008) collected faculty members’ perceptions of the tasks. These data are important because the forty-one faculty members were selected to be widely representative of faculty across the country, regionally, by public or private college, by academic field, and by rank. Moreover, the faculty members became intimately familiar with the CLA performance tasks through extensive review of the tasks themselves and extensive reading and discussion of student responses to the tasks. More specifically, faculty responded to a questionnaire on a fivepoint Likert-type scale (1 = strongly disagree . . . 5 = strongly agree), with items tapping whether the following occurred: • An important educational construct was measured. • What is measured on the CLA is taught in college courses. • Performance tasks measured what they were intended to measure (critical thinking, etc.). • Performance on the test would predict important life outcomes. • Training students on the tasks would help them get ahead in life. • Known groups would perform better on the tasks (e.g., professors would be expected to perform better than dropouts on the CLA). These faculty seemed to be in a position to judge issues of importance, overlap with courses taught in college, whether the tasks measured analytic reasoning (etc.), and perhaps differences between known groups. However, it is a stretch to believe that they could predict the future. Nevertheless, for completeness, those findings are reported along with the others. In general, the lowest mean might be expected for the scale tapping the overlap between courses taught and the CLA. College courses tend to focus more on knowledge in the subject being taught and less on broad reasoning abilities (Figure 2.1). The results are shown in Figure 4.3. Consistent with expectation, the lowest mean rating was given to the overlap between CLA performance tasks and what is taught in college courses. With respect to whether the performance tasks measure an important educational outcome and whether they measure what they are supposed to measure, the faculty agreed or strongly agreed. As for pre-

The Collegiate Learning Assessment 59

Mean Rating

5 4 3 2 1

0 Important

Taught

Intended

Predict Life

Get Ahead

Known Groups

Figure 4.3 Faculty perceptions of the CLA’s performance tasks. Source: R. Shavelson; www.cae.org/content/pdf/CLA.in.Context.pdf.

dicting the future, the faculty agreed that the CLA would do so, although that is a far conjecture. Finally, faculty agreed that the performance tasks would distinguish known groups, but that, too, is as much conjecture as experience. The evidence, such that it is, suggests that faculty who have studied the performance tasks and read a substantial number of student papers varying in quality viewed the CLA performance tasks as reflecting important educational outcomes, measuring what they were intended to measure, and distinguishing “experts” from “novices.” They also felt that these tasks were somewhat different from what was taught in college courses. And finally, they viewed the tasks as predictive of life outcomes and getting ahead if taught. Student Perceptions of Performance Tasks. The CLA regularly collects students’ perceptions of its performance tasks (e.g., Klein et al., 2005). The most recent data available are for freshmen in fall of 2006 and seniors in spring of 2007. They were asked to evaluate the tasks on a set of eight items, six of which are pertinent to content representativeness. Unfortunately, the Likert-type scales associated with each item on the questionnaire differ from one another in number of scale points, and so a succinct summary of findings like that in Figure 4.3 is not possible. The questions are paraphrased and the mean (standard deviation) response provided for freshmen and seniors in Table 4.4. Freshmen and seniors agree that the CLA performance tasks are “mostly different” from those encountered in their classes (Question 1). This is what might be expected if the CLA were measuring broad ability to perform holistic, real-world tasks. Just as the faculty did, students saw the differences between CLA and classroom tasks; this difference perhaps reflects broad reasoning ability and knowledge in the major. Moreover, they considered the tasks to be good at tapping their ability to analyze and communicate (Question 3).

60 The Collegiate Learning Assessment

Table 4.4. Students’ Mean (Standard Deviation) Perceptions of CLA Performance

Tasks Question 1. How similar are the CLA tasks to those you do in college (1 = Very Different . . . 4 = Very Similar)? 2. How interesting was the task compared to course assignments and exams (1 = Boring . . . 5 = Far More Interesting)? 3. How good are the CLA tasks at measuring ability to analyze and present a coherent argument (1 = Very Poor . . . 5 = Very Good)? 4. How difficult was the task ompared to your college exams (1 = Much Easier . . . 5 = Much Harder)? 5. Do you agree that more professors should use tasks like this one in their courses (1 = Strongly Disagree . . . 5 = Strongly Agree)? 6. What is your overall evaluation of the quality of this task (1 = Terrible . . . 4 = Fair . . . 7= Excellent)?

Freshmen

Seniors

2.08 (.87) 2.98 (.99) 3.59 (1.05) 2.74 (.92) 2.91 (1.08) 4.86 (1.10)

1.96 (.87) 2.93 (1.01) 3.61 (.97) 2.47 (.92) 2.99 (1.09) 4.86 (1.11)

source: R. Shavelson.

Both freshmen and seniors viewed the performance tasks to be about as interesting as college tasks (Question 2) and at about the same level of difficulty as college tasks (Question 4). They were neutral about having more professors use these tasks (Question 5) but rated the overall quality of the tasks as “fair” to “good” (Question 6). Perhaps the most important evidence for the validity of CLA task-score interpretation is the finding that students say the tasks are different from what they encounter (Question 1), and that the tasks tapped their ability to analyze and communicate (Question 3). This is just what the CLA says about its tasks. Students are neutral about having more such tasks in courses and view the tasks as about as interesting and challenging as those encountered in their courses. Relationship of CLA Scores to Related Measures. Another way to examine the interpretative validity of CLA scores is to see whether they “behave” as might be expected. For example, since both the CLA and the SAT measure broad abilities, the latter measuring broader abilities than the former (see Figure 2.1), CLA scores and SAT scores should be positively correlated with one another. Also, since CLA tasks tap critical thinking (in part), scores on these tasks should be positively correlated with scores on other critical-thinking measures. Moreover, science majors would be expected to perform slightly higher on science-like CLA tasks than humanities or social science majors would, and vice versa. Even though all CLA tasks tap broad reasoning (etc.) abilities, some special domain knowledge might help, at least in comprehending the task presented (see Figure 2.1).

The Collegiate Learning Assessment 61

Males and females might be expected to perform similarly, but the majorityminority gap might be found on CLA tasks. Correlation with SAT. The correlation between SAT scores and CLA performance and writing scores should be positive and of moderate magnitude, as both tap into cognitive abilities, although the SAT score taps verbal and quantitative aptitudes and the CLA tasks tap broad domain abilities more closely tied to education (more “crystallized abilities” than the SAT). The correlations between the SAT and CLA for seniors in 2006 and 2007 (N ~ 4,000), for example, are as follows: performance task—.55 and .57, respectively; analytic writing— .57 and .50, respectively. The freshman correlations are of similar magnitude (Klein et al., 2007). So the CLA is “behaving” as expected. The SAT-CLA correlation at the school level, however, is considerably higher, on the order of .88 for freshmen and seniors on CLA total score (available at the school level and not the individual level due to matrix sampling); .91 and .88, respectively, for the performance task; and .79 and .83, respectively, for the writing task (Klein et al., 2007). The higher reliabilities at the school level than at the individual level arise because school mean scores are more reliable than individual scores, and there are systematic differences between campuses on both the SAT and the CLA. The high correlation at the school level does not mean that the SAT and CLA measure the same thing, as some believe (Klein et al., 2007). Rather the two measures share about 60 percent to 80 percent of their variance at this level, leaving room for college effects. Such effects are reflected, in part, by seniors at all SAT levels scoring higher than freshmen across campuses (Figure 3.2). Moreover, the CLA and SAT measure different things. As a thought experiment, imagine coaching students on the CLA and the SAT. The coaching would take very different forms, because the two assessments measure different things and require somewhat different thinking processes. Incidentally, there is about a .91 correlation between LSAT and bar exam scores at the school level. The rank ordering of school means on the LSAT corresponds almost perfectly with the differences in bar exam passing rates among law schools. Does this mean that the LSAT and the bar exam are measuring the same knowledge and skills? Hardly (see Klein, 2002b). Correlation with Grade-Point Average. It seems reasonable to expect a positive correlation between CLA scores and college grade-point averages. However, given the unreliability of GPA, the variability of GPA from one instructor to another or

62 The Collegiate Learning Assessment

one major to another, and the fact that seniors’ GPA is based on a diverse set of courses, the magnitude of the correlation should be fairly low—say, about .35 (Sackett, Borneman & Connelly, 2008). This is typically the range for SAT– freshman GPA correlations within campuses. And this is what is found with the CLA for seniors in 2007. The CLA-GPA correlation for performance tasks was .28, for Make an Argument .23, and for Critique an Argument .25. The direction and magnitude of these correlations did not change when carried out by students’ major area of study. Note that these values are the average within-school correlations for the nonrandom sample of students who elected to participate in the CLA at their campuses. Correlation with Critical-Thinking Measures. If the CLA taps important aspects of critical thinking, CLA scores should correlate positively and moderately with other measures of critical thinking, such as the Watson-Glaser Critical Thinking Appraisal. The relationship among the CLA, MAPP, and CAAP scores, along with specific measures of critical thinking, are currently being studied; but results are not yet available. Correlation Between Academic Domain and Task Type. The CLA taps broad cognitive abilities developed in humanities, social science, and science domains. This leads to the conjecture that science and engineering majors might do better on performance tasks based on science and engineering scenarios, humanities majors better on humanities scenarios, and social science majors better on social science tasks. A counter conjecture would be that while these tasks vary, they all tap basically the same cognitive abilities. Moreover, students take courses in all three domains. Any differences, especially after adjusting for differences in SAT scores between majors, should be very small at most. It turns out that differences do exist across the academic domains, both before and after adjusting for differences in SAT scores. In Figure 4.4 seniors’ adjusted mean SAT-adjusted CLA scores in 2007 are presented for three types of performance tasks (science-engineering, social science, and humanities) and four academic-major groupings (science-engineering, social science, humanities, and “other,” including business and service majors). The mean scores for the academic groupings after SAT adjustment are science-engineering (n = 855), 1,178; social science (n = 788), 1,204; humanities (n = 641), 1,199; and other (n = 2,036), 1,168. These mean differences are statistically different: “Other” performs, on average, below the remaining groupings; science scores do not differ significantly from those of either the humanities or

The Collegiate Learning Assessment 63

SAT-Adjusted Mean CLA Score

1260 1240

Academic Domain ScienceEngineering Social Science Humanities Other

1220 1200 1180 1160 1140 1120 1100 ScienceEngineering

Social Science

Humanities

Performance Task Type Figure 4.4 Relationship between academic domain and performance task type (SAT-adjusted scores: seniors 2007). Source: R. Shavelson.

the social sciences, although the social sciences scores are higher than the humanities scores. The interaction of task type and academic domain bears directly on competing conjectures: Is there or isn’t there a relationship between academic domain and task type? First, and perhaps surprisingly, students majoring in the social science domain scored, on average, higher than students in other domains across all three task types (Figure 4.4). However, the mean difference between social science and science-engineering students on the science-engineering task type is quite close (means of 1,197 and 1,189, respectively), as is the mean difference between social science and humanities students on the humanities tasks (1,239 and 1,226, respectively). Across the board, the “other” grouping fell considerably lower than the rest. There does, then, seem to be a small (about 1 percent of variance) relationship between academic domain and task type, but the high performance of the social science students across domains muddies the water a bit. Correlation of Performance Task Scores with Gender and Minority Status. It is important to ensure that measures of learning do not contain bias or have an adverse impact on various groups of students. Consequently, attention focuses, for example, on the performance of men and women and of majority and minority students. No statistically significant relationship was found between gender and mean unadjusted performance task scores. However, when an adjustment is

64 The Collegiate Learning Assessment

made for SAT, women scored, on average, .30 standard deviations higher than men. Moreover, white students scored about .50 standard deviations higher than nonwhite students (a smaller difference than what is typically observed on other cognitive tests) before covariate adjustment. However, adjusting for SAT, this mean difference is not statistically significant (p < .071). Cognitive Demands. Finally, if CLA tasks tap students’ reasoning, problem solving, and critical thinking skills, having students think aloud while performing these tasks should reveal the degree to which the tasks are having their intended impact on thinking (e.g., Taylor & Dionne, 2000). Unfortunately, such “cognitive validity” data have not yet been collected for the CLA. Summary of Validity Evidence. Validating test score interpretations is an ongoing process. This is especially true of the CLA, as it has only recently been developed. Much remains to be done. That said, the evidence that does exist supports the proposed interpretation of CLA scores. The tasks on the CLA, according to faculty and students, do vary from typical tasks found in college courses. Moreover, the CLA scores correlate with SAT scores, as would be expected. And CLA scores tend to be sensitive to students’ academic domain and the type of task presented (science-engineering, social science, humanities), with social science students scoring, on average, higher than students in the other domains (adjusting for SAT). Finally, the white-minority gap disappears once CLA scores are adjusted for SAT; a gender gap appears (women scoring higher than men) once SAT is taken into account.

Reprise The Collegiate Learning Assessment was developed as a measure of students’ broad abilities—critical thinking, analytic reasoning, problem solving, and communicating. These abilities appear to be the kinds of college outcomes valued by educators, policy makers, and the public. The CLA took an approach that differs from the traditional approach of analyzing complex performance into component psychological components and measuring each with, typically, a multiple-choice test. Rather, the CLA adopted a criterion-sampling approach to measure complex performance by sampling holistic, real-world tasks drawn from life situations. The CLA assumes that the whole is greater than the sum of its psychological-component parts. The evidence from both faculty and student encounters with CLA tasks (“content representativeness”) supports this claim so far.

The Collegiate Learning Assessment 65

The assessment was developed to send a signal to campuses as to how well their students are performing (Benjamin, 2008). This the CLA does by providing value-added scores and benchmarking a campus’s performance with those of its peers (Klein et al., 2008). The parent organization of the CLA, the Council for Aid to Education, is on record as stating that the intent of the CLA is to provide feedback to campuses for the improvement of teaching and learning and not for high-stakes external comparisons (Benjamin, 2008). CLA’s board recognized the diversity of college student bodies and missions and noted that one size does not fit all. The CLA is a relatively new assessment (with a pedigree dating back to the 1930s), and so information about its reliability and validity is being gathered. At present, although there is fairly extensive and strong evidence of reliability, some validity studies have been done and reported here, some are in progress, and some remain to be begun. This said, validation is a process, not an end; hence, studies need to be done continually to improve the measurement and the construct definition. One way of reprising the technical information about the CLA is to address its more vocal critics (Banta, 2008; Banta & Pike, 2007; Kuh, 2006; Pike, 2008; Shermis, 2008).6 Banta (2008, p. 4) laments shortcomings of the CLA (and other measures of broad abilities, including the MAPP and CAAP), saying, “Dear colleagues, the emperor has no clothes.” She, Pike (2008), and Shermis (2008) enumerate a number of limitations. Evidence from this chapter, and Klein et al. (2007) and Klein et al. (2008), will be brought to bear on each claim. • Tests like the CLA are measures of prior learning, as evidenced “by the near perfect .9 correlation between CLA scores and SAT/ACT scores at the institution level” (Banta, 2008, p. 3). There is no doubt that tests of cognitive ability reflect prior learning or achievement at a given point in time (see Chapter 2). Indeed, prior learning has been found to be the best predictor of future learning. Just how much prior learning is tapped by the CLA is another story. The .9 correlation reflects the SAT-CLA correlation for total scores at the school level. This school-level correlation ranges from .6 to .8 when performance and writing task scores are examined separately. However, the best measure of prior learning is not the school-level correlations that aggregate over students and capture campus-level SAT-CLA relationships but the individual-level correlations between students’ SAT scores and their CLA scores. This correlation was found to be in the .5 realm. Even adjusting for

66 The Collegiate Learning Assessment

unreliability, these correlations are not perfect, suggesting that the CLA measures something other than the SAT, which Banta uses as an index of prior learning. • The high correlation between CLA and SAT scores means that there is little room in which to observe college impact on student learning. That is, a correlation of .9 accounts for 80 percent of the total variation in CLA scores. (Recall the square of the correlation coefficient can be interpreted as the percent of variance shared by two measures.) Surely some of the remaining 20 percent, so the argument goes, is captured by demographic differences at campuses, test-taker maturation, motivation and anxiety, and measurement error. To be sure, some of that 20 percent is taken up by such factors. However, when CLA scores are predicted from SAT scores and student demographics, the proportion of variance shared in common stays roughly the same (Klein et al., 2007; Klein et al., 2008). Finally, measurement error cannot take up shared variance, as it is unpredictable, by definition. • A corollary of this reasoning (Pike, 2008) is that the variation among students is large within a campus, and the variation between campuses is small. However, there is ample evidence of substantial variation among campuses’ CLA scores (see Figure 3.2). And campuses with the same mean SAT score vary considerably in the level of their students’ mean performance on the CLA. • There is inadequate evidence of the technical quality of CLA scores—retest reliability is missing, construct and content validity studies are sparse to nonexistent, and so on (Banta, 2008; Pike, 2008; Shermis, 2008). As described in this chapter, extensive reliability information has been reported for the CLA, and it appears to be adequate. True, retest reliability has not been reported, but what would that look like? Traditionally, to find retest reliability, the same test is given on two separate occasions about two weeks apart, assuming no intervening learning has occurred. Such information’s value is far less than the cost of collecting such data with the CLA, for two reasons. The first is the high cost (financially, motivationally, logistically) of retesting within a short time period. The second is that, except in longitudinal applications of the CLA (which have been few),7 retest reliability is much less relevant than internal consistency and inter-rater reliabilities that speak to the quality of scores at a particular point in time (freshman and senior years with the CLA). Moreover, as pointed out previously, there is evidence about the content representativeness of CLA tasks and of the CLA’s construct validity in the form of correlations with other measures. This

The Collegiate Learning Assessment 67

said, a great deal of work needs to be done in collecting additional evidence regarding correlational validity (e.g., correlation with other measures of critical thinking) and “cognitive” validity, making sure the CLA tasks evoke the kind of thinking they are intended to evoke (critical thinking, problem solving, etc.). • Value-added scores are unreliable and to be mistrusted. “I also confess to a great deal of skepticism about the wisdom of attempting to measure value added,” states Pike (2008, p. 9). Banta (2008, p. 4) tells readers that “the reliability of value-added measures is about .1, just slightly better than chance.” To be sure, there is room for skepticism about value-added scores; from a measurement perspective they are prone to errors and misinterpretation. Also, there are different methods available for measuring value added, and each method might paint a somewhat different picture. However, the evidence summarized in this chapter shows that both discrepancy scores (the discrepancy between expected and observed scores for seniors across campuses, for example) and value-added scores (the difference between senior discrepancy scores and freshman discrepancy scores) were reasonably reliable, the former around .70–.75 and the latter around .63 (see Klein et al., 2007; Klein et al., 2008). The CLA uses discrepancy and value-added scores because simply comparing campuses’ raw scores would be misleading due to the great variability in the ability of these campuses’ entering freshmen. Discrepancy and value-added scores attempt to level the playing field and provide benchmarks for campuses by which to judge their performance. • No tasks are content free, so differential performance on tasks is to be expected, depending on a student’s academic preparation. As we saw, there is a slight relationship between academic domain and performance on CLA tasks. But this was very small. Moreover, since the CLA focuses on campus-level (or program-within-campus–level) performance with matrix sampling, by randomly assigning students to tasks, such differential academic preparation by task-type relation is balanced out. • A corollary of this reasoning is that there is no course on college campuses in which students would learn the broad abilities assessed by the CLA. Shermis (2008, p. 10) asks whether CLA-measured competencies are “something that would likely be an outcome of a general education course? If so, which one? English? Math? Introductory psych?” These questions are revealing. In the CLA view, the goal is to transcend “course” fixes and speak of an integrated general or liberal education that builds over the college years toward these

68 The Collegiate Learning Assessment

competencies. Shermis is right: No single course can do the trick, and that is just the message the CLA intends to send. • Students are not motivated to take the CLA, and, consequently, their observed performances are not reflective of their true performances. Without doubt, motivation is an issue for all testing, not just the CLA. We know that motivation is high on college and graduate-school entrance examinations and on certification examinations. These are high-stakes tests for students’ futures. Where the stakes are low for the test taker, as with the CLA, motivation is an issue. To address this issue (and to get an adequate sample) campuses vary in the incentives they do or do not provide students, and this might account for between-campus differences. Klein et al. (2007) have studied, correlationally, the relationship of various incentives and no incentives with campus CLA scores. They found no systematic relationship (Klein et al., 2007). Moreover, CAE believes that assessments of learning—the CLA and campus measures—need to become an integral part of college students’ education. Once students see the benefit of having information about their ability to reason analytically, solve problems, and communicate clearly, that becomes a source of motivation (see the CLA’s frequently asked technical questions 2007–2008, www.cae.org/content/pdf/CLA.Facts.n.Fantasies.pdf). As will be seen in Chapter 5, some campuses have achieved this, but it is rare at present. • There is no urgent need to compare institutions. Homemade assessments are to be preferred because they are more likely to be closely linked to a campus’s curriculum than a test designed to assess “a generic curriculum” (Shermis, 2008, p. 12). While there is clearly a role for campus assessment programs in the improvement of learning (see Chapter 5), there is also a need for benchmarking (Benjamin, 2008). Campus-grown assessments cannot tell administrators, faculty, students, and the public whether the campus is doing as well as it might do in fostering student learning. As Graff and Birkenstein (2008) point out, It is simply not true, as the antistandardization argument has it, that colleges are so diverse that they share no common standards. Just because two people, for example, don’t share an interest in baseball or cooking, it does not follow that they don’t have other things in common—or that, just because several colleges have different types of faculties or serve different student populations, they can share no common pedagogical goals. A marketing instructor at a community

The Collegiate Learning Assessment 69

college, a biblical studies instructor at a church-affiliated college, and a feminist literature instructor at an Ivy League research university would presumably differ radically in their disciplinary expertise, their intellectual outlooks, and the students they teach, but it would be surprising if there were not a great deal of common ground in what they regard as acceptable college-level work. At the end of the day, these instructors would probably agree—or should agree—that college-educated students, regardless of their background or major, should be critical thinkers, meaning that, at a minimum, they should be able to read a college-level text, offer a pertinent summary of its central claim, and make a relevant response, whether by agreeing with it, complicating its claims, or offering a critique. Furthermore, though these instructors might expect students at different institutions to carry out these skills with varying degrees of sophistication, they would still probably agree that any institution that persisted in graduating large numbers of students deficient in these basic critical-thinking skills should be asked to figure out how to do its job better.

The CLA, then, represents a different approach to the assessment of student learning than other such measures (e.g., CAAP, MAPP). Moreover, it measures students’ critical thinking, analytic reasoning, problem solving, and communication competencies in a realistic, holistic manner. And it does so reliably and validly. Not surprisingly, I believe it is the best alternative for measuring undergraduates’ learning of broad abilities.

5

Exemplary Campus Learning Assessment Programs

IMPROVEMENT OF TEACHING AND LEARNING in our colleges will proceed only so far with summative assessment of student learning, the focus of the previous chapter. Such assessment signals the need for improvement overall and perhaps in some specific areas. It also may provide formative information when used to guide instruction in class (Benjamin, Chun & Shavelson, 2007). At worse, however, these assessments punish colleges that are not meeting expectations without providing adequate information on what to improve and how to improve it. For substantive, not symbolic, responses to accountability to payoff, campuses need in-depth, context-sensitive diagnostic information about student learning. Such information cannot be provided by external assessments alone. External assessment, then, needs to be supplemented with closely aligned internal assessments of students’ learning and with an analysis of organizational structures and processes that afford or constrain students’ learning. In a somewhat overused phrase, colleges and universities need to become “learning organizations.” Moreover, they need to recognize that “doing good” is not enough. Their goals should be such that the proverbial bar is raised higher and higher in response to their own prior performance and their peer institutions’ performance. And to do so, they need both external assessments of learning that provide benchmarks for judging how well they are doing (Benjamin, 2008) and internal measures to diagnose where improvement is needed (Shavelson, 2008a,b). Finally, once diagnosed, campuses need to adopt a spirit of experimentation to judge which alternative solutions to diagnosed problems are effective. In a word, alignment of external assessment with internal assessment is essential for a campus to learn and grow productively. 70

Exemplary Campus Learning Assessment Programs

71

Campuses, then, need to develop internal, formative assessments for generating context-sensitive diagnostic information for improvement. Is this possible beyond a symbolic response? Are there examples—existence proofs and models—to guide those institutions seeking to respond substantively to improving teaching and learning? If so, what do these campus assessment programs look like? How did they start? What keeps them going? Have they had the impact they intended? What challenges have they faced? These questions are posed with the recognition that a great deal has been written about campus assessment of learning and that literature is readily accessible (e.g., Banta & Associates, 2002; Peterson, Vaughn & Perorazio, 2001). Here we look closely at six campus assessment programs that, within the past ten years, have been widely recognized by peers as “successful” and “exemplary” at one time or another (recently, the Council for Higher Education Accountability has given awards to exemplary programs; see Eaton, 2008). These are not “representative” in some statistical sense, they are not unanimously acclaimed, and they will not necessarily be at the top of their game by the time you read this chapter. Nevertheless, they were selected to give a sense of the variation in approaches campuses have taken to assessing student learning and to improving. The goal is to identify a range of campus assessment practices that might be adopted and adapted by other campuses seeking to assess learning and experiment with the improvement of teaching and learning. This chapter begins by describing two benchmark campus learning assessment programs that reflect the variability in possible approaches. It then focuses on the findings from a case study of campuses recognized by the field as exemplary, examining their inception, philosophy, operation, and impact. It concludes by drawing conclusions for the design of campus learning assessment programs.

Benchmark Campus Learning Assessment Programs Over the past twenty-five years, two campus assessment-of-learning programs have arguably stood out as exemplary, serving as benchmarks for other campuses. To be sure, they have evolved over time, but their distinctive features remain. Perhaps somewhat surprisingly, they are as different as they are similar in many important respects. While both programs sprang from visionary leaders, one focused on individual student development of holistic, problem-focused, real-world critical thinking and social responsibility abilities and skills. The other focused on campus-level improvement of underlying, general psychological abilities and skills. While both campuses—Alverno College and Truman

72

Exemplary Campus Learning Assessment Programs

State University—view assessment of learning as part of their mission, they have varied as to how much they integrated that assessment into their teaching and learning processes. The goal here is to characterize the similarities and differences between the two assessment programs, along a set of dimensions that can then be applied to the campuses in the case study that follows. Although Alverno and Truman have been identified over the past ten years, at one time or another, as exemplars by peers and experts in the learning assessment community, the potential for rhetoric about accomplishments even from these two campuses (let alone the campuses in the case study below) sometimes exceeds reality. And although both have been widely recognized, they have their critics, for different reasons. Alverno College Alverno College has a four-year, undergraduate liberal arts program for women and coeducational graduate programs. The college is dedicated to the student— her learning, personal and professional development and service to the community. Located in Milwaukee, the nineteenth-largest city in the United States, Alverno serves about 2,500 students and offers more than sixty undergraduate programs (majors, minors, and associate degrees) in four schools: School of Arts and Sciences, School of Business, School of Education, and School of Nursing. Alverno believes that education means “being able to do what one knows” (Loacker & Mentkowski, 1993, p. 7). Since 1973, students graduate only if they have demonstrated an appropriate level of performance on eight abilities: (1) communication, (2) analysis, (3) problem solving, (4) valuing in decision making, (5) social interaction, (6) developing a global perspective, (7) effective citizenship, and (8) aesthetic engagement. To assess the level of student performance, Alverno developed an extraordinary program of performance assessment. The program was initiated in response to concerns about the quality of its academic programs raised about thirty-five years ago, in accreditation. The then president, Sister Joel Read, challenged each department to identify important questions being raised in its discipline and then to decide on the critical concepts that should be taught and the most appropriate methods for teaching them. This exercise led to a key question that drove the curricular reform and the assessment process: “What kind of person were we [Alverno faculty] as educators seeking to develop?” (Loacker & Mentkowski, 1993, p. 6). This question triggered the definition of the outcomes, characteristics, and abilities that were expected from the students as a result of their education at the college.

Exemplary Campus Learning Assessment Programs

73

The basic notion that emerged at Alverno was that assessment of these learning outcomes should incorporate samples of the performances the college seeks to prepare students for. Consequently, Alverno built a performance assessment system. The system assessed students’ performance, in realistic tasks and contexts, on the specific abilities the college considered essential learning outcomes. Students were expected to demonstrate competence within a range of situations (assessment tasks) in which they might find themselves (Loacker & Mentkowski, 1993), such as giving a speech, writing a business plan, or designing a scientific investigation. This program embraced a criterion-sampling philosophy based on McClelland’s (1973) approach to the measurement of competence. For McClelland (1973, p. 7), learning assessment tasks should be samples of criterion situations: “If you want to test who will be a good policeman, go find out what a policeman does. Follow him around, make a list of his activities, and sample from that list in screening applicants.” Assessment at Alverno is considered an integrating, developmental experience. Its main purpose is to support students in developing their own strengths on each of the learning outcomes. The assessment process is integrated into curriculum and teaching to enhance students’ developmental experiences. Supporting individual student development, then, is the core of the Alverno system, a system in which both faculty and administrators take responsibility for their roles in student development. Moreover, the assessment program is built to measure developmental trajectories, a concept that Alverno has used from the program’s inception, and one that has been put in the spotlight by the National Research Council (2001). Indeed, in the early 1970s, Alverno conceived the development of the eight abilities as successive and increasingly sophisticated as students moved through their studies; for example: “To meet general education requirements, the student will show analytical skills at the four basic levels: observing, making inferences, making relationships, and integrating concepts and frameworks. All these are integrated with the content of her general education courses” (Loacker & Mentkowski, 1993, p. 9). All assessments are developed to provide an opportunity for students to demonstrate one of the eight abilities. Tasks are sampled, and a student’s performance or “criterion behavior” is evaluated. The assessments’ criteria for success are public. Students receive feedback on their performance and on how to improve it. They are encouraged, as well, to assess themselves and their own goals (Loacker & Mentkowski, 1993). The idea is that if students can be taught

74

Exemplary Campus Learning Assessment Programs

to perform on samples of criterion tasks, they have been taught to perform in real-world situations. “Cheating,” in the sense of performing on various samples of criterion tasks, is not problematic. If students can perform well on the assessment, that means they are likely to perform well in a comparable realworld situation. Upon entry to Alverno, for example, students are videotaped as they give a persuasive talk; each subsequent year they give another persuasive talk and are videotaped again. Over a four-year period, then, students’ development on this criterion task is monitored and evaluated. Formative feedback from a review panel, including representatives from the Milwaukee business and government communities, provides for individual development in this criterion situation. At the same time, external participation at Alverno develops critical links with the community and public service. Since the beginning, faculty has sought different strategies to ensure multiple perspectives and data sources on student learning. For example, since the start of the program faculty have kept written portfolios with copies of key performances as a cumulative record of each student’s development. About seven years ago, a digital portfolio was created. The portfolios enable students to follow their learning progress throughout their years of study. Although the assessment program focuses on student learning trajectories on the eight critical abilities, this information is also used to evaluate academic programs and the institution as a whole. In this way, Alverno evolves over time with feedback as to how well it is meeting its goals for student learning in a systematic, rather than an intermittent, way. Alverno has run its assessment program organizationally first through an Office of Research and Evaluation (which later became the Assessment Center). Three years after the assessment program began, the college created the Office of Research and Evaluation and charged it with describing (1) developmental trajectories, (2) models of professional performance, (3) knowledge and skills students should develop, and (4) expectations of what graduates would need (Loacker & Mentkowski, 1993). The Assessment Center now is a department that works closely with students, faculty, staff, and the southeastern Wisconsin business community to provide services related to assessment at Alverno. The assessment program, then, is a coherent system created by the faculty and embedded in a supportive culture. Coherence is achieved by articulating and integrating educational mission, values, assumptions, principles, theory, and practice. Moreover, “it relies on the re-conceptualization of the use of time, aca-

Exemplary Campus Learning Assessment Programs

75

demic structures, and other resources to bring about increasingly effective learning for students” (Loacker & Mentkowski, 1993, p. 20). Perhaps the signature characteristic of the assessment program is that assessment has been tightly integrated into the students’ learning processes, the faculty’s vision and enactment of their teaching and learning, and administrators’ commitment to student development. This “assessment as learning” approach has earned Alverno worldwide recognition (Banta, 2002). Truman State University Truman State University, formerly Northeast Missouri State University, is located in Kirksville, Missouri. A four-year, liberal arts university with more than six thousand students, Truman offers forty-five undergraduate and six graduate areas of study in twelve academic divisions, such as science, language and literature, mathematics and computer science, education, and social science. The university seeks to advance knowledge; create an environment for freedom of thought and inquiry; and develop the personal, social, and intellectual growth of its students. Truman State’s widely recognized institutional culture of assessment was spurred in part by the State of Missouri’s approach to higher-education accountability and in part by a visionary president. Missouri early created financial incentives to encourage its colleges and universities to assess and report on student learning. Administrators at Truman State took leadership among the state’s campuses and spearheaded student-learning assessment, capitalizing both on that leadership and on the resources made available. Learning assessment began in the 1972–73 academic year, when President Charles J. McClain invited graduating students to sit for comparative (senior) exams. Early in his administration McClain made clear that the traditional use of inputs for assessing the quality of the institution (e.g., resources, reputation; see Chapter 8) would be replaced by methods focusing on student learning outcomes and value-added models for measuring quality (Cartwright Young & Knight, 1993). He wanted to demonstrate that the university made a difference in students’ knowledge, skills, and attitudes, and that graduates were nationally competitive in their chosen fields. The university referred to its assessment program as value-added, even though the data collected did not always fit a valueadded model (Cartwright Young & Knight, 1993). The assessment program typically has tested students in their first, third (at 75 credit hours), and senior years at the university with multiple methods. The assessments provide both indirect and direct outcome measures. In contrast to

76

Exemplary Campus Learning Assessment Programs

Alverno’s use of performance assessments, Truman focuses on surveys, questionnaires, and nationally standardized instruments that measure broad, underlying cognitive abilities (knowledge and broad domain reasoning; see Figure 2.1). The important advantage of using these types of assessments over locally developed assessments is that they provide an external reference for benchmarking student achievement against peer institutions (Magruder & Cartwright Young, 1996a). Different sets of instruments are used, depending on the student’s academic year. For example, freshmen are administered the Cooperative Institutional Research Project (CIRP) survey, which profiles the entering class on field of study, highest degree planned, college choice, ethnic background, and self-ratings of various abilities and skills. At the senior year, students take a majors’ test prior to graduation. Recently, the Collegiate Learning Assessment (CLA) has been administered to a sample of freshmen and seniors, providing a value-added measure. Although seniors are tested in every discipline with externally normed tests (e.g., ETS’s major field tests) their graduation does not depend on test performance. Since 1985, seniors have taken capstone courses that seek to integrate subfields within a major. Many of the courses require that students demonstrate the knowledge and skills that faculty have determined as learning priorities within the major. Faculty in each major, then, determine the content of the capstone. This flexibility acknowledges the faculty responsibility for the curriculum. However, it also leads to considerably different capstone experiences across majors. For example, in one major students write a thesis, in another they present papers or projects at an organized forum outside class, and in still another they take a comprehensive exam (Cartwright Young & Knight, 1993). One of the few assessments developed by the university is a student portfolio, a requirement created in 1988 in response to a petition from President McClain for an instrument that could demonstrate students’ achievement and learning. Currently, all students are required to develop a portfolio of their best work, accompanied by a reflective essay, written in the senior year, on their growth in knowledge, skills, and attitudes in college (Kuh, Gonyea & Rodriguez, 2001; Magruder & Cartwright Young, 1996b). Students learn about the portfolio requirement as freshmen, hear more about it periodically during their course of study, and fully develop the portfolio as seniors. A final type of assessment employs interviews. Since 1992 faculty-student teams have conducted interviews to gather information not collected in other surveys on issues such as teaching-learning strategies and learning experiences (Cartwright Young & Magruder, 1996).

Exemplary Campus Learning Assessment Programs

77

Compared with the program at Alverno, Truman’s assessment program focuses less on individual student improvement and more on aggregate measures of performance that reflect the campus’s academic programs. Portfolios, the exception, are used for formative feedback in meetings between advisors and students. Portfolios have been identified as the characteristic that has put Truman’s assessment program back on the map as an exemplar (Kuh, Gonyea & Rodriguez, 2001). Critical to the success of the Truman assessment program was its incremental and low-key manner of implementation. Unlike the way things were done at Alverno, at Truman the president chose not to create a central assessment office. (An Advisory Committee for Assessment was created at the beginning of the 1990s.) The rationale was that such an office would reduce faculty interaction. What was critical for the success of the development and implementation of this assessment program was the extensive role modeling that President McClain and Vice President Darrell Krueger did in the use of assessment data at the assessment program’s inception (Cartwright Young, 1996; Cartwright Young & Knight, 1993). They were particularly “adept at suggesting program innovations that increased faculty interaction, conveyed higher expectations for students’ academic development, and heightened students’ involvement in learning” (Cartwright Young & Knight, 1993, p. 29). Assessment became the university’s mechanism for using a common vocabulary and an organizational focus (Cartwright Young & Knight, 1993). Other keys in the success of the program have been the faculty’s role in implementing the program and the type of assessment information provided to them. Faculty-administration conversations grounded in assessment data have been critical for developing assessment-based improvements at the university (Magruder & Cartwright Young, 1996b). Also, faculty have been directly involved in developing specifications for assessing students. The process of determining what learning objectives to assess has benefited faculty, curriculum, and courses develop (Magruder & Cartwright Young, 1996b). Faculty receive, annually, information on their students’, along with university averages and norms when available. However, they do not receive comparative departmental data (Cartwright Young & Knight, 1993).

Exemplary Learning Assessment Programs That Alverno and Truman State are so very different but also so highly regarded demonstrates that, not surprisingly, there is no consensus as to the “best” way to assess learning in higher education. But it also raises questions as to what it is

78

Exemplary Campus Learning Assessment Programs

about these programs and perhaps others that has made them archetypes in the field. While Alverno and Truman State are, arguably, benchmarks in the learning assessment community, other institutions have become well known as exemplary, too. In order to learn from these institutions, answers were sought to questions such as how their assessment programs originated; what assessment of learning means on the campus, including its underlying philosophy; how the assessment program was organized and used; and how it impacted teaching and learning. Because there is often a gap between rhetoric and reality in the world of higher-education assessment and accountability (e.g., Newman, 2003), an indepth case study approach (Yin, 2003) seemed appropriate to develop an understanding of campus assessment programs. Much of the current assessment literature is descriptive and champions innovation and effort more than it analyzes program design and use. The case study reported here collected data from a broad variety of individuals and documents to characterize or profile four campuses’ assessment programs. Assessment program was defined quite broadly—as a college’s or university’s effort to systematically measure undergraduate student learning indirectly (by proxies such as graduation rates, student surveys) or directly (via instruments such as the CLA, MAPP, CAAP, GRE, or certification examinations; see Chapters 3 and 8). Here we provide an overview of the questions that drove the study and describe the campuses that participated. Site selection and methods used for data collection and analysis are described in the appendix at the end of this chapter. Research Questions The study sought to understand the origins, philosophy, operation, and impact of exemplary campus assessment-of-learning programs. To this end, it addressed four questions: (1) How did these programs come into being—e.g., were they institutionally initiated, externally mandated, or both? (2) What philosophy underlies the program’s assessment of students’ learning—e.g., performance competence or cognitive ability? (3) How does the assessment program operate—e.g., what structures and policies shape the program? (4) What is the impact of the program, intentionally and unintentionally, on administrators, faculty, and students and on the improvement of teaching and learning? In sum, four dimensions of campus assessment programs were addressed: development, philosophy, operation, and impact.

Exemplary Campus Learning Assessment Programs

79

Case Study Sites Four campuses participated in the study. Site selection took into account a number of institutional and program characteristics, as well as recommendations from researchers and policy analysts who pointed to the colleges and universities as having a particularly innovative or effective assessment program. Case studies were conducted during the 2003–4 academic year at these institutions. In order to protect their anonymity pseudonyms are used for the campuses. Each is described briefly here: • Learning Outcomes University (LOU) is an urban state university committed to outcomes-based education. With an enrollment of about thirty thousand students, the campus offers more than 180 academic programs, from associate degrees to doctoral and professional degrees. This university has been considered a service-learning campus, linking university programs with the community. The campus is noted for graduating a high percentage of professionals (e.g., dentists, nurses, physicians, and social workers) in this state. Its learning assessment program encompasses both general education and the majors, with emphasis on the former. • Student-Centered Learning University (SCLU) is a small, somewhat rural state university. It is committed to outcomes-based education, both in general education and in academic majors. The campus offers about fifty academic programs, including graduate degrees. Most of the university’s roughly 3,700 students come from segments of society that have been traditionally underserved by the educational system. Its learning assessment program encompasses both general education and the majors, with emphasis on the latter. • Assessment-Centered University (ACU) is a medium-size public university located in a rural setting. The campus of about seventeen thousand students is also committed to outcome-based education. This university offers about seventy academic programs, including bachelor’s and graduate degrees. More than 60 percent of the students come from within the state, and most are white. The university’s learning assessment program encompasses both general education and the majors, with emphasis on the former. • Flexible University (FU) is a large public university located in a suburban area of a large city. It offers about seventy academic programs, including associate, bachelor, graduate, and professional degrees, to about thirty

80

Exemplary Campus Learning Assessment Programs

thousand students. More than 91 percent of the students come from within the state. The learning assessment program encompasses both general education and the majors, with emphasis on the former.

Findings: Comparing and Contrasting Campuses Here the four campuses’ learning assessment programs are compared and contrasted on four dimensions: development, philosophy, operation, and impact. Within each dimension specific, concrete evidence portrays a campus. Development—Impetus Accreditation served as the common impetus for assessing learning on all four campuses (Table 5.1). Given the popular perception that accreditation has “no teeth” and has not been an effective accountability mechanism, especially for those who seek cross-campus comparative information, this finding might be somewhat surprising. However, in support of the skeptic, the research team found that seeking accreditation might be a necessary, but is certainly not a sufficient, condition for stimulating campus learning assessment. The desire for accreditation combined with a campus vision, especially a vision espoused by the president or chancellor—or combined with state policy incentives—led all four campuses to assess student learning. With respect to vision, the president at ACU believed in data-based evidence on the value the campus adds to student learning and the key role that assessment played: “It’s perhaps a little cliché to say, but it really is true—we were interested in better understanding what value we add before someone told us that we had to do that. . . . So I think it’s important that the learning assessment really take the lead in our efforts, because that is, after all, our primary reason for being” (ACU president). The effect of the accreditation application was especially strong on campuses with professional schools. Specialized accreditation, combined with a certification examination, created a culture of assessment within the particular school. A LOU department chair reported, “A lot of it’s driven by accreditation, but I think it’s also driven by the faculty’s dedication to quality of teacher education. . . . Because the [State] Professional Standards Board mandated that we were going to have a unit assessment plan and have it in place and operating by this past year.” Moreover, state incentives played directly into the ACU president’s vision. The executive director of the assessment office at ACU recalled, “We had a legislative mandate from the state . . . that mandated assessment at all of the

Exemplary Campus Learning Assessment Programs

81

Table 5.1. Cross-Campus Comparison on Dimensions of Development, Philosophy,

Operation, and Impact

Assessment Program Topic

Learning Outcomes University

StudentCentered Learning University

AssessmentCentered University

Development—Impetus • State higher- education policy / $ • A ccreditation • U niversity leader

✓ ✓+ ✓

✓ ✓ ✓+

✓ ✓ ✓

✓+

O

O

O

O

T

C

T

T













Philosophy • P rocesses vs. outcomes • Trait vs. criterion sample • F ocus - student centered - feedback to programs - feedback to students Operation • C hancellor/provost support - faculty hiring policy - promotion and tenure - link to improvement • A ssessment director - stature - coherent vision - work with faculty • A ssessment vs. planning offi e • Ov ersight committees • P rogram-based committees • S ize vs. relationship with faculty • T op-down or bottom- upor both • F eedback to program in place • T echnical (psychometric) capacity

Flexible University

✓+











✓ ✓

✓ ✓



✓ ✓ ✓

✓ ✓ ✓

A+P

A

A



A ✓

✓–







Large–

Small+

Large+

Small+

Top–

Both+

Both+

Bottom–





✓–

✓–

✓ (continued)

82

Exemplary Campus Learning Assessment Programs

Table 5.1. (continued)

Assessment Program Topic Stages of maturity - age (years) - outcomes developed - assessment system developed - feedback systems in place • I nstrumentation - in- house assessment offi e - in- houseprogram committee - standardized commercial Impact—Consequences • F aculty burden • F aculty improvement around student, learning, and assessment

Learning Outcomes University

StudentCentered Learning University

AssessmentCentered University

Flexible University

>10

5 – 10

>10