Assessing competence in medicine and other health professions 9780429426728, 0429426720, 9781138596344, 1138596345, 9781498785082, 1498785085

894 32 10MB

English Pages [433] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Assessing competence in medicine and other health professions
 9780429426728, 0429426720, 9781138596344, 1138596345, 9781498785082, 1498785085

Citation preview

Assessing Competence in Medicine and Other Health Professions

Assessing Competence in Medicine and Other Health Professions Claudio Violato Professor and Assistant Dean University of Minnesota Medical School

CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2019 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper International Standard Book Number-13: 978-1-138-59634-4 (Hardback) International Standard Book Number-13: 978-1-4987-8508-2 (Paperback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all m ­ aterial ­reproduced in this publication and apologize to copyright holders if permission to ­publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit ­organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data Names: Violato, Claudio, author. Title: Assessing competence in medicine and other health professions / by Claudio Violato. Description: Boca Raton : Florida : CRC Press, [2019] | Includes bibliographical references and index. Identifiers: LCCN 2018032562 | ISBN 9781138596344 (hardback : alk. paper) | ISBN 9781498785082 (pbk. : alk. paper) | ISBN 9780429426728 (e-book) Subjects: | MESH: Clinical Competence | Professional Competence | Educational Measurement | Formative Feedback. Classification: LCC RA399.A1 | NLM W 21 | DDC 616—dc23 LC record available at https://lccn.loc.gov/2018032562 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com

Dedication

LeRoy Douglas Travis (October 14, 1940–August 9, 2017) A dear friend and inspirational teacher

Contents

Preface ix Author xi SECTION I  FOUNDATIONS 1 1 2 3 4

Introduction 3 Competence and professionalism in medicine and the health professions 35 Statistics and test score interpretation 55 Correlational-based methods83

SECTION II  VALIDITY AND RELIABILITY 109 5 6 7 8 9

Validity I: Logical/Face and content 115 Validity II: Correlational-based139 Validity III: Advanced methods for validity163 Reliability I: Classical methods191 Reliability II: Advanced methods 213

SECTION III  TEST CONSTRUCTION AND EVALUATION 231 10 11 12 13 14 15 16

Formats of testing for cognition, affect, and psychomotor skills 235 Constructing multiple-choice items 257 Constructed response items 285 Objective structured clinical exams 301 Checklists, rating scales, rubrics, and questionnaires 319 Evaluating tests and assessments: Item analyses341 Grading, reporting, and standard setting 365

Glossary of Terms395 Index

409 vii

Preface

Assessing competence in medicine and other health professions continues to be a challenge. The assessment and evaluation of competence and learning have their origins in antiquity but are only now beginning to emerge as a unified field. This is due to developments in statistical and mathematical theory, test theory, and advances in computer hardware and software, as well as the developing internet. Testing and assessment are the acts of quantifying an educational or psychological dimension or assigning a numerical value to it. Evaluation requires a value judgment to be made based on the measurement. While the history of testing dates back two millennia, it has emerged on a large scale only in the 20th century, coincidental with mass education. The current status of testing is paradoxical: it is on the increase and receives overwhelming public support, and yet most teachers, instructors, and professors (who construct, administer, and interpret the vast majority of tests) have little or no formal education in testing. Tests can be used to motivate students, enhance learning, provide feedback, evaluate educational programs, and for research. Other functions include curriculum revisions, selection and screening, certification, and guidance and counselling. Tests can be used in a variety of ways and for several evaluation functions. Testing is likely to expand in modern society. The improvement of healthcare professional competence depends, in part, on the improvement of testing and assessment of medical competence. The alarming rate of medical errors that currently result in death or negative outcomes may be reduced. My purpose in writing this book is to provide a resource for teachers, a­ ssessors, administrators, pedagogues, generally, and students in health professions education for understanding, implementing, and critically evaluating testing, assessment, and evaluation. This is a comprehensive book on how to assess competence in medicine, medical education, and healthcare throughout the clinician’s career. It is organized into three main sections: (1) Foundations, (2) Validity and reliability, and (3) Test construction and evaluation consisting of 16 chapters. Each chapter begins with an advanced organizer and contains summaries, reflections, and exercises. The book also contains a glossary of testing and statistical terms—e.g., ­reliability, validity, standard error of measurement, biserial correlation, ix

x Preface

problems, and data. Fundamental problems will be presented—e.g., ­calculating reliability, conducting item analyses, running confirmatory factor analyses; here ­appropriate software (e.g., SPSS, Iteman, EQS), and examples of its use are demonstrated. There are several titles on assessment and evaluation in the health professions, but none is a single-authored textbook. All are edited, multi-authored books. When dealing with a discipline (e.g., biology, assessment), edited multi-authored books are not generally regarded as “textbooks” by professors and students. These edited collections suffer from the problems of such books: the emphasis is skewed (doesn’t represent the curriculum) by selection of contributors, the writing is uneven for the same reason; they lack the “authoritativeness” that a recognized single author of a textbook does. Additionally, they lack the clarity and comprehensiveness (partly because of the multi-authorship) that a textbook in the field requires. Currently, none of the existing books in this area or the field are textbooks in the real sense. Therefore, I have tried to write a usable, a balanced, a well-written, and an authoritative textbook for assessing competence in medicine and other health professions.

Author

Claudio Violato, PhD, is Professor and Assistant Dean, Assessment, Evaluation and Research at the University of Minnesota Medical School. He has taught at and held leadership positions at Wake Forest School of Medicine, the University of Calgary, University of British Columbia, University of Victoria, Kwantlen University, and the University of Alberta. Dr. Violato’s publications in medical education include competency-based assessment, psychometrics, research methods, leadership, and clinical reasoning and cognition. In addition to 10 books, Dr. Violato has published more than 300 scientific and technical articles, abstracts, and reports in major journals such as Academic Medicine, Medical Education, British Medical Journal, Canadian Medical Association Journal, and the Lancet. He has received millions of dollars in research funding from various institutions. Some of Dr. Violato’s recent honors and awards include the “Outstanding Achievement Award” from the Medical Council of Canada “For Excellence in the Evaluation of Clinical Competence” and the “Innovation Award for the development of the Physician Achievement Review Program,” from the Royal College of Physicians and Surgeons of Canada.

xi

I

Section     Foundations

The Foundations section consists of four chapters. Chapter 1 introduces the need for testing and assessment; deals with questions of “what are testing, assessment, measurement, and evaluation?”; provides a brief history of testing; describes the common types of tests and assessments; and summarizes the uses of tests. While the history of testing dates back two millennia, it has emerged on a large scale only in the 20th century, coincidental with mass education. The current status of testing is paradoxical: it is on the increase and receives overwhelming public support and yet most professionals who construct, administer, and interpret the vast majority of tests have little or no formal education in testing. Testing is likely to expand in modern society. The improvement of healthcare professional competence depends, in part, on the improvement of testing and assessment of medical competence. The alarming rate of medical errors that currently result in death or negative outcomes may be reduced. In Chapter 2, the somewhat controversial topic of competence in medicine is discussed. This chapter deals with the definition of competence, controversies surrounding it, and its multiple forms. In this chapter, there is a discussion of the various methods of assessing competence over the career span. Competence and professionalism are complex, interrelated, multidimensional constructs commonly based on three primary components: knowledge, skills, and attitudes. Entrustable professional activities (EPAs) are clinical actions that require the use and integration of several competencies and milestones critical to safe and effective clinical performance. To assess the competence of physicians for EPAs requires complex and comprehensive assessments. The present views define medical competence and professionalism as the ­ability to meet the relationship-centered expectations required to practice ­medicine competently. For Asian countries, professionalism is a Western ­concept without a precise equivalent in Asian cultures. Instruments and procedures that

2 Foundations

are designed to measure competence and professionalism are challenged on validity, reliability, and practicality. A variety of professional organizations have recently weighed in on the­ ­constructs of competence and professionalism. The Canadian Medical Education Directions for Specialists (CanMEDS) framework consists of seven roles: ­manager, communicator, professional, scholar, expert, health advocate, and collaborator and has been used as the theoretical underpinnings of physician competencies. Chapter 3 deals with somewhat more technical content, the basic statistics in testing. It includes descriptive statistics, standard scores, and graphical a­ nalyses. The understanding of statistics and data analysis allows for a greater appreciation of the differences and similarities between learners and also helps the teacher better organize and interpret test scores and other educational ­measures. The normal curve is a particularly important distribution as are skewed, bimodal, and rectangular distributions with the characteristic of many frequency distributions of their central tendency—the scores tend to cluster around the center. There are three measures of central tendency or average: the mode, median, and mean. Other descriptions include measures of dispersion, the range, and standard deviation. Norms and standard scores are an important aspect of test score interpretation and reporting. There are four types of standard scores: z-scores, T-scores, stanines, and percentiles. Correlational techniques—indispensable methods in testing—are presented, detailed, and discussed in Chapter 4. The correlation is a statistical technique to study the relationship between and among variables. Several other ­statistical techniques have evolved based on the original Pearson’s r: rank-order ­correlation, biserial correlation, regression analyses, factor analysis, discriminant analysis, and cluster analysis. The correlation coefficient, r, must always take on values between +1.0 and −1.0 since it has been standardized to fit into this range. The sign indicates the ­direction of the relationship (i.e., either positive or negative), the magnitude indicates the strength of the relationship (weak as r approaches 0; strong as r increases and approaches +1.0 or −1.0), and the coefficient of determination (r2) which is an indicator of the variance accounted for in y by x. The following major correlational techniques are presented: Pearson product–­ moment, Spearman rank-order, biserial, point biserial, regression, multiple regression (linear, logistic), factor analysis (exploratory factor analyses [EFA], confirmatory factor analyses [CFA]), structural equation modelling, and ­hierarchical linear modelling.

1 Introduction ADVANCED ORGANIZERS

Teaching & Learning

Courses & Curriculum

Programs & Institutions

21st Century Individual Evaluation of Competence Healthcare professional

20th Century

Types of Tests Achievement Psychological Skills Patient Safety Medical Errors

Testing Assessment Quantification Statistics Basic, correlation

Computers Scanning, algorithms

Numbers or Quantities

Test Theory CTT, G, IRT

3500 Years of History Regulation and Licensing Medical and Healthcare Testing

ASSESSMENT OF MEDICAL AND HEALTHCARE COMPETENCE The assessment of medical and healthcare competence continues to be one of the most challenging aspects of the education, training, licensing, and regulation of 3

4 Introduction

healthcare professionals such as doctors, nurses, dentists, optometrists, and other allied healthcare workers. The report To Err Is Human: Building a Safer Health System,1 of the Institute of Medicine of the National Academy of Sciences in the United States, made some staggering claims: nearly 100,000 people die annually in American hospitals as the result of medical mistakes. Subsequent commentators have suggested that this is an underestimate and the actual mortality rate is much higher. Some argue that the number of medical mistakes is much higher than is commonly accepted because most of the errors are not reported. A recent report by leading American researchers, employing more detailed and advanced methods than previously, has estimated that a conservative estimate of the real death rate due to medical error is more than 250,000 per year. It is the third most common cause of death in the United States after heart disease and cancer but ahead of chronic obstructive pulmonary disease, suicide, firearms, and motor vehicle accidents.2 An international report on adults’ healthcare experiences in seven countries (New Zealand, the United Kingdom, the United States, Australia, Canada, Germany, and the Netherlands) indicated that 12%–20% of adults experienced at least one medical error in the 2 years of the study.3 These findings have triggered international discussion, concerns, and controversies about patient injuries in healthcare. The major factors underlying medical errors are thought to be ­system-based factors (e.g., miscommunication on the ward) as well as person factors resulting in drug overdoses or interactions, misdiagnoses, surgical mistakes, incorrect medications, and simple carelessness. Patient safety, a topic that had previously been little understood and even less discussed in healthcare systems, has become a public concern in most Western countries. Patient safety has now become a mantra of modern medical practice. Despite this, thousands of people are injured or die from medical errors and adverse events (incapacitation, serious injury, or death) each year. Worldwide, this figure may run into millions. Leaders in the healthcare systems have emphasized the need to reduce medical errors as a high priority. Doctors, as main participants, have been called upon to address the underlying systems causes of medical error and harm. Unfortunately, several studies4 have shown that more than half of the hospital doctors surveyed haven’t even heard of the report To Err Is Human. The magnitude of the problem is thought to be similar in the United Kingdom, Canada, and elsewhere in the world.2 While both system-based factors as well as person factors are at the root of medical errors, it is now believed that the impact of some person factors has been underestimated: physician carelessness, lack of knowledge, lack of professionalism, physician exhaustion and sleeplessness, and poor self-assessment, particularly of personal limitations in medical skills.5,6 There is concern that the preferred tendency to put the emphasis on systems but not holding individuals responsible for errors will weaken accountability for physician performance. Failure to identify individual factors may contribute significantly to the risk of adverse events and may lead to a focus of patient safety away from the clinician to a systems-based approach. The assessment of the competence of individual healthcare professionals looms larger than ever.

Assessment of medical and healthcare competence  5

Assessment is also commonly known as testing. Testing has its roots in antiquity and has undergone rapid advances in the later part of the 20th and the first part of the 21st century. This is because some necessary developments in its emergence—statistical and mathematical theories, advances in test theory, and computer technology and optical readers, online testing, social and political policy—have come only in the last several decades. Currently, the field is undergoing rapid development and change bringing exciting possibilities and challenges. Before describing the current status of testing and its history, however, we must describe and distinguish testing, assessment, and evaluation.

BOX 1.1: Licensing physicians throughout the ages Prior to systematic testing as in modern times, the licensing of physicians has nevertheless been regulated. Control of the medical marketplace through licensing, prosecutions, and penalties has a long history and is not unique to modern society. Several cases illustrate the practices in the past several hundred years. Jacoba Felicie paced nervously in her room glancing through her notes a final time before she set out for the court house.7 Powerful forces including the Dean and Faculty of Medicine at the University of Paris were allied against her. This was Paris in 1322, and the physician guilds and university faculty had increased in power and control of the medical marketplace. They were seeking to consolidate their regulation of medical practice. They decided that Jacoba was a particularly good case to prosecute as she was a woman practicing medicine. The Dean and Faculty of Medicine were determined to put a stop to the illegal practice of medicine. For some time now, the Parisian faculty wanted to gain stronger control over various practitioners of medicine such as surgeons, barbers, and empirics, whether male or female. The Dean and the Faculty of Medicine charged Jacoba with illegally visiting the sick, examining their limbs, bodies, urine and pulse, prescribing drugs, and collecting fees. The Dean and Faculty were most outraged because she actually cured some patients, frequently after conventional physicians had failed to do so. The Dean and Faculty of Medicine who prosecuted her did not deny her skill or even that she cured patients. They argued that Jacoba had not read the proper texts; medicine was a science to be acquired through proper reading of texts such as Galen and lectures and discourses based on the written word. Medicine was not a craft to be learned empirically. Jacoba argued in court that the intent of the law was to forbid the practice of medicine by ignorant and dangerous quacks and charlatans but that this did not apply to her as she was both knowledgeable and skillful. She also argued that she was fulfilling a particular need with female diseases because conventional modesty precluded male practitioners from dealing with these. Many of Jacoba’s patients came to court that (Continued)

6 Introduction

BOX 1.1 (Continued): Licensing physicians throughout the ages day to testify to her skill, knowledge, and caring. Jacoba must have been crushed when the court ruled in favor of the Dean and Faculty of Medicine that she was not legally constituted to practice medicine in Paris. She was therefore prohibited from practicing medicine in the future on penalty of imprisonment. The court acknowledged her patients’ positive testimonies, but this was not deemed relevant to her legal status as a medical practitioner. Thus, the case of Jacoba ended, and we have no further historical record of her. Perhaps, she left Paris to practice medicine. The French medical establishment continued to battle with illegal practitioners of medicine including the famous skirmishes with Louis Pasteur in the mid-19th century,8 some 500 years after the case of Jacoba. One incensed physician even challenged Pasteur to a duel. Louis Pasteur, primarily a chemist and microbiologist, is one of the main founders of the germ theory of disease. Although his discoveries reduced mortality from puerperal fever, created the first vaccine for rabies and anthrax, and eventually revolutionized medical practice, the medical establishment in France was hostile to him as an unlicensed interloper.

THE RENAISSANCE AND THE CASE OF LEONARDO FIORAVANTI The Milanese physicians had been plotting against him since his arrival from Venice in 1572. Fioravanti had been arrested and imprisoned by officers of the Public Health Board in Milan on the sketchy charge of not medicating in the accepted way.9 After 8 days in prison, Fioravanti was becoming increasingly outraged by the indignity he was suffering. The Milan medical establishment considered him an outsider, an alien, and an unwelcome intruder. They finally were able to have him incarcerated. Fioravanti was not a conventional medical charlatan hawking his nostrums in the piazza and then moving on. Nor was he a run-of-the-mill barber-surgeon. He had practiced medicine for years in Bologna, Rome, Sicily, Venice, and Spain. He had an MD from the University of Bologna, had published several medical texts, had developed many medicines, and was a severe critic of much of conventional medical practice. The Milan physicians were not welcoming and considered him a foreign doctor. A prison guard provided pen-and-paper for Fioravanti, and in his most elegant and formal language, he addressed it to Milan’s public health minister from “Leonardo Fioravanti of Bologna, Doctor of Arts and Medicine, and Knight.” He asked to be released from prison and to “medicate freely as a legitimate doctor.” A paid messenger delivered the letter to the health office located in the Piazza del Duomo. (Continued)

Assessment, testing, and evaluation  7

BOX 1.1 (Continued): Licensing physicians throughout the ages The health minister, Niccolo Boldoni, was responsible for overseeing every aspect of medical practice in Milan, from examining midwives, barber-surgeons, and physicians, to collecting fees, imposing fines, inspecting apothecaries, and ruling on appeals. The letter from the doctor and knight, Leonardo Fioravanti, claimed that the Milan physicians were in a plot to stop him from providing care and cures to the sick of Milan. Moreover, he claimed that the Milan physicians were a menace to their patients and did more harm than good with quack treatments, poisonous medicines, and careless and arrogant behaviors. Fioravanti challenged the minister to provide 25 of the sickest patients to him and an equal number to Milan doctors that the minister selected and that he—Fioravanti—would cure his patients quicker and better than the other doctors. It is unlikely that this early clinical trial ever occurred as there is no historical record of it, but Boldoni and the Milan court set Fioravanti free.

ASSESSMENT, TESTING, AND EVALUATION Assessment involves the assignment of numbers, quantities, or characteristics to some dimension. Evaluation is the process of interpreting or judging the value of the assessment. Testing as used in the usual sense is a subset of ­a ssessment: in the classroom, in the clinic, or on the ward, it is assessment of educational outcomes. In recent years, the focus for classroom, clinic, or wardbased assessment has been on performance assessment (also called authentic, direct, or alternative assessment). This sort of assessment involves “real life,” open-ended activities that are intended to measure aspects of higherorder thinking and professional conduct which together can be referred to as competence.10 In psychology and health sciences education, assessment instruments include examples such as intelligence tests, achievement tests, personality inventories, biomedical tests, and any classroom tests that you have taken in school, college, and medical school. All of these are measurement instruments because they attempt to quantify (assign a number to) some dimension whether it is length (ruler), time (clock), intelligence (IQ test), or scholastic aptitude for medical school (Medical School Admission Test—MCAT). Each assigns a number to one of a physical (length), psychological (intelligence), or educational/achievement/ aptitude dimension (MCAT). In addition to the MCAT, another very important test in medicine is the United States Medical Licensing Examination (USMLE), which is given in steps, reflecting emerging competence over time in the physician’s development.

8 Introduction

Step  1  assesses understanding and application of important concepts of the sciences basic to the practice of medicine. Here the focus is on principles and mechanisms underlying health, disease, and modes of therapy. Step 2 assesses the application of medical knowledge, skills, and understanding of clinical science essential for the provision of patient care under supervision. Step 3, the final examination, assesses the application of medical knowledge and understanding of biomedical and clinical science essential for the unsupervised, independent practice of medicine, with emphasis on patient management in ambulatory settings. In all cases, scores or numbers are derived from these tests. Tests and assessment devices, therefore, are measurement instruments in education and psychology. Long before systematic or standardized testing in the United States and Europe, however, physician licensing has been rigorously practiced (see Box 1.1 for ­historical examples). Evaluation involves value judgments. To make an evaluation, you interpret a measurement according to some value system. A measurement may be a clinical teacher reporting Jason’s score on a procedural skills test, such as intubation, as 19. If the teacher interprets this score and concludes that Jason is excellent at intubation, then this is an evaluation.

Assessment = assigning a number



Measurement = assigning a number



Testing = assigning a number



Evaluation = assessment or measurement or testing + a value judgment

Test scores generally serve as measurements or attempts to quantify some aspect of student or clinician educational functioning. The letter grades (A, B, C, D, F) or adjective descriptions (Excellent, Fair, Poor) associated with these numbers are evaluations based on these test scores. In some medical schools, students are provided with a class ranking, which is a type of evaluation, because the higher the rank, the better the standing. Evaluation, t­ herefore, is measurement plus value judgments. The quality of the evaluation depends on both the quality of the measurement and the care with which this result is interpreted. A careless interpretation of good quality data is likely to lead to a poor evaluation just as a careful interpretation of shoddy data would. Evaluations by professors (or others) of student performance that are based on little or no data (that is, subjective evaluations) are likely to be of very poor quality. One of the main aims of this book, therefore, is to help course professors, clinical teachers, assessment experts, licensing authorities, and others to develop good-quality assessments (measurement instruments) and to conduct careful interpretations of the results. Thus, their evaluation of student and ­clinician performance will be enhanced.

Brief history of testing  9

BRIEF HISTORY OF TESTING Modern testing of physicians and other healthcare professionals is well established and rely on scientific processes and procedures encompassed in the field of psychometrics. While the science of testing and assessment for the purpose of certification and licensure has developed in the past century or so, the regulation of medical and healthcare practice has its roots in antiquity (see Box 1.2 for its origins).11 According to the Book of Judges, the tribe of Gilead defeated the invading tribe of Ephraim (around 1370–1070 BCE). The Gileadites captured the fords of the Jordan opposite Ephraim. When any of the fugitives of Ephraim asked to crossover to return home, they were given a test. To discern whether the request was to be allowed, the requester was asked to pronounce the word “Shibboleth.” The men of Gilead would say to him, “Are you an Ephraimite?” If he said, “No,” then they would say to him, “Say now, ‘Shibboleth.’” But he said, “Sibboleth,” for he could not pronounce it correctly. Then they seized him and slew him at the fords of the Jordan. Thus there fell at that time 42,000 of Ephraim.12 This was a final exam indeed! The first country in the world to implement standardized testing on a broad scale was Ancient China. Called the imperial examination, the main purpose of the test was to select candidates for specific jobs in the government.13 The imperial examination was established during the Sui Dynasty in 605 AD. It lasted for 1,300 years and was abolished in 1905 during the Qing Dynasty.14 During the period of use, the imperial examination system played a central role in the Chinese imperial government. It served as a tool for the political and ideological control, functioned as a proxy for education, produced an elite social class, and became a dominant culture in traditional Chinese society.15 The examination system was an attempt to recruit candidates on the basis of merit rather than on the basis of family or political connection. The texts studied for the examination were the Confucian classics. After a period of emphasis on memorization, without practical application and a narrow scope, the exam underwent change (circa 960), stressing the understanding of underlying ideas and the ability to apply classical insights to contemporary problems. Students sometimes spent 20–30 years memorizing the classics in preparation for a series of up to eight examinations in philosophy, poetry, mathematics, and so on. By the 20th century, the imperial examination was considered outdated and inadequate. Meanwhile, an examination system modelled on the Chinese imperial exam was adopted in England in 1806. Its purpose was to select specific candidates for positions in Her Majesty’s Civil Service. This system was later applied to education and influenced testing in the United States as it became a model of standardized tests. These practices include using standard conditions for the test (e.g., quiet setting, proctors supervising), standard scoring procedures (e.g. using exam scorers who were blinded to the candidates’ identity), and protocols (e.g., scoring rubrics).

10 Introduction

BOX 1.2: Regulation of medical practice since antiquity MEDICAL PRACTICE IN ANTIQUITY A set of laws, known as the Code of Hammurabi (circa 1740 BC), have come down to us from the Babylonians after its namesake, the founder of the Babylonian empire.16 These 282 statutes or common laws governed nearly all aspects of social, political, economic, and professional life including those pertaining to physicians, surgeons, veterinarians, midwifes, and wet nurses. Carefully conscribed details were devoted to specifying the relationship between patients and practitioners, including fees and penalties. Problems of “internal medicine” were dealt with by physicians of the priestly class who saw to internal disorders caused by supernatural factors. The surgeon who dealt with physical problems, however, was accountable for both remuneration and liability to earthly courts. If a doctor performed surgery, generally with a bronze knife, and saved the life or eyesight of an upper class citizen, he was to be paid 10 shekels of silver. A similar outcome for a commoner was worth 5 shekels and only 2 shekels for a slave. If the outcome for the upper class citizen was bad (blindness or death), the doctor’s hand was amputated. If a slave died because of the surgery, the doctor had to provide a replacement but had to pay only half the value in silver, if the slave was blinded. The code provided further detail for many procedures including those of the veterinarians (“doctor of an ox or ass”). Probably, the most famous physician of all time and the founder of clinical medicine is Hippocrates (circa 460–360 BCE) of Greek antiquity, the putative author of the Corpus Hippocraticum. Upon graduation from medical school, many modern physicians continue to take the Hippocratic Oath, the model of the ideal physician. Many historians question whether Hippocrates actually wrote this oath or even the essays attributed to him. Some even question whether Hippocrates was a real person or was a composite created later by Greek and Roman scholars. Nonetheless, even in antiquity, there were rules, policies, and regulations on how to behave as a physician. After Hippocrates, Galen is probably the next most famous physician in history. His works and texts continued to be studied by medical students and scholars for hundreds of years after his death. When Galen ventured to Rome in 161 AD., he was met with hostility by the medical establishment. For 5 years, he was able to continue to practice medicine, lecture, and conduct public discussions under the protection of the p ­ owerful Emperor Marcus Aurelius who named him the “first of physicians and philosophers.”17 Eventually, Galen returned to Greece complaining that he had been driven out of Rome by the medical establishment who saw him as an interloper. Galen did subsequently return to Rome honoring a request from Marcus Aurelius. He remained there for the rest of his life. (Continued)

The Flexner Report  11

BOX 1.2 (Continued): Regulation of medical practice since antiquity MODERN REGULATIONS AND PROCEDURES FOR LICENSING PHYSICIANS The modern rules, policies, and regulations governing the practice of medicine today are well established. In most jurisdictions—the United States, Canada, Europe—like the Code of Hammurabi, regulations govern nearly every aspect of the patient–physician relationship, clinical guidelines, best practices, evidence-based medicine, fees, collegiality, and so forth. In the United States, for example, a number of organizations, such as the American Board of Internal Medicine (ABIM), American Board of Medical Specialties (ABMS), and the state licensing authorities (e.g. Medical Board of California), as well as doctor organizations such as the American Medical Association (AMA), are all involved in governing the assessment, licensing, and behavior of physicians. Nearly all jurisdictions in the world nowadays require physicians to be university educated and earn an MD (medical doctor degree) or other approved degrees (e.g., MBBS, MBChB). In many jurisdictions, students who have earned a medical degree are required to undergo further postgraduate supervised clinical training. In the United States and Canada, this is called residency, also called a house officer or senior house officer in the United Kingdom and several Commonwealth countries. Depending on the medical specialty and jurisdiction, residency can be 1–6 years in duration. Once a doctor has passed all relevant examinations and qualifying procedures, the physician may be granted a license in a specified jurisdiction to practice medicine without direct supervision.

TRADITIONAL METHODS OF TESTING In the United States, written tests to assess students were not generally used until Horace Mann introduced them in 1845. This was seen as a radical and controversial innovation when American educators adopted them in the late 1800s. Testing quickly became an important element in America’s modern public school system as well as in postsecondary education including medical schools.

THE FLEXNER REPORT In 1910, Abraham Flexner produced the now famous Flexner Report, Medical Education in the United States and Canada: A Report to the Carnegie Foundation for the Advancement of Teaching that was commissioned by the Carnegie Foundation.18 At that time, there were approximately 157 medical schools that

12 Introduction

BOX 1.3: History of licensure and regulation in the United States

(Continued)

The Flexner Report  13

BOX 1.3 (Continued): History of licensure and regulation in the United States Flexner provided a review of medical education and medical institutions and reviewed the economic and social factors that had an impact on the delivery of medicine of that era. He concluded that the situation was untenable and that medical education needed to be radically transformed. Today, with a few rare exceptions, medical school organization in the United States and Canada consists of four years, the first two preclinical or laboratory sciences consisting of foundational knowledge in anatomy, physiology, pharmacology, and pathology, followed by 2 years of clinical sciences experiences. The Flexner Report also contained arguments for evidence-based and scientific-based medicine. It has had an impact on the medical school admissions process, the learning environment, the qualifications and expectations of medical teachers, and the need for standardized licensure exams. The Flexner Report has been pivotal to most of modern medical education including the development of standardized testing and assessment in medicine. Meanwhile, developments in large-scale standardized testing were taking place in Europe and the United States largely as a consequence of World War I. Large-scale testing in Germany for army inductees had been in progress since 1905. Alfred Binet and Theodore Simon had discussed the application intelligence testing for the French army in 1910. In the United States, Lewis Terman had completed the revision and standardization of the Binet scales in 1917, and these principles of mass testing were soon applied to the American military effort resulting in the Army Alpha and Beta Tests of intelligence. Nearly 2 million recruits were tested with these instruments before the end of World War I.19 This testing was seen as so successful that, after the war, large-scale standardized testing swept the American school systems. The major types of test used throughout the 20th century were pencil-and-paper multiple-choice questions (MCQs).

14 Introduction

were loosely affiliated with educational institutions, and the education of physicians was primarily a for-profit business. The knowledge and skills of the graduating physicians was highly variable, resulting in an over-production of poorly educated medical practitioners.

STATISTICAL AND MATHEMATICAL THEORIES Another important development that underlay this widespread testing was the development of modern statistics. The large databases that were generated from this prodigious testing required improvements in statistical methods. Beginning in the late 19th century, there were a series of statistical and mathematical initiatives. The major contributors to the field have been Sir Francis Galton (polymath, correlation), Sir Ronald Fisher (analysis of variance), Karl Pearson (mathematical statistics, correlation), Charles Spearman (measurement of intelligence, factor analysis), EL Thorndike (educational measurement), Fredrick Kuder (reliability), Lee J Cronbach (reliability, generalizability theory), George Rasch (item response theory), Fredric M Lord (item response theory), Karl Joreskog (confirmatory factor analysis), Peter Bentler (structural equation modelling), and many others.

Group-administered final exams

TEST THEORY There are three major interrelated test theories today: (1) classical test theory (CTT), (2) generalizability theory (G-theory), and (3) item response theory (IRT). All three theories are fundamental to the field of psychometrics: the theory and technique of psychological and educational measurement. This includes the

Computers 15

objective measurement of attitudes, personality traits, skills and knowledge, abilities, and educational achievement. Psychometric researchers focus on the construction and validation of assessment instruments such as questionnaires, tests, raters’ judgments, and personality tests as well as statistical research relevant to the measurement theories. CTT is so-called because it was developed first in psychometrics. The premise is simple: any observed score (e.g., a test score) is composed of the “real” score or true score plus error of measurement: X (Observed score) = T (True score) + e (error of measurement). The early foundational work of scholars like Karl Pearson, Charles Spearman, EL Thorndike, and Fredrick Kuder was based on this idea. G-theory was developed by Cronbach et al.20 as an advance over CTT. In CTT, each observed score has a single true score and has a single source of error of measurement. G-theory is a statistical framework for conceptualizing and investigating multiple sources of variability in measurement. An advantage of G-theory is that researchers can estimate what proportion of the variation in test scores is due to factors that often vary in assessment, such as raters, setting, time, and items. Anyone who has watched Olympic diving has observed the effect of different sources of variance: the divers’ scores vary based on particular differences in performance, by the different raters (judges), and the items (components of the dive). The variation in scores, therefore, comes from multiple sources. In healthcare assessment, the same situation obtains when the student performs skills which are rated by two or more judges or raters. Both CTT and G-theory continue to play a role in testing and measurement. Both of these theories and their applications will be discussed in subsequent chapters (Chapters 8 and 9). The third major theory, IRT, is also known as latent trait theory. Like CTT and G-theory, it can be used for the design, analysis, and scoring of tests, questionnaires, and assessments measuring abilities, attitudes, or other variables. IRT is based on mathematical modelling of candidates’ response to questions or test items in contrast to the test-level focus of CTT and G-theory. This model is widely used with MCQs (that are scored right or wrong) but can also be used on a rating scale, patient symptoms (scored present or absent), or diagnostic information in disease. In IRT, it is assumed that the probability of a response to an item is a mathematical function of the person and item characteristics. The person is conceptualized as a latent trait such as aptitude, achievement, extraversion, and sociability. The item characteristics consist of difficulty, discrimination (how they distinguish between people), and guessing (e.g., on multiple-choice items). All three psychometric theories—CTT, G-theory, and IRT—have their relative advantages and disadvantages and will be further discussed in later chapters (Chapters 8 and 9).

COMPUTERS The advent of computers has played a large role in the expansion and evolution of testing in the last 60 or so years. Many of the complex statistical techniques that have been applied to testing, such as correlational and factor analysis and

16 Introduction

test theories like IRT have only been possible with the use of computers. As large mainframe computers became commonplace in universities and some other institutions during the 1960s, large databases from testing programs could be stored and the data analyzed with software programs. At the same time, software programs that could be installed into the computers also became available. Prior to this, users had to write their own software from scratch. Data could be entered into the computers (e.g., with punch cards, optical scanning sheets, bubble sheets, keyboards) and analyzed, and reports generated for individual test-takers. By the 1980s, optical reading scanning machines (reading bubble sheets marked with pencils) had been more-or-less perfected and matched to desktop computers. Large numbers of bubble sheets (e.g., 1,000) can be quickly and efficiently entered into a desktop computer where sophisticated software can quickly analyze the test results and produce individual reports and psychometric results of the test. The next evolutionary step that began in the 1990s was the elimination of pencil-and-paper (i.e., bubble sheets) so that students could take a test directly on a computer. This is called computer-based testing (CBT). The main advantage of CBT historically has been for report generation and quick feedback. With the advent of the personal computer, CBT functions primarily for the computeradministered versions of paper-and-pencil tests. These provided some advantages over paper-and-pencil in test administration and item innovation. Some disadvantages include the need for latest hardware and software and large test centers to accommodate large group testing.

Taking the USMLE, Step 1 CBT can be innovative allowing flexible scoring of items. Test items can use sound or video to create multimedia items. An item may contain a 40-s video and audio sequence of a doctor performing a focused physical exam on the left lower quadrant of the abdomen for example. This can be followed by a series of MCQs to the student. Content innovation also relates to the use of dynamic item types such as drag-and-drop, point-and-click, or hovering over hot-spots. Future developments in CBT are likely to focus on item innovation that measure c­ omplex cognitive outcomes such as clinical judgment and professionalism.

Current status of testing  17

ONLINE TESTING Online testing refers to the delivery of tests via the internet. This approach also provides a new medium for distribution of test materials, reports, and practice manuals and also for the automated collection of data. Even traditional paperand-pencil materials can be delivered online as PDF format files using e-book publishing technologies. Theoretically, anyone with an internet connection could take a test at anytime from anywhere in the world. Such an approach provides much more flexibility in testing than has been possible in the past. Online testing highlights a whole set of issues: confidentiality, cheating, test-taker identification, hacking, breaching the test bank, and so on.

CURRENT STATUS OF TESTING The current status of testing is paradoxical. On the one hand, rapid advances in computer technology, statistics, and test theory have brought impressive new capabilities to assessment of learning and performance. On the other hand, most classroom testing is still very primitive. In education in general, there are probably 500 million or so tests given in the United States in classrooms every year, the vast majority would not meet even minimal standards of appropriate tests. How can this be? The main reason for the poor quality of tests in most classrooms is that teachers have little knowledge of basic educational measurement or test construction principles. Teachers (including professors) receive almost no education whatsoever in these theories and practices in their university programs. In the United States, most of the 50 states require no explicit training in measurement and assessment as part of teacher certification. Most states simply require completion

18 Introduction

of an accredited teacher education program. A majority of teacher education programs, however, have neither a compulsory nor an optional course in educational measurement and assessment. Indeed, many programs have no course offering in these areas at all. Most American teachers have no education in this subject. For those that teach in post-secondary institutions and medical schools (i.e., instructors, professors) the situation is even worse. Not only do the instructors, clinical teachers, and professors have no education in testing and assessment, they have no instruction of basic pedagogical principles or learning theories. The situation is similar throughout most of the world. There is a further irony to the situation. While as we have seen testing is on the increase and teachers, instructors, and professors as well as university administrators know little about it, the general public is very much in favor of increased testing in schools21 including post-secondary institutions. A majority (>80%) of the American public is in favor of using standardized testing in their community for core subjects (English, Math, Science, History, and Geography), favor it for problem-solving skills, and favor it for the ability to write. Finally, more than half of Americans are in favor of having students repeat their grade based solely on standardized national achievement test performance. To sum up the current situation in the educational institutions, then, the public overwhelmingly supports testing, as does the current political climate, but most teachers who are the primary constructors and administrators of tests know little about appropriate testing practices and standards. The situation is similar in post-secondary education, including medical schools. Testing and assessment is likely to increase in the future, but most instructors and professors know little about appropriate testing practice and standards. The demands of assessment are currently rigorous and are likely to become even more so. Professors and instructors may spend as much as 20% or 30% of their professional educational time in assessment-related activities. These include designing, developing, selecting, administering, scoring, recording, evaluating and revising tests, and assignments. Healthcare instructors, professors, and medical education leaders such as ES Holmboe, DS Ward, RK Reznick, and others22 are very concerned about the quality of tests and measurements and lack confidence in them. Currently, medical education is under pressure to adopt competency-based approaches that emphasize outcomes. This shift in emphasis requires more effective evaluation and feedback by professors and instructors than is currently the case. The existing faculty is not fully prepared for this task for traditional competencies of medical knowledge, clinical skills, and professionalism, never mind for the newer competencies of interdisciplinary teamwork, quality improvement, evidence-based practice, and systems. Medical and healthcare faculty welcomes relevant and useful training or assistance in measurement and assessment, but they feel frustrated by their lack of training and support.19

THE ROLE OF ASSESSMENT IN TEACHING, LEARNING, AND EDUCATION There are essentially four categories in which testing and assessment play a role: (1) teaching and learning, (2) program evaluation and research, (3) guidance and

The role of assessment in teaching, learning, and education  19

counselling, and (4) administration. While these are interrelated and test results can be used in several categories, they are described in turn.

Teaching and learning All teaching has some goal or objective as its purpose. The objective may be implicit or explicit (see Chapter 10 for a discussion of objectives). Teaching, then, is for the purpose of having students reach some intended learning outcome (objective). The only sound means by which a teacher can determine the extent to which students have achieved the objectives is to assess performance. The outcome of the testing, therefore, is an evaluation of not only student performance but the effectiveness of the teaching itself as well. A main function of testing is to measure the extent to which the instructional objectives have been met. Here are some examples of learning objectives that can be measured by testing or assessment, including direct observation with a checklist: 1. The student will be able to describe in writing the interaction of tRNA and mRNA in protein synthesis. 2. At the conclusion of the neurology course, students can write a cost-effective approach to the initial evaluation and management of patients with dementia. 3. At the conclusion of internal medicine clerkship, third-year medical students will be proficient in the diagnosis and management of hyperlipidemia, hypertension, diabetes, chronic obstructive pulmonary disease, angina, and asymptomatic HIV. 4. Student nurses will demonstrate proper handwashing technique prior to changing a dressing on a patient. Tests also have a number of other functions in teaching and learning. MOTIVATION AND ENGAGEMENT

Tests motivate students to work and study harder than they might otherwise. With the many demands made on medical students and other healthcare students, tests can motivate students to set a high priority on material that will be tested. Also, frequent short testing will motivate students to sustain a consistently higher effort than long-term infrequent tests. Few things can motivate students to read a scientific article, a chapter, or a whole textbook as a forthcoming test can. Any technique that teachers can use to motivate students is desirable. In a recent study, researchers were able to definitively demonstrate the impact of testing on medical student effort. Employing an online curriculum system by first-year medical students, researchers were able track activities through logons, amount of time engaged, and areas and content of use.23 As illustrated in Figure 1.1, the data on Digital-Space (D-Space) use is summarized on a weekly basis for 114 first-year medical students. The dates for the seven first-year course examinations are also indicated on the figure with drop-down arrows and identified by a corresponding course name.

20 Introduction

3500

Endocrine Principles of Medicine

D-Space Use (Hits/Week)

3000

Musculoskeletal & Skin

2500

Respiratory Renal

Blood

2000 1500

Cardiovascular

1000 500

12 3 / 9/ 200 26 3 / 10 200 /1 3 0 10 /20 /2 03 4/ 11 200 /7 3 / 11 200 /2 3 1/ 12 200 /5 3 / 12 200 /1 3 9/ 1/ 200 2/ 3 2 1/ 004 16 /2 1/ 00 30 4 / 2/ 200 13 4 /2 2/ 00 27 4 / 3/ 200 12 4 /2 3/ 00 26 4 /2 4/ 004 9/ 2 4/ 004 23 /2 5/ 004 7/ 2 5/ 004 21 /2 00 4

00

/2

9/

29

8/

15 8/

/2

00 3

0

Date (Weekly Summary)

Figure 1.1  Summary of 2003/2004 D-Space use and course examination schedule for the class of 2006 first-year medical students.

As is depicted by the fluctuations in the use of course materials and resources online, the medical students as a class show a dramatic increase in academic engaged time 1 or 2 weeks before a scheduled examination. In preparation for the “Principles of Medicine” course examination, for example, students in the week ending September 8 recorded a total of 531 D-Space hits. In the week ending September 15, the day of the scheduled examination, the students’ recorded an over five-fold increase in D-Space usage at 2,865 hits. The week that followed the course examination, the students’ use dropped ten-fold to a baseline low of 238 hits. This similar pattern of online use of course materials and resources is repeated both on an individual student basis and for the class as a whole throughout the year. The same pattern of academic engaged time seen for “Principles of Medicine” was obtained for all seven of the scheduled course examinations. In particular, the frequency of D-Space use based on the number of recorded hits the weeks just before and after a course examination varied from two- to ten-fold. Student effort, motivation, and academic engaged time are determined by the examinations and engagement peaks just prior to the examinations and drops to near zero just after the examination. ENHANCE LEARNING

Increasing student motivation to work and academic engagement will also increase their learning. When students anticipate a forthcoming test, they will attend to material more closely, increase their study time, and work harder to learn the relevant material than they otherwise would. Also the anticipation of a test improves students’ learning set so that they increase their memory capacities for the material (through rehearsal, elaboration, organization). Students will

The role of assessment in teaching, learning, and education  21

make a conscious effort to improve their knowledge and understanding of material when they anticipate a test as opposed to when they do not. Testing also promotes overlearning which occurs when you systematically study and prepare for a test. Of course some of the material learned for the test will be forgotten afterwards (hence the term overlearning) but more material will be retained in the longrun than if overlearning had not occurred. This “forgotten” material is much easier to recall or relearn even years later than if it had to be acquired without prior knowledge. Thus, testing promotes not only learning in the immediate future but results in longer more stable learning as well. FEEDBACK

Test scores provide feedback for both the faculty and student. The professor can evaluate the efficacy of instruction based on the students’ performance. If, for example, no student in the class was able to correctly diagnose endometriosis on a case presented on the test, then the professor can conclude that his teaching of clinical presentation of endometriosis failed. Similarly, if students are unable to describe the interaction of tRNA and mRNA in protein synthesis on a test, this indicates that the teaching of this material failed. These outcomes also indicate the need for review of this material. Feedback can also be provided directly to the student. Based on performance, the student can evaluate progress as a whole or in specific sub-areas measured by the test. If in the class presented in Figure 1.1, a student performed well on six of the tests but scored poorly on “Endocrine,” this is clear and diagnostic information to the student. Upon a closer look, the student discovered that he did particularly poorly on the reproductive sections of “Endocrine,” this provides even more precise diagnosis of learning required. This information can also provide insights for the student about effort. The test may indicate that effort had been adequate or that it may have to be increased. TEACHING

Giving a test itself is a form of teaching. Students may learn substantially merely from writing a test, since they will have to formulate answers to questions. In a study, Foos and Fisher24 gave students a short essay about the American Civil War to read. Half of the students were then given an initial test and half were not. Two days later, all the students took a final common exam: those students who took the initial test generally did better than those who did not. It may very well be that the act of retrieving the information for the original test may have altered and strengthened it. In the Foos and Fisher study, then, actually taking the initial test may have taught the students about the Civil War. This is called the “forward effect of testing.” These findings can be applied in medical courses as well.25 A professor may give a test in obstetrics and gynecology two weeks before the final exam. The main purpose of this test is to help students learn and organize the material as they will need to formulate answers to the test. A secondary purpose of the test is to provide feedback to the students on their current performance before the final exam. Taking the test will teach students about obstetrics and gynecology, capitalizing on the forward effect of testing.

22 Introduction

Program evaluation and research Tests can also be useful to evaluate educational programs and to conduct basic research. One of the most widely evaluated and researched educational programs in medicine is problem-based learning (PBL) approaches. PBL generally consists of the following characteristics26: (1) medical problems (e.g., case presentation of acute abdominal pain in a 28-year-old woman) are used as a trigger for learning, (2) lectures are used sparingly, (3) learning is facilitated by a tutor, (4) learning is student initiated, (5) students collaborate in small groups for part of the time, and (6) the curriculum includes ample time for self-study. Many of the program evaluation studies have used test data (e.g., course examinations in pediatrics, USMLE Step 1, etc.) to compare performance between students in PBL versus those in more conventional curricula (e.g., lectures focused on systems). These curriculum comparison studies have been reviewed extensively over the past 25 years with mixed results: sometimes students in PBL curricula outperform comparison groups in conventional curricula and sometimes they do not. Schmidt et al.26 went beyond simple program evaluation to some basic theoretical aspects of PBL from a careful review of the PBL findings. They concluded that PBL works. This is because it encourages the activation of prior knowledge in the small-group setting and provides opportunities for elaboration on that knowledge. Activation of prior knowledge results in the understanding of new information related to the problem and facilitates transfer to long-term memory. Real problems (e.g., myocardial infarction) as presented in PBL arouse interest that motivates learning. Skilled tutors permit flexibility in cognitive organization compared to rigid structures in worksheets or questions added to problems. Although initially students do not study much beyond the learning issues required, the self-study focus increases the development of self-directed learning. The extent of learning in PBL results come from both group collaboration and individual knowledge acquisition. In another evaluation study, researchers27 studied the relationship between basic biomedical knowledge and clinical reasoning outcome in a clinical presentation curriculum. They hypothesized a model of medical students’ knowledge as a function of their aptitude for medical school, basic science achievement, and clinical competency. A variety of models through an explicit representation of a distinct number of observed and latent variables were tested using sophisticated statistical application—structural equation modelling—that allows researchers to test the goodness-of-fit of various models. The models included the knowledge encapsulation theory and the emphasis on clinical knowledge. Consideration was given to an alternative model that includes both the basic science knowledge and clinical knowledge components as contributing independently to the diagnostic or clinical reasoning skills (CRS) of medical students. Test data were collected from a total of 589 students (292 males, 49.6%; 297 females, 50.4%) who completed medical school. The data consisted of the f­ ollowing medical students’ aptitude, achievement, and performance measures: (1) the MCAT subtests (verbal reasoning [VR], biological sciences [BS], physical sciences [PS], and

The role of assessment in teaching, learning, and education  23

writing sample [WS])*; (2) undergraduate grade point Average (UGPA) at admission; (3) basic science achievement in the first 2 years of medical school (Y1, Y2); (4) clinical performance in the medical school’s single clerkship year (Y3); and (5) the Medical Council of Canada’s (MCC) Part I test, which consists of a seven-section, 196 multiple-choice question (MCQ) examination on declarative knowledge (e.g., medicine, pediatrics, psychiatry), and an approximately 60-case or 80-item CRS test designed to assess problem-solving and clinical decision-making abilities. Figure 1.2 shows the final three latent variable model with respective parameter estimates and goodness-of-fit index values for CFI, SRMR, and RMSEA (Chapter 7, pp. 164–172). In this maximum likelihood model fit, the theoretical structure of the model is supported with the existence of covariance between the aptitude for medical school and basic science achievement latent variables. In this model, the combination rules of cutoff score values are achieved for the CFI at .905 and are close to the criteria set for robustness and non-robustness conditions with n > 250 and values of SRMR at .054 and RMSEA at .105. In comparison, the test values obtained for the knowledge encapsulation and emphasis on clinical knowledge Model: CFI = .905, SRMR = .054, RMSEA = .105 .81 YR 2

YR 1

MCQ

.87

.85

.52 Basic Science Achievement

.87

YR 3

.39

UGPA

CRS

—.10 Clinical Competency

.44 .15 .43

.15 Aptitude for Medical School

.13 .79

VR

WS

.76 BS

PS

VR = Verbal Reasoning, BS = Biological Sciences, PS = Physical Sciences, WS = Writing Sample UGPA = Undergraduate Grade Point Average, YR 1 = Year 1, YR 2 = Year 2, YR 3 = Year 3, MCQ = MCC Part I: Multiple-Choice Question Exam, CRS = MCC Part I: Clinical Reasoning Skills Exam measured/observed variables latent variables

paths with coefficient correlation coefficient

Figure 1.2  A model of relationship between basic science achievement, aptitude for medical school, and clinical competency. * A new version of the MCAT (2015) has been implemented since this research was conducted. This is discussed in detail in Chapter 5.

24 Introduction

models were found to be less than the minimum 0.90 cutoff CFI values at 0.81 and 0.82, respectively. The clinical competency latent variable was found to have a large path coefficient value of .85, accounting for 72% of the variance, with the MCC Part I— CRS subtest. The MCAT–VR subtest has a small positive (.15) path coefficient on the clinical competency latent variable and may reflect the importance of verbal proficiency in the clerkship year. The Y3 variable, however, has a small negative (−.10) path coefficient on the clinical competency latent variable that may demonstrate the variability of clinical performance measures obtained from medical students during their different clerkship rotations. As anticipated, the aptitude for medical school latent variable is related to the MCAT subtest and UGPA proximal measures obtained at the beginning of students’ medical school experience. These findings support an integrated medical school curriculum that would nurture the inherent connection between the basic sciences and clinical competency in the further development of CRS in medical students. The results of this program evaluation and research study, therefore, have implications for curriculum design and delivery in developing medical reasoning skills. With this study, then, the researchers were able to not only make an evaluation of the effects of curriculum content but to explore a basic research problem as well.

Administration monitoring Administrative decisions about education from university administrators to government officials can be informed and improved by test results. Standardized test results such as the USMLE data, for example, might be used by a dean to monitor the medical school’s overall performance and take appropriate action if any is required. Suppose that such results indicated that the students in a given medical school were performing poorly in renal compared to other similar schools and relative to their own performance in cardiology, this might suggest to the dean a need for a revamping of the renal curriculum. Similarly, associations of medical schools might monitor the whole country’s performance. A national picture may be provided with test data that shows inadequate knowledge of anatomy of beginning physicians. This might suggest the need for administrative action in educational policy on a national scale.

CURRICULUM DECISIONS Results on a standardized basic skills test (such as the Medical Council of Canada Examinations—MCCE) might show that students in a medical school are achieving better scores with basic science curriculum A (e.g., PBL) than other students with basic science curriculum B (e.g., systems). The administration and faculty of the medical school may decide to adopt curriculum A as a standard. A dean of a medical school may discover that students who take an elective advanced clinical skills course score 15 points higher on the MCCE than students who do not take the course. She might decide to expand the course offering thereby providing other students this advantage on the MCCE.

Workplace-based assessments  25

SELECTION AND SCREENING Tests are used to improve selection and screening decisions. Selection decisions for programs for the gifted are frequently made on the basis of test data (e.g. IQ and achievement). Screening for admission to college can be based, at least in part, on the Scholastic Achievement Test (SAT), for medical school on the Medical College Admission Test (MCAT), and for law school on the Law School Admission Test (LSAT). Similarly, the Graduate Record Entrance exam (GRE) is used by many universities to select for graduate school.

GUIDANCE AND COUNSELLING Test scores have long been used in clinical psychology and school guidance to help improve decisions. Results from intelligence, personality, interest, aptitude, and achievement tests can be used by a guidance counsellor to help students make the best decisions about their future schooling, career, and work. In medical school, test data can help counselling for the selection of a medical specialty. Test and assessment information can also help psychologists and psychiatrists to provide the best possible care for their clients and patients. Test scores alone, however, should never be used to make guidance and counselling decisions. Rather these should be used by the professional to assist in making the best possible decision.

CERTIFICATION AND LICENSING Nearly all professional organizations (physicians, lawyers, accountants, optometrists, nurses, etc.) and many other groups as well have formal certification and licensing procedures involving tests. Thus, to be eligible to practice as a physician, lawyer, dentist, and psychologist, to name a few, you must write and pass certification exams. In the health professions, it is also common practice to assess practical clinical skills such as conducting a physical exam, taking a patient history, procedural skills, and so forth using standardized objective structured clinical exams (OSCEs). Many organizations have moved to or are moving toward implementing certification and licensing exams utilizing written, CBT, and OSCEs as a means of determining the competency of their licensees. This is likely to expand and increase in the future.

WORKPLACE-BASED ASSESSMENTS Many health professionals now undergo assessment in the workplace when they are in professional practice. A common assessment is multisource feedback (MSF), also referred to as “360° evaluation.” This has emerged as an important approach for assessing professional competence, behaviors, and attitudes in the workplace. Early attempts to develop MSF questionnaires in medicine focused on the assessment of residents in the late 1970s. Today, MSF tools are being used in the United  States, Canada, and Europe (in the Netherlands and

26 Introduction

the  United  Kingdom) across a number of physician specialties. Typically, this feedback is collected using surveys or questionnaires designed to collect data from various respondents (e.g., peers, coworkers, patients) including the healthcare professional in corresponding self-assessments of the measurement instrument. MSF has gained widespread acceptance for evaluation of professionals and is seen as a catalyst for the practitioner to reflect on where change may be required.28 MSF originated in industry because the reliance on a single supervisor’s evaluation was considered an inadequate approach to the assessment of a worker’s specific abilities. Similarly, physicians work with a variety of people (e.g., medical colleagues, consultants, therapists, nurses, coworkers) who are able to provide a better assessment and contextually based understanding of physician performance than any single person could. In MSF, physicians may complete a self-assessment instrument and receive feedback from a number of medical colleagues, non-physician coworkers (e.g., nurses, psychologists, pharmacists), as well as their own patients. Different respondents focus on characteristics of the physician that they can assess (e.g., patients are not expected to assess a physician’s clinical expertise) and together provide a more comprehensive evaluation than what could be derived by any one source alone. MSF is gaining acceptance and credibility as a means of providing doctors with relevant information about their practice to help them monitor, develop, maintain, and improve their competence. The assessment focuses on clinical skills, communication, collaboration with other healthcare professionals, professionalism, and patient management. This assessment system is a means of maintaining and improving physician competence through systematic feedback.

TYPES OF TESTS AND EVALUATION PROCEDURES Tests and evaluation procedures are frequently classified into one or more categories. Following are some of the most commonly used categories of test types. 1. Individual or group administered. Some tests, like psychological ones such as IQ, or performance medical skills tests (e.g., phlebotomy), must be individually administered. Many educational tests, however, can be administered to a group ranging from a typical classroom to many hundreds of students simultaneously. Almost all medical or healthcare classroom tests are of the group type. 2. Teacher-made or standardized tests. Sometimes these tests are also referred to as informal (teacher-made) or formal (standardized tests). Teacher-made tests are obviously constructed by the teacher to measure some specific aspect of achievement (e.g. pathology, anatomy, biochemistry, pediatrics, obstetrics/gynecology) relevant to that course. This test will likely be administered once only to this group of students. Standardized tests, on the other hand, are constructed by testing experts usually to measure some broad range of aptitude or achievement relevant to many medical or other healthcare students. Unlike teacher-made tests, the standardized test is

Types of tests and evaluation procedures  27

3.

4.

5.

6.

not designed for one specific course or indeed healthcare school. They are intended to be used (and re-used) in broader jurisdictions such as state-wide (e.g., board certification exams) or perhaps even used nationally (USMLE step exams) or internationally (e.g., MCAT). Moreover, these tests are administered in formal situations according to standardized conditions. Finally, these formal tests have known statistical properties (frequently referred to as psychometric properties), since they have been administered in advance to a peer group of those who will write the test. Speed or power tests. Speeded tests obviously have time constraints on completing them. All test-takers are given a fixed amount of time designed so that many or most writers will not be able to complete the test in the allotted time. Power tests, on the other hand, do not have time as a significant constraint. While of course there is some upper limit on the time available, most test-takers can complete the test in the allotted time. Most course tests are power tests where time should not be a significant factor in performance. Tests should only be speeded, if speed of response is a significant element in the dimension measured by the test. For most courses, the professor is not so much interested in students’ speed of response but rather the accuracy and depth of response when adequate time is provided. Supply or selection tests. Supply or selection refers to the type of items that make up the test. Selection items are those where students or candidates must select a correct answer from several options (e.g. multiple-choice items). For supply items, the student must provide or supply the answer or responses to a question or instructions (e.g. short-answer items). The selection item provides a much more structured task for the test-taker than does the supply item, especially for essay questions. Since for supply items the student must provide the answer, skills such as organization, writing, fluency, and so forth play a prominent role. Supply items are also called constructed-response items, and selection items are called forced choice-items. Objective or subjective tests. This distinction refers to how the test is scored and graded and not to the item types. Objective tests are scored objectively by anyone in possession of the key or answer sheet. Indeed, machines (optical scanners or software programs) can score these tests. For subjective tests, only an expert in the subject matter can score these tests, as expert judgment is required to evaluate the constructed answer. Selection items can be generally scored objectively. The objective-subjective refers to the nature of the scoring process. Norm-referenced or criterion-referenced tests. Another important way to classify tests is based on how the results are interpreted on norm group performance or against some criterion. For norm-referenced tests, an individual’s performance is evaluated against the performance of a norm (peer) group. A professor may interpret a student’s physiology test score with reference to the rest of the class, even perhaps using class rank. It may be below average, above average, or about average. For criterion-referenced tests, however, the norm group performance is irrelevant. Rather, a criterion for “passing” or “mastery” performance is established before the test is written

28 Introduction

or the skill is performed. Irrespective of the norm groups’ performance, an individuals’ score is evaluated against this standard, frequently called a cutoff score. Some standardized tests used in American and Canadian university and healthcare assessment are norm-referenced. The GRE and the MCAT are two tests of this type. The GRE consists of major sections measuring verbal and quantitative skills. Taking the GRE is required for entry to many, but not all, university graduate school in the United States and Canada. Criterion-referenced tests are widely used more for licensure testing by boards of professional organizations. The USMLE step tests, the Medical Council of Canada exams, and medical licensing exams in Britain are all examples of criterion-referenced tests. To establish minimum performance levels (MPLs) for these tests (cutoff scores) usually requires a complex series of decisions by subject-matter experts with the help of testing experts. While the idea of criterion-referenced measurement is appealing, its use has remained narrow largely because of the theoretical and practical difficulties of establishing standards. The test for acquiring a driver’s license is an example of a common criterion-referenced test. Another set of terms that are used synonymously with criterion-referenced and norm-referenced are mastery tests in contrast to survey tests. For mastery tests, the candidate must exceed some pre-established standard (criterion) to have achieved mastery. By contrast, for the survey test, there is a survey of the candidates’ knowledge, but mastery is not evaluated against some standard. 7. Achievement, intelligence, and personality tests. Achievement tests attempt to measure the extent to which a student has “achieved” in some subject matter or on some skill. This achievement is thought to represent the level of learning due to motivation, effort, interest, and work habits. A typical test in a cardiovascular course is an example of an achievement test. Intelligence may play a part in the outcome. Intelligence or aptitude is thought to be the potential for learning or “achieving.” The actualization of this potential is achievement itself. Intelligence or this potential for learning is usually measured by IQ tests such as the Wechsler Adult Intelligence Scale and aptitude for medicine is measured by the MCAT. A personality test measures that aspect of a person’s psychological makeup which is not intellectual. Generally, these tests seek to measure the level of a person’s adjustment, attitudes, feelings, or specific traits like introversion, sociability, and anxiety. Sometimes personality tests, like the Minnesota Multiphasic Personality Inventory, are used for clinical diagnoses to assess the degree of neuroses or psychosis of an individual. Researchers have studied personality on medical student selection and performance. One test that has been used for this purpose is the NEO-PI, which is referred to as the Big Five personality inventory, measuring five broad traits, domains or dimensions that are used to describe human personality. These are neuroticism (N), extraversion (E), openness (O), conscientiousness (C), and agreeableness (A). Conscientiousness has been found to

The organization of this textbook  29

be a predictor of performance in medical school and becomes increasingly significant as students and residents advance through medical training. Other traits concerning sociability (i.e. extraversion, openness, and neuroticism) have also been found to be relevant for performance and adaptation in the medical environment.29 The traits of neuroticism and conscientiousness are related to stress in medical school: low extroversion, high neuroticism, and high conscientiousness results in highly stressed students (brooders), while high extroversion, low neuroticism, and low conscientiousness produce low-stressed students (hedonists).30 8. Summative, formative, placement, and diagnostic evaluation. Each of these types of evaluation refers to the use of the test results. Summative. This is a “summing up” of all assessments and measurements in order to determine final performance. Assigning the letter grade (A, B…F) or the final percentage value (e.g., 73.4%) is a form of summative evaluation. Formative. This is monitoring the situation. During the course of instruction, the professor wishes to monitor if students are moving toward the final objectives so as to take appropriate action. This usually involves quizzes, mid-term tests, etc. Placement. Sometimes, it is necessary to determine wherein the educational sequence the student (e.g., advanced, remedial, etc.) should be placed. Professors and administrators may be uncertain, for example, in where to place a student who has failed a clinical rotation or clerkship. Should the student continue on to the next year while remediation is undertaken or should the student be held back until the clinical rotation is redone and passed? Placement evaluation might be conducted using previous records, test data, and interviews to make this decision. Diagnostic. When students have persistent learning problems, it may be necessary to diagnose the problem. Psychological, medical, and educational data are frequently necessary to make a diagnostic evaluation. Does the student have visual or hearing impairments? Does the student have any learning disabilities?

THE ORGANIZATION OF THIS TEXTBOOK This is a comprehensive textbook on how to assess competence in medicine, medical education, and healthcare throughout the clinicians’ career. This textbook is organized into three main sections: (1) Foundations, (2) Validity and reliability, and (3) Test construction and evaluation, consisting of a total of 16 chapters.

Foundations The Foundations section consists of four chapters. Chapter 1, of course, is the introduction that introduces the need for testing and assessment, deals with questions of “what are testing, assessment, measurement, and evaluation?”, provides a brief history of testing, describes the common types of tests and assessments, and summarizes the uses of tests.

30 Introduction

In Chapter 2, the somewhat controversial topic of competence in medicine is discussed. This chapter deals with the definition of competence, controversies surrounding it, and its multiple forms. In this chapter, there is a discussion of the various methods of assessing competence over the career span. Chapter 3 deals with somewhat more technical content, the basic statistics in testing. These include descriptive statistics, standard scores, and graphical analyses. Correlational techniques—indispensable methods in testing—are presented, detailed, and discussed in Chapter 4. The following major correlational techniques are presented: Pearson product-moment, Spearman rank-order, biserial, point biserial, regression, multiple regression (linear, logistic), factor analysis (exploratory factor analyses [EFA], confirmatory factor analyses [CFA], structural equation modelling, and hierarchical linear modelling.

Validity and reliability The section on Validity and reliability contains a discussion on the multiple forms of validity (face, content, criterion-related, construct, advanced methods, unified view of validity) and a rational for the three chapters on validity and two on reliability for a total of five chapters. Chapter 5—Validity I—deals with logical and content validity and describes the uses of tables of specifications (blueprint) to enhance these types of validity. Chapter 6—Validity II—contains correlational based and between group differences aspects of validity such as criterion-related (concurrent and predictive) and construct (between group differences, factorial validity, experimental, and data structure—regression, discriminant, and cluster analyses). Chapter 7—Validity III—advanced methods, is about advanced techniques of validity such as multitrait multi-method (MTMM), confirmatory factor analyses, structural equation modelling, systematic reviews and meta-analyses, hierarchical linear modelling, and a unified view of validity. Reliability is discussed in two chapters. Chapter 8—Reliability I—deals with the multiple forms of reliability (test–retest, internal consistency, parallel forms, inter-rater, intra-rater, Ep2, Se, etc.). The technical and mathematical theories of CTT are developed and detailed in this chapter. Chapter 9—Reliability II—contains the technical and mathematical theories of IRT and Ep2 which are developed and detailed in this chapter.

Test construction and evaluation In Chapter 10, the various formats of testing for cognition, affect, and psychomotor skills are summarized. These include selection type items (MCQs, constructed response, checklists, and other item formats). There is a discussion of Bloom’s taxonomies of cognitive, affective, and psychomotor skills; Miller’s Pyramid; and further work on the use of tables of specifications. Chapter 11 deals specifically with multiple-choice items especially on how to write items, types of MCQs, number of options, how to construct MCQs, and appropriate levels of measurement. In Chapter 12, constructed response items

Summary 31

are detailed. These include essays, short answer, matching, and hybrid items such as the script concordance tests. Measuring clinical skills with the OSCE is discussed in Chapter 13. The focus is on describing the OSCE and its use in measuring communications, patient management, clinical reasoning, diagnoses, physical examination, case history, etc. Additional issues of case development and scripts, training assessors, and training standardized patients are detailed. Chapter 14 deals with checklists, questionnaires, rating scales, and direct observations. The following commonly used assessments are discussed: intraining evaluation reports (ITERs), mini-CEX, multisource feedback, chart audits, and semi-structured interviews. Chapter 15 is about evaluating tests and assessments using item analyses: conducting classical item analyses with MCQs (item difficulty, item discrimination, and distractor effectiveness), conducting item analyses with OSCEs, conducting item analyses with constructed response items, and computing the reliability coefficient and errors of measurement. Grading, reporting, and methods of setting cutoff scores (pass/fail) are discussed in Chapter 16. The focus is on norm-referenced versus criterion-referenced approaches, setting a MPL utilizing the Angoff method, Ebel method, Nedelsky method, as well as empirical methods (borderline regression, cluster analyses, etc.) The book also will contain a Glossary (of testing and statistical terms—e.g., reliability, validity, standard error of measurement, biserial correlation, etc.), problems and data (fundamental problems will be presented—e.g., c­ alculating reliability, conducting item analyses, running confirmatory factor analyses; here appropriate software (e.g., SPSS, Iteman, EQS, etc.) and example of its use will be demonstrated. There will be comprehensive references and several Appendices.

SUMMARY The assessment and evaluation of learning has its origins in antiquity but is only now beginning to emerge as a unified field. This is due to developments in statistical and mathematical theory, test theory, and advances in computer hardware and software, and also the developing internet. Testing and assessment are the acts of quantifying an educational or psychological dimension or assigning a numerical value to it. Evaluation requires a value judgment to be made based on the measurement. While the history of testing dates back two millennia, it has emerged on a large scale only in the 20th century, coincidental with mass education. The current status of testing is paradoxical: it is on the increase and receives overwhelming public support and yet most teachers, instructors, and professors (who construct, administer, and interpret the vast majority of tests) have little or no formal education in testing. Tests can be used to motivate students, enhance learning, provide feedback, evaluate educational programs, and for research. Other functions include curriculum revisions, selection and screening, certification, and guidance and counselling. Tests can be used in a variety of ways and for several evaluation functions. Testing is likely to expand in modern society. The improvement

32 Introduction

of healthcare professional competence depends, in part, on the improvement of testing and assessment of medical competence. The alarming rate of medical errors that currently result in death or negative outcomes may be reduced.

REFLECTIONS 1. How can assessment and evaluation reduce medical errors? (500 words maximum) 2. Distinguish between testing, assessment, quantification, and evaluation. (500 words maximum) 3. Briefly summarize the history of testing from antiquity to modern times. (500 words maximum) 4. Describe the various functions of testing as used in medical education. (500 words maximum) 5. Compare and contrast formative, summative, and placement evaluation. (500 words maximum) 6. Describe the various types of tests and evaluation procedures. (500 words maximum) 7. What roles have statistics and computers had in the development of ­testing? (500 words maximum)

REFERENCES 1. Kohn KT, Corrigan JM, Donaldson MS (1999). To Err Is Human: Building a Safer Health System. Washington, DC: National Academy Press. 2. Makary MA, Daniel M (2016). Medical error—The third leading cause of death in the US. British Medical Journal 353:i2139. doi:10.1136/bmj. i2139. 3. Schoen C, Osborn R, Doty M, Bishop M, Peugh J, Murukutia N (2007). Toward higher performance health systems: Adults’ healthcare experiences in seven countries. Health Affairs 26(6):717–734. 4. Brand C, Ibrahim J, Bain C, Jones C, King B (2007). Engineering a safe landing: Engaging medical practitioners in a systems approach to patient safety. Internal Medicine Journal 37:295–302. 5. Newman-Toker DE, Pronovost PJ (2009). Diagnostic errors—the next frontier for patient safety. Journal of the American Medical Association 301:1060–1062. 6. Sibinga EM (2010). Clinician mindfulness and patient safety. Journal of the American Medical Association 304:2532–2533. 7. Magner LN (2005). A History of Medicine (2nd edition). New York: Taylor & Francis, pp. 154–155. 8. Debre P (2000). Louis Pasteur. Baltimore, MD: Johns Hopkins University Press.

Summary 33

9. Eamon W (2010). The Professor of Secrets. Washington, DC: National Geographic Publishing. 10. Hodges B, Lingard L (eds) (2013). The Question of Competence: Reconsidering Medical Education in the Twenty-First Century. Ithaca, NY: Cornell University Press. 11. Violato C (2016). A brief history of the regulation of medical practice: From Hammurabi to the national board of medical examiners. Wake Forest Journal of Science and Medicine 2:122–129. 12. American New Standard Bible, Book of Judges, 12: 6. 13. Dubois PH (1970). A History of Psychological Testing. Boston, MA: Allyn Bacon. 14. Miyazaki I (1981). China’s Examination Hell: The Civil Service Examinations of Imperial China. New Haven, CT: Yale University Press. 15. Wang R (2013). The Chinese Imperial Examination System. Plymouth, UK: Scarecrow Press. 16. Johns CH (2000). The Oldest Code of Laws in the World: The Code of Laws Promulgated by Hammurabi, King of Babylon. Union, NJ: Lawbook Exchange, Ltd. 17. Sigerist HE (1961). History of Medicine, Volume II: Early Greek, Hindu, and Persian Medicine. New York: Oxford University Press. 18. Flexner A (1910). Medical Education in the United States and Canada: A Report to the Carnegie Foundation for the Advancement of Teaching, Bulletin No. 4. New York City, NY: The Carnegie Foundation for the Advancement of Teaching. 19. McArthur DL (1983). Educational Testing and Measurement: A Brief History, (Report #216). Los Angeles, CA: Center for the Study of Evaluation, University of California. 20. Cronbach LJ, Nageswari R, Gleser GC (1963). Theory of generalizability: A liberation of reliability theory. The British Journal of Statistical Psychology 16:137–163; Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N (1972). The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. New York: John Wiley. 21. Rose LC, Gallup AM (2007). The 39th annual Phi Delta Kappa/Gallup poll of the public’s attitudes toward the public schools. Phi Delta Kappa 89(1):33–48. 22. Holmboe ES, Ward DS, Reznick RK, Katsufrakis PJ, Leslie KM, Patel VL, Ray DD, Nelson EA (2011). Faculty development in assessment: The missing link in competency-based medical education. Academic Medicine 86(4):460–467. 23. Donnon T, Violato C, DesCôteaux JG (2006). Engagement with online curriculum materials and the course test performance of medical students. International Journal of Interactive Technology and Smart Education 10:150–156. 24. Foos PW, Fisher RP (1988). Using tests as learning opportunities. Journal of Educational Psychology 80:179–183.

34 Introduction

25. Pastotter B, Bauml KH (2014). Retrieval practice enhances new learning: The forward effect of testing. Frontier in Psychology 5. doi:10.3389/ fpsyg.2014.00286. 26. Schmidt HG, Rotgans JI, Yew EH (2011). The process of problem-based learning: What works and why. Medical Education 45:792–806. 27. Donnon T, Violato C (2006). Medical students’ clinical reasoning skills as a function of basic science achievement and clinical competency measures: A structural equation model. Academic Medicine 81(10 Suppl):S120–S123. 28. Donnon T, Al Ansari A, Al Alawi S, Violato C (2014). The reliability, ­validity, and feasibility of multisource feedback physician assessment: A systematic review. Academic Medicine 89(3):511–516. 29. Doherty EM, Nugent E (2011). Personality factors and medical training: A review of the literature. Medical Education 45(2):132–140. 30. Tyssen R, Dolatowski FC, Røvik JO, Thorkildsen RF, Ekeberg O, Hem E, Gude T, Grønvold NT, Vaglum P (2007). Personality traits and types predict medical school stress: A six-year longitudinal and nationwide study. Medical Education 41(8):781–787.

2 Competence and professionalism in medicine and the health professions ADVANCED ORGANIZERS • Competence and professionalism are complex, interrelated, ­multidimensional constructs commonly based on three primary ­components: knowledge, skills, and attitudes. Miller described a four-tier pyramid of clinical ability, identifying the base of the pyramid as knowledge (measuring what someone knows), followed by competence (measuring how they know), performance (shows how they know), and last at the summit, action (behavior). • Entrustable professional activities (EPAs) are clinical activities that require the use and integration of several competencies and milestones critical to safe and effective clinical performance. Each EPA is based on a number of competencies and milestones where EPAs are units of work and ­competencies are abilities of persons. To assess the competence of physicians for EPAs requires complex and comprehensive assessments. • In the English-speaking world, the present views define medical competence and professionalism as the ability to meet the relationshipcentered expectations required to practice medicine competently. • A variety of professional organizations have recently weighed in on the constructs of competence and professionalism: the American Board of Internal Medicine (ABIM) in the early 1990s, the Association of American Medical Colleges (AAMC), the Accreditation Council for Graduate Medical Education (ACGME), and the National Board of Medical Examiners (NBME) focused on professionalism. In addition, the ABIM in partnership with the American College of Physicians, American Society of Internal Medicine, and the European Federation of Internal Medicine developed a Physician Charter.

35

36  Competence in healthcare personnel

• Another model of medical competence is the Canadian Medical Education Directions for Specialists (CanMEDS) Framework. The CanMEDS consists of seven roles: manager, communicator, professional, scholar, expert, health advocate, collaborator and have been used as theoretical underpinnings of physician competencies. • British and American views of medical professionalism are quite similar. For Asian countries, however, professionalism is a Western concept ­without a precise equivalent in Asian cultures. • Instruments and procedures that are designed to measure competence and professionalism are challenged on validity, reliability, and practicality. • Assessing competence: objective structured clinical examinations, assessment of language and communication, in-training evaluation reports, workplace assessment and multisource feedback, direct o ­ bservations and competence. • Teaching and physician competence—does teaching medicine make you a better doctor? • The overall results provide evidence of effectiveness of the present direct observation and assessment in a clinical environment. The repeated ­measures analyses showed that students improved substantially on clinical skills, communications, and professionalism. • Reflection, particularly self-reflection, is thought to be an important part of competence and professionalism. Unfortunately, reflection has been used with nearly as many meanings as has competence, but in the health science professions, it is viewed as an important part of lifelong personal and professional learning and competence.

INTRODUCTION Competence and professionalism are complex, interrelated, multidimensional constructs. They are commonly based on four primary components: ­k nowledge, skills, attitudes, and behavior. Miller has described a four-tier pyramid (Table 10.2) of clinical ability, identifying the base of the pyramid as knowledge (measuring what ­someone knows), followed by competence (measuring how they know), ­performance (shows how they know), and last at the summit, action (behavior).1 Clinical competence involves a complex interplay between attributes displayed within the physician–patient encounter, which enables physicians to effectively deliver care. These attributes share an intimate relationship with one another and include the ability to take a proper history, complete a physical examination, communicate effectively, manage a patient’s health, interpret clinical results, synthesize information, collaborate with other health professionals, advocate for population health, and so forth. These attributes are seen as integral to almost all medical practice and practice in other health professions. Competence is composed of many different measurable variables, latent ­variables, and factors relating behavior to environment. Medical competence is a reciprocal link and interaction of a person’s medical knowledge, skills, attitudes, and behavior. Clinical competence is determined and standardized as it relates to

Entrustable professional activities  37

the practice of medicine in treating patients. Competence and performance are frequently distinguished from one another. Performance in the practice of medicine is the appropriate application of clinical competence to a clinical situation. Clinical competence is a construct that is not easy to measure directly. Professionalism as it relates to practice of medicine is another construct which has not been adequately defined and is difficult to measure. Epstein’s and Hundert’s definition of professional competence is that it is the “habitual and judicious use of communication, knowledge, technical skills, clinical reasoning, emotions, values and reflection in daily practice for the benefit of the individual and community being served.”2 Hodges and Lingard have pointed out that in the past decade competence has grown to the status of a “god term” in medicine and other healthcare professions.3 Together with professionalism, it has become a principle notion of what medical education should be striving for in the 21st century. They express concern that the concept of competence and competency-based medical education (CBME) have become ubiquitous but only vaguely defined and poorly understood. Although CBME has promise (commitment to outcomes, focus on learner-centeredness, move away from time-based education), there are perils (reductionism, emphasis on minimum performance levels, utilitarianism focus, logistical challenges for individually paced progress). According to Hodges and Lingard, competence is so widely used, in so many different ways, that it risks having no meaning at all. They concluded that the question, “what is competence?” remains to be ­satisfactorily answered.

MEDICAL COMPETENCE Competence refers to an area of performance that can be described, operationalized, and measured.4 These are performance areas in patient care, medical ­k nowledge, professionalism, practice-based learning and improvement, systems-based practice, and interpersonal and communication skills. Although ­multi-faceted, medical competence can be determined through testing and assessments. There has been a great deal of activity in testing and assessment in the past 100 years, and these are likely to play an even larger role in the ­determination of competence for licensure and regulation of medical practice in the future.5 The future focus will be on competency-based testing of, among other things, entrustable professional activities (EPAs).

ENTRUSTABLE PROFESSIONAL ACTIVITIES An EPA is a clinical activity that requires the use and integration of several competencies and milestones critical to safe and effective clinical performance.6 Each EPA—for example, #8: “Give or receive a patient handover to transition care responsibility”—is based on a number of competencies and milestones where EPAs are units of work and competencies are traits of persons. To assess the competence of physicians for this, EPA requires complex and comprehensive assessments.

38  Competence in healthcare personnel

Although EPA-based assessment is gaining momentum, challenges of feasibility of implementation remain. This assessment strategy also provides an ­opportunity to improve training and assessment by increasing focus on ­appropriate supervision and regulation of students, residents, and physicians to assess competence. Entrustability is achieved when the learner can perform the professional activity safely and effectively without supervision.

HEALING, COMPETENCE, AND PROFESSIONALISM There have been healers in society since time immemorial. The tradition of ­physician healer in Western society dates back to Hippocrates and is known and cherished by medical practitioners everywhere. The healer offers advice and ­support in matters of health and ministers to and treats the sick. For centuries, the Hippocratic Oath and its modern derivatives have served as the foundation of competence in medicine. There are two important major elements of medicine as a profession: a specialized body of knowledge and a commitment to service. But, according to Swick,7 modern medical professionals have become more closely connected to the application of expert knowledge at the expense of functions central to the good of the public they serve. He warns that medical professionals have become  distracted from their public and social purposes and thus lost a distinctive voice. Strengthening medical professionalism becomes one way to the distinctive voice.

DEFINITION OF MEDICAL PROFESSIONALISM In the English-speaking world, the service of the physician–healer has been organized around the concept of the professional. The present views define medical competence and professionalism as the ability to meet the relationship-centered expectations required to practice medicine competently. The concept of professionalism has evolved over the past several decades. There are three stages in the evolution of medical professionalism: the first (1980s–early  1990s) was dominated by the debates of professionalism and ­commercialism; a second (1990s) was dominated by calls to define medical professionalism as a concept and as a competency; the third (late 1990s– current) called for definitions by highlighting the need to develop measures and metrics.8 A review in 2002 reported that half of the medical schools in the United States had identified between four and nine elements of professionalism and had developed written criteria and specific methods for their assessment. Northeastern Ohio Universities College of Medicine described professionalism by eight ­elements: (1) reliability and responsibility, (2) honesty and integrity, (3) maturity, (4) respect for others, (5) critique, (6) altruism, (7) interpersonal skills, and (8) absence of impairment. The University of New Mexico School of Medicine ­contained similar elements, but it included communication skills and respect for patients while omitting altruism. The University of California, San

Definition of medical professionalism  39

Francisco School of Medicine, just focused on four aspects of professionalism: (1) ­professional responsibility, (2) self-improvement and adaptability, (3) relationships with patients and families, and (4) relationships with members of the healthcare team.9 Professional values common to the undergraduate medical curricula are altruism, respect for others, and additional humanistic qualities such as honor, integrity, ethical and moral standards, accountability, excellence, and duty/advocacy. A variety of professional organizations have recently weighed in on the ­constructs of competence and professionalism. The ABIM began its humanism project that led to project professionalism in the early 1990s. Similarly, the AAMC, the ACGME, and the NBME focused on professionalism. In addition, the ABIM in partnership with the American College of Physicians, American Society of Internal Medicine, and the European Federation of Internal Medicine developed a Physician Charter.10 According to the ABIM, professionalism in medicine requires the physician to serve the interests of the patient above self-interest. Professionalism aspires to altruism, accountability, excellence, duty, service, honor, integrity, and respect for others. The elements of professionalism required of candidates seeking certification and recertification from the ABIM encompass: a commitment to the highest standards of excellence in the practice of medicine and in the ­generation and dissemination of knowledge; a commitment to sustain the interests and welfare of patients; and a commitment to be responsive to the health needs of society.11 Altruism is the essence of professionalism and fundamental to this definition. It demands that the best interests of patients, not self-interest, guide physicians. Respect for others (ranging from patients to medical students) is the essence of humanism. A second large organization defining professionalism is the ACGME: ­professionalism is manifested through a commitment to carrying out professional responsibilities, adherence to ethical principles, and sensitivity to a diverse ­ rofessionalism: patient population. Moreover, it lists the positive aspects of p respect, regard, integrity, and responsiveness to patient and society that supersede self-interest. The 2015 joint initiative of the ACGME and the ABIM has resulted in The Internal Medicine Milestone Project, a definition of medical competence and professionalism.12 The NBME has identified 60 behaviors of professionalism and has created an intriguing pictorial representation of professionalism composed of a circular core of knowledge and skills surrounded by seven supporting qualities: altruism; responsibility and accountability; leadership; caring; compassion and communication; excellence and scholarship; respect, and honor and integrity. The ABIM, the ACGME, the NBME, AAMC, the Society of Academic Emergency Medicine and European Federation of Internal Medicine mostly concur with elements such as altruism, accountability, excellence, duty/­advocacy, humanistic qualities, ethical and moral standards, service, honor, integrity, and respect for others. In 1995, the ACGME and the Royal College of Physicians and Surgeons of Canada (RCPSC) endorsed “professionalism” as one of the six ­general competencies for physicians.

40  Competence in healthcare personnel

CanMEDS and competence Another model of medical competence has been developed by the RCPSC: the Canadian Medical Education Directions for Specialists (CanMEDS) Framework (Figure 2.1). The CanMEDS consists of seven roles: manager, communicator, ­professional, scholar, expert, health advocate, collaborator and have been used as theoretical underpinnings of physician competencies. The CanMEDS roles are thought to be dynamic and interrelated, but they are also distinct and cohesive. The CanMEDS diagram is designed to illustrate how these roles fit together and describe the competent physician. Competence in this model goes well beyond medical expert (although central) to the six other competencies.

Cross-cultural concepts of medical professionalism The British approach to medical professionalism is grounded in p ­ atient-centered professionalism in the structural and normative context of medicine in the United Kingdom. The professional physician requires medical knowledge, skills, modernity, empathy, honesty, listening, and communication skills to be an

Figure 2.1  Adapted from the CanMEDS Physician Competency Framework with permission of the Royal College of Physicians and Surgeons of Canada. Copyright © 2005.

Definition of medical professionalism  41

effective team player. Professionalism is composed of expert knowledge and skill, ethicality, and service to patients. There are some differences between the American and British views on professionalism. British definitions of professionalism are more patient-centered than in the United States. Additionally, the British emphasize service but not altruism as a core element of professionalism as in the United States. In general, however, British and American views of medical professionalism are quite similar. For Asian countries, professionalism is a Western concept without a precise equivalent in Asian cultures. The term does not easily translate into any Asian language or ways of thinking regarding autonomy, service, and justice which are understood quite differently in the East, with the 2,000-year-old Confucian ­tradition gives priority to collectivist, authoritarian, and hierarchical cultural standards. (In Hong Kong, the ideal of medical professionalism has been adhered to due to the British rule.) Despite this, Keio University School of Medicine in Tokyo has successfully introduced a professionalism curriculum that both supports Japan’s cultural traditions and affirms the school’s academic mission by using a sequential, overlapping, and small group discussion approach. The success of this ­i nitiative demonstrates that even in a culture unfamiliar with the concept of medical professionalism, it can be taught in a way that is appreciated by students and faculty.13 In China, medical professional bodies have also been formed and professional ethical standards are acknowledged, and the focus is on individual patients rather than the community. The concepts of patient– physician relationship and primacy of patient interest are in concordance with Western views and has been widely accepted by Chinese society and healthcare workers.14 In Taiwan, Tsuen-Chiuan et al.15 employed the ABIM medical professionalism definitions. They explored the factors of commitment to care, righteous and rule-abiding, pursuing quality patient care, habit of professional practice, interpersonal relationship, patient-oriented issues, physician’s ­self-development, and respect for others and found that “commitment to care” was perceived least important by Taiwanese medical graduates. Culture sensitivity was perceived as pursuing quality patient care rather than respect as in the ABIM definition. In Vietnam, there has been no clear definition of medical professionalism. The most popular concept which guides physicians’ practice is medical ethics. Nhan et al.16 found that the perceptions of medical professionalism of Vietnamese medical students and physicians were fairly similar to the ABIM definitions: integrity, social responsibility, professional practice habits, ensuring quality care, altruism, and self-awareness. Social responsibility was perceived least important, and self-awareness was perceived most important by Vietnamese medical students. These constructs of medical professionalism were relatively similar with those found in the ABIM definitions but with some Vietnamese cultural differences, focusing on sentimentality. The positive focus on aspects of sentimental culture is humanity and respect of others, but the negative effects are lack of principles, too accommodating, and unreasonableness.

42  Competence in healthcare personnel

Assessing competence and professionalism Assessing these two constructs, especially with the multiple definitions, is ­challenging. Instruments and procedures that are designed to measure ­competence and professionalism are challenged on: ●●

●●

●●

Validity—does the instrument measure competence as it is intended to measure? Reliability—is the assessment objective and are the results consistent and reproducible? Practicality—can it be assessed practically with students, residents, and ­physicians with available assessors?

MILLER’S PYRAMID OF CLINICAL COMPETENCE Miller proposed a framework for assessing levels of clinical competence as a pyramid: the lowest two levels test cognition, followed by competence, performance, and at the pinnacle, action. Novice learners may know something about a neurological examination, for example, or they may know how to do a neurological examination. The upper two levels test behavior to determine if learners can apply what they know into practice. They can show how to do a neurological examination; can they actually do a neurological examination in practice?

Structured assessment: Objective structured clinical examinations The OSCE is a performance-based assessment with some objectivity and concentrates mainly on skills and to a lesser degree on knowledge (see Chapter 13). Even though Barrows and Abrahamson had described the technique of using standardized patients in 1964, the OSCE did not come into general use until the 1970s. The use of oral exams for assessing clinical competence continued for a long time despite its limitations of reliability and validity. Harden et al. combined the use of multi-stations and standardized patients and identified three ideal criteria for assessing clinical competence in an OSCE.17 In an OSCE format, the candidates rotate through a series of stations (mostly with an SP; the ones without SPs are referred to as “static stations”), where they are required to perform a clinical task. During this rotation, all the candidates receive the same medical encounter and are assessed by an examiner (physician judges or SPs) who provides a score based on the preset criteria on a checklist. Some of the attributes measured through OSCEs include the ability to take a proper and adequate history, complete a physical examination, communicate effectively, manage patient’s health, interpret clinical results, and synthesize information. The OSCE has been used for assessing competence of health professionals, including physicians, nurses, optometrists, dentists, and others. Their utility, however, is poor for assessing attitudinal and behavioral performance that is better assessed in situ during clerkship and residency, for example.

Miller’s Pyramid of clinical competence  43

In-training evaluation reports In-training evaluation is the process of observing and documenting the ­performance of students, residents, or other trainees in a naturalistic clinical setting. Typically, the attending physician observes the trainee in a particular department, ward, or discipline, one or more times during an assessment period. Then in-training evaluation reports (ITERs) are completed by the preceptor or assessor at the end (or any other point of training), usually on a standardized form assessing clinical knowledge, skills, and attitudes. ITERs are used by most disciplines to assess the clinical competence of residents during the 2–5 year supervised training period. ITERs are also used for giving formative feedback to the trainees on their competence. The construct measured through ITERs can be based on ­professional roles such as the CanMEDS roles (e.g., communicator, medical expert, ­collaborator) or those set by professional bodies such as the ABIM (e.g., professionalism, etc.). ITERs have been in use for many years for assessing the clinical competence of residents. ITERs have been improved substantially (reliability and validity) over the last several decades and have been made discipline specific.

Assessment of language and communication Medical language represents the wide array of terms associated with medical knowledge with the professional language that is needed for communicating effectively with patients and colleagues in the practice of medicine. Professional communication in medicine is also related to the ability to demonstrate clinical skills. One of the most significant advancement in the assessment of language for professional purposes has been the move toward authentic assessment: language proficiency in situations and tasks that are specific to the professional practice (i.e., situations with ecological validity). The goal of authentic assessment is to simulate the conditions of professional interaction. The conditions need to accurately sample the range of experiences which are most likely to be representative of professional practice. Although not necessarily identical to genuine experiences (the use of SPs in OSCEs), the interaction requires candidates to engage strategically in solving unpredictable problems by communicating in real time. Analytic criteria can be employed on rating scales to assess fluency, pronunciation, comprehensibility, vocabulary, grammar, etc. Overall or holistic assessment allows the assessor to provide an overall impression of professional language proficiency. The OSCE is useful for the authentic assessment of professional language proficiency as it can employ multiple cases and examiners, authentic tasks, real-time communication, and psychometrically based rating scales. An example of such a scale is the Canadian Language Benchmark Assessment (CLBA) that has been adapted to reflect medical contexts (cultural-communication, legal, ethical, and organizational aspects of the practice of medicine). For more naturalistic assessment, both spoken and written language samples, in situ can be collected and analyzed for authentic assessment of language and communication proficiency.

44  Competence in healthcare personnel

Workplace assessment and multisource feedback The Physician Achievement Review (PAR) in Alberta is probably the best case of medical workplace assessment employing multisource feedback (MSF). This MSF was developed specifically for assessing performance of practicing physicians. The instruments of PAR were developed in the late 1990s in response to The College of Physicians and Surgeons of Alberta’s initiative to improve the quality of medical practice.18 The approach consists of four types of data: self, medical colleagues (physicians), coworkers (non-physicians), and the patients, all completing standardized questionnaires. The competencies assessed by the four instruments vary and include clinical competence for self and peer physician questionnaires only: humanistic attributes, patient communication professionalism, collegiality, etc. are in all the instruments. The PAR instruments have been used for assessing surgical, pediatrics, and other specialties.19 Violato et al.19 factor-analyzed PAR instruments and identified a number of factors of physician performance such as communication skills, patient management, clinical assessment, psychosocial and humanism, and professional ­development. The instruments were found to have evidence of adequate reliability (α > 0.90 for most scales; Ep2 > 0.65 for most assessments) and had substantial evidence for validity (face, content, factorial, convergent, and discriminant). A comparative factor analytic study was conducted by Violato and Lockyer, comprising 2,306 peer surveys for 304 participating physicians from internal medicine, pediatrics, and psychiatry; a four factor solutions was reported (patient management, clinical assessment, professional development, communication).20 They reported adequate reliability (α = 0.98 and Ep2 = 0.83) and nearly 70% of the variance was accounted for by the four factors. These studies ­provide empirical evidence for the reliability and validity of the PAR instruments. A recent MSF 5-year longitudinal study employing family physicians shows ­considerable stability and evidence of construct validity of the PAR instru­ orkplace-based approach employing MSF has been employed in ments.21 This w the Netherlands, Britain, United State, Bahrain, and elsewhere.

TEACHING AND PHYSICIAN COMPETENCE Teaching has been regarded as part of medical competence and professionalism since antiquity.22 Over the millennia, physicians, from Hippocrates, Galen, and Osler to current professors, have taught patients, medical students, and other health professionals, either through direct instruction, modelling, or by publications. An interesting question that has arisen is, “Does teaching medicine make you a better doctor?” Perhaps, teaching requires the physician to have a more complex and subtle understanding of the subject matter than do non-teachers. Moreover, perhaps the teachers have an understanding of how to convey this subject matter to novices and learners thus developing meta-cognitive awareness of the knowledge, skills, attitudes, and professionalism of medical competence. Additionally, the need to role model medical competence and professionalism for learners may improve these qualities in the teacher. Physicians themselves

Direct observations and competence  45

identify teaching as a factor that enhances performance although existing data to support this relationship has been limited. Researchers in Alberta23 recently studied more 3,500 physicians some of whom had academic teaching appointments at university medical schools and some who didn’t to determine whether there were differences in clinical performance scores as assessed through MSF data based on clinical teaching. MSF data for 1,831 family physicians, 1,510 medical specialists, and 542 surgeons were collected from physicians’ medical colleagues, co-workers (e.g., nurses and pharmacists), and patients. Typically, an average of around eight colleagues, eight co-workers, and 25 patients anonymously rated the doctors on standardized instruments. These data were analyzed in relation to information about physician teaching activities including percentage of time spent teaching during patient care and academic appointments. The results indicated that higher clinical performance scores were associated with holding any academic appointment and generally with any time teaching versus no teaching during patient care. This was most evident for data from ­medical colleagues, where these differences existed across all specialty groups. Patient data results were mixed as were the data for co-workers. For ratings by colleagues (i.e., other physicians), the results were clear: teachers in family ­medicine, medical specialists and surgeons were rated as more competent, better clinicians and more professional than were non-teachers. In this study, higher involvement in teaching was associated with higher clinical performance ratings from medical colleagues and co-workers for all physicians. These results suggest teaching as a method to enhance and maintain high-quality clinical performance and promote maintenance of competence and improving professionalism. They also support the idea that teaching medicine is integral to competence and professionalism.

DIRECT OBSERVATIONS AND COMPETENCE The evaluation of the clinical competence of medical students and residents can be done through the use of direct observation. There has been inconsistency in how best to measure and compare performance on clinical skill domains, ­however. In response to this problem, the ABIM proposed the use of the miniclinical evaluation exercise (mini-CEX) to evaluate learners’ in the completion of a patient history and physical examination that results in the demonstration of organized clinical judgments and efficient counseling skills. In its standard form, the mini-CEX is a seven-item, global rating scale that is designed to evaluate medical students’ and residents’ patient encounters in about 15–20 min. The mini-CEX is specifically designed to assess the skills that learners require in actual patient encounters and also to reflect the educational requirements that are expected of those learners. As described by Norcini et al.,24 the multiple uses of the mini-CEX with trainees allows for a greater variability across different patient encounters that results in improved reliability and validity as a measure of clinical skill practice and development. It is a performancebased evaluation method that is used to assess selected clinical competencies

46  Competence in healthcare personnel

(e.g., patient interview and physical examination, communication, and interpersonal skills) within a clinical training context. The mini-CEX can be used widely in a broad range of clinical settings. There is mounting psychometric evidence of it for direct observation of single-patient encounters. Kogan et al.25 in a systematic review identified the mini-CEX as having the strongest validity evidence of instruments for direct observation and assessment of clinical skills of medical trainees. In a recent meta-analysis, Alansari et al.26 supported this finding with evidence of construct and ­criterion-related validity of the mini-CEX as an important instrument for the direct o ­ bservation of trainees’ clinical performance and competence. The mini-CEX assesses seven dimensions: (1) medical interviewing, (2)  ­physical examination, (3) humanistic qualities/professionalism, (4) clinical ­judgment, (5) counseling skills, (6) organization/efficiency, and (7) overall competence. According to Norcini et al.,24 the mini-CEX assesses candidates at the top of Miller’s Pyramid (i.e. “shows how” and “does”).

Growth of competence Objective performance in medical competence domains (e.g., professionalism, patient management) can be measured by collecting and analyzing activities in situ, but collecting the necessary amount of data can be difficult. Nonetheless, this can be done through repeated measures designs where the student, ­resident, or physician can be repeatedly observed in structured situations by trained observers, and their performance can be assessed objectively on standardized instruments (e.g., mini-CEX). This growth or the rate of learning can be usefully represented by learning curves. They allow assessment in real time and may allow us to determine how much practice or learning is required to achieve a particular level of competence. According to the Thurstone classical learning curve,27 learning increases with time or practice according to a negative exponential function (improvement decreases over time as learning occurs), the rate of learning can be determined (slope of the curve), maximal performance can be determined for the practice or time period (the upper asymptote). Learning curves have not been used very much in medical education generally. In a recent study,28 faculty members were trained how to assess students and provide formative feedback during a clinical encounter with the adapted miniCEX utilizing principles of CBME and direct observation of third-year clerkship students (Table 2.1). There were 27 assessors, 108 students for a total of 1,001 assessments during required clerkships from May–February. The mean number of assessments per student was 8.97 (SD = 2.57) with range 1–15. The mean time for assessment was 25.31 min and for feedback 19.69 min. Students were rated on a five-point scale based on their degree of entrustability on that competency (1 = not close to meeting criterion; 2 = not yet meets criterion; 3 = meets criterion; 4 = just exceeds criterion; 5 = well exceeds criterion). The data used for the analyses were primarily of direct observations of student–patient encounters in naturalistic situations in the clinical environment during the clerkships.

Direct observations and competence  47

Table 2.1  Means, standard deviations and range of direct observation of patient encounters with the mini-CEX in structured clinical settings Time (mean days in clerkship)

Mini-CEX competency 1. Communication Range: 1–5 2. Medical interview Range: 1–5 3. Physical exam Range: 1–5 4. Professionalism Range: 2–5 5. Clinical reasoning Range: 1–5 6. Management planning Range: 1–5 7. Organizational efficacy Range: 1–5 8. Oral presentation Range: 1–5 9. Overall assessment Range: 1–5

27

65

108

154

213

271

Total

3.08 0.68 2.81 0.79 2.63 0.72 3.31 0.72 2.74 0.70 2.65

3.44 0.87 3.18 0.91 3.09 0.66 3.58 0.79 3.01 0.83 3.05

3.40 0.73 3.15 0.71 3.01 0.46 3.47 0.69 3.17 0.64 3.04

3.48 0.67 3.30 0.72 3.16 0.67 3.63 0.64 3.22 0.65 3.11

3.60 0.63 3.30 0.67 3.19 0.56 3.68 0.68 3.33 0.65 3.20

3.67 0.68 3.39 0.68 3.22 0.64 3.70 0.71 3.44 0.67 3.31

3.44 0.73 3.18 0.77 3.03 0.66 3.56 0.72 3.15 0.73 3.06

SD 0.70 Mean 2.77

0.78 3.03

0.62 0.65 3.15 3.24

0.59 3.20

0.57 3.34

0.68 3.12

SD Mean SD Mean SD

0.84 3.04 0.77 3.19 0.75

0.61 3.08 0.59 3.22 0.57

0.62 3.20 0.57 3.30 0.56

0.60 3.14 0.49 3.33 0.57

0.66 3.29 0.51 3.39 0.58

0.70 3.09 0.64 3.21 0.63

SD









— —

— —

— —

— —

Mean SD Mean SD Mean SD Mean SD Mean SD Mean

0.73 2.81 0.74 2.86 0.61

Observation & feedback

Range

Mean

Observing time (min) Coach feedback (min)

10–90 5–45

25.31 11.05 19.69 7.45

The descriptive statistics for the variables in the analyses are presented in Table 2.1. As can be seen from these data, the smallest total means for the miniCEX items is 3.03 (Physical Exam) while the largest is 3.56 (Professionalism). The SDs are in the two-thirds to three-quarters of a scale point range (mean SD = 0.73). Items all range from 1 to 5 with the exception of professionalism (2–5). The mean time for assessment was 25.31 min (range: 10–90 min). The mean time for ­feedback to the students by the assessor was 19.69 min (range: 5–45 min).

Growth and learning curves Figures 2.2 and 2.3 contain a longitudinal analysis for six time periods (­ sextiles) during the assessments (10 months; May 2016–February, 2017). There were approximately 165 observations for each of the six time periods.

48  Competence in healthcare personnel

3.80 3.70

Expert Proficient

3.60 3.50 3.40 Performance

3.30

Entrustable

3.20 3.10 Physical Ex

3.00

Interview

2.90

Clin Reasoning

2.80

Manage Plan

Learner

2.70 2.60 2.50 Novice

2.40 27

65

108

154

213

271

Days in Clerkship

Figure 2.2  Growth of competence during clinical clerkships (May–February).

3.80

Expert Proficient

3.70 3.60 3.50

Performance

3.40

Entrustable

3.30 3.20 3.10

Communication

3.00

Professionalism

2.90 2.80

Org Efficacy

Learner

2.70

Oral Presentation

2.60 2.50 2.40

Novice 27

65

108 154 Days in Clerkships

213

271

Figure 2.3  Growth of competence during clinical clerkships (May–February).

Each of the eight competencies from the mini-CEX is depicted on the graphs. A close inspection of Figure 2.2 reveals that there was an increase in the means of four competencies over time (p  +1.0 account for more information or variance than a single variable and should be retained as a factor. The Kaiser Rule is the most commonly used approach to selecting the number of components (factors): retain all components with eigenvalues ≥ 1.0. In other words, eigenvalues equal to or greater than the information accounted for by an average single item should be retained.

Rotation of factors A primary goal of factor analysis employing extraction is to discover the optimum number of factors in the solution that (1) accounts for the maximum variance, but (2) remains parsimonious. This is usually the first step in factor analysis that results in an “unrotated” matrix. It is a compromise between a large number of factors to account for maximal variance and the fewest factors to maintain parsimony or simplicity. A second major goal is to rotate the factors so as to maximize their interpretability with a “rotated” matrix producing a simple structure. This is a pattern of loadings where each item loads strongly (i.e., >0.70) on only one of the factors but weakly on the other factors (i.e., 0.71 0.63–0.70 0.55–0.62 0.45–0.54 0.32–0.44

Overlapping variance (%)

Evaluation

>50 40–49 30–39 20–29 10–19

Excellent Very Good Good Fair Poor

Methods of rotation Varimax, the most common rotation option, is an orthogonal rotation of the factors to maximize the variance extracted that is accounted by that factor. A varimax solution yields results which make it as easy as possible to identify each variable with a single factor. Direct oblimin rotation is the method for a non-orthogonal (oblique) solution: that is, the factors are allowed to be correlated with each other. While there are other rotation methods, they are not widely applicable in medical education.

Interpretation of factors Factors are usually interpretable when some observed variables load highly on them and the rest do not. Ideally, each variable would load on one and only one factor. The greater is the item loading, the more the variable is considered to be a clean measure of the factor. Sometimes there are “split” loadings when a variable loads on two factors (e.g., 0.65 on one factor and 0.52 on another). This means that the variable measures elements of both factors. The cutoff size of loading to be interpreted is somewhat subjective but a good rule of thumb is that only variables with loadings of 0.40 and above are interpreted. A final aspect of the interpretation is to characterize a factor by assigning it a name, a process that is just as much an art as it is a science. Interpretation of factors can be based on the variable names with the largest loadings, and the types of variables are grouped by their correlations with factors (e.g., several reading, writing, and speaking tests may group together so we call the factor “Language”). The usefulness, replicability, and complexity of factors are also considered in interpretation. Is the solution replicable with different samples? Are some factors trivial or outliers or do they fit in the hierarchy of theoretical explanations about a phenomenon? The most critical element of a factor analysis is the identification of the number of factors, even more important than extraction and rotation techniques.

98  Correlational-based methods

Extraction effectiveness is tied to the number of factors. The more factors extracted, the better the fit and the greater the percent of variance in the data “accounted for” by the factor solution. The more factors extracted, however, the less parsimonious the solution. Extracting as many factors as variables would account for 100% of the variance, but this is a trivial solution. The main goal is to extract as few factors as possible while accounting for as much variance as possible. The goal of parsimony (as few factors as necessary) is an important goal and should not be lost. In confirmatory factor analysis (Chapter 7), the number of factors is a selection of the number of theoretical processes underlying a research area. A hypothesized factor structure can be confirmed by asking if the theoretical number of factors adequately fits the data. Construct validity evidence of the factors requires that scores on the latent variables (factors) correlate with scores on other variables or that scores on latent variables change with experimental conditions as predicted by theory.

Running factor analysis The following example illustrates the use of factor analysis. Direct observations have been widely accepted as a good way to evaluate the clinical competence of medical students and residents. It has been proposed to use mini-clinical evaluation exercise (mini-CEX) to assess a set of clinical competencies (e.g., medical interview, physical examination, professionalism, and communications) in the completion of a patient history within a medical training context. The mini-CEX is designed to reflect educational requirements during the teaching round and to assess skills that residents require in the actual patient encounters. Some good characteristics including direct observations, instant use in day-to-day practice, and immediate feedback to the learner after the physical examination make the mini-CEX an excellent educational tool that helps learners to be aware of their strengths and opportunities for improvement. The following study illustrates the use of the mini-CEX with third-year medical students participating in mandatory clerkship rotations (surgery, pediatrics, obstetrics/gynecology, internal medicine, psychiatry, emergency medicine, family medicine, neurology, radiology). There were 108 students (57 men; 52.8% and 51 women; 47.2%) in the study. Students were observed by faculty members in real patient encounters in situ. An adapted version of the miniCEX to directly assess clerkship students’ medical competence (Table 4.3) was employed. Students were rated on a five-point scale based on their degree of entrustability on that competency (1 = not close to meeting criterion; 2 = not yet meets criterion; 3 = meets criterion; 4 = just exceeds criterion; 5 = well exceeds criterion). The data were entered into SPSS. To run a factor analysis with SPSS, use the following. Select “Dimension Reduction” then select “Factor.” The following dialogue window should be open.

Table 4.3  Adapted version of the mini-CEX to directly assess clerkship students’ medical competence Mini-CEX Assessment Form Assessor__________________Student__________Date_____________ Patient Problem/Dx(s) ____________________________Patient Age___Patient Sex: M/F Setting (I/O)________ENCOUNTER COMPLEXITY: Low/Moderate/High Mini-CEX time: Observing___min  Assessor providing feedback to student ___min Variables

3.47 (0.71) 3.22 (0.76) 3.09 (0.64) 3.57(0.69) 3.23 (0.73) 3.13 (0.70) 3.16 (0.71) 3.25 (0.64) 25.94 (11.0) 20.19 (7.0) 2.06 (0.57)

Median 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 25.00 20.00 2.00

Min

Max

1.00 1.00 1.00 2.00 1.00 1.00 1.00 1.00 10.00 5.00 1.00

5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 100.00 45.00 3.00

Skewness 0.06 0.13 0.20 0.44 0.21 0.31 0.31 0.36 3.01 0.63 0.01

Factor analysis  99

1. Communication skills 2. Medical interviewing skills 3. Physical examination skills 4. Professional /humanistic qualities 5. Clinical reasoning 6. Management planning 7. Organization/efficacy of encounter 8. Oral presentation Observing time (min) Feedback time (min) Encounter complexity

Mean (SD)

100  Correlational-based methods

1

Select the mini-CEX items click over to the “Variables:” pane as above. 1. Select “Extraction.” The following dialogue window should open. 2. The default “Method” should be “Principal components.” If not select it. 3. “Extract” should be “Based on eigenvalue.” Eigenvalue greater than 1 should be the default. If not select it. (You could use any number of factors by selecting “Fixed number of factors” if you had a good reason—theoretical or practical—to extract a particular number of factors which may not correspond to Eigenvalue greater than 1). 4. The maximum number of iterations for this iterative process is 25. Leave this as the default unless you have a very good reason to change it. 5. Click on “Continue.”

2

3

4

1 5

Factor analysis  101

Select “Rotation” button in the dialogue box. 6. Under “Method,” select “Varimax.” 7. In “Display,” “Rotated Solution” should be selected. If not, select it. 8. The maximum number of iterations for this iterative process is 25. Leave this as the default unless you have a very good reason to change it. 9. Click on “Continue.”

6

7

8

9

Click on “Options” in the Dialogue box. 1 0. For “Missing Values,” “Excluded cases listwise” should be selected. 11. Under “Coefficient Display Format” the “Absolute value below:” should be set at 0.40. 12. Click on “Continue.” Now you should be back at the main dialogue box and the factor analysis is ready to run. Click on “OK” and the procedure should execute.

Output from SPSS A large amount of output will ensue from the analysis. The key is to identify and interpret the output relevant for the present analysis and problem. We begin with Table 4.4 which contains the initial eigenvalues and the variance accounted for. The eigenvalues are listed hierarchically from largest to smallest

102  Correlational-based methods

10

11

12

Table 4.4  Total variance explained principal component extraction Initial eigenvalues Component 1. Communication 2. Medical interview 3. Physical exam 4. Professionalism 5. Clinical reasoning 6. Management planning 7. Organizational efficacy 8. Oral presentation 9. Overall

Total 5.774 1.225 0.586 0.451 0.390 0.323 0.287 0.264 0.202

Variance (%) 64.151 8.051 6.514 5.009 4.330 3.584 3.191 2.930 2.239

Cumulative % 64.151 72.202 78.716 83.726 88.056 91.640 94.831 97.761 100.000

for the nine variables. Notice that Component 1 has a corresponding eigenvalue of 5.774 which accounts for 64.151% of the variance in the data. The magnitude of the next several eigenvalues decreases markedly accounting small amounts of the variance. A close inspection of Table 4.4 reveals that there are two eigenvalues greater than 1 and that accounts for 72.202% of the variance. Additionally, the table contains how much of the variance in this solution is accounted for by each

Factor analysis  103

component; e.g., Component 2 accounts for 8.051% of the common variance as determined by the rotation sums of squared loadings. Therefore, two factors are selected in this analysis.

Interpreting the factors Table 4.5 contains the varimax (orthogonally) rotated factors; convergence occurred in three iterations. Based on the factor loadings, theoretical meaning and coherence, the next task is to name the factors. Factor 1 is Clinical Competence because the main large loadings (0.680–0.814) are from the following variables all reflecting clinical competence: 1. 2. 3. 4. 5.

Physical exam Clinical reasoning Management planning Organizational efficacy Oral presentation

Factor 2 (Professionalism & Communication) is named based on the factor loadings, theoretical meaning, and coherence as it has large primary loadings from communication, medical interview, and professionalism ranging from 0.649 to 0.841. There are several “split-loadings” (variables load on more than one factor) in Table 4.5: medical interview, management planning, organizational efficacy, oral presentation, and overall. Variable #2 (Medical Interview), for example, loads both on factor 1 (Clinical Competence) at 0.518 and factor 2 (Professionalism & Communication) at 0.649. This split makes theoretical sense as medical interviewing skills is part of both clinical competence and communication skills and Table 4.5  Varimax-rotated component matrixa Factors

Variables 1. Communication 2. Medical interview 3. Physical exam 4. Professionalism 5. Clinical reasoning 6. Management planning 7. Organizational efficacy 8. Oral presentation 9. Overall Convergence in three iterations.

a

1 Clinical competence 0.518 0.704

2 Professionalism and communication 0.813 0.649 0.841

0.814 0.808 0.680 0.710 0.733

0.510 0.468 0.531

104  Correlational-based methods

professional demeanor. The other split-loading also make theoretical sense since they are part of more than one factor. The results of this factor analysis allow us to identify the underlying theoretical structure of the various distinct but interrelated variables of the mini-CEX. The following conclusions can be drawn from the results. 1. A number of items (9) are reducible to a few basic underlying elements, latent variables, or factors. 2. Based on the eigenvalues > 1.0 rule, there are two clear factors. 3. The two factors account for nearly three-fourths (72.20%) of the total variance, a good result. 4. The magnitude of the variance accounted for by the factors themselves provide theoretical support for construct validity. Clinical Competence accounts for the largest proportion of the variance—64.15% while professionalism & communication accounts for a smaller proportion, 8.051%. This is a theoretically meaningful result. 5. The factors overall provide theoretical support and are meaningful and cohesive. The split-loadings also provide supporting evidence since it is expected that some items are part of more than one factor. The overall factor analysis then helps us reduce the complexity of understanding the patient encounter of medical students into fundamental underlying factors.

DISCRIMINANT ANALYSIS Sometimes we wish to investigate factors that result in group differences (e.g., passing or failing an assessment). Candidates can be grouped based on prior performance on related assessments. We can investigate, for example, if performance on institutionally based tests (e.g., physiology, cellular biology, etc.) is related to passing the USMLE Step 1 board examination by employing discriminant analysis. This procedure can evaluate whether, based on a discriminant function calculated, participants can be correctly categorized into pass/ fail groups such as based on prior test scores. Can our discriminant analysis correctly classify pass/fail students based on the prior test scores? If so, this may allow the school to provide early detection of students who may be at risk for failing Step exams. For the purposes of discriminant analyses, a grouping (dependent) variable can be identified (e.g., pass/fail). The independent variables are the tests scores from the Year 1 institutional exams. During the analysis, a central mean score is calculated for the dependent variable and Wilks, Lambda statistic indicates whether there is a statistically significant difference between the two group means. A canonical correlation is calculated to indicate the strength of the relationship between the test scores. A canonical discriminant function coefficient is calculated for each test; the larger the absolute value of the variable (test), the greater its contribution to the discrimination between the two groups. Finally, a classification table is created that compares the actual classification of candidates

Summary and main points  105

based on their test scores with the statistical classification of group membership based on a classification function created by the statistical analysis of the test data. Even more precisely, how well can we classify students based on our discriminant function into pass/fail in Step 1? Effective classification based on this known group differences analysis may allow early detection of student academically at risk.

CLUSTER ANALYSIS Cluster analysis categorizes data into homogenous groups. It has been useful in several applications in medical education such as exploring the validity of standard setting for cutoff scores. Cluster analysis can be used to identify groups of similar performances in a cohort using mathematical concepts of distance and similarity among performances. Cluster analysis is considered more objective than is human judgments in setting cutoff score for passing an exam. Thus, it is has been used for validating standard setting methods requiring expert judgments setting pass/fail percentages, for example. Can a group of experts set authentic cutoff scores to make pass/fail decisions about students? To explore the validity of expert judgment for this purpose, cluster analysis can be used to identify the natural number of distinct groups (pass versus fail) of students based on their performance. Similarly, we can identify three groups such as pass, fail, and borderline. Based on performance patterns of the three clusters on an assessment (e.g., clinical skills), we can identify them as pass, borderline, and fail. The valid cutoff score may lie at some point between borderline and fail scorers.

SUMMARY AND MAIN POINTS The correlation is a statistical technique to study the relationship between and among variables. Several other statistical techniques have evolved based on the original Pearson’s r: rank-order correlation, biserial correlation, regression analyses, factor analysis, discriminant analysis, and cluster analysis. ●●

●●

●●

The correlation coefficient, r, must always take on values between +1.0 and −1.0 since it has been standardized to fit into this range. The sign indicates the direction of the relationship (i.e., either positive or negative), the magnitude indicates the strength of the relationship (weak as r approaches 0; strong as r increases and approaches +1.0 or −1.0), and the coefficient of determination (r2) which is an indicator of the variance accounted for in x by y. Cause and effect cannot be interpreted from correlation which only indicates a relationship Other correlations that are important in assessment and measurement: Spearman’s rho coefficient, the Biserial (rb), and Point-Biserial (r pb) correlation coefficients. When both x and y are continuous, Pearson’s r is the appropriate coefficient. If both x and y are ordinal, then Spearman’s rho is

106  Correlational-based methods

●●

●●

●●

●●

the correct correlation. When x is continuous and y is artificially nominal, the rb is the correct coefficient. Finally, when x is continuous and y is naturally nominal, the r pb is correct. We employ several independent variables to improve the predict outcome on a dependent variable. The multiple regression equation is Y ′ =   β1 X1 + β 2 X 2 +  β k X k +  c Factor analysis is a collection of methods used for exploring the correlations between a number of variables seeking the underlying clusters or subsets called factors or latent variables. It addresses the following: ●● Number of factors are needed to summarize the pattern of correlations in the correlation matrix ●● The amount of variance in a dataset is accounted for by the factor ●● Factors that account for the most variance ●● Meaning and how are the factors to be interpreted Discriminant analysis can evaluate whether, based on a discriminant function calculated, participants can be correctly categorized into pass/fail groups based on prior test scores Cluster analysis categorizes data into homogenous groups. It has been useful in several applications in medical education such as exploring the validity of standard setting for cutoff scores

REFLECTION AND EXERCISES Reflections 4.1: Write brief response (250 words maximum) for each item below 1. 2. 3. 4. 5.

Describe the use of correlation in health sciences education. Compare and contrast r, rho, regression, and biserial correlation. What is multiple regression and how is it used? Summarize the theory underlying factor analysis. Compare and contrast discriminant and cluster analysis.

Exercise 4.1: Clerkship clinical scores Dataset from clerkships clinical scores (pediatrics, surgery, internal medicine), grade point average (GPA), and a knowledge MCQ test. Enter the above data into an SPSS file. 1. Compute descriptive statistics of all five variables. Describe the nature of the distributions (i.e., heterogeneity—variance/SD, skewness, central tendencies)—250 words maximum 2. Compute Pearson’s r among all five variables. Interpret the results (i.e., magnitude, direction, and coefficient of determination)—250 words maximum

Summary and main points  107

ID

Peds

Surgery

IM

GPA

MCQ

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

70 80 65 64 55 62 58 73 67 62 68 60 64 59 52 63 69 58 47 65 56 53 59 52 60 62 49 52 52 75 53 59 57 54 51 66 69 59 71 56

35 40 27 32 26 34 34 43 35 36 34 31 36 28 27 35 35 24 22 33 29 33 30 22 33 31 28 31 23 42 24 29 26 31 33 37 40 25 37 27

31 33 38 32 29 28 24 30 32 26 34 29 28 31 25 28 34 34 25 32 27 20 29 30 27 31 21 21 29 33 29 30 31 23 18 29 29 34 34 29

3.0 2.0 2.2 3.1 2.0 2.5 2.5 3.3 3.5 3.3 3.9 2.6 2.1 2.5 1.8 2.0 3.8 1.7 3.4 3.4 2.5 2.4 2.7 2.3 3.0 3.2 2.4 2.1 2.6 3.1 2.9 2.4 2.3 3.4 2.9 2.4 2.5 2.0 1.8 2.0

112 93 98 114 95 106 107 132 123 127 136 102 92 98 87 90 134 83 133 124 104 110 109 104 111 109 100 97 103 110 103 96 95 117 100 101 96 88 85 90

108  Correlational-based methods

Exercise 4.2: Factor analysis Using the above data, run a factor analysis (i.e., determine the number of factors, the rotation method, etc.). Interpret the results (e.g., how much variance was accounted for, name the factors, specify if the factors are cohesive)—500 words maximum

Exercise 4.3: Multiple regression Using the above data, run a multiple regression with GPA as the dependent variable and the remaining variables as independent. Use a stepwise regression method. Interpret the results (e.g., how much variance was accounted for, which is the optimal model)—500 words maximum

REFERENCES 1. Pearson K. (1895) Notes on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58: 240–242.

II

Section     Validity and reliability

IMPORTANT CHARACTERISTICS OF MEASUREMENT INSTRUMENTS There are three critical elements of any assessment procedure or measurement instrument: validity, reliability, and usability or feasibility. Reliability has to do with the consistency of measurement; usability deals with the practicality of using an assessment procedure, while validity has to do with the extent to which the instrument measures, whatever it is supposed to measure. That is, ­validity focuses on the question of how well an assessment carries out its intended function. The question is about the extent to which an assessment or test ­measures, whatever it is supposed to measure. Face validity, the most superficial of the four types, focuses on the test’s ­appearance: does it appear to fit its intended purpose? Content validity deals with the issue of the adequacy of the sampling of the test: does it contain the correct content? Both of these types of validity can be considered logical validity. Criterion-related validity, an empirically based concept, pertains to the ­correlations between a test and current performance on other relevant ­criteria (concurrent validity) and a test’s ability to predict future performance on a ­relevant criterion (predictive validity). Finally, construct validity subsumes all of the other forms of validity in establishing the extent to which the test measures the hypothesized entity, process, or trait. There has been a fifth form of validity proposed recently, consequential ­validity*. This refers the consequences of the use of a particular assessment instrument or * Messick S. Validity. In RL Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: Macmillan, 1989.

110  Validity and reliability

test. There may be positive and negative outcomes for the test-­takers, educational institutions, and society as a whole. Consequential validity may also be referred to as utility, in that it refers to a measure’s application. The use of and social consequences of assessment as a whole has received little empirical ­examination in ­comparison to statistical validity (criterion-related and construct). A measure cannot be used appropriately—and therefore lacks validity—if it does not m ­ easure what it purports to measure. Conversely, a measure will not be valid—quantify what it is intended to measure—if it is not used appropriately for its intended purpose. Each purpose affects the other (empirical evidence and consequences of use), so both purposes have been referred to as a type of validity. While there are three critical elements of any measurement instrument, including educational tests—validity, reliability, and usability—reliability is a precondition to validity. A test must be reliable to be valid. A test which is reliable, however, is not necessarily valid. Reliability, therefore, is a necessary but not sufficient condition for validity. (For a further discussion of reliability see Chapters 8 and 9.) Finally, neither validity nor reliability can guarantee usability or practicality. The polygraph illustrates an instrument with low practicality under some conditions. This device measures physiological arousal on various modalities such as the electroencephalograph, Galvanic skin response, heart rate, and so on. It is highly reliable and valid for measuring fear responses in the laboratory. The polygraph is not very practical or usable for the same measurement under naturalistic conditions, because a person must be strapped into electrodes and other equipment. This limitation greatly reduces the use value of the polygraph. Before we begin a detailed examination of validity, let us further illustrate the interrelations between validity, reliability, and usability. If your wristwatch is five minutes fast and you always keep it that way, it is highly reliable but lacks validity (since it produces the wrong time). Alternatively, if your watch is both consistent and tells the correct time, then it has both high reliability and validity. Finally, your watch may sometimes run fast and sometimes run slow in unpredictable ways so that it is neither reliable nor valid. In all of the above cases, however, the watch has high usability as you can strap it onto your wrist and go about your daily affairs. By contrast, a grandfather clock or an atomic clock may be far more reliable and valid than your watch, but both lack practicality and usability. Similarly, the balance arm weight scales that are common in physicians’ offices are probably more reliable and valid than your bathroom weight scale, but they are not as usable or practical (due to cost and size). Ultimately, the value of a measuring instrument (including educational tests) must involve a carefully balanced consideration of reliability, validity, and usability.

THE NATURE OF VALIDITY Does a ruler measure length? Does a clock measure time? Does a thermometer measure temperature or a speedometer measure velocity? These questions may appear trite and the answers self-evident, but they are the essence of validity.

The nature of validity  111

Does the Stanford–Binet (an IQ test) really measure intelligence, or the Law School Admission Test (LSAT) measure a predilection for legal knowledge? Does the Medical School Admission Test (MCAT) measure an ability to acquire medical knowledge? Did the final exam in history that you wrote in college really measure your knowledge of history? These questions are less trite than the first set and the answers are less evident. Indeed, these are essential questions about test validity. These questions have to do with the nature of validity. In daily life, we are surrounded by a myriad of measurements that you rarely question. We accept automatically the information imparted by the  ­t hermometer, the wristwatch, weight scale, ruler, or many other instruments. We may not be so quick to accept without question the results of an anatomy test, however. Perhaps we felt the test was unfairly difficult or didn’t ask central questions about the subject matter. These are concerns about the test’s validity. Validity, however, is not an all or none phenomenon—it is not discrete. It is a matter of degree. An atomic clock is more valid than an inexpensive wristwatch but both validly measure time. Your physician’s weight scale is more valid than your bathroom scale though both measure weight. Another point about validity is that it is situation specific. The validity of an instrument is clearly restricted to particular conditions. A speedometer is clearly valid for measuring velocity but not temperature. Similarly, the Stanford–Binet is valid for predicting academic achievement (to some extent) but not for predicting eventual happiness in life, economic success, or leadership qualities. The central questions for the validity of a test is “What is it valid for?” To answer this question adequately, there are four levels of analysis: 1. Face Validity 2. Content Validity 3. Criterion-related Validity a. predictive b. concurrent 4. Construct Validity In some conventional discussions, three forms of validity are generally recognized (content, criterion related, and construct). Face validity is not regarded as a form of validity per se. A joint committee of the American Psychological Association, National Council on Measurement in Education, and the American Educational Research Association has prepared a document, Standards for Educational and Psychological Testing (Washington, D C: American Psychological Association, 2014), which describes validity and sets standards for test use. In this treatment, face validity is not regarded as a form of validity. Some textbook writers also do not discuss this form as a separate kind of validity. Nevertheless, in the present discussion and in Chapter 5, it is discussed as a separate type of validity because it influences classroom climate and many aspects of assessment in the health professions. Therefore, for present purposes, face validity is considered a form of validity. It also has become particularly important recently because of the

112  Validity and reliability

emphasis on “performance” assessment such as assessing clinical competency where face ­validity is critical. The levels of validity are generally regarded as hierarchical so that a higher level includes those below it. It is not always necessary to analyze every level of validity for every situation or use of an assessment. Indeed, there are some useful rules of thumb for classifying the validities and procedures for establishing them. These are summarized in Table 1. The first two levels of validity, face and content, are not fully empirical in nature. The other two levels, criterion-related and construct are empirical in nature. In short, face and content validity can be established without recourse ­empirical evidence, while criterion-related and construct validity requires data for their demonstration.

THE NATURE OF RELIABILITY Reliability has to do with the consistency of measurement. It is a necessary condition for validity. While reliability is a precondition for validity, it does not Table 1  Four types of validity Type Face Validity Content Validity

Criterionrelated Validity

Construct Validity

Definition The appearance of the instrument The adequacy with which an instrument samples the domain of measurement The extent to which the instrument relates to some criterion of importance The psychological processes that underlie measurement on the assessment device

Evidence

How to establish

Non-empirical

Make the instrument appear appropriate

Non-empirical

Construct a table of specifications (blueprint)

Empirical (correlation, regression, exploratory factor analysis, discriminant analysis, etc.) Empirical (multiple correlation, confirmatory factor analysis, structural equation modelling, experimental)

Study the relationship of the scores with some criterion of importance

Manipulate the test scores through experimental procedures and observe results. Study relationships among the latent variables or constructs

The nature of reliability  113

guarantee it. That is, a measurement instrument which is reliable is not necessarily valid. As we have seen, a clock which always runs ten minutes fast is reliable since it is consistent but it is not valid since it gives the incorrect time. It is evident then, that reliability is a necessary but not sufficient condition for validity. Alternatively, an instrument which is valid must be reliable. What is meant by the consistency of measurement? What factors can lead to the inconsistency of measurement? How can you tell if some measurement is consistent or not? These questions are at the heart of the concept of reliability. A simple example can help to clarify the nature of reliability. Suppose that you measured the length of your kitchen table with a ruler and it turned out to be 50 inches long. Later that day, you began to doubt this measurement so next morning you measured the table again. This time it turned out to be 46 inches long. Which is correct? To settle the matter, you measured the table a third time and this time it was 53 inches long. Still unsure you measured the table several more times with a different result each time. What is the problem here? There are two possibilities: (1) the table keeps changing length every time it is ­measured or (2) something is wrong with your measurement instrument. Under normal conditions, of course, tables do not change length, so in this case the problem must be the measurement instrument. Upon closer examination you discover that your ruler is made of rubber and the differential results are due to the inconsistency of the measurement. This is why of course, rulers are made from wood, plastic, or metal so that they don’t stretch and shrink from time to time. This example is not merely unlikely or trite, though, as many tests and ­assessments in health professions education do behave like rubber rulers, as they produce ­inconsistent results or unreliable measurements. The concern with ­reliability in educational, psychological, and health professions education measurement is paramount because of the difficulty of producing consistent measures of achievement and psychological constructs. In measuring physical properties of the universe, reliability is usually not a big problem (For example, think of measuring height, velocity, temperature, weight, and so on.), while it is a central problem for educational and performance characteristics of health professionals and other people. Reliability is a multifaceted concept rather than a singular idea. Indeed, there are several ways of thinking about and discussing reliability. Four methods of ­establishing reliability are usually recognized. The four methods or techniques for determining the reliability of a measurement instrument are summarized in Table 2 and listed below. 1. Test–Retest 2. Parallel Forms a. given at the same time b. given at different times 3. Split-Half 4. Internal Consistency

114  Validity and reliability

Table 2  Various methods of estimating reliability Method

AU: Please note that both “Cronbach’s α” and “Cronbach’s alpha” have been found in text. Please check whether the representations have to be made consistent throughout the book and correct if necessary.

Reliability measure

Test–retest

Stability over time

Parallel Forms (same time) Parallel Forms (different time)

Form Equivalence Form Equivalence and stability over time Internal consistency

Split-half, Cronbach’s α, KR20, Ep2

Procedure Give the same tests to the same group of subjects at different times (hours, days, weeks, months, etc.) Give two forms of the same test to the same group at the same time Give two forms of the same test with a time interval between the two forms

Give the test once and apply the Kuder– Richardson formula, Cronbach’s alpha coefficient formula, generalizability analysis. Apply analysis of variance methods to get variance components.

These techniques are not only different methods for establishing reliability but each type produces a somewhat different type of reliability as well. While all forms of reliability focus on consistency, there are different aspects of the testing to which the consistency is relevant. These various methods of understanding and determining reliability are discussed in detail in Chapters 8 and 9.

5 Validity I: Logical/Face and content ADVANCED ORGANIZERS • This chapter deals with two forms of validity, logical and content, commonly referred to as face and content validity. • Face validity has to do with appearance: does the test appear to measure whatever it is supposed to measure? Face validity provides an initial impression of what a test measures but can be crucial in establishing ­rapport, motivation, and setting classroom climate. • An issue with face validity is that many educators, psychologists, and others judge assessments only on the basis of face validity. When judgments are made under uncertainty, a number of cognitive biases and heuristics involving superficial aspects of the assessments result in an overreliance on face validity. • Content validity involves sampling or selecting. The domain of measurement must be clearly defined and detail the cognitive processes involved employing levels of Bloom’s taxonomy (knows, comprehends, applies, analyzes, synthesizes, evaluates/creates). • Enhancing content validity may be achieved most directly through the use of a table of specifications (TOS). A well-designed and carefully developed TOS will provide a sound plan for a test. The closer is the match between the test’s accuracy in sampling of the content and learning outcomes, the higher is the content validity. • Prior to developing a TOS, we must know what the instructional objectives of the program or course are. Instructional objectives are goals of instruction that specify student or learner behavior as an outcome of instruction. • The Delphi procedure, a method employing a systematic procedure to achieve consensus of group judgments, may be used to further enhance the content validity of a test or other assessment. Evidence for content validity therefore is supported from consultations with experts in the relevant field and a high degree of agreement among experts on the relevancy of the contents to measure the domain of interest.

115

116  Validity I: Logical/Face and content

LOGICAL AND CONTENT This chapter deals with the first two forms of logical validity that are most ­commonly referred to as face and content validity (see Table 1 in section 2).

FACE VALIDITY Although it is sometimes considered to be only a superficial type, face ­validity can play a crucial role in many assessment situations. Face validity has to do with appearance: does the test appear to measure whatever it is supposed to m ­ easure? From the candidate’s perspective, does the instrument seem to measure the ­relevant domain? Besides providing an initial impression of an assessment, face validity can be crucial in establishing rapport, motivation, and determining how seriously it will be taken. There is little that can undermine a respondent’s motivation and ­seriousness than an assessment that is perceived to be inappropriate or unfair. Face validity thus plays a pivotal role in assessment as we shall see in this chapter. While some testing authorities understate the importance of face validity and even regard it as a misnomer, it can play a crucial role in many testing situations. In addition to the question, “Does the test appear to measure whatever it is supposed to measure?” another relevant question is, “Does the test look right and fair?” Why is this so important? Besides providing an initial impression of what a test measures, face validity can be crucial in establishing rapport, motivation, and setting classroom climate. There is little that makes a group of students hostile as quickly as a test that is perceived to be inappropriate or unfair. Here are typical medical student-written comments on a gastrointestinal exam they perceived as unfair: ●●

●● ●● ●●

“This was not adequately covered in lecture and certainly wasn’t a learning objective” “We DID NOT do histology” “These questions are unfair” Another student offered a divine invocation: “DIOS MIO!”

Students may react with dismay and anger toward the exam, instructor, and course, resulting in poor motivation and a negative learning environment. Classroom climate may become generally negative. Face validity thus plays an important role in testing. The Wechsler tests of intelligence also provide examples of the importance of face validity. These tests were developed by David Wechsler at Bellevue Hospital in New York to be used with Americans. Two main tests, the Wechsler Intelligence Scale for Children now in its fifth edition (WISC-V) and the Wechsler Adult Intelligence Scale currently in its fourth edition (WAIS-IV), have been widely used historically throughout the world to measure IQ. The Information

114  Validity and reliability

Table 2  Various methods of estimating reliability Method

Reliability measure

Test–retest

Stability over time

Parallel Forms (same time) Parallel Forms (different time)

Form Equivalence Form Equivalence and stability over time Internal consistency

Split-half, Cronbach’s α, KR20, Ep2

Procedure Give the same tests to the same group of subjects at different times (hours, days, weeks, months, etc.) Give two forms of the same test to the same group at the same time Give two forms of the same test with a time interval between the two forms

Give the test once and apply the Kuder– Richardson formula, Cronbach’s alpha coefficient formula, generalizability analysis. Apply analysis of variance methods to get variance components.

These techniques are not only different methods for establishing reliability but each type produces a somewhat different type of reliability as well. While all forms of reliability focus on consistency, there are different aspects of the testing to which the consistency is relevant. These various methods of understanding and determining reliability are discussed in detail in Chapters 8 and 9.

118  Validity I: Logical/Face and content

Another exam format, the objective structured clinical exam (OSCE), is very widely used in the health professions to assess a variety of skills, procedures, and cognitive processes. These include taking a case history, conducting a physical exam, interpreting X-rays, blood pressure, etc., providing a diagnosis, and prescribing treatments. Although the OSCE has been used extensively for more than 40 years, there continues to be issues with its empirical validity (discussed in detail in Chapter 13). As it involves clinical situations frequently including actors portraying patients, the OSCE has high face validity. An issue with face validity is that many educators, psychologists, and others judge assessments only on the basis of face validity. Nearly every medical school employs the personal interview as a major criterion for selection into medical school, for example. In a recent systematic review of 75 studies assessing the use of interviews involving many thousands of medical school applicants, Patterson, Knight, Dowell et al.2 found that the medical school interview has near zero predictive validity for medical school. Many of these studies have also shown that the interview has nearly zero reliability or agreement between interviewers whether the interviews are done in a panel or individually. In other words, the personal interview is of little or no value in distinguishing between suitable and unsuitable candidates for medical school. The personal interview for law school, dentistry, optometry, and other health professions is equally dismal for selection. Similar findings pertain to the job interview where prospective candidates for a job are interviewed individually or in a panel. Many decades of research have shown that the job interview lacks both reliability and predictive validity in selecting for a job. The job interview is probably the most widely used criterion for job selection even though it has poor reliability and validity. The unavoidable conclusion about the personal interview for job selection, admission to medical school or other programs is that they are poor assessment methods lacking in reliability and validity. Nonetheless, the personal interview continues to be widely used at great effort and expense. Consider the medical school personal interview. A typical mid-sized medical school with a class of 120 will interview around 500 applicants every year. An interview can last for 1–3 h and may involve several interviewers. Accordingly, 500–1,500 h of candidates’ time is involved and 1,500–2,000 h of professors’ time as more than one professor generally interviews each candidate. Many of the applicants may travel from out of town or out of state at considerable expense. Overall then, the personal interview is a massive undertaking costing thousands of hours of human time and financial costs in professors’ salaries and applicants’ time and travel. This is repeated in the United States more than 150 times every year and perhaps thousands of times worldwide. How can such massive effort continue with such a useless activity? One of the explanations is face validity. Most people have a firm belief that they can discern human qualities (whether for a job or medical school suitability) by interviewing or talking with prospective candidates. When people—even experts—are faced with complex judgment without complete information, they typically rely on the intuitive system of judgment versus the analytic cognitive system. Daniel Kahneman—the Noble Laureate psychologist—and his colleague

Face validity  119

Amos Tversky concluded that when judgments are made under conditions of uncertainty, a number of cognitive biases and heuristics involving superficial aspects of the assessments result in an overreliance on face validity.3 In this context, a cognitive heuristic is simple procedure that helps find answer to difficult questions. When faced with uncertainty or incomplete information, experts frequently rely on intuition to answer difficult questions. One of the cognitive heuristics that comes into play under such conditions is the illusion of validity of expert judgment; those who make the judgments express a high degree of confidence even though they are likely to be wrong.4 This double interplay between the intuitive illusion of validity and a high degree of confidence probably explains the continued persistence of medical school interviews where professors want to “look candidates in the eye” or discern the “cut of their jib.” Candidates also favor the interview—thereby also harboring the illusion of validity—in that they want an opportunity to prove themselves. These heuristics and biases emerge in low-validity environments such as medical school interviews.4 A further discussion of low-validity environments is included in Box 5.1.

BOX 5.1: Advocatus diaboli: Low-validity environments Daniel Kahneman who won the Noble Prize in Economic Sciences for his revolutionary work in cognitive psychology has challenged the rational model of judgment and decision making. Rational thought (the analytic system of cognition) is especially undermined in conditions of uncertainty or incomplete information such as in business, medicine, politics, psychology, and education. These conditions of uncertainty or incomplete information result in low-validity environments.4 Under these conditions, non-rational or intuitive thought seems to predominate. Many aspects of business (e.g., the job interview, stock market fluctuations), medicine (e.g., longevity of cancer patients, diagnosis of cardiac disease), politics (e.g., selecting candidates for office, success of social programs), psychology (e.g., diagnosis of mental illness, effectiveness of psychotherapy), and education (e.g., selection for medical school, assessing teacher effectiveness) occur in low-validity environments. Aspects of assessment and evaluation in medical and other health professions education frequently occur in low-validity environments as well. This includes assessing “professionalism,” “communication skills,” “team work,” “interprofessional education,” and even “clinical skills and procedures.” The overall conclusion from decades of research is that formulas in the form of algorithms are better for predicting outcomes than are experts especially in low-validity environments. Part of the reason for this is that experts attempt to be clever and consider complex combinations of features in making their predictions. Such enhanced complexity reduces validity. Another part of the reason is that experts attend to irrelevant stimuli and cues which further reduce validity. (Continued)

120  Validity I: Logical/Face and content

BOX 5.1 (Continued): Advocatus diaboli: Low-validity environments A number of studies have shown that human decision makers are inferior to a prediction formula even when they are given the score predicted by the formula4! In medical school admissions, the final determination is made by faculty members who interview the candidate. It appears that conducting the interview is likely to diminish the accuracy of the selection procedure when the interviewers also make the final admission decision, which is common practice in medical schools.2 The interviewers tend to be overconfident in their intuitions thus assigning too much weight to their personal impressions and too little weight to other sources resulting in lowered validity. The same biases and heuristics are operative for residency selection. To echo Kahneman, the research in this area suggests a surprising conclusion: to maximize predictive accuracy, final decisions should be left to formulas, especially in low-validity environments. Expert judgment—with the illusion of apparent high face validity—muddies the waters and reduces validity. In a different domain, forensic auditing by expert accountants, the results are eerily similar to medical expert judgment.11 Grazioli et al. constructed a computer algorithm to detect fraudulent financial statements. The software correctly identified the frauds 87% of the time, while the expert auditors correctly identified only 45% of the cases. When reviewing the cases, the auditors articulated their thinking aloud which was audio recorded. On analyses of these talk-aloud protocols, Grazioli et al. discovered that many irrelevant cues (e.g., superficial similarity to previous cases such as age) effected judgment. A confirmation heuristic, “most people don’t commit fraud so probably this one didn’t either”, also played a role. Moreover, emotions such as generosity or “giving the benefit of the doubt” were part of the mix. Notwithstanding their dismal performance, the auditors expressed a great deal of confidence of the correctness of their judgments. The accountant auditors were suffering from the illusion of validity of their judgments.

BOX 5.2: Advocatus diaboli: Expertise and confidence Recent research indicates that most people are not very accurate in assessing their own performance, competence, or expertise in relation to either objective standards or peer assessments. Kruger and Dunning,12 in a series of experiments employing undergraduate students, found that poor performers in a variety of intellectual and social domains tended to overestimate their competence or performance compared to objective measures or peer assessments. Conversely, high performers tended to underestimate their own performance compared to objective outcomes or peer assessments. (Continued)

Face validity  121

BOX 5.2 (Continued): Advocatus diaboli: Expertise and confidence They concluded that the cognitive skills necessary to perform well in a domain are the same as those needed to recognize good performance in that domain. In other words, poor performers lack the meta-cognitive skills (self-appraisal, knowledge and organization of the task, standards of performance in the task, memory processing, etc.) to be competent or perform well but also to accurately assess their own performance. Most people— whether they perform well or not—express high degrees of confidence in their judgments or performance especially in the face of uncertainty. A recent study13 focused directly on the confidence (i.e., as self-assessed) expressed by medical students, residents, and faculty physicians on their diagnostic accuracy of clinical cases (correct vs. incorrect as objectively assessed). On a case-by-case basis, Friedman et al. found a weak relationship between confidence and diagnostic accuracy. Residents were overconfident (confidence was not related to correctness) in 41% of the clinical cases, faculty physicians in 36%, and medical students in 25%. Even experienced clinicians appear unaware of the correctness of their diagnoses at the time they make them. Practice improvement and reduction of medical errors cannot rely exclusively on clinician’s perception of their performance or needs. Rather, external (peer or objective) feedback is required. What factors characterize people who are so poor at self-assessments? Kruger and Dunning12 suggested that the lack of knowledge and skills that lead to poor performance are the very same ones that are necessary for accurate self-assessment. Therefore, a paradox results: poor performers can only improve the accuracy of their self-assessments by improving their performance first. Kruger and Dunning have indicated that unskilled people are frequently unaware of their own “incompetence” because they lack the very skills needed to both perform well and self-assess accurately. Hodges et al.14 have identified a similar phenomenon with novice physicians who were found to be unskilled and unaware of it. The main conclusion is that even officially designated experts (e.g., physicians) who are unskilled are highly confident in their own ­judgments. In low-validity environments, the heuristics or biases of judgment are invoked. Experts are often able to produce quick answers to difficult questions by substitution thus creating coherence where in fact there is none. A classical substitution heuristic has been found in answering the following question: “How happy are you with your life these days?” In response to this question, people usually substitute (unconsciously) the following question: “What is my mood right now?”4 When experts (and others) employ these cognitive heuristics to answer questions, they express a high level of confidence of their subjective judgment even though the judgment is likely wrong.

122  Validity I: Logical/Face and content

By way of summary then, face validity, while the most superficial sort of validity, can and does play a significant role in both educational and p ­ sychological assessment and evaluation. Indeed, classroom climate can be adversely or ­positively influenced by the face validity of a particular test. It is important, therefore, for the faculty to be aware of this fact and the “public relations” role of testing. Face validity, however, plays a secondary role to content validity.

CONTENT VALIDITY Content validity concerns the extent to which an assessment adequately samples the domain of measurement—the content. A patient questionnaire which requires the patient to evaluate the physician’s clinical skills and cognitive knowledge would lack content validity, because these are not areas in which a patient has sufficient expertise. A peer questionnaire to evaluate physicians’ clinical skills and cognitive knowledge requiring that physicians complete it would have content validity. In the case of this questionnaire, the content refers to both what is sampled (i.e., clinical skills and cognitive knowledge) and who is the rater. Content validity therefore involves sampling or selecting. The domain of ­measurement must be clearly defined and detailed. Not only must the content areas or subject matter be identified but so must the cognitive processes involved. The cognitive processes refer to the levels of Bloom’s taxonomy5 (knows, ­comprehends, applies, analyzes, synthesizes, evaluates/creates).* We must determine not only the content to be assessed but as well the cognitive processes or skills that have been specified by a task or performance analysis. Enhancing content validity may be achieved most directly through the use of a TOS. Prior to developing a TOS, we must know what the instructional ­objectives of the program or course are.

INSTRUCTIONAL OBJECTIVES Instructional objectives are goals of instruction that specify student or learner behavior as an outcome of instruction. These should specify overt student or learner performance during or at the end of instruction. Two types of ­instructional objectives are usually identified: (1) general and (2) specific.

General instructional objectives A general instructional objective is an intended outcome of instruction that has been stated in general enough terms to encompass a domain of student * Since the original the original publication there has been revisions of this taxonomy. The major and most controversial change is at levels 5 and 6. The revised version has demoted “evaluation” to level 5 and has substituted “creating” in level 6 as the highest level. See Table 5.1 for further details.

Instructional objectives  123

performance. A general instructional objective must be further defined by a set of specific learning outcomes to clarify instructional intent and describe the type of performance students are expected to demonstrate. The following general instructional objectives that have been adopted by a number of medical schools from the AAMC physician competencies reference sets6 describe domains of student performance. All graduates when they receive an MD must be able to: 1. Demonstrate knowledge of the basic, clinical, and behavioral sciences and apply this knowledge to patient care (Knowledge for Practice). 2. Communicate and interact effectively with patients, their families, and members of the inter-professional healthcare team (Interpersonal and Communication Skills). 3. Function as a member of an inter-professional healthcare team and provide patient care that is compassionate, appropriate, and effective for the treatment of health problems and the promotion of health in diverse populations and settings (Patient Care). 4. Demonstrate a commitment to upholding their professional duties guided by ethical principles (Professionalism). 5. Demonstrate the ability to investigate and evaluate their care of patients, to appraise and assimilate scientific evidence, and to continuously improve patient care based on constant self-evaluation and life-long learning (Practice-Based Learning and Improvement). 6. Demonstrate awareness and understanding of the broader healthcare ­delivery system and will possess the ability to effectively use system resources to provide patient-centered care that is compassionate, appropriate, safe, and effective (Systems-Based Practice). 7. Demonstrate the skills to participate as a contributing and ­integrated member of an interprofessional healthcare team to provide safe and e­ ffective care for patients and populations (Interprofessional Collaborative Practice). 8. Demonstrate the qualities and commitment required to sustain lifelong learning, personal and professional growth (Personal and Professional Development).

Guidelines for stating general instructional objectives ●●

●● ●●

●●

Begin each objective with a verb (demonstrates, knows, understands, ­appreciates, etc.) State at the proper level of generality (clear, concise, readily definable) State in terms of student performance (e.g., demonstrates knowledge of ­clinical sciences), not in terms of instructor performance (e.g., will teach clinical sciences) Include only one objective per statement

124  Validity I: Logical/Face and content

●●

●●

State in terms of the performance outcome, or product (communicates ­effectively), not the learning process (acquires communication skills) State in terms of the skill or terminal performance (understands the e­ thics of a situation) not the specific subject-matter topics (ethics of end of life assistance)

Specific learning outcomes These objectives are sometimes also referred to as specific objectives, ­performance/ skill objectives, behavioral objectives, and measurable objectives. An intended outcome of instruction is stated in terms of specific, measurable, and observable student or learner performance. Specific learning outcomes describe the types of performance that learners will be able to exhibit when they have achieved a general instructional objective. General instructional objectives: 1. Demonstrates knowledge of the basic, clinical, and behavioral sciences and applies this knowledge to patient care (Knowledge for Practice). Possible specific learning outcomes: 1.1. Summarizes the normal structure and function of the human body and each of its major organ systems. 1.2. Describes cell and molecular biology for understanding the mechanisms of acquired and inherited human disease. 1.3. Identifies altered structures and function of major organ systems that are seen in common diseases and conditions. 1.4. Explains clinical, laboratory, and radiologic manifestations of common disease and conditions. 1.5. Integrates behavioral, psychosocial, genetic, and cultural factors ­associated with the origin, progression, and treatment of common diseases and conditions. 1.6. Explains the epidemiology of common diseases and conditions within a defined population and systematic approaches useful in reducing the incidence and prevalence of these maladies. 1.7. Applies knowledge of the impact of cultural and psychosocial ­factors on a patient’s ability to access medical care and adhere with care plans. 2. Demonstrates competence in presenting information orally. Possible specific learning outcomes: 2.1. Delivers a patient presentation to peers effectively. 2.2. Projects voice and delivery. 2.3. Uses time effectively. 2.4. Fields questions about patient presentation effectively. 2.5. Introduces topics in such a way to generate interest. 2.6. Discusses topics using examples and illustrations.

Taxonomies of objectives  125

2.7. Concludes topic by summarizing main ideas.         2.8. Uses visual and multimedia aides effectively.    2.9. Involves audience in the presentation of information. 2.10. Explains test results to patients at an appropriate level.

CHARACTERISTICS OF GOOD OBJECTIVES When writing specific learning outcomes or objectives, the performance or observable learner behavior must be clearly identified. The specific learning ­outcome, or objective, can be further defined and clarified by stating the conditions and criteria for that student performance. Performance: an objective is useful to the extent it specifies clearly what ­students must be able to do when they demonstrate mastery of the objective. Given ten clinical problems, the student will solve 90% correctly. Condition: an objective is useful to the extent it clearly states the c­ onditions under which the learning/performance must occur (i.e., circumstances and materials). Given ten clinical problems, the student will solve 90% correctly. Criteria: an objective is useful to the extent it specifies the quality or expected level of performance. Given ten clinical problems, the student will solve 90% correctly.

Guidelines of stating specific learning outcomes ●● ●● ●●

●●

●●

●●

Describe the specific performance or observable behavior expected Describe a performance or behavior that is measurable Begin each specific learning outcome with a verb that specifies definite observable performance (discusses, writes, projects, concludes) Describe the performance in the specific learning outcome so that it is ­relevant to the general instructional objective Include only one learning outcome per statement (writes a case history) not multiple learning outcomes per statement (writes and presents a case history) Specific learning outcome should be free of course content so it can be used with various units of study

For example: This is good: diagnose diabetes. This is poor: diagnose diabetes in the case of Mr. Smith who is a 53-year-old obese white male.

TAXONOMIES OF OBJECTIVES Various taxonomies or classification schemes have been developed for ­objectives. An important such taxonomy is the one developed by Benjamin Bloom for the cognitive domain. This is summarized in Table 5.1. The taxonomy is ­hierarchically organized into six categories ranging from verbatim recall or

126  Validity I: Logical/Face and content

Table 5.1  Bloom’s taxonomy of instructional objectives in the cognitive domain Classification

Definition

1. Knowledge/ rememberinga

Recall or recognition of learned material

2. Comprehension/ understanding

Translation, interpretation, and extrapolation of information

3. Application/ applying

Apply knowledge to an unfamiliar or new situation

4. Analysis/ analyzing

To break down into its constituent components and identify their interrelationships Combining elements to form a new whole

5. Synthesis/ evaluatinga

6. Evaluation/ creating

Making judgments about the value of materials or methods

Example verbs Define, describe, identify, list match, name, recall, recognize, remember Defend, explain, rewrite, paraphrase, infer, re-word, interpret, explain Predict, prepare, use, relate, solve, modify, operate, compute, discover, apply Identify, differentiate, discriminate, infer, relate, distinguish, detect, classify Combine, compose, design, devise, organize, plan, revise, summarize, deduce, produce Appraise, compare, conclude, contrast, criticize, support, justify, compare, assess

Source: Anderson and Krathwohl7 a The words in bold have been added to the levels since the revisions of this taxonomy. The major and most controversial change is at levels 5 and 6. The revised version has demoted “evaluation” to level 5 and has substituted “creating” in level 6 as the highest level. All the levels are stated as verbs, suggesting that learning is an active process. Some debate continues on the order of levels 5 and 6 but this revised version has gained acceptance overall.

knowledge at the lowest level to the process of evaluation at the highest level. As well as a description of each level, each category contains examples and verbs that exemplify from which to construct instructional objectives. As well as specifying intended ­performance for students, instructional objectives should be w ­ ritten at the appropriate cognitive level. The information in Table 5.1 will help you both write instructional objectives and subsequently test items that measure the ­instructional objectives.

TABLE OF SPECIFICATIONS A TOS is a blueprint of the assessment device or procedure. Just as an engineer develops a detailed plan of a bridge that is to be built and an architect draws a

Table of specifications  127

blueprint of the building that is to be erected, so to the researcher begins with a plan for the assessment device. This plan specifies the content areas to be assessed as well as the cognitive processes and skills that are to be measured. Thus, it is called a table of specifications. A well-designed and carefully developed TOS will provide a sound plan for an assessment. The closer the match between the assessment in its accurate sampling of both the content and cognitive processes, the higher the content validity. If a researcher develops an assessment with high content validity, then a good measurement may result. Content validity concerns the extent to which the test adequately samples the domain of measurement, the content. In a first-year medical school neurosciences course, 275 new terms have been introduced and students are required to know them. The domain of measurement refers to the definition of all the terms. You would probably not want to include all of these on a test, however, as it would be too long, tedious, and exhausting. Thus, you might randomly select 75 terms from the total domain of 275 terms to include in the test. The extent to which the 75 terms selected adequately represent the total number of terms, determines the content validity of the test. Content validity is usually the most important consideration in course, unit, or subjectmatter tests. Content validity involves sampling or selecting. The domain of measurement, therefore, must be clearly defined and detailed. Not only must the content areas or subject matter be identified but so must the cognitive processes involved. The cognitive processes refer to the levels of Bloom’s taxonomy (knows, comprehends, applies, analyzes, synthesizes, evaluates). You must determine not only the subject matter to be tested but as well the cognitive outcomes that have been specified by the instructional objectives. Enhancing content validity may be achieved most directly through the use of a TOS. An assessment device that is constructed without a TOS results in a poor quality test lacking in content validity. A similar result would be obtained by constructing a bridge or building without blueprints. The TOS is a chart where, by convention, the content area to be measured is itemized along the vertical axis, while the cognitive outcomes are specified along the horizontal axis. Use the following steps to construct the TOS: 1. Prepare a heading on a blank sheet of paper or file. Write the subject (e.g., cardiovascular), date, and educational level (e.g., Year 3) of the test. 2. Identify all of the content areas to be tested and list these on the left-hand margin in the order they were taught. 3. Across the horizontal, specify the levels of learning outcomes specified in the instructional objectives. 4. Determine the total number of items that will be on the test. 5. In the right-hand margin, write the number and percentage of items that will be devoted to each specified content area and do the same for the objectives. 6. In each cell (intersection of content area and level of learning outcome), write the number of items and percentage of the total test that these items represent (some cells may be blank).

128  Validity I: Logical/Face and content

7. Carefully consider the percentage of test items in each cell and ensure that this accurately reflects the emphasis and time spent in the teaching and learning of this content. If only 10% of the total time was spent learning names of scientists and their discoveries, then only 10% of the test items should be devoted to this. 8. Each item should be carefully checked to ensure that it indeed measures at the intended cognitive outcome (Chapters 11 and 12 contain detailed discussions of item construction). If, for example, a mathematics problem intended to measure application outcomes is given but the problem has been previously presented in class, then the item probably only measures at the knowledge level, since the students may have simply memorized the answer. A well-designed and carefully developed TOS will provide a sound plan for your test. The closer is the match between the test’s accuracy in sampling of the content and learning outcomes, the higher is the content validity. If a professor or course committee develops a test with high content validity, then a good test usually results. Examples of tables of specifications are presented in Tables 5.2 and 13.4. As we have already seen, Table 5.2 contains a TOS of a Year 3 cardiovascular exam. Table 13.4 contains a TOS for a 12 station OSCE assessing clinical, communication, and counselling skills employing standardized patients. The OSCE stations are classified by content area (patient presentation) and level of assessment (e.g., physical exam/diagnosis/management). As OSCEs refer to “clinical exams”—history taking, physical exam, counselling a patient, ordering tests, etc., they frequently, but not always, involve a standardized patient. This TOS ­covers  the content and levels of assessment in medicine and some pediatrics, illustrating the use of a TOS for clinical, communication, and counselling skills. Table 5.3 is the published “table of specifications” from a standardized test, Step 1 of the United States Medical Licensing Exam (www.usmle.org/step1/#content-outlines). Notice that it is called “Test Specifications” but essentially specifies the content of the test (left-hand margin) and the percentage ranges of items number of items in each cell. There is no indication of the cognitive levels of measurement. Therefore, this not a true TOS but only a partial one although the content is classified as “System” and “Process.”

DELPHI PROCEDURE A Delphi procedure may be used to further enhance the content validity of a test or other assessment. This name derives from the 8th century BC Oracle of Delphi, who gave prophecies about important future events. Today, the Delphi is a method employing a systematic procedure to achieve consensus of group judgments where expert input is required. It is assumed that input from several experts is more valid than individual judgments. In testing and assessment, this method seeks agreement from experts on the content and processes of an assessment.

Table 5.2  Table of specifications (test blueprint) Cardiovascular, Year 3 Date_________ Level of understanding Content area

Knowledge

Comprehension

Application

Higher

Total 10 (13%)

1. Examination of the patient a. History

1

2

0

0

3 (4%)



b. Electrocardiography

1

2

1

0

4 (5%)



c. Echocardiography

1

1

0

1

3 (4%)



15 (20%)

2. Heart failure

a. Clinical aspects

2

3

0

0

5 (7%)



b. Pathophysiology

2

1

1

1

5 (7%)



c. Management

2

1

1

1

5 (7%)

1

0

0

2 (3%)

12 (16%)

3. Arrhythmias, sudden death and syncope a. Genesis of arrhythmias



b. Genetics

1

1

0

0

2 (3%)



c. Diagnosis

1

2

1

1

5 (7%)

d. Management

1

1

0

1

3 (4%)



20 (27%)

4. Preventive cardiology

a. Vascular biology

1

2

1

0

4 (5%)



b. Risk for cardiac disease

2

2

1

2

7 (9%) (Continued)

Delphi procedure  129



1

Cardiovascular, Year 3 Date_________ Level of understanding Content area

Knowledge

Comprehension

Application

Higher

Total

c. Treatment of Hypertension

1

2

0

1

4 (5%)

d. Obesity, diet and nutrition

2

2

1

0

5 (7%) 18 (24%)

5. CV disease and other organ systems disorder

a. Heart in endocrine

1

2

0

0

3 (4%)



b. Rheumatic fever

1

2

0

0

3 (4%)



c. Oncology and CV diseases

1

2

1

2

6 (8%)



d. Renal and CV diseases

1

2

1

2

6 (8%)

23 (31%)

31 (41%)

9 (12%)

12 (16%)

75

Total

130  Validity I: Logical/Face and content

Table 5.2 (Continued)  Table of specifications (test blueprint)

Delphi procedure  131

Table 5.3  Information provided by the United States medical licensing exam on the website USMLE Step 1 test specificationsa System General principles of foundational scienceb immune system Blood & lymphoreticular system Behavioral health Nervous system & special senses Skin & subcutaneous tissue Musculoskeletal system Cardiovascular system Respiratory system Gastrointestinal system Renal & urinary system Pregnancy, childbirth, & the puerperium Female reproductive & breast Male reproductive Endocrine system Multisystem processes & disorders Biostatistics & epidemiology Population health Social sciences Process Normal processesc Abnormal processes Principles of therapeutics Otherd

Range (%) 15–20

60–70

Range (%) 10–15 55–60 15–20 10–15

Source: www.usmle.org/step-1/#content-outlines (accessed April 15, 2016). a Percentages are subject to change at any time. See the USMLE Web site for the most up-to-date information. b The general principles category includes test items concerning those normal and abnormal processes that are not limited to specific organ systems. Categories for ­individual organ systems include test items concerning those normal and abnormal ­processes that are system-specific. c This category includes questions about normal structure and function that may appear in the context of an abnormal clinical presentation. d Approximately 10%–15% of questions are not classified in the normal processes, abnormal processes, or principles of therapeutics categories. These questions are likely to be classified in the general principles, biostatistics/evidence-based medicine, or social sciences categories in the USMLE Content Outline.

Gabriel and Violato8 illustrated this procedure in a recent study by developing and testing an instrument for assessing knowledge of depression of patients with non-psychotic depression. A TOS was created for three levels of cognitive outcomes of Bloom’s taxonomy: knowledge, comprehension, and application. It was decided to write 27 multiple-choice questions (MCQs) to cover the content areas

132  Validity I: Logical/Face and content

and level of measurement. The initial items and the TOS were developed based on empirical evidence from an extensive review of literature, theoretical knowledge, and in consultation with national and international psychiatry experts. The MCQ items were written following basic rules for item construction so as to avoid common technical item flaws (see Chapter 11). A volunteer panel of experts met on three occasions to review the items for the following: (1) appropriateness of difficulty and relevancy for patients as examinees, (2) concise, clear language free from medical or psychiatric jargon at the appropriate level (Grade 9), (3) knowledge to be demonstrated in a specific area of depression or its treatment, and (4) at least three experts agreed on the correct answer for each question. Additional experts were asked to rate the relevance of each MCQ in sampling patient knowledge of depression and its treatment on a 5-point scale (1 = irrelevant, 2 = slightly relevant, 3 = moderately relevant, 4 = significantly relevant, and 5 = highly relevant). The 12 psychiatrists (mean of 22 years’ experience) included both men and women experts in mood disorders. There were nine at the rank of professor, two at associate professor, and one at assistant professor. Three of the experts were invited for an informal panel discussion of the instrument and an in-depth review of the individual items. Each of the remaining nine experts independently rated each item for its relevancy in testing depression knowledge and its treatment on the five-point scale. The experts achieved consensus of the relevance of each item for testing patient knowledge of depression. There were no significant differences between men and women in ratings of experts or based on their years of experience. There was a very high overall agreement (88%) among experts about the relevance of the MCQs to test patient knowledge on depression and its treatments. The majority of the items were rated as highly or significantly relevant (mean = 4.4, SD = 0.67, range = 1–4). There was significant positive relationship (r = 0.35, p < 0.01; r = 0.33, p < 0.05), between having the necessary knowledge about the risks of relapse (subscale #2) and being aware of the symptoms of depression (subscale #4), on the one hand, and having knowledge of different biological and psychological treatments (subscale #5), respectively. It is assumed that when patients understand the causes of depression, they will be able to think of treatment options more rationally. There was also positive correlations (r = 0.30, p < 0.05; r = 0.27, p < 0.05) between subscale #5 “understanding biological and psychological treatments” and ­subscale #3 “knowledge of etiology and triggers of depression,” and subscale #4, “knowledge of symptoms”, respectively. There were no correlations of the subscales with subscale #1, “definition of terms.” The evidence for content validity therefore is supported by two main factors. First, the MCQ test was initially developed based on empirical evidence from extensive literature review and from consultations with experts in the field of depression. Second, a very high degree of agreement was achieved among experts on the relevancy of its contents to measure patient knowledge of depression and its treatments (very high mean = 4.4) using a Delphi procedure. The formal

Summary and conclusions  133

inclusion of external experts illustrates the Delphi procedure for enhancing ­content validity.

UPDATING CONTENT VALIDITY: THE 2015 MEDICAL COLLEGE ADMISSION TEST The Medical College Admission Test (MCAT) has been widely used by most medical schools in the United States, Canada, and elsewhere for more than 80 years. In 1928, F.A. Moss created an early version, the Moss Scholastic Aptitude Test, to improve selection and reduce high failure and attrition rates. It remained until 1946, when the first revision was undertaken by the American Association of Medical Colleges and was subsequently renamed twice to the Professional Aptitude Test and finally the MCAT in 1948.9 The Moss version, 1928–1946 was criticized for its lack of breadth and its focus on the recall or recognition of facts. The post-World War II versions were thought to be an improvement and provided a better prediction of medical student success beyond the first 2 years of basic sciences or preclinical performance in medical school. Subsequent revisions have been composed of four subscales, measuring basic sciences and social or verbal reasoning ability. The objectives of early versions concentrated heavily on designing a reliable and valid assessment that would not only aid in selection but also produce estimates of future success, namely predictive validity. Minor revisions occurred in 1962, followed by changes that emphasized a focus on the sciences in 1977, the introduction of a Writing Sample subtest in 1991, and finally a major over haul for 2015. The new version consists of four sections: ●● ●● ●● ●●

Chemical and physical foundations of biological systems Critical analysis and reasoning skills Biological and biochemical foundations of living systems Psychological, social, and biological foundations of behavior

According to AAMC documents,10 “the blueprints [i.e., TOS] for the new exam shift focus from testing what applicants know to testing how well they use what they know.” This new TOS is presumably based on accumulated scientific evidence over the past one or two decades of both the changing healthcare system and advances in testing and psychometric theory and practice. The content validity of assessment devices should be updated as science, assessment, and psychometrics evolve and change.

SUMMARY AND CONCLUSIONS This chapter dealt with the two forms of validity, logical and content, commonly referred to as face and content validity. Face validity has to do with appearance: does the test appear to measure whatever it is supposed to measure? Face validity provides an initial impression of what a test measures but can be crucial in establishing rapport, motivation, and setting classroom climate.

134  Validity I: Logical/Face and content

An issue with face validity is that many educators, psychologists, and others judge assessments only on the basis of face validity. Nearly every medical school uses the personal interview as a major criterion for selection into the school, for example. How can such massive effort continue with such a useless activity? One of the explanations is face validity. When people—even experts—are faced with complex judgment without complete information, they typically rely on the intuitive system of judgment versus the analytic cognitive system. When judgments are made under uncertainty, a number of cognitive biases and heuristics involving superficial aspects of the assessments result in an overreliance on face validity. Content validity concerns the extent to which an assessment adequately samples the domain of measurement—the content. Content validity therefore involves sampling or selecting. The domain of measurement must be clearly defined and detailed and must sample the cognition processes involved, and the levels of Bloom’s taxonomy (knows, comprehends, applies, analyzes, synthesizes, evaluates). Enhancing content validity may be achieved most directly through the use of a table of specifications. Prior to developing a TOS, we must know what the instructional objectives of the program or course are. Instructional objectives are goals of instruction that specify student or learner behavior as an outcome of instruction. A TOS is a blueprint of the assessment device or procedure. The educator begins with a plan for the assessment device. This plan specifies the content areas to be assessed as well as the cognitive processes and skills that are to be measured. A well-designed and carefully developed TOS will provide a sound plan for your test. The closer is the match between the test’s accuracy in sampling of the content and learning outcomes, the higher is the content validity. The Delphi procedure may be used to further enhance the content validity of a test or other assessment. It is a method employing a systematic procedure to achieve consensus of group judgments where expert input is required. In testing and assessment, this method seeks agreement from experts on the content and processes of an assessment. Evidence for content validity therefore is supported from consultations with experts in the relevant field and a high degree of agreement among experts on the relevancy of the contents to measure the domain of interest. Content validity needs to be updated based on evolving scientific and societal changes and needs.

REFLECTIONS AND EXERCISES Reflections 5.1: Face validity 1. Briefly define face validity. 2. Why is it so important that educational tests have high face validity? 3. Construct four questions at the application level of knowledge that measures basic concepts in arithmetic (e.g., area, percentage). Write several versions of each question so that it has face validity for four groups of

Summary and conclusions  135

people (e.g., physicians, carpenters, mechanics, bank tellers, teachers, carpet layers, etc.). A particular question should measure the same concept for everyone but should be stated so as to have face validity for each group (e.g., If 12 cm are cut from a board 80 cm long, what percentage is removed?—application for carpenters. If six students leave a class of 40 students, what percentage has left the class?—application for instructors). 4. Discuss the aspects of the test that you would examine to improve face ­validity when developing a first-year medical school test on immunology.

Reflections 5.2: C  ontent validity 1. Briefly define content validity. 2. How can a table of specifications be used to enhance content validity? 3. Construct a table of specifications for a test in a subject area in which you have some expertise. Make the test 50 items long and use at least four levels of learning outcomes from Bloom’s taxonomy. Break the content areas into appropriate subdivisions. 4. Identify two assessments in health sciences education that may have content validity problems (e.g., in training evaluation reports—ITERS). Specify what these shortcomings are.

Exercise 5.1: Instructional objectives 18 marks Purpose: to study the nature of instructional objectives and to practice writing both general instructional objectives and specific learning outcomes. Objectives enable instructors to plan and execute instruction and to fairly assess and evaluate student achievement and performance. DIRECTIONS

1. Write three general instructional objectives in any content/subject area of your choice. Indicate the cognitive level of each according to Bloom’s ­taxonomy. (3 marks) 2. For each of the general instructional objectives, list five specific ­learning ­outcomes that describe observable student performance as a result of instruction. Indicate the cognitive level of each according to Bloom’s ­taxonomy. (15 marks) SUBMIT TO THE INSTRUCTOR One page with three general instructional objectives, each followed by five specific learning outcomes (also indicating the level of Bloom’s taxonomy).

136  Validity I: Logical/Face and content

Exercise 5.2: OSCE table of specifications assignment 20 marks Purpose: to study the nature of tables of specifications for objective structured clinical exams (OSCEs) utilizing standardized patients. DIRECTIONS

1. Design and develop a table of specifications/blueprint for an OSCE in any specialty of interest (e.g., surgery, psychiatry, obstetrics/gynecology, internal medicine, etc.) 2. Using the template in Table 13.4: patient and presentation, assessments, ­differential diagnoses and/or management SUBMIT TO THE INSTRUCTOR One page with five stations (10 marks), each with a patient name and condition, followed by assessments (5 marks) and possible differential diagnoses and/or management (5 marks).

Exercise 5.3: Delphi procedure 10 marks Purpose: to develop the plans for a Delphi procedure. DIRECTIONS

1. Design and develop a Delphi procedure for enhancing content validity of an assessment (use the TOS in Exercise #1 or #2 or design a new one) 2. Identify experts (e.g., surgeons, etc.) who will participate in your Delphi study. 3. How many experts will you include? 4. Design the questionnaire for the Delphi procedure (provide some examples (at least 3) of the items that will be in your assessment) and the multi-point scale that you will use. SUBMIT TO THE INSTRUCTOR One page with three items (6 marks) and the rating scale that the experts will use (4 marks).

REFERENCES

1. Violato C. Effects of Canadianization of American-biased items on the WAIS and WAIS-R information subtests. Canadian Journal of Behavioural Science. 1984; 16(1):36–41. 2. Patterson F, Knight A, Dowell J, Nicholson S, Cousans F, Cleland J. How ­effective are selection methods in medical education? A systematic review. Medical Education. 2016; 50(1):36–60.

Summary and conclusions  137









3. Tversky A, Kahneman D. Judgment under uncertainty: Heuristics and biases. Science. 1974; 185(4157):1124–1131. 4. Kahneman D. Thinking Fast, Thinking Slow. Macmillan: New York, 2011. 5. Bloom BS, Engelhart MD, Furst EJ, Hill WH, Krathwohl DR. Taxonomy of Educational Objectives: The Classification of Educational Goals. Handbook I: Cognitive Domain. David McKay Company: New York, 1956. 6. Englander R, Cameron T, Ballard AJ, Dodge J, Bull J, Aschenbrener C. Toward a common taxonomy of competency domains for the health professions and ­competencies for physicians. Academic Medicine. 2013; 88:1088–1094. 7. Anderson L, Krathwohl DA. Taxonomy for Learning, Teaching and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman: New York, 2001. 8. Gabriel A, Violato C. The development of a knowledge test of depression and its treatment for patients suffering from non-psychotic depression: A psychometric assessment. BMC Psychiatry. 2009; 9:56. doi:10.1186/1471-244X-9-56. 9. Donnon T, Paolucci EO, Violato C. The predictive validity of the MCAT for medical school performance and medical board licensing examinations: A meta-analysis of the published research. 11th Ottawa International Conference, Barcelona Spain, July, 2004. 10. Association of American Medical Colleges. The new score scales for the 2015 MCAT exam, 2015. Available at www.aamc.org/mcat2015/admins. 11. Grazioli S, Johnson PE, Jamal K. A cognitive approach to fraud detection. Journal of Forensic Accounting. 2006; VII(1):65–88; University of Alberta School of Business Research Paper No. 2013–676. Available at SSRN: http://ssrn.com/abstract=920222 or doi:10.2139/ssrn.920222. 12. Kruger J, Dunning D. Unskilled and unaware of it: How difficulties in ­recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology. 1999; 77(6):1121–1134. 13. Friedman CP, Gatti GG, Franz TM, Murphy GC, Wolf FM, Heckerling PS, Fine PL, Miller TM, Elstein AS. Do physicians know when their diagnoses are correct? Implications for decision support and error reduction. Journal of General Internal Medicine. 2005; 20(4):334–339. 14. Hodges B, Regehr G, Martin D. Difficulties in recognizing one’s own i­ ncompetence: Novice physicians who are unskilled and unaware of it. Academic Medicine. 2001; 76(10 Suppl):S87–S89.

6 Validity II: Correlational-based ADVANCED ORGANIZERS Validity—the extent of which a test measures whatever it is intended to ­measure—­is usually classified into four types: (1) face, (2) content, (3) criterionrelated, and (4) construct. The topic of this chapter, criterion-related and construct validity both require empirical evidence (primarily correlational) for their support. Criterion-related validity involves concurrent and predictive forms both of which require validity coefficients. Construct validity subsumes all forms of validity plus a further determination of the psychological and educational processes involved in the construct. 1. When we are interested in how performance on a test correlates with performance on some other criterion, we are concerned about criterion-related validity. There are two categories of this kind of validity: (1) predictive (How well does a test predict some future performance), and (2) concurrent (How well do two tests intercorrelate concurrently). 2. The determination of a test’s criterion-related validity requires empirical evidence in the form of correlation coefficients (r). Interpreted in the context of validity, r is called a validity coefficient. The validity coefficient can be transformed into the coefficient of determination (r2) which in turn is used to determine the percentage of variance in the criterion that is accounted for by the test. 3. In predictive validity, individual scores on the criterion can be estimated using regression techniques. This strategy allows the prediction of individual scores on some future criterion. 4. Construct validity focuses on the truth or correctness of a construct and the instruments that measure it. A construct is defined as an entity, process, or event which is itself not observed and can be measured only indirectly. Establishing the validity of constructs also requires determination of the

139

140  Validity II: Correlational-based

validity of relevant instruments. Establishing construct validity is a complex process. 5. One of the most widely researched and readily measureable constructs in education and psychology is intelligence. Systematic measurement and research in the area has been going on nearly 100 years beginning with the work of Alfred Binet in France. Much controversy, however, continues to surround the validity of the construct of intelligence. 6. A number of factors influence the validity of tests. Some of these factors are internal to the tests such as the directions on the test. Others such as noise or distractions during test administration are external to the test. We summarized 12 such factors in this chapter.

CRITERION-RELATED VALIDITY When we are interested in how performance on a test correlates with ­performance on some other criterion, we are concerned about criterion-related validity. The criterion may be any performance on some other measure. There are two subcategories of criterion-related validity: (1) predictive and (2) concurrent. Predictive validity refers to how current test performance correlates with some future performance on a criterion and thus involves the problem of prediction. Concurrent validity refers to how test performance correlates concurrently (at the same time) with some criterion. If you develop a test which will be used for screening and hiring management personnel, you are dealing with a prediction problem because it is the future performance of the candidate as manager that is at question. A pencil-and-paper test of knowledge of computers represents a concurrent validity problem where the simultaneous criterion is actual skills in computer use. Since the empirical procedures used in both predictive and concurrent validity are essentially the same (correlation), and since each involve a criterion external to the test, they are classified together under criterion-related validity and involve the use of validity coefficients.

VALIDITY COEFFICIENTS Evaluating the criterion-related validity of a test requires examining the magnitude of the correlation coefficient, which is referred to as the validity coefficient. Validity coefficients are merely correlations that are interpreted within the context of validity. Interpretations can be aided further by using the coefficient of determination (r2) and then determining the percentage of variance (a statistical summary of the total differences in test scores—see Chapter 4) that is accounted for in the criterion by the test. The percentage of variance accounted for is derived by multiplying the coefficient of determination by 100 (r2 × 100 = percent of variance accounted for). When the correlation is used within the context of predictive validity, it is called the predictive validity coefficient. In the context of concurrent validity, it is called the concurrent validity coefficient.

Predictive validity  141

PREDICTIVE VALIDITY How well does test performance predict some future performance? Figure 6.1 schematically represents this problem. Generally, a criterion (also the dependent variable) and a predictor (independent variable) are identified and their intercorrelations are called the predictive validity coefficient. The arrow in the Figure which points from the predictor to the criterion indicates that interest is unidirectional from the predictor to the future. Six examples are given in Part A of Figure 6.1. In the Medical College Admissions Test (MCAT) example, the correlation between this predictor and medical school test scores, which is the criterion, is approximately r = 0.61.1 This indicates that 37% of the variance (0.612 × 100 = 37%) in medical school test scores is accounted for by performance on the MCAT. The correlation between the Law School Admissions Test (LSAT) scores as a predictor and law school Grade Point Average (GPA) as the criterion is approximately r = 0.51 (0.512 × 100 = 26% of the variance in law school GPA is accounted for by performance on the LSAT). The correlation between the Reading Readiness Test (given to preschool children) and performance on a standardized reading achievement test given at the end of grade one is r = 0.60. Therefore, 36% of the variance (0.602 × 100 = 36%) in reading achievement is accounted for by reading readiness before children begin school. Finally, in example six of Figure 6.1, Infant Tests (vocalization, locomotion, manipulation skills of hands and fingers, attention span, goal directedness) correlate with childhood IQ at approximately r = 0.20. This accounts for 4% (0.202 × 100 = 4%) of the variance. Given the above data, are the predictors good, moderate, or poor? How large does the predictive validity coefficient have to be? Part A: Predictive Validity 1. MCAT scores

Board test scores during medical school (r = .61)

2. Medical school GPA

Residency directors ratings (r = .18)

3. GRE scores

Graduate school GPA (r = .48)

4. LSAT scores

Law school GPA (r = .51)

5. Reading Readiness Test scores 6. Infant Tests

School achievement test scores (r = .10)

Childhood IQ (r = .20)

Part B: Concurrent Validity 1. IQ scores

Achievement test scores (r = .47)

2. Personality scores 3. Attitude scores

Achievement test scores (r = .30) Achievement test scores (r = .10)

Figure 6.1  Schematic representation of criterion-related validity.

142  Validity II: Correlational-based

r = .61 MCAT

r2= .37

Board exams med school

Figure 6.2  Venn diagram of the correlation between MCAT and board exams.

Interpreting the predictive validity coefficient Figure 6.2 is a schematic of the correlation and overlap between the MCAT and board exams taken during medical school. It is quite rare to derive a predictive validity coefficient greater than r = 0.61. This is because the criterion we wish to predict is extremely complex (e.g., achievement, IQ, marital satisfaction, managerial skills), and because people and situations are continuously changing. The factors determining whether or not a validity coefficient is large enough depends on the benefits obtained by making the predictions, the cost of testing, and the cost and validity of alternative methods of prediction. Obviously, the higher the predictive validity coefficient, the better is the prediction. In some circumstances, a validity coefficient of r = 0.60 may not justify an extremely expensive and time consuming examination, while in others, a coefficient of r = 0.20 may make an appreciable difference. In the predictive validity examples in Figure 6.1, the LSAT, SAT, RRT, and GRE are moderate to good predictors of their respective criteria (16%–36% of the variance is accounted for). The Infant Tests, however, are poor predictors of childhood IQ as only small amounts of the variance is accounted for (4%). Determining the efficacy of a predictor, therefore, is based on a number of factors including the magnitude of the validity coefficient, the variance accounted for in the criterion, the cost of testing, and the financial and social costs of alternative procedures. So far the discussion has focused on judging the overall predictive validity of the test. But how can we predict an individual’s score on the criterion, given a score on the predictor?

Regression: Predicting individual scores If the correlation between two variables is known, as in the case of predictive validity, a person’s outcome on the criterion can be predicted using regression

Predictive validity  143

techniques. There are two regression approaches: simple and multiple regression; the latter one builds on the former (see Chapter 4). A particular student, Jason, has an MCAT total score of 30 and his average science GPA of 3.7. To predict Jason’s Step 1 score based on MCAT alone, we would use the simple regression (regression results from Chapter 4): Y ′ = 2.422 × MCAT + c Y ′ = 2.422  × 30 + 138.655 = 211.315  To improve the prediction of a particular student, Jason’s performance in Step   1, we now use the results from the multiple regression and equation 4.3 (Chapter 4). Jason’s MCAT total score of 30 and his average science GPA of 3.7 for performance on Step 1 of the USMLE becomes: Y ′ =  2.422 × MCAT + 9.171 × GPA + c Substituting the values for Jason : Y ′ = 2.422 × 30 + 9.171 × 3.7 + 112.972 = 219.56 This is a 13% improvement in predictive validity by employing a second variable, average science GPA over the simple regression of only MCAT total. These results indicate that average science GPA contributes unique variance beyond the one predicted by the MCAT alone. The percentage of variance accounted for with both variables (0.345 × 100 = 34.5%) is an improvement over the percentage with only the MCAT (0.306 × 100 = 30.6%). Such predictions also have course and classroom applications. Suppose that a professor of Biochemistry knows that the correlation between a Quiz (Mean = 15; SD = 2) given in September and the Final exam (Mean = 68; SD = 10) given in

144  Validity II: Correlational-based

December is r = 0.72. If Katherine achieves a score of 8 on the Quiz, what is her predicted score on the Final? The simplest way to carry out the prediction is using standard scores:

zcriterion = r × z predictor (6.1)



where zcriterion = z score on the criterion r = validity coefficient zpredictor = z score on the predictor

8 − 15 = −3.5 2 Now using equation 6.1, the z score on the Final is z = 0.72 × −3.5 = −2.52

Employing equation 3.6, Katherine’s z score on the Quiz is z =

To derive Katherine’s predicted score on the Final, use equation 3.6 z =



−2.52 =  

X−M SD

X − 68 10

Re-arranging,

X = 68 + (−2.52)10 = 43

All things remaining equal, Katherine’s predicted score on the Final may mean that she will fail the course. Now is the time, of course, for the professor to intervene and make sure “that all things do not remain equal.” The prediction has allowed the teacher to identify Katherine as a student at risk.

CONCURRENT VALIDITY A similar problem to predictive validity is that of concurrent validity. Part B of Figure 6.1 depicts the concurrent validity problem. Compared to predictive validity (Part A), concurrent validity has no predictor. Both variables are called criteria (independent variables). The line with an arrow at both ends is used to indicate a simultaneous, reciprocal relationship. The two tests or measures are taken simultaneously. In predictive validity, on the other hand, time (months or years) separates the two measures. The concurrent validity problem arises most frequently when we wish to identify the correlations between two theoretically related but separate domains, such as IQ test scores and achievement test scores that are given more or less at the same time. Theoretically, we expect these two domains to be related. Similarly—though perhaps less obvious—we might expect attitude test scores

Construct validity  145

and personality test scores to be related to achievement test scores. To determine the validity of these tests, research must be carried out. The personality and attitude tests must be given simultaneously (e.g., within a few days) to determine their intercorrelations. Used in this context, the correlation is called a concurrent validity coefficient.

Interpreting the concurrent validity coefficient When examining the correlation between the tests, the magnitude of the correlations requires considerable interpretation. The actual magnitude required depends on the theoretical relationship between two domains. We would expect, for example, that the correlations between IQ and achievement should be higher than those between personality and achievement. The higher correlation between IQ and achievement (r = 0.47) compared to personality and achievement (r = 0.30) makes theoretical sense. Therefore, the pattern of correlations in Figure 6.1 Part B provides evidence for the overall concurrent validity of achievement. If the correlations are too high between the two tests, it suggests that the two tests are measuring the same variable.

CONSTRUCT VALIDITY Construct validity is a multifaceted and complex idea that has undergone much discussion, debate, and revision since its elegant definition by Cronbach and Meehl.2 They proposed the nomological network approach that identifies, defines, and operationalizes constructs, contains theoretical propositions, and specifies linkages between constructs as a central idea for construct validity. The linkages maybe correlation based (Pearson’s r, regression, factor analysis, etc.) or experimentally based hypothesis testing specifying between group differences (analyses of variance, etc.). Construct validity focuses on the truth or correctness of a construct and the instruments that measure it. What is a construct? A construct is an “entity, process, or event which is itself not observed” but which is proposed to summarize and explain facts, empirical laws, and other data.3 In the physical sciences, gravity and energy are two examples of hypothetical constructs. Gravity has been proposed to explain and summarize facts such as planets in orbit, the weight of objects, and mutual attraction between masses. Gravity, defined as a process or force, cannot be directly observed or measured; only its effects can be identified and quantified. Energy, also an abstraction or construct, is used to explicate such disparate phenomena as photosynthesis in plants, illumination from a light bulb, and the engine that propels a jet liner. In psychological and medical educational measurement, examples of ­constructs include intelligence, scholastic aptitude, communications, clinical competence, honesty, clinical reasoning, and creativity. These constructs have been proposed to explain, summarize, and organize empirical relationships and response consistencies. Clinical competence, like gravity, cannot be

146  Validity II: Correlational-based

directly measured; its existence must be inferred from behavioral measurements. Similarly, anxiety is indicated by behavioral and physiological markers, while creativity is indexed by products (e.g., paintings, novels, inventions) though it is thought to be a process. In any case—whether gravity, clinical competence, honesty, or anxiety—construct validity requires the gradual accumulation of information from a variety of sources. In effect, it is a special instance of the general procedure of validating a theory in any scientific endeavor. The ultimate purpose of validation is explanation, understanding, and prediction. Cronbach4 has proposed that all forms of validity are really construct validity. Content and criterion-related evidence are pieces of the puzzle on which we can focus our attention. In many tests, it follows that all forms of validity should be considered and synthesized to produce an explanation and understanding. Therefore, all data from the test or data relevant to it are pertinent to its construct validity. There are, however, specific procedures that contribute to construct validity. These are explicated below. In general, several important steps and procedures are central to establishing construct validity. These may be summarized as follows: 1. 2. 3. 4.

Identify and describe the meaning of the construct. Derive theoretical support for the construct. Based on the theory, develop items, tasks, or indicators of the construct. Develop a theoretical network for the construct that can be empirically established by correlation. If, for example, the test is to measure neuroticism, then it should correlate with clinical ratings of neuroticism made by psychologists and psychiatrists. A test of mechanical aptitude as another instance should correlate with measures such as ability to fix an engine or assemble machinery. 5. Conduct research to obtain the data necessary to investigate the correlations between the variables in the theoretical framework. 6. Design experiments based on the construct, theory, and correlations to test for causal relationships. 7. Evaluate all of the relevant evidence and revise the theory, construct, and measures if necessary. 8. Fine-tune the measures of the construct by eliminating items and revising the tasks. 9. Return to Step 3 and proceed again. Probably the best developed, most widely researched, and readily measureable construct in psychology and education is intelligence. The process of establishing construct validity of intelligence will be summarized here since it serves as an excellent illustration.

The construct validity of intelligence The concept of intelligence has its roots in antiquity. The early Greek philosophers such as Plato and Aristotle speculated about and debated the nature of

Construct validity  147

intelligence. More recently, Charles Darwin offered an analysis of intelligence in his book, The Descent of Man (1871), as did his cousin, Sir Francis Galton in his book, Hereditary Genius (1869). These original but crude conceptions involved equally crude measures such as reaction time and attention span. With the work of Alfred Binet in France at the beginning of the twentieth century, understanding of intelligence and its measurement underwent considerable advances. In 1904, Binet was appointed by the French government to a committee to investigate the causes of intellectual disability among public school children. Binet decided to refine the concept on intelligence and develop a test which measured important aspects of it. Based on rational and logical analyses (procedures for establishing content validity), Binet constructed a test which measured verbal and numerical reasoning. By giving the test to large numbers of children at different ages, Binet was able to develop norms of mental age (MA). Binet then identified various criteria (school performance, reading ability) that were relevant to the test and gathered data to see if the test is correlated with these. These early tests did show impressive results, both as concurrent and predictive measures. It was the German psychologist Wilhelm Stern who took the next logical step and divided MA (as determined by the test) by chronological age and multiplied this quotient by 100 to produce the familiar Intelligence Quotient (IQ). That is,

IQ =

MA × 100 CA

The French psychiatrist, Theodore Simon, joined Binet in his later work in revision of the test to produce the Binet–Simon Intelligence Tests. These were valuable beginnings in the measurement of intelligence. In 1916, the Stanford University psychologist Lewis Terman, using the Binet–Simon test as a model, produced a test for use in the United States. This is now called the Stanford–Binet Tests of Intelligence. Terman et al. gave the tests to many thousands of participants and studied the nature of the distribution of the scores, correlations between the scores and other criteria (school success, memory ability, reading ability, etc.), and the internal consistency of the subtests across different populations. In addition, Terman launched a longitudinal study (which is still ongoing) that has produced important data relevant to the stability/instability of intelligence over the life span. Many thousands of other researchers have since administered the tests to many millions of people and have empirically studied its relationships to other theoretically relevant variables. A number of other measures of intelligence have since been ­developed by other psychologists such as David Wechsler (the Wechsler Intelligence Scales) as well as others (e.g., the Otis–Lennon, ­Full-Range Picture Vocabulary Test). The data that has accumulated over the course of the past century or so involving thousands of scientists, millions of subjects, and many different measures are all relevant to the construct validity of IQ, intelligence, and the tests which are

148  Validity II: Correlational-based

used to measure it. Based on this research and these data, the following general conclusions can be made5: ●●

●●

●●

●●

●●

●●

●●

MA (as measured by the original Binet scales) increases steadily until cognitive maturity is reached in late adolescence or early adulthood. Thereafter, it stabilizes. This is in keeping with the general principles of growth (height for example). People who score poorly on IQ tests generally also have difficulties in theoretically related tasks such as reading, memory, arithmetic, and abstract reasoning. Conversely, those who achieve high scores on IQ tests do well on these tasks. Children who achieve high scores on IQ tests tend to do well in school; those who achieve low scores tend to do poorly. Subjects with clinical abnormalities affecting brain growth and organization (e.g., Down’s syndrome, Phenylketoneuria) do very poorly on IQ tests. IQ scores show stability over a number of years and even over the entire life span. The IQs of identical twins show very high correlations even when the siblings are raised apart. Experiments to alter or increase IQ have yielded inconsistent results. While this last conclusion still remains controversial, some psychologists have concluded that intervention attempts have demonstrated that IQ is not very malleable and represents a stable trait.

The above results are in substantial agreement with logical, theoretical, and empirical expectations of intelligence. Thus, current evidence provides support for the construct validity of intelligence and some tests which measure it (Wechsler tests, Stanford–Binet). Even so, the construct of intelligence as it is currently formulated and operationalized is constantly undergoing critical scrutiny and is the center of much controversy in the continuing process of construct validation.

Factors that influence validity A number of test characteristics, administration, and other factors can seriously affect the validity of a test. Twelve such factors are summarized below. 1. Directions of the test. If the directions are vague, misleading, or unclear, this can have detrimental effects on test performance. Students may run out of time to complete the test, for example, if the directions about the time allotted are unclear or misleading. Long complicated directions may similarly distract and confuse test-takers and reduce validity. 2. Reading level. If many of the candidates cannot understand the questions, because the reading level is too difficult, validity is compromised. A reading level that is appropriate for graduate students may be too difficult for secondyear nursing students.

Construct validity  149

3. Item Difficulty/Ease. If the items on a test are either too difficult or too easy for the student’s in question, then validity is compromised. 4. Test items inappropriate for the instructional objectives. If the items fail to correspond to the objectives, content validity is undermined. Items that measure synthesis are inappropriate for comprehension level objectives. Conversely, items that measure knowledge are not appropriate for application objectives. A public health question on a pharmacology exam where there were no public health instructional objectives compromises validity. 5. Time problems. Insufficient time to take the test will reduce its validity. This is because students may fail to complete some portion of the test or because they may be so rushed that they give only superficial attention to some items. In an OSCE, requiring a detailed patient history and complex neurological physical exam in 10 min, validity may be undermined. 6. Test length. A test which is too short will fail to sample adequately the domain of measurement and thus reduce content validly. Essay tests or other constructed response type tests (e.g., short answer) frequently suffer from this problem because a few essay items may not adequately sample the relevant domain. 7. Unintended clues. Wording of questions which provide clues to the answer reduces test validity. These can take the form of word associations, grammatical errors, plural-singular connections, and carryovers from previous questions. 8. Improper sampling of content/outcomes. Even when the test length is adequate, there may be improper sampling of content and outcomes such as too much emphasis on a content area (or too little). This will reduce content validity. A typical student comment illustrates this: “Biochem/genetics course was poorly organized with unclear objectives and material tested somewhat unfairly.” 9. Noise or distractions during test administration. External noise or distractions within the examination room can affect performance and thus validity. 10. Cheating and copying. Students who cheat or copy from others do not provide their own performance and thus invalidate their results. 11. Scoring of the test. Careless or global scoring strategies by the assessor can influence test validity in a negative fashion. 12. The criterion problem. Tests which are to predict some future performance or which are to relate to some other measure of performance concurrently must correlate with some measure called the criterion. Frequently, the criterion is very difficult to define, operationalize, and measure. This is called the criterion problem. Consider, for example, a test which is to predict future performance in “professionalism.” How can this criterion be defined, operationalized, and measured? What content and processes should be included on this test? How do you determine the relevant aspects of professionalism to correlate the test to? These interrelated issues constitute the criterion problem. When the criterion is general and abstract, it is difficult to construct a test which will show

150  Validity II: Correlational-based

correlations to the criterion. Conversely, the more specific and precise the criterion is, the easier it is to sample on a test. The problem is one of criterion-related validity.

Factor analysis and construct validity The following study illustrated the use of factor analysis for evidence of construct validity. Multisource feedback questionnaires including self-assessments are a feasible means of assessing the competencies of practicing physicians including surgeons in communication, interpersonal skills, collegiality, and professionalism. In a study of 201 surgeons, data were collected on a 34 item instrument with a five-point scale. There were 25 general surgeons, 25 orthopedic surgeons, 24 obstetricians and gynecologists, 24 otolaryngologists, 24 ophthalmologists, 20 plastic surgeons, 20 urologists, 15 cardiovascular and thoracic surgeons, 13 neurosurgeons, 6 general practice surgeons, and 5 vascular surgeons. The data were entered into SPSS. To run a factor analysis with SPSS, use the following. Select “Dimension Reduction,” then select “Factor.” The following dialogue window should be open. Select Items 1–34 and click over to the “Variables:” pane as below.

1

1. Select “Extraction.” The following dialogue window should open. 2. The default “Method” should be “Principal components.” If not, select it. 3. “Extract” should be “Based on eigenvalue.” Eigenvalue greater than 1 should be the default. If not, select it. (You could use any number of factors by selecting “Fixed number of factors” if you had a good reason—theoretical or practical—to extract a particular number of factors which may not correspond to Eigenvalue greater than 1.) 4. The maximum number of iterations for this iterative process is 25. Leave this as the default, unless you have a very good reason to change it.

Construct validity  151

5. Click on “Continue.”

2

3

4

1

5

Select “Rotation” button in the dialogue box. 6. Under “Method,” select “Varimax” 7. In “Display,” “Rotated Solution” should be selected. If not, select it. 8. The maximum number of iterations for this iterative process is 25. Leave this as the default, unless you have a very good reason to change it. 9. Click on “Continue.”

6

7

8

9

152  Validity II: Correlational-based

Click on “Options” in the Dialogue box. 1 0. For “Missing Values,” “Excluded cases listwise” should be selected. 11. Under “Coefficient Display Format” the “Absolute value below:” should be set at 0.40. 12. Click on “Continue.” Now you should be back at the main dialogue box and the factor analysis is ready to run. Click on “OK” and the procedure should execute.

10

11

12

Output from SPSS A large amount of output will ensue from analysis. The key is to identify and interpret the output relevant for the present analysis and problem. We begin with Table 6.1 which contains the initial eigenvalues and the variance accounted for. While this table has been edited for brevity (Items 7–28 are missing), there are as many eigenvalues output as there are items (34). The eigenvalues are listed hierarchically from largest to smallest. Notice that Component 1 has a corresponding eigenvalue of 17.939 which accounts for 52.761% of the variance in the data. The magnitude of the next several eigenvalues decreases markedly accounting small amounts of the variance. A close inspection of Table 6.1 reveals that there are four eigenvalues greater than 1 and that accounts for 65.660% of the variance. Additionally, the table contains how much of the variance in this solution is accounted for by each component; e.g., Component 2 accounts for 18.639% of the common variance as determined by

Construct validity  153

Table 6.1  Total variance explained—Principal component extraction Initial eigenvalues Component 1 2 3 4 5 6 29 30 31 32 33 34

Total 17.939   1.649   1.507   1.229   0.974   0.934   0.157   0.125   0.118   0.100   0.090   0.069

Variance Cumulative (%) % 52.761   4.850   4.433   3.615   2.863   2.747   0.462   0.368   0.346   0.295   0.263   0.204

  52.761   57.612   62.045   65.660   68.523   71.270   98.523   98.891   99.238   99.532   99.796 100.000

Rotation sums of squared loadings Total

Variance (%)

Cumulative %

6.668 6.337 4.940 4.378

19.613 18.639 14.530 12.877

19.613 38.252 52.782 65.660

the rotation sums of squared loadings. Therefore, four factors are selected in this analysis.

Interpreting the factors Table 6.2 contains the varimax (orthogonally) rotated factors; convergence occurred in 11 iterations. Based on the factor loadings, theoretical meaning, and coherence, the next task is to name the factors. Factor 1 is Clinical Performance because the main large loadings (0.677–0.715) are from items: 4. Within my range of services, I perform technical procedures skillfully 5. I select diagnostic tests appropriately 6. I critically assess diagnostic information 7. I make the correct diagnosis in a timely fashion 8. In general, I select appropriate treatments Many other relevant items load on this factor (Table 6.2). Similarly, Factors 2 (Patient Care), 3 (Communication & Humanist), and 4 (Professional Development) are named based on the factor loadings, theoretical meaning, and coherence. There are many “split-loadings” (items load on more than one factor) in Table 6.2. For example, Item #14 (I maintain confidentiality of patients and their families) loads both on Factor 1 (Clinical Performance) at 0.600 and Factor 2 (Patient Care) at 0.523. This split makes theoretical sense as confidentiality is part of both clinical performance and patient care. The many other splitloading also make theoretical sense since they are part of more than one factor.

Factor 1 Clinical performance 1. I communicate effectively with patients 2. I communicate effectively with patients’ families 3. I communicate effectively with other health care professionals 4. Within my range of services, I perform technical procedures skillfully 5. I select diagnostic tests appropriately 6. I critically assess diagnostic information 7. I make the correct diagnosis in a timely fashion 8. In general, I select appropriate treatments 9. I maintain quality medical records 10. I handle transfer of care appropriately 11. I make it clear who is responsible for continuing care of the patient 12. I communicate information to patients about rationale for treatment 13. I recognize psychosocial aspects of illness 14. I maintain confidentiality of patients and their families

2 Patient care

3 Communication & humanist

4 Professional development

0.839 0.798 0.558 0.677 0.692 0.714 0.715 0.704 0.446 0.482 0.463 0.569

0.600

0.576 0.634

0.488 0.523

0.505 (Continued )

154  Validity II: Correlational-based

Table 6.2  Varimax-rotated component matrixa

Table 6.2 (Continued )  Varimax-rotated component matrixa Factor 2 Patient care

0.438

0.499 0.405

0.553 0.467

3 Communication & humanist

4 Professional development

0.436

0.549 0.717

0.627

0.467 0.567 0.440 0.519

0.402 0.614 0.647 0.580 0.471 0.747 0.629 0.514 0.616 (Continued )

Construct validity  155

15. I co-ordinate care effectively for patients with health care professionals 16. I co-ordinate the management of care for patients with complex problems 17. I respect the rights of patients 18. I collaborate with medical colleagues 19. I am involved with professional development 20. I accept responsibility for my professional actions 21. I manage health care resources efficiently 22. I manage stress effectively 23. I participate in a system of call for care for my patients when unavailable 24. I recognize my own surgical limitations 25. I handle requests for consultation in a timely manner 26. I advise referring physicians if outside of scope of my practice 27. I assume appropriate responsibility for patients 28. I provide timely information to referring physicians about mutual patients

1 Clinical performance

Factor 1 Clinical performance 29. I critically evaluate the medical literature to optimize clinical decisions 30. I facilitate the learning of medical colleagues and co-workers 31. I contribute to quality improvement programs and practice guidelines 32. I participate effectively as a member of the health care team 33. I exhibit professional and ethical behavior to my physician colleagues 34. I show compassion for patients and their families Source: Violato et al. a Convergence in 11 iterations. 6

2 Patient care

3 Communication & humanist

4 Professional development 0.675 0.733 0.763

0.429 0.428 0.402

0.506 0.473 0.618

156  Validity II: Correlational-based

Table 6.2 (Continued )  Varimax-rotated component matrixa

Summary and main points  157

Construct validity The results of this factor analysis provide evidence of construct validity of the self-report assessment of surgeon competence. 1. A large number of items (34) are reducible to a few basic underlying elements, latent variables, or factors 2. Based on the eigenvalues >1.0 rule, there are four clear factors 3. The four factors account for two-thirds (65.66%) of the total variance, a good result 4. The magnitude of the variance accounted for by the factors themselves provide theoretical support for construct validity. Clinical Performance accounts for the largest proportion of the variance—52.761%, while the other factors are much smaller. This is expected in surgical practice. 5. The factors overall provide theoretical support and are meaningful and cohesive. The split-loadings also provide supporting evidence since it is expected that some items are part of more than one factor. Other items load on only one factor as expected. Item #30 (I facilitate the learning of medical colleagues and co-workers) loads only on factor 4 (Professional Development) at 0.733 as theoretically expected. The overall factor analysis then provides evidence of construct validity of the self-report assessment of surgeon competence.

SUMMARY AND MAIN POINTS Validity—the extent of which a test measures whatever it is intended to ­measure— is usually classified into four types: (1) face, (2) content, (3) criterion-related, and (4) construct. Face and content, which are the most important types on teacher-made tests, are non-empirical forms of validity. Criterion-related and construct validity both require empirical evidence (primarily correlational) for their support. Criterionrelated validity involves concurrent and predictive forms both of which require validity coefficients. Construct validity subsumes all forms of validity plus a further determination of the psychological processes involved in the construct. Finally, there are a number of factors, some inherent to the test and some due to the administration which affect validity. 1. Three critical factors characterize educational tests: validity, reliability, and usability. 2. Validity, at its most general level, is the extent to which the test measures whatever it is supposed to measure. 3. There are four main types of validity: (1) face, (2) content, (3) criterionrelated, and (4) construct.

158  Validity II: Correlational-based

4. Face validity has to do with the appearance of the test. It can affect classroom climate and mood, as well as influence testees’ motivation. 5. Content validity, usually the most important consideration in classroom tests, is defined as the extent to which the test adequately samples the domain of measurement. This form of validity can be enhanced by the use of a table of specifications. 6. When we are interested in how performance on a test correlates with performance on some other criterion, we are concerned about criterion-related validity. There are two categories of this kind of validity: (1) predictive (How well does a test predict some future performance), and (2) concurrent (How well do two tests intercorrelate concurrently). 7. The determination of a test’s criterion-related validity requires empirical evidence in the form of correlation coefficients (r). Interpreted in the context of validity, r is called a validity coefficient. The validity coefficient can be transformed into the coefficient of determination (r2) which in turn is used to determine the percentage of variance in the criterion that is accounted for by the test. 8. In predictive validity, individual scores on the criterion can be estimated using regression techniques. This strategy allows the prediction of individual scores on some future criterion. 9. Construct validity focuses on the truth or correctness of a construct and the instruments that measure it. A construct is defined as an entity, process, or event which is itself not observed and can be measured only indirectly. Establishing the validity of constructs also requires determination of the validity of relevant instruments. Establishing construct validity is a complex process. 10. One of the most widely researched and readily measureable constructs in education and psychology is intelligence. Systematic measurement and research in the area has been going on nearly 100 years beginning with the work of Alfred Binet in France. Much controversy, however, continues to surround the validity of the construct of intelligence. 11. A number of factors influence the validity of tests. Some of these factors are internal to the tests such as the directions on the test. Others such as noise or distractions during test administration are external to the test. We summarized 12 such factors in this chapter.

REFLECTION AND EXERCISES Reflection 6.1 Defining criterion-related and construct validity 1. Briefly define criterion-related validity (250 words maximum) 2. Briefly define construct validity (250 words maximum)

Summary and main points  159

Exercises 6.1 Criterion-related validity 1. From the Exercise 4.2: Clerkship Clinical Scores (Chapter 4), compute the validity coefficient (i.e., r) for Peds as a predictor of the GPA. Treat the GPA as the criterion and Peds as the predictor. 2. Compute the validity coefficient for a composite predictor (sum of Peds, Surgery, IM, MCQ) of GPA. Is there a significant increase in predictive efficiency by combining the results of Peds, Surgery, IM, MCQ rather than just Peds? Explain your answer. 3. Compute the predicted scores on the GPA for students who achieve the following scores on Peds: 12, 17, 8, 10, 14, 16, 9, 15, 7, 11. 4. A researcher developed a new test of clinical reasoning, the Clinical Reasoning Test (CRT). This test consists of two subscales, Processing Speed and Quantitative Reasoning, and provides a total score as well. In order to investigate the validity of the CRT, the researcher gave the test to 86 third-year clinical students. He also obtained the students’ MCAT Verbal Reasoning scores and their cumulative GPA. These data are summarized below. Enter the data into an SPSS file. Is there any evidence of the criterion-related validity of the CRT? What kinds? Are there any sex differences in the CRT or other variables? Explain your answer. 5. From the CRT dataset is there any evidence for the validity of the GPA as a measure of academic achievement? What kinds of validity? Explain your answer. Dataset from the Clinical Reasoning Test (CRT), GPA, MCAT verbal reason, CRT processing speed, CRT quantitative reasoning, and total CRT for third-year clinical students

ID

Sex

 1  2  3  4  5  6  7  8  9 10 11 12 13

Male Male Female Male Male Female Female Male Male Male Female Female Female

Total GPA

MCAT verbal reason

CRT processing speed

3.75 3.68 3.97 2.84 3.58 3.62 3.89 3.39 3.62 3.12 3.84 3.42 3.81

10  9  7  9 11  9  6 13 11 11 13  9 10

  99.55   95.45   97.27   90.00 105.45   98.18   96.36   88.18 106.36   98.18 105.45   98.18   90.45

CRT quantitative reasoning

Total CRT

  94.09 193.64   95.45 190.91   97.27 194.55   88.64 178.64 104.55 210.00   97.27 195.45   89.09 185.45   93.64 181.82 103.18 209.55 105.91 204.09 105.00 210.45   97.73 195.91   95.91 186.36 (Continued )

160  Validity II: Correlational-based

Dataset from the Clinical Reasoning Test (CRT), GPA, MCAT verbal reason, CRT processing speed, CRT quantitative reasoning, and total CRT for third-year clinical students

ID

Sex

Total GPA

MCAT verbal reason

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

Male Female Female Male Male Female Male Male Female Female Female Male Female Female Female Female Male Female Female Female Male Male Male Male Male Female Female Male Male Male Female Female Male Female Female Male

3.69 3.72 3.02 2.94 3.01 3.56 3.28 3.47 3.82 3.41 3.28 3.38 3.43 3.05 3.61 3.65 3.30 3.37 3.97 3.60 3.50 3.72 3.58 3.69 2.86 3.52 3.61 3.08 2.92 4.00 3.58 3.49 3.83 3.67 3.71 3.90

9 12 7 8 11 9 7 10 12 6 9 9 13 9 8 8 8 11 8 9 6 9 8 11 10 8 8 10 7 9 12 10 7 9 9 7

CRT processing speed 94.55 97.73 60.00 74.55 88.18 93.64 90.91 85.91 93.18 90.45 92.27 103.64 89.09 100.00 80.00 97.27 85.91 89.55 83.64 102.73 100.00 86.82 90.91 80.18 81.36 83.18 84.09 90.00 78.18 79.09 104.55 95.00 85.00 81.82 103.64 75.45

CRT quantitative reasoning 100.00 90.00 80.43 81.36 92.27 90.45 78.18 90.45 100.91 93.64 89.55 90.91 99.09 97.73 91.82 90.91 94.09 101.82 87.27 100.45 104.09 82.73 88.64 76.36 83.64 80.45 83.18 96.82 76.36 94.09 105.45 94.09 92.27 85.45 104.09 85.45

Total CRT

194.55 187.73 140.43 155.91 180.45 184.09 169.09 176.36 194.09 184.09 181.82 194.55 188.18 197.73 171.82 188.18 180.00 191.36 170.91 203.18 204.09 169.55 179.54 156.55 165.00 163.64 167.27 186.82 154.55 173.18 210.00 189.09 177.27 167.27 207.73 160.91 (Continued )

Summary and main points  161

Dataset from the Clinical Reasoning Test (CRT), GPA, MCAT verbal reason, CRT processing speed, CRT quantitative reasoning, and total CRT for third-year clinical students

ID 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86

Sex Female Male Female Male Male Male Female Male Male Male Female Female Male Female Male Male Male Female Male Male Male Male Female Male Male Male Female Male Male Male Female Female Male Male Female Male Male

Total GPA 3.31 3.78 3.33 4.00 3.45 3.62 3.14 3.49 3.25 3.65 3.93 3.65 3.75 3.55 3.63 3.75 3.11 3.63 3.38 3.65 3.08 3.89 3.69 3.77 3.01 2.85 3.92 3.38 2.00 4.00 3.55 3.25 3.68 3.72 3.61 3.38 3.96

MCAT verbal reason 9 10 10 11 11 9 11 12 10 10 9 11 9 10 9 11 7 8 7 9 11 8 11 8 10 7 10 9 10 9 11 8 10 5 11 10 6

CRT processing speed 96.36 99.09 94.09 87.73 100.00 99.09 95.00 96.36 81.82 110.00 101.36 102.73 80.91 91.36 109.55 84.55 90.91 102.73 95.91 107.73 105.00 90.45 95.45 85.91 104.09 95.00 94.09 80.91 81.82 78.64 93.18 79.55 83.18 91.82 85.91 99.55 89.55

CRT quantitative reasoning 96.82 98.18 101.82 87.27 109.55 95.91 93.18 96.82 77.27 108.64 98.18 100.45 94.55 94.09 98.64 93.18 89.55 88.64 101.82 106.36 100.91 96.36 102.27 81.82 109.55 88.18 96.82 95.45 82.73 90.45 104.09 89.09 88.64 89.55 94.55 98.64 100.45

Total CRT 193.18 197.27 195.91 175.00 209.55 195.00 188.18 193.18 159.09 218.64 199.55 203.18 175.45 185.45 208.18 177.73 180.45 191.36 197.73 214.09 205.91 186.82 197.73 167.73 213.64 183.18 190.91 176.36 164.55 169.09 197.27 168.64 171.82 181.36 180.45 198.18 190.00

162  Validity II: Correlational-based

Exercises 6.2 Construct validity 1. Briefly describe the general procedure for determining construct validity 2. Run a factor analysis on these data (Total GPA, MCAT Verbal Reason, CRT Processing Speed, CRT Quantitative Reasoning, Total CRT). Use the Kaiser rule for the number of factors and varimax rotation. Name the factors. 3. From the CRT dataset is there any evidence of the construct validity of the CRT? Discuss.

REFERENCES 1. Dunleavy DM, Kroopnick MH, Dowd KW, Searcy CA, Zhao X. The predictive validity of the MCAT exam in relation to academic performance through medical school: A national cohort study of 2001–2004 matriculants. Academic Medicine, 2013, 88, 666–671. 2. Cronbach LJ, Meehl P. Construct validity in psychological tests. Psychological Bulletin, 1955, 52, 281–302. 3. MacCorquodale K, Meehl PE. On a distinction between hypothetical constructs and intervening variables. Psychological Review, 1948, 55(2), 95–107. pp. 95–96. 4. Cronbach LJ. Essentials of Psychological Testing (6th ed). New York: Haper & Row, 1990. 5. Sternberg RJ, Kaufman SB (Eds.). The Cambridge Handbook of Intelligence. Cambridge: Cambridge University Press, 2011. 6. Violato C, Lockyer J, Fidler H. Multisource feedback: A method of ­assessing surgical practice. British Medical Journal, 2003, 326, 546–548.

7 Validity III: Advanced methods for validity ADVANCED ORGANIZERS • This chapter deals with advanced forms of validity, structural equation modelling (SEM), confirmatory factor analysis (CFA), multi-trait multimethod matrix (MTMM), systematic reviews and meta-analysis, hierarchical linear modelling (HLM), and a unified view of validity. • Structural equation modeling is a family of statistical techniques used for the systematic analysis of multivariate data to measure underlying hypothetical constructs (latent variables) and their inter-relationships. It is a powerful statistical research tool in health sciences education research and in particular in investigating construct validity. SEM builds upon statistical techniques such as correlation, regression, and analysis of variance (ANOVA). • CFA is an extension of exploratory factor analysis (EFA) and a subset of SEM. CFA provides a means by which researchers can model a priori latent variables or factors; this method identifies any variability (systematic and error variance) in a measured variable that is not associated with the latent construct. In addition, it can model the relationships between factors. The structural relations are the relations between the factors and are the core of SEM. • MTMM is a method that involves correlating scores of different attributes (e.g., communication, physical examination) across different methods (e.g., patient questionnaire, direct observation); this gives a more comprehensive assessment of the construct measured than do correlations of traits assessed by single methods. The minimum requirement for using MTMM for a construct validity study is the use of two traits and two methods. The inter-correlations may thus provide evidence of convergent and divergent validity.

163

164  Validity III: Advanced methods

• Systematic reviews and meta-analyses are sometimes combined in a single study. A systematic review is a literature review of collecting and critically analyzing several research studies, using and explicitly specifying methods whereby the papers are located, identified, and selected. A meta-analysis is a statistical analysis that combines the results of multiple empirical studies. The purpose of meta-analysis is to use statistical approaches to derive a pooled estimate of the underlying regularities or constructs. • HLM is an elaborated method of regression that can be used to analyze the variance in dependent variables when the independent variables are hierarchically organized such as students within courses within schools. Because of the shared variance, simple regression is inadequate and more advanced methods such as HLM is more appropriate. It can account for the shared variance in hierarchically structured data such as students within courses, which are within a school, which are within pedagogical approaches (e.g., organ systems vs. PBL curricula). • Validity is a unitary concept called construct validity. A construct is a hypothetical entity (i.e., theory) that specifies the purpose or intent of the assessment and interpretation of relevant data. Evidence is gathered to support the interpretation of the construct and interpretation of scores. Several complex steps are required for validity evidence.

In the past several decades, several advanced methods for studying validity, particularly construct validity, have been developed and used. These include SEM modeling, CFA, MTMM approaches, HLM, meta-analysis, and a unified approach to validity. These are the subject of this chapter.

STRUCTURAL EQUATION MODELLING Structural equation modeling (SEM) is a family of statistical techniques used for the systematic analysis of multivariate data to measure underlying hypothetical constructs (latent variables) and their inter-relationships. It is a powerful statistical research tool in health sciences education research and in particular in investigating construct validity. SEM builds upon statistical techniques such as correlation, regression, and ANOVA. It combines the strength of the confirmatory, data reducing ability of factor analysis with multi-regression techniques of path analysis to explicate the direct and indirect relationships between measured and hypothetical (latent) variables. The use of SEM has increased in psychological and educational research in the past few years but has not been widely applied in medical education research. In this section, the basic tenets of SEM, the development of SEM as an integrated statistical theory, the central features of model creation, estimation, and model fit to data are outlined. The strengths and weaknesses are discussed together with an example of the use of SEM in medical education research. There is a discussion of areas where there are opportunities for the application of SEM for the study of validity.

Structural equation modelling  165

Basics of SEM SEM is a confirmatory approach that provides a mechanism to study the hypothesized underlying structural relationships between latent variables or constructs. The development of SEM was based on integration of three key components (1) path analysis, (2) factor analysis, (3) and the development of estimation techniques for model fit. Path analysis employs models that represent the hypothesized casual connections among a set of variables and estimates the magnitude and significance of each connection. Path analysis moves beyond predicting whether independent variables predict a phenomenon (regression analysis) to examining the interrelationships between the variables. CFA is an extension of EFA and allows a researcher to model a priori how measured variables identify latent constructs. CFA provides a means by which researchers can model latent variables; this method identifies any variability (systematic and error variance) in a measured variable that is not associated with the latent construct. Integration of these statistical methods, which are referred to as measurement (factor analysis) and structural (path analysis) models, gave rise to the development of SEM described by researchers such as Karl Joreskog and Peter Bentler.1 The structural relations are the relations between the latent variables and are the core of SEM. The measurement model (Figure 7.1) represents how the latent variables are measured by indicator variables and describes the measurement properties of the indicator variables.2 The structural model defines the relationships between latent variables (and possibly observed variables that are not indicators

Measurement Model A

Measurement Model B

A

B

C Structural Model Measurement Model C

Figure 7.1  A structural equation model depicting three latent variables (A, B, and C) and the relationship between measurement models and a structural model. The path analytic model contains three variables, variable A has both a direct and indirect on C. Source: Modified from Nachtigall et al.3,4

166  Validity III: Advanced methods

of latent variables) and allows for the determination of the extent of association between these variables.

COMPARISON OF SEM WITH OTHER STATISTICAL METHODS SEM is a confirmatory approach which allows for the analyses of hypothesized inter-relationships between latent constructs. Use of SEM requires statistical tools which are based upon regression, ANOVA, and correlation. While regression and ANOVAs define the degree of significance between variables, it is difficult for researchers to model underlying constructs that independent variables might load upon. SEM is built upon the multivariate techniques of factor and path analysis. While these methods are quite strong independently, SEM subsumes and allows for a higher level of abstraction, through the development of structural models of hypothesized constructs. Path analysis analyzes structural models between observed variables and depicts direct and indirect causal effects based upon hypotheses of causal effects. However, each variable is a single indicator and there is the assumption that the variables are measured without error. The strength of SEM over path analysis is that SEM allows for a structural model to be created between latent variables or a combination of measured and latent variables and the path coefficients in a SEM between latent factors are corrected for measurement error (observed variables have measurement errors), because the hypothetical constructs are measured by multiple indicators where measurement error is removed. Factor analysis is a data reducing technique for explaining correlations among variables in terms of factors that are unobservable and superordinate to the measured variables. As we have seen, there are two major applications of factor analysis: exploratory (EFA) and confirmatory (CFA). EFA does not require a priori hypotheses regarding the number of underlying factors or the relationships between measured variables and factors. Because of its atheoretical nature, it is not typically used as an SEM procedure. CFA, on the other hand, requires a priori hypotheses and a proposed model (like SEM) about the number of factors and the nature of the relationships between measured and hypothetical constructs. Unlike SEM, CFA cannot model causality or the temporal relationships between variables. SEM subsumes CFA, in that CFA is the measurement model of a SEM, outlining the relationships between indicators and underlying hypothetical constructs.

SEM: AN INTEGRATED RESEARCH METHOD SEM is a flexible and powerful statistical tool for the development, refinement, and validation of theories and hypothesized relationships between variables. SEM requires fundamental understanding of statistical methods and concepts. It is considered to be primarily confirmatory, in that models are specified a priori based upon theory and previous exploratory work. Nonetheless, it can also be exploratory in the sense that it provides a mechanism whereby competing models can be tested or models can be re-specified (i.e., re-drawn to improve fit).

SEM: An integrated research method  167

The  relationships among the variables should be justifiable, based on theory or based on previous research findings. A researcher must specify a model, or competing models, a priori, before any analysis is performed. Analysis and interpretation of a SEM, especially the underlying hypothetical constructs represented as the latent variables, depends on how those variables were measured. Therefore, SEM not only subsumes many statistical concepts (correlation, ANOVA, MR) it requires that the researcher understand measurement theory, meaning that the psychometric properties (reliability and validity) for each tool used to measure an underlying construct have to be elucidated. It is important to understand that SEM tests the whole model, for goodness of fit, and provides information as to the relevance of various measurement components as well as relationships between variables that can be reviewed based upon the theory that was used to create the model. The development of a model occurs systematically. Bollen and Long4 have outlined five steps that characterize most applications of SEM: (1) model specification, (2) identification, (3) estimation, (4) testing fit, and (5) re-specification. These operational steps are necessary in order to increase the likelihood that the observed data will fit the predicted model. First, a research problem is outlined and a model is specified. This problem and subsequent questions influence and are influenced by the underlying theory that has surrounded the work in that area. The theory underlying the model should be supported by the presence of preliminary empirical evidence gathered by reliable and valid psychometric tools and analyzed by the appropriate univariate and multivariate techniques. SEM relies upon a priori hypotheses, and a researcher has to be able to define the theoretical underpinnings and how the research question was developed. The population under study also has an impact on the development of theory and research question(s) that are of interest. A testable model, or competing testable models, is then developed based upon the research question and theory. The model is then specified, where the relationships between the variables must be explained and the measurement and the structural models are explicitly defined. Here the two component parts—the measurement model and the structural model—need to be explicated. Second, the model is reviewed for identification. This determines whether it is possible to find unique values for the parameters of the specified model. In the measurement model, the data should contain multiple indicators of each latent variable (or at least two indicators). One indicator alone of a hypothesized latent construct would result in a biased measurement.2,5 While some suggest that at least three indicators should be used to account for as much of the variance as possible in the latent variable, practically, most SEM models use two indicators at a minimum. The indicators must be selected carefully, and the reliability and validity of the psychometric tools used for the measured variables should be well described. Identification determines if there are more variables measured than parameters to be estimated. For models to be properly empirically assessed, they should be identified or overidentified, meaning that the information in the data (which are the known values: variances and covariances) is equal to or exceeds the information being estimated (unknown values: parameter estimations, measurement error, etc.).

168  Validity III: Advanced methods

If  the unknowns exceed the knowns, then the model is underidentified. For overidentification of a model, there should be two or more observed variables for a latent factor.6 Third, the model is estimated. Estimation techniques were developed to ascertain how a model fits the observed data based upon the extent to which the model implied covariance matrix is equivalent to the data derived covariance matrix. The most common methods for estimation are maximum likelihood (ML), generalized least squares (GLS), and asymptotic distribution-free (ADF). These estimation procedures are iterative, meaning that the calculations performed are iterative until the best parameter estimation is obtained. ML is the default estimation technique in most SEM software and is the most widely used. ML assumes that variables are multivariate normal (i.e., distribution of the variables is multivariate normal) and require large sample sizes. ML estimates parameters which maximize the likelihood (the probability) that the predicted model fits the observed model based upon the covariance matrix. The covariance, defined as covxy = rxySDxSDy (Pearson product–moment correlation between variables X and Y multiplied the standard deviation (SD) for X and Y), is the statistic primarily used in SEM. The covariance matrix used in SEM is meant to understand patterns of correlation among a set of variables and to explain as much of their variance as possible with the model specified by the researcher. GLS is based upon the same assumptions as ML and used under the same conditions (it performs less well with smaller sample sizes, therefore, ML is recommended). GLS reduces the sum of the squared deviations (or variances) between the predicted model and observed model and is more popular in regression analyses. Asymptotically distribution free (ADF) estimation techniques (such as arbitrary distribution least squares [ALS]) may be used if some measured variables are dichotomous and others are continuous and therefore multivariate normality cannot be assumed or if the distributions of the continuous variables are nonnormal. These estimation techniques are not as readily used. The SEM study begins by collecting a sample from the population that the researcher wishes to generalize. Descriptive and univariate analyses are performed to determine whether the data meets the assumptions for SEM. The assumptions for SEM are similar to most statistical methods dealing with parametric data, in that SEM assumes multivariate normality, independence of observations (assumption of local statistical independence), and homoscedasticity (uniform variance across measured variables). Fourth, the model fit is estimated. Generally, the estimation is based on testing the null hypothesis expressed as

Σ = Σ (θ ) (7.1)

where Σ is the population covariance matrix of the observed variables Σ(θ) is the covariance matrix implied by the specific model θ is a vector containing the free parameters of the model

SEM: An integrated research method  169

A test statistic allows us to test the null hypothesis that the specified model leads to a reproduction of the population covariance matrix of the observed variables. To assess goodness-of-fit of the observed data to the model, various fit indices have been developed. The following section describes the most common fit indices reported and the parameters associated with their use. There is no fixed rule as to which one to use or which combination to use, and there is still little agreement on what represents the best fit.2 Fit criteria indicate the extent to which the model fits the data. Only the χ2 provides a significance test, while other measures are descriptive and are broken down into three categories: measures of overall model fit, measures based upon model comparisons, and measures of model parsimony.2 The χ2 statistic as used in SEM assesses whether the calculated covariance matrix is equal to the model implied covariance matrix. It tests the null hypothesis that the differences between the two are zero. Therefore, a nonsignificant χ2 value (p-value > 0.05) with associated degrees of freedom means that the null hypothesis is accepted and the model is considered to fit the data. In other words, we do not want a significant difference between the sample covariance matrix and the model implied covariance matrix. There are problems with the χ2 statistic in SEM. Theoretically, it has no upperbound and therefore the values are not interpretable in a standardized way. It is also sensitive to sample size, as sample size increases, so does the χ2 value which means that the model may be rejected, as sample size decreases so does χ2 value which may provide non-significant results even though there may be considerable difference between the sample and model implied covariance. Using χ2 can lead to an inflated Type I error rate for model rejection. To reduce the sensitivity of the χ2, some have suggested that the χ2 be divided by the degrees of freedom (χ2/df), and that the ratio be less than three in order for a good fit. Unfortunately, this still does not address the problem of sample size dependency. Descriptive goodness of fit measures are used in conjunction with the χ2 values to determine overall fit of the empirical data to the model. Descriptive measures of overall model fit measure the extent to which an SEM corresponds to the empirical data, based on the difference between the sample covariance matrix and the model-implied covariance matrix (measure of the average of the residual variances and co-variances between the sample matrix and the model implied matrix). Descriptive measures based on model comparisons assess the fit of a model compared to the fit of a baseline model. Baseline models are either an independence model where it is assumed that the observed variables are measured without error or a null model where all parameters are fixed to zero. Descriptive measures of model parsimony are used primarily when competing models are being compared. Typically, values of these measures range from zero (no fit) to one (perfect fit), while some provide a badness of fit with parameters of zero to one. Common fit indices include root mean square error of approximation (RMSEA), standardized root mean square residual (SRMR), goodness of fit index (GFI), Akaike Information criterion (AIC), the comparative fit index (CFI), and a number of others. The most reported index in the literature is the CFI which compares a predicted model with a baseline model (Table 7.1).

Goodness of fit indices

Model Knowledge Encapsulation (full model with no constraints on the covariance among BSA, CC and AFM) Independent Influence (reduced model: estimate the covariance between BSA and CC to 0) Distinct Domain (reduced model: estimate the covariance between BSA, AFM, and CC to 0)

Root mean Standardized squared error of root mean approximation— squared Goodness Comparative RMSEA Chi square ( χ2) of fit index fit index—CFI residual—SRMR

Delta chi square χ D2

( )

0.973

0.968

0.038

0.063

88.11 (df = 29)

0.886

0.811

0.163

0.191

443.91 (df = 29) 355.80 (dfD = 1)

0.856

0.779

0.225

0.170

514.93 (df = 29) 462.82 (dfD = 3)

AFM, aptitude for medicine; BSA, basic science achievement; CC, clinical competency.

(i) —

170  Validity III: Advanced methods

Table 7.1  A comparison of fit of three models of the development of medical expertise

Sample size and model fit  171

Is there a consensus as to what constitutes an adequate fit? Some have suggested that a model fit is dependent upon more than the fit indices. The recommendation is that a model should be identified (parameters = observations), the iterative estimation procedure converges using ML or another technique, the parameter estimates for the model are within the range of permissible values, standard errors of the parameter estimates have a reasonable size, and the residual matrix (observed covariances—predicted covariances) should have residuals that are approximately zero.7 Furthermore, as mentioned above, the χ2 test for fit should not solely be used to assess goodness of fit. Usually several indices are provided which represent the different classes of fit criteria. It is suggested that χ2 value and its p-value, χ2/df, RMSEA, and its confidence interval (CI), SRMR, NNFI, and CFI. However, in practicality, usually the χ2 value, CFI, SRMR, and RMSEA are presented as indicators of goodness of fit. For model comparisons, the χ2 difference test and AIC should be provided. The literature strongly suggests that sample sizes of ≤250 should be treated with caution, the recommended procedure is to report multiple fit indices such as the CFI (≥0.95) in combination with SRMR (≤0.08) and RMSEA (≤0.06) as this combination tends not to reject models under non-robust conditions.8 Finally, as Bentler noted, model selection should be guided by the principles of parsimony, if several models fit the data, the simplest should be selected.9 A model can only be rejected, it can never be proven to be valid. Models should not be re-specified based upon statistical criteria, they should be re-specified based upon theoretical criteria. If the fit is acceptable, the proposed relationships in the measurement model between the latent and the observed variables and the structural models between the latent variables is supported by the data. The data collected are empirically assessed through estimation methods which attempt to fit the data to the hypothesized models; fit indices are calculated to assess the goodness of fit. Fit indices such as the CFI should be >.90. If the fit is poorer than this, then the model can be re-specified. Fifth, the model may be re-specified. The first step is to re-specify the measurement model, which is then analyzed to assess fit. If the fit is acceptable, the SEM model can then be analyzed. The model is then assessed and re-specified if need be based upon statistical output and theoretical relevance. Finally, conclusions can be drawn from analyses and comparison against theory.

SAMPLE SIZE AND MODEL FIT Sample size is important for goodness of fit as the estimation procedures used to calculate the model parameters require a sample size large enough to obtain meaningful parameter estimates.10 Some have suggested that sample size needs to be more than 25 times the number of parameters with a minimum subject: parameter ratio of 10:1, the lower bound of total sample size should be approximately 100–200.11 Bentler proposed that when sample sizes are small, multiple competing models must be tested, if some of the models are rejected using the fit indices, then the sample size is probably large enough because there is enough power to reject competing models.

172  Validity III: Advanced methods

STRENGTHS AND WEAKNESSES OF SEM SEM is a theory strong approach. It supersedes other multivariate analyses because it can model the relationships between “error-free” latent variables by partialing out measurement error from multiple, “imperfectly reliable” indicators. Furthermore, SEM can be used for both cross-sectional and longitudinal studies. A good fit does not imply a strong effect on the dependent variable. Even a high proportion of explained variance is no proof of causality; the best we can expect from SEM is evidence against a poor model but never proof of a good one. Well-designed, theory-strong longitudinal studies, however, can provide powerful evidence for the effect of one variable on another. SEM does have a number of weaknesses. SEM is dependent upon a well-­ developed theory. If SEM is used for primarily exploratory purposes, then model fit may be more a function of statistical fit rather than theoretical fit. Furthermore, if the model is misspecified due to weak theory, unclear hypotheses or poor study design, the casual relationships between the variables will be misinterpreted.11 SEM, like other analytic procedures, cannot compensate for unreliable assessment (see Chapters 8 and 9). If measuring instruments with poor reliability are employed, the SEM will be laden with error. The use of only one measure to identify a latent variable is also a weakness, in that it reduces the amount of variability that can be identified in the latent variable thus producing a biased measurement. SEM cannot compensate for instruments with poor reliability and validity, poorly specified theoretical models, inadequate sampling, or misinterpretation of the fit indices.

WHERE SEM WOULD BE USEFUL IN MEDICAL EDUCATION RESEARCH Although identified as a potentially powerful research tool, SEM has not been used much in health sciences education research. Some interesting SEM work has been done in medical education research such as CFA and path analyses. SEM has been used to test competing hypothetical models of how basic science knowledge and clinical knowledge are used in diagnostic reasoning to assess medical expertise,12 to assess the predictive validity of various standard and nonstandard selection criteria on medical school performance,13 to assess students’ motivation for choosing a specialization,14 to assess a causal model of the influence of educational interventions and motivation in problem-based learning,6 to assess the factor structure of a psychometric tool measuring readiness to engage self-directed learning, and to assess the stability of communication skills in medical students as measured in objective-structured clinical examinations (OSCEs) measured at different times,15 but none of these approaches have employed the full SEM model of latent variable path analysis. Researchers have created a LVPA assessing the predictive validity of the MCAT and UGPA on standardized outcome measures (USMLE Steps 1–3 examinations)

Latent variable path analysis  173

as mediated through latent variables such as “undergraduate achievement,” ­“aptitude for medicine,” and “performance in medicine.”12 Potential applications of SEM in medical education research could include, longitudinal measurement of student learning, verifying predictive power of selection process, predictive validity studies using cognitive and non-cognitive factors as predictors for student success, testing aspects of clinical performance, and test scale development.

LATENT VARIABLE PATH ANALYSIS The following LVPA was performed as a confirmatory method to evaluate the findings of previous studies that have used correlation-based methods (e.g., Pearson’s r, multiple regression) to test the predictive validity of the MCAT and other cognitive, achievement, and demographic variables on the performance of medical students on licensure examinations. A total of 548 physicians (292 men—53.3%; 256 women—46.7%), who had graduated from Wake Forest School of Medicine from 2009 to 2014, participated. There were three types of data: (1) Admissions data that consists of undergraduate (GPA, MCAT) subtest scores (physical sciences, biological sciences, and verbal reasoning), (2) Course performance data (Year 1, Year 2, and Year 3), and (3) Performance on the NBME exams (i.e., Step 1, Step 2 CK, and Step 3).12 The data were used to test the fit of a hypothesized model using latent variable path analysis. In this model, the development of elaborate knowledge networks evolves through a process of biomedical knowledge acquisition, practical clinical experience, and an integration or encapsulation of both theoretical and experiential knowledge. The encapsulation process of basic science knowledge begins as soon as medical students are introduced to real patients through clinical encounters or presentations.16 Figure 7.2 shows the final three-latent variable model with parameter estimates and goodness-of-fit indices, GFI, CFI, SRMR, and RMSEA. In this ML estimation, the theoretical structure of the model is supported with the significant correlation (r = 0.38, p  0.70 with two raters, at least eight to nine tasks are required. With three raters, fewer tasks will be required (perhaps 4–5). The precise number of raters and tasks that are optimal will ultimately depend on the nature of the tasks (e.g., spoken language, clinical skills, etc.) and the competence and training of the raters. G-studies are designed to assess the dependability of a particular assessment technique while decision studies (D-studies) are designed to gather data for decisions about individuals. D-studies rely on the evidence generated by G-studies to design dependable measures for a particular decision and a particular set of facet about which we would like to generalize. The goal is to design a measure that samples sufficient numbers of instances from different facets of a universe of observations to yield a dependable estimate of the universe score of the assessment.

ITEM RESPONSE THEORY IRT, also known as latent trait theory, is based on the relationship between performances on a test item and the ability or the trait that item was designed to measure. The items aim to measure the ability (or trait) that underlies the performance. In many medical education assessment situations, there is an underlying variable or latent trait, of interest that are easily intuitively understood, such as “professionalism” or “diagnostic reasoning.” Latent traits (also called constructs) are not directly observable (see Chapter 7). Although such a trait can be easily described, and its attributes listed, it cannot be measured directly. IRT models, in general, use with dichotomous items (e.g., MCQs scored as right-wrong), differ only in terms of the number of parameters describing the relationship between the examinee and the test item. The standard (i.e., unidimensional) IRT models include the one-parameter (Rasch), the twoparameter, and the three-parameter logistic models. Each of these allows us to compute the probability that an examinee will answer a specific test item correctly based on characteristics of the item and the examinee’s knowledge or ability. Item Characteristic Curves (ICC) are graphical functions that represent the examinee’s ability as a function of the probability in getting the item correct. An item’s location is defined as the amount of the latent trait needed to have a 0.50 probability of getting the item correct (Figure 9.4). The trait level is labelled ability (θ) with mean = 0 and a standard deviation = 1. Like z-scores, the values of θ typically range from −3 to +3. IRT offers an important advantage of estimating an examinee’s θ on the trait. IRT consists of one-parameter logistic (1PL) model that employs only a single parameter (item difficulty) in estimating ability, θ. The two-parameter logistic (2PL) model employs both the difficulty and discrimination and the threeparameter logistic (3PL) IRT model characterizes probability as a function of the examinee’s ability, with three parameters representing the item’s discriminating power, difficulty level, and guessing.

224  Reliability II: Advanced methods

1 0.9 0.8

Probability

0.7

Discrimination (D)

0.6 0.5 0.4 0.3 Location

0.2 0.1 0 –3

–2

–1

0 Ability (θ)

1

2

3

Figure 9.4  Item characteristic curve (ICC).

The difficulty of an item is the percentage or proportion of people who got the item correct. If everyone gets the item correct, it is an easy item; if very few test-takers get the item correct, it is a very difficult item. Item discrimination has to do with the extent to which an item distinguishes or discriminates between high-test scorers and low-test scorers. These item characteristics are discussed in detail in Chapter 15. An item’s location (on the x-axis) is defined as the amount of the latent trait needed to have a 0.50 probability of correctly answering the item. The higher the difficulty parameter (P—analogous to CTT), the higher the trait level an examinee needs in order to get the item correct. The item discrimination (D) indicates the steepness of the ICC at the items location. It indicates how strongly related the item is to the latent trait. Items with high discriminations are better at differentiating examinees around the location point; conversely, items with low discriminations are poorer at differentiating. Figures 9.5 and 9.6 illustrate ICCs with varying D and θ levels. Figure 9.7 shows two ICCs with different discrimination and θ and a guessing parameter (Curve 1) indicated by the lower asymptote. The inclusion of a guessing parameter indicates that examinees low on the trait may still get the item correct. Analyses with more parameters tend to provide more information for the item than does fewer parameters as depicted in Figure 9.8. The one parameter ICC (1PL—right) is flatter and represents a more difficult item that the other two with 0.50 probability at about θ = 1.5. The two parameter ICC (2PL—left) is sharper with a θ = −1.0. Finally, the 3PL (middle) has the most information showing a lower asymptote with a sharp discrimination and for 0.5 probability, θ = 0.

Item response theory  225

1 0.9 Same D

0.8

1

2

Probability

0.7 0.6 0.5 0.4 0.3

Different Locations

0.2 0.1 0 –3

–2

–1

0 Ability (θ)

1

2

3

Figure 9.5  Two ICCs with the same discrimination but different θ. 1 0.9 Different D

0.8

Probability

0.7 0.6 0.5 0.4 0.3 0.2

Different Locations

1

0.1

2

0 –3

–2

–1

0 Ability (θ)

1

2

3

Figure 9.6  Two ICCs with different discrimination and different θ.

The nature ICCs with varying item discrimination and difficulty are summarized in Table 9.4. Item discrimination and difficulty are independent of each other. Graphically, discrimination is depicted as flat when it is low and S-shaped when it is high. Difficulty places the curve along the x-axis with difficult items to the left of 0 and easy items to the right of 0.

226  Reliability II: Advanced methods

1 0.9 Different D

0.8

Probability

0.7 0.6

Lower Asymptote

0.5 0.4

Different Locations

1

0.3 0.2

2

0.1 0 –3

–2

–1

0 Ability (θ)

1

2

3

Figure 9.7  ICCs with different discrimination and θ and a guessing parameter (Curve 1).

1 0.9 0.8

Probability

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –3

–2

–1

0 Ability (θ)

1

Figure 9.8  ICC with 1PL (Right), 2PL (Left) and 3PL (Middle).

2

3

Advantages and disadvantages of IRT  227

Table 9.4  Summary of the item characteristic curve for difficulty and discrimination Item discrimination

Item difficulty

Low

Easy item

Medium

Easy item

High

Easy item

Low

Moderate item

Medium

Moderate item

High

Moderate item

Low

Difficult item

Medium

Difficult item

High

Difficult item

Curve characteristic Flat with p(θ) > 0.50 ICC to the right of 0 Sharper with p(θ) > 0.50 ICC to the right of 0 S-shape with p(θ) > 0.50 ICC to the right of 0 Flat with p(θ) ≈ 0.50 ICC around 0 Sharper with p(θ) ≈ 0.50 ICC around 0 S-shape with p(θ) ≈ 0.50 ICC around 0 Flat with p(θ) < 0.50 ICC to the left of 0 Sharper with p(θ) < 0.50 ICC to the left of 0 S-shape with p(θ) < 0.50 ICC to the left of 0

When an item has 0 discrimination, all difficulty levels result in the same ­ orizontal line at a value of p(θ) = 0.50, because the value of the item difficulty for h an item with no discrimination is undefined. As summarized in Table 9.4, when an item is easy, p(θ) = 0.50 occurs at a low ability level and when it is difficult, this value occurs at high ability level.

ADVANTAGES AND DISADVANTAGES OF IRT All theories that are concerned with reliability, such as CTT, G-theory, and IRT, focus on measurement error. Central to all three theories is the nature of measurement error. ●● ●● ●● ●●

●●

●●

Advantages of IRT compared to both CTT and G-theory IRT provides several improvements in assessing items and people Difficulty of items and the ability of people are scaled on the same metric Difficulty of an item and the ability of a person can be compared IRT models can be test and sample independent thus providing greater flexibility where different samples or test forms are used These qualities of IRT are the basis for computerized adaptive testing. Disadvantages of IRT compared to both CTT and G-theory Scoring generally requires complex procedures and understanding compared to CTT

228  Reliability II: Advanced methods

●●

●●

●●

●● ●●

CTT provides simple to understand single reliability indices such as Cronbach’s alpha In IRT analogous, indices such as the separation index are difficult and inaccessible It is difficult to decompose the specific sources of error in IRT while G-theory is designed, in part, for this purpose G-theory allows generalization of dependability to well-defined universes Compared to CTT, both G-theory and IRT have low usability or feasibility

REFLECTIONS AND EXERCISES Reflections 9.1 The following analysis of variance table (Table 9.5) consists of the MS for 125 ­students assessed by three observers in an OSCE station assessing chest pain. This was a fully crossed design. Calculate the generalizability coefficients for assessors rating all 125 students. ●● Three raters ρ 2 ●● Two raters ρ 2 ●● One rater ρ 2

Exercise 9.1 Steps in calculating generalizability coefficients 1. Calculate the variance components 2. Determine which components contribute to universe and obtained score variance in D-study designs [using E(MS)] 3. Calculate ρ2s (person variance components)

Exercise 9.2 Calculate the variance components from this G-study 1. MS pxo = σ ε2 2 . MS p = σ ε2 + noσ p2 3. noσ p2 Table 9.5  Analysis of variance summary table Source Person Observers Residual

df

MS

E(MS)

124   2 362

100.60  53.40  10.20

σ ε2 + noσ p2 σ ε2 + n pσ o2

σ ε2

Advantages and disadvantages of IRT  229

4. no = 3,σ p2 5. MSo = σ ε2 + n pσ o2 6. n pσ o2 7. n p = 125,σ o2

Exercise 9.3 The variance components are 1. σ ε2 2 . σ p2 3. σ o2

Exercise 9.4 The generalizability coefficients are 1. Three raters ρ2 2 . Two raters ρ2 3. One rater ρ2

REFERENCE 1. Cronbach, L.J., Nageswari, R., & Glesser, G.C. (1963). Theory of generalizability: A liberation of reliability theory. The British Journal of Statistical Psychology, 16, 137–163; Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. New York: John Wiley.

III

Section     Test construction and evaluation

In Chapter 10, the various formats of testing for cognition, affect, and psychomotor skills are summarized. These include selection type items (multiple-choice questions, constructed response, checklists, and other item formats). There is a discussion of Bloom’s taxonomies of cognitive, affective, and psychomotor skills, Miller’s Pyramid and further work on the use of tables of specifications. Chapter 11 deals specifically with multiple-choice items especially on how to write items, types of MCQs, number of options, how to construct MCQs and appropriate levels of measurement. MCQs are objective tests because they can be scored routinely according to a predetermined key, eliminating the judgments of the scorers. The multiple-choice item consists of several parts, the stem, the keyed-response, and several distractors. The keyed-response is the right answer or the option indicated on the key. All of the possible alternative answers are called options. The stems should include verbs such as define, describe, identify (knowledge), defend, explain, interpret, explain (comprehension), and predict, operate, compute, discover, apply (application). In Chapter 12, constructed response items are detailed. These include essays, short answer, matching, and hybrid items such as the Script Concordance Tests. These item types are sometimes also called subjective tests because scoring involves subjective judgments of the scorer. Constructed-response questions are usually intended to assess students’ ability at the higher levels of Bloom’s taxonomy: application, analysis, synthesis, and evaluation. There are at least two types of essay questions, the restricted and extended response forms. The restricted response question limits the character and breadth of the student’s composition.

232 Test construction and evaluation

There are two basic approaches to the scoring of constructed-response test questions: analytic scoring and holistic scoring. Measuring clinical skills with the objective structured clinical exams (OSCE) is discussed in Chapter 13. The focus is on describing the OSCE and its use in measuring communications, patient management, clinical reasoning, ­diagnoses, physical examination, case history, etc. Additional issues of case development and scripts, training assessors, and training standardized patients are detailed. Content checklists and global rating scales are used to assess observed ­performance of specific tasks. Candidates circulate around a number of different stations containing various content areas from various health disciplines. Most OSCEs utilize standardized patients (SPs), who are typically actors trained to depict the clinical problems and presentations of real issues commonly taken from real patient cases. OSCE validity concerns are (1) content based—defining the content to be assessed within the assessment; (2) concurrent based—this requires evidence that OSCEs correlate with other competency tests that measure important areas of clinical ability; and (3) construct based—one of the ways of establishing construct validity with OSCEs is through differentiating the performance levels between groups of examinees at different points in their education. Chapter 14 deals with checklists, questionnaires, rating scales, and direct observations. Checklists, rating scales, rubrics, and questionnaires are instruments that state specific criteria to gather information and to make judgments about what students, trainees, and instructors know and can do on outcomes. They offer systematic ways of collecting data about specific behaviors, knowledge, skills, and attitudes and can be used at higher levels of Miller’s Pyramid. A checklist is an instrument for identifying the presence or absence of knowledge, skills, and behaviors. Surgery has made extensive use of checklists to improve patient safety. A rating scale is an instrument used for assessing the performance of tasks, skill levels, procedures, processes, and products. Rating scales allow the degree or frequency of the behaviors, skills, and strategies to be rated. Rating scales state the criteria and provide selections to describe the quality or frequency of the performance. A five-point scale is preferred because it provides the best all-around psychometrics. Chapter 15 is about evaluating tests and assessments using item analyses: ­conducting classical item analyses with MCQs (item difficulty, item discrimination, and distractor effectiveness), conducting item analyses with OSCEs, conducting item analyses with constructed response items, and computing the reliability coefficient and errors of measurement. A complete analysis of a test requires an item analysis together with descriptive statistics and reliability. There are three essential features for an item analysis for multiple-choice questions: (1) difficulty of the item, (2) item discrimination, and (3) distractor effectiveness. All of these criteria apply to every other test or assessment form (e.g., OSCE, restricted essay, extended essay, survey) except for distractor effectiveness since there are no distractors in these test formats. The difficulty of the item is the percentage or proportion of people who got the item correct. If everyone gets the item correct, it is an easy item; if very few

Test construction and evaluation  233

­test-takers get the item correct, it is a very difficult item. Item difficulty (P) is ­ sually expressed as a proportion such as P = 0.72 (72% got it correct). u Item discrimination (D) is the extent to which an item discriminates between high-test scorers and low-test scorers. The point-biserial is commonly used for item discrimination, D. Distractor effectiveness refers to the ability of distractors in attracting responses. Grading, reporting, and methods of setting cutoff scores (pass/fail) are ­discussed in Chapter 16. The focus is on norm-referenced versus criterion-­ referenced approaches, setting a minimum performance level (MPL) utilizing the Angoff method, Ebel method, Nedelsky method. Grading systems can be based on norm-referenced (grading on the curve) or on criterion-referenced bases. All of these systems have some problems ­associated with them. The main and historically oldest symbols for grading are the letter grade, usually ranging from A to F. Substitutes have been attempted but have achieved little success because these usually involve reducing the number of categories (e.g., good, satisfactory, unsatisfactory). Numerical grades (­percent or 1–10) have also met with limited success as have pass/fail systems and ­checklists of objectives, as well as empirical methods (borderline regression, cluster ­analyses, etc.). The Dean’s Letter or Medical Student Performance Evaluation employs m ­ ultiple reporting systems dealing with the cognitive, affective, and skills domains. There are provisions for reporting learner achievement, effort, attitudes, interest, social and personal development, and other noteworthy outcomes. The wealth of ­information on the MSPE, however, should be balanced against the need for brevity and simplicity so that it can be readily understood.

10 Formats of testing for cognition, affect, and psychomotor skills ADVANCED ORGANIZERS • It is common practice to think of human behavior in three categories or domains: cognition (thinking), affect (feeling), and psychomotor (doing). Although we tend to discuss human behavior in these separate ways, most behavior involves all three aspects or domains. • Thurstone’s negative exponential or hyperbolic model learning curve represents a principle of learning in a variety of disciplines and for a variety of learners; learning increases rapidly with practice but attains an upper limit quickly and then flattens out. • Notwithstanding scientific advances in the study of thinking, perception, memory, learning, much of assessment in medicine and the other health professions is based on lore, intuition, anecdotal evidence, and personal experience. The use of Bloom’s taxonomies is helpful in systematizing assessment. • Much of the testing in these domains involves selection items (e.g., ­multiple choice) or constructed response (e.g., essays) items. • Jean Piaget’s theory of intelligence is that we adapt to the world employing the dual cognitive processes of assimilation and accommodation. The theory both helps to explain key ideas such as deliberate and interleaved practice, retrieval practice, and learning styles and how to assess them. • The script-concordance test (SCT) is used to assess the ability to interpret medical information under conditions of uncertainty, tapping Piaget’s highest levels of cognitive functioning. There still remain a number of challenges to assess the processes of problem-solving, abstract reasoning, and meta-cognition. • Students should be continuously tested so as to capitalize on the testing effect. Using retrieval practice with testing—working memory to recall or 235

236  Formats of cognitive testing













retrieve facts or knowledge—is more effective than reviewing content or re-reading text. Both the experimental work and the correlational-based research suggest the need for purposeful, direct teaching for integration and encapsulation of basic sciences into clinical reasoning and clinical skills. This integration and encapsulation needs to be assessed and tested dynamically to determine the cognitive processes and outcomes of such pedagogy. Bloom is credited for developing a taxonomy in the affective (feeling) domain that provides a framework for teaching, training, assessing, and evaluating the effectiveness of training as well as curriculum design and delivery. In the affective domain, the structure of professionalism, aptitude, achievement, and personality are accounted by Aptitude for Science (cognitive variables, and Extraversion—a characteristic of working with people (e.g., medicine); Aptitude for Medicine (cognitive variables, and Conscientiousness and, inversely, Neuroticism); Professionalism is composed of the peer assessment and Openness; General Achievement is composed in almost equal parts of GPA and Agreeableness; and Self-Awareness. In health sciences education, psychomotor skills and technical skills—rarely defined in the health sciences literature—most commonly refer to motor and dexterity skills associated with carrying out physical examinations, clinical procedures, and surgery, and also using specialized medical equipment. Miller proposed a framework for assessing levels of clinical competence that encompasses the cognitive, affective, and psychomotor domains. In the pyramid, the lowest two levels test cognition (knowledge); this is the basis for initial learning of an area. The higher levels (shows, does) require more complex integration in the cognitive, affective, and psychomotor domains. Direct observations in situ or objective structured performance-related examinations (OSPREs) are methods to assess the higher levels of Miller’s Pyramid.

INTRODUCTION It is common practice in education and psychology to think of human behavior in three categories or domains: cognition (thinking), affect (feeling), and psychomotor (doing). Although we tend to discuss behavior in these separate ways, most behavior involves all three aspects or domains. The physician working with a patient, for example, is collecting information, arriving at a clinical impression and perhaps a diagnosis (thinking), making eye contact, nodding, smiling showing empathy (feeling), and palpating, percussing, and touching (doing). Aspects of all three domains are employed in complex, interactive ways. Nonetheless, for purposes of instructional design and assessment, we tend to approach these tasks in their separate domains.

The cognitive domain  237

While recognizing the interrelatedness of all three domains, in this chapter we will emphasize and discuss each in turn. Much of formal testing and a­ ssessment in education, in general, is in the cognitive domain (e.g., sciences, humanities, mathematics, etc.). Less formal testing occurs in the affective domain (e.g., music and art appreciation) and the psychomotor domain (e.g., lighting a Bunsen burner, keyboarding). In this chapter, we will in turn discuss assessment in the three domains with particular reference to medical and health profession education.

THE COGNITIVE DOMAIN The cognitive domain has to do with thinking. We have learned a great deal about cognition, learning, and memory since Hermann Ebbinghaus’ pioneering work more than 100 years ago.1 His focus was on developing quantitative descriptions of psychological processes such as learning and forgetting and the precise mathematic assessment of these processes. Louis Thurstone was similarly preoccupied with systematizing the assessment of learning processes into mathematical equations, as well as the development of factor analysis (Chapters 4 and 6) and theories of intelligence. Thurstone’s negative exponential or hyperbolic model learning curve represents a principle of learning in a variety of disciplines and for a variety of learners.2 This model depicts learning as increasing rapidly with practice but attains an upper limit quickly and then flattens out. Figure 10.1 depicts data of the growth of aspects medical competence (physical exam, ­medical interview, clinical reasoning) during clinical clerkships. Growth Curves for Competence During Clinical Clerkships 3.40 3.30 3.20

Performance

3.10 3.00 PE

2.90

Interview

2.80

Clinic Reas

2.70

Manage Pl

2.60 2.50 2.40

15

40

65 92 Days in Clerkships

119

150

Figure 10.1  Negative exponential growth curves of competence.

238  Formats of cognitive testing

Cognitive and educational psychologists have made scientific advances in the study of thinking, perception, memory, learning, and teaching in the past 100 years. Still, much of teaching and assessment in medicine and the other health professions is based on lore, intuition, anecdotal evidence, and personal experience. The use of Bloom’s taxonomy is helpful in systematizing assessment. Bloom’s taxonomy of the cognitive domain is described in detail in Chapter 5. It is a simple, practical, and useful classification for assessment, learning, and teaching. Recall that there are six major categories, and the general objectives at each level provides the action verbs which are useful in writing objectives or constructing test items in each of these categories. It also helps to clarify cognition at high levels (i.e., analysis, synthesis, evaluation, creating) and more basic levels (i.e., knowledge, comprehension, application). While the knowledge can span all six categories, most assessment, teaching, and learning in medicine occurs at the first three levels. Much of the testing in these domains involves selection items or constructed response items.

Selection items The most common type of selection items are multiple-choice questions (MCQs). Other forms of selection items include true-false and matching. In this item type, the answer is selected from a list following a prompt or a question; hence the term selection items. The construction and use of MCQs are detailed in Chapter 11. The other selection type items (true-false, matching) are so flawed from an assessment perspective that they are not used much and should be avoided in health sciences education.

Constructed response items The most common type of constructed response items are essays. These are also referred to as open-ended items. Unlike MCQs, there is no list from which to select the answer. The student must construct the answer following a prompt or question. The major advantage that constructed response items have over selection item is that learning in the highest domains of Bloom’s taxonomy can be assessed. In this item type, students can demonstrate fluency, flexibility, originality, elaboration, language mastery, and creativity in responses. The construction and use of constructed response items are detailed in Chapter 12.

Cognitive processes and development Jean Piaget’s theory3 has several useful key ideas for learning and assessment in health sciences. His theory of intelligence is that we adapt to the world ­employing the dual cognitive processes of assimilation (the process by which material is taken into—subsumption—cognitive structures) and accommodation (the change made to cognitive structures as a result of assimilation). These cognitive structures are called schemas and are representations of perceptions, ideas, and/or actions, which go together. In its most practical sense then, Piaget’s

The cognitive domain  239

theory both helps to explain key ideas such as deliberate and interleaved practice, retrieval practice, and learning styles and how to assess them. Assimilation takes in information through sensory input and attentional selection. This information is fit into existing schemas until the schemas no longer can hold new information. Accommodation occurs by restructuring, elaborating, or discarding the schema. Meaningful improvement in learning, therefore, results when the schemas or cognitive structures are altered. Such accommodating requires effortful learning. Organizations of schemas or cognitive maps help students organize their learning and facilitate the assimilation of new information and the subsequent accommodation by expanding, altering, or reorganizing schemas. Piaget has proposed that human development proceeds in four stages of cognition: sensorimotor, preoperational, concrete operational, and formal operational period. The sensorimotor stage is where infants construct knowledge and understand the world by sensorimotor interactions with objects (e.g., grasping, sucking, and tactile feeling). Infants progress from reflexive, instinctual action at birth to the beginning of symbolic thought (i.e. language) by age 18–24 months or so. The second stage is the preoperational period where children’s b ­ ehavior is mainly categorized by symbolic play and manipulating symbols. This ­preoperational stage is characterized by intuitive, ego-centric behavior that lacks logic (ages 2–6 years). Behavior is intuitive and impulsive. The third stage (ages 7–11 years), concrete operations, is characterized by the use of logic. At this stage, children apply logic mainly to concrete events or objects. They may be able to use inductive reasoning—drawing inferences from observations in order to make a generalization—but not deductive reasoning (using a generalized principle in order to try to predict the outcome of an event). The final stage, formal operational stage (ages 11–13 years; adolescence and beyond), is characterized by logical use of symbols and abstract concepts. At this point, the person is capable of hypothetical and deductive reasoning. The adolescent or adult can employ hypothetical, counterfactual thinking (what-if), which is frequently required in science and healthcare. Other features of this stage include ●●

●●

●●

Abstract thought that considers possible outcomes and consequences of actions Metacognition, the capacity for thinking about thinking, allows reflection on one’s own thought processes and monitors them Problem-solving with the ability to systematically solve a problem in a logical and methodical way

Most healthcare professionals and students should be functioning in the formal operational stage of cognitive development and therefore readily learn from symbolic content (speech, text, observation) and can be assessed accordingly. The SCT4 is used to assess the ability to interpret medical information under ­conditions of uncertainty. It presents: (1) poorly defined clinical situations but respondents must choose between several realistic options; (2) the format allows flexible responses thus mirroring cognitive processes in problem-solving

240  Formats of cognitive testing

situations; and (3) scoring takes into account the variability of responses of experts to clinical situations. This type of assessment is rooted in the highest stage of Piaget’s cognitive reasoning. It is as important to assess or test the cognitive processes in this stage as the content of clinical reasoning as the SCT attempts to do. There still remain a number of challenges to develop assessments that measure not only the ­clinical content but also of the processes of problem-solving, abstract reasoning, and meta-cognition.

Continuous learning and assessment With continuous learning, students acquire knowledge and increase learning intelligence (learning how-to-learn) through meta-cognition. The human brain is remarkably changeable. This phenomenon, known as neuroplasticity, refers to changes in neural pathways, synapses, and myelination due to learning, thinking, and changes in behavior. Although the architecture and gross structure of the brain are largely genetically determined, the detailed structure and neural networks are shaped by experience and can be modified.5 A recent naturalistic study has elegantly demonstrated this. To become a licensed taxi driver in London, trainees must learn the complex layout of London’s streets over 4 years. Trainees that passed the tests of London streets had an increase in gray matter volume in their posterior hippocampi in brain scans. Controls and trainees that failed did not have structural brain changes.6 Continuous learning is valuable for its own sake and because it results in building new neural connections and intellectual capabilities. Students learn content and can increase their intelligence, both admirable educational outcomes. Both instructors and students need to work hard to form the cognitive structures and neural networks of meaningful learning that is deep and durable. Additionally, students should be continuously tested so as to capitalize on the testing effect. Using retrieval practice with testing—working memory to recall or retrieve facts or knowledge—is more effective than reviewing content or rereading text. Long-term memory is increased when some of the learning period is devoted to retrieving the information to-be-recalled. Testing practice produces better results than other forms of studying. Students who test themselves during learning or practice recall more than students who spend the same amount of time re-reading the complete information. Retrieving information for a test will alter and strengthen it. Taking a test teaches students about a subject and they will perform better on a subsequent test than students who only reviewed or re-read the material. This is called the forward effect of testing. A recent randomized study of neurology continuing medical education courses compared the effects of repeated quizzing—test-enhanced learning— and repeated studying on retention. A final test covering all information points from the course was taken 5 months after the course. Performance on the final test by neurologists showed that repeated quizzing led to significantly greater long-term retention (almost twice as much) relative to both repeated studying and no further exposure.7

Three competing theories of clinical reasoning  241

THREE COMPETING THEORIES OF CLINICAL REASONING Schmidt et al.8 proposed a cognitive structure of medical expertise model based on the accumulation of clinically relevant knowledge about disease signs and symptoms referred to as illness scripts. In this model, the development of elaborate knowledge networks evolves through a process of biomedical k­ nowledge acquisition, practical clinical experience, and an integration or encapsulation of both theoretical and experiential knowledge. There is some support for the knowledge encapsulation theory, which ­postulates that basic science knowledge has an indirect influence on diagnostic reasoning by contributing directly to clinical knowledge. A large-scale study employing more than 20,714 medical graduates lent support to the encapsulation theory.8 In this model, the biomedical or basic science knowledge in the first 2 years of medical school precedes and is eventually incorporated during the acquisition of clinical knowledge. Accordingly, basic science knowledge becomes encapsulated or reorganized into causal representations of illness scripts that lead to the formal process of diagnostic and clinical reasoning. In the distinct world model, the relationship between basic science knowledge and clinical knowledge are unique domains with their own distinct structure and characteristics.9 Diagnostic or clinical reasoning, from the clinical knowledge obtained from patient presentations of signs and symptoms, develops to taxonomy of disease. The activation of clinical knowledge is primary in the diagnostic or clinical reasoning process. Although this process rarely relies on the strict use of basic science knowledge, pathophysiological explanations can p ­ rovide further support for the explanation of clinical phenomena. Another theory of medical expertise is the independent influence model. Both basic science knowledge and clinical knowledge are independently related to diagnostic performance or clinical competency. Basic science knowledge ­provides a foundational basis in medical diagnosis. Clinical experience and knowledge are acquired temporally (second half of medical school and residency) and ­geographically (on the wards and clinics) separate from basic science ­k nowledge (in the first half in classrooms and labs). As accurate diagnostic thinking develops with gained experience, students and residents draw on knowledge of the anatomy and physiology of the human body. In constructing clinical case representations, they recognize clinical phenomena and activate basic science knowledge to account for unexplained symptoms and to specify a diagnosis.

Experimental work Several experimental studies have addressed the problem if and how basic sciences can become integrated into clinical reasoning or clinical skills. Employing medical students, physician assistant students, and nursing students, researchers compared a spatial and temporal proximity of clinical content to basic sciences versus a purposeful, explicit teaching model that exposes relationships

242  Formats of cognitive testing

between the domains. They concluded that proximity alone is insufficient for integration but that explicit and specific teaching may be necessary to facilitate integration for the learner. Integration requires teaching basic science information in a manner that creates a relationship between basic science and ­clinical science thus resulting in conceptual coherence—mental representations of ­clinical and basic science information that can help learners find meaning in clinical problems.10 In another study, undergraduate dental and dental hygiene students (n = 112) were taught the radiographic features and pathophysiology underlying four intrabony abnormalities with a test-enhanced (TE) condition (thought to enhance learning and cognitive integration) and a study (ST) condition. TE participants outperformed those in the ST condition group on immediate and delayed diagnostic testing, but there were no differences in a memory test between the groups. The inclusion of the basic science test appears to have improved the students’ understanding of the underlying disease mechanisms learned and also improved their performance on a test of diagnostic accuracy.11 Both the experimental work and the correlational-based research (e.g., Pearson’s r, regression, and latent variable path analyses) suggest the need for ­purposeful, direct teaching for integration and encapsulation of basic sciences into clinical reasoning and clinical skills. This integration and encapsulation needs to be assessed and tested dynamically to determine the cognitive processes and outcomes of such pedagogy. Such complex hypothesized models of medical expertise of aptitude for medical school, basic science achievement, and clinical competency employing longitudinal data can be studied within structural ­equation modelling (SEM). Accordingly, the development of medical expertise and clinical competency can be empirically verified.

AFFECTIVE DOMAIN The affective domain has to do with feeling and emotions. The areas in ­psychology that generally deal with the affective domain are personality, motivation, interests, and attitudes. Bloom is also credited for developing a taxonomy in the affective domain.12 As with the cognitive domain, the affective domain provides a framework for teaching, training, assessing, and evaluating the effectiveness of training as well as curriculum design and delivery (Table 10.1). Table 10.2 is a list of ten typical adjectives for healthcare professionals in the affective domain. Each is followed by some behavioral descriptors. Most of these are values at the higher levels of Bloom’s taxonomy (levels 3, 4, 5).

Self-/peer assessment Self- and peer assessment is used frequently as a tool to evaluate students, residents, and physicians as the basis of their professional behaviors. The Rochester peer assessment protocol (RPAP), a form consisting of 15 items, is an instrument

Table 10.1  Bloom’s taxonomy in the affective domain Category

Behaviors

1. Receiving

Openness to experience and learning

2. Responding

React and participate actively

Attend, focus interest learning experience, take notes, devote time for learning experience, participate passively Participate actively, interest in outcomes, enthusiasm for action, question and probe ideas, make interpretations Evaluate worth and relevance of ideas, experiences; commit to course of action Refine personal views and reasons, state beliefs Self-reliant; consistency with personal value set

Verbs that describe the activity Ask, listen, focus, attend, take part, discuss, acknowledge, hear, be open to, retain, follow, concentrate, read, do, feel React, respond, seek clarification, interpret, clarify, show motivation, contribute, question, present, cite, help team, write, perform Argue, confront, justify, persuade, criticize, challenge, debate, refute Build, develop, formulate, defend, modify, relate, reconcile, compare, contrast Act, display, influence, solve, practice

Affective domain  243

Attach values and express personal opinions 4. Organizing values Reconcile internal conflicts; develop value system 5. Internalizing values Adopt belief system and philosophy 3. Valuing

Types of experience

244  Formats of cognitive testing

Table 10.2  Professional behavior evaluation 1. Self-confidence—demonstrates the ability to trust personal judgment, an awareness of strengths and limitations; exercises good personal judgment 2. Communications—speaks clearly; writes legibly; listens actively; adjusts communication as needed to various situations 3. Integrity—consistently honest, can be trusted with confidential information, complete and accurate documentation of patient care and learning activities 4. Empathy—shows compassion and respect for others, calm, compassionate, and helpful demeanor toward those in need, be supportive and reassuring 5. Respect—polite to others, does not use derogatory or demeaning terms 6. Self-motivation—initiative to complete assignments, improve and/or correct behavior, following through on tasks, enthusiasm for learning and improvement, accepting constructive feedback in a positive manner, taking advantage of learning opportunities 7. Appearance and personal hygiene—clothing is appropriate, neat, clean and well maintained, good personal hygiene and grooming 8. Time management—consistently punctual, completes tasks and assignments on time 9. Teamwork and collaboration—places the success of the team above self-interest; does not undermine the team; shows respect for all team members; remains flexible and open to change; communicates with others to resolve problems 10. Patient advocacy—free from personal bias or feelings that interfere with patient care; places the needs of patients above self-interest; protects and respects patient confidentiality and dignity

commonly used to assess professional behaviors and attitudes in medical students and has been studied extensively for reliability and validity evidence.13 Additional research on peer assessment in medical students reveals that there is high correlation in the pre- and post-test scores of the RPAP when administered longitudinally, and that there is no bias in rater evaluation of the students when the students being evaluated chose their assessors.14 Data in Table 10.3 contain typical assessment in the affective domains for both peer and self-ratings. A factor analysis was conducted for the peer questionnaire as the 15 items from the peer assessment were intercorrelated using Pearson ­product–moment correlations based on 120 students. The correlation matrix was then decomposed into principal components, and these were subsequently rotated to the normalized varimax criterion. Items were considered to be part of a factor if their primary loading was on that factor. The number of factors to be extracted was based on the Kaiser rule (i.e., eigenvalues > 1.0). The factor analysis showed that the data on the peer questionnaire decomposed into two factors that accounted for 67.8% of the total variance: (1) work

Table 10.3  Peer and self-ratings on professionalism Peer rating factor structure Questionnaire items

Interpersonal habits

0.744 0.802 0.745 0.517 0.747 0.804 0.753 0.666 0.416 0.521

0.705 0.599

0.407 0.787 0.642 0.803 0.862

0.784 57.65 0.81

0.766 10.11 0.85

Self-ratings

Mean (SD), Min-Max

Mean (SD), Min-Max

4.82 (0.45), 1–5 4.89 (0.37), 1–5 4.87 (0.39), 1–5 4.89 (.39), 1–5 4.87 (0.39), 1–5 4.67 (0.58), 1–5 4.84 (0.43), 1–5 4.88 (0.39), 1–5 4.81 (0.48), 1–5 4.91 (0.37), 1–5 4.90 (0.36), 1–5 4.95 (0.31), 1–5 4.90 (0.40), 1–5 4.88 (0.38), 1–5 4.81 (0.50), 1–5 — 0.95

4.52 (0.57), 3–5 4.72 (0.49), 3–5 4.56 (0.58), 3–5 4.85 (0.40), 3–5 4.85 (0.36), 3–5 4.40 (0.66), 3–5 4.59 (0.59), 2–5 4.73 (0.49), 3–5 4.59 (0.59), 3–5 4.90 (0.30), 4–5 4.82 (0.41), 3–5 4.89 (0.41), 2–5 4.80 (0.50), 2–5 4.71 (0.54), 3–5 — — 0.87

Affective domain  245

1. Is prepared for sessions 2. Identifies and solves problems using data 3. Able to explain clearly 4. Compassion and empathy 5. Seeks to understand others’ views 6. Initiative and leadership 7. Information or resource sharing 8. Assumes responsibility 9. Seeks and employs feedback 10. Presents consistently to superiors and peers—Trustworthy 11. Admits and corrects own mistakes—Truthful 12. Dress and appearance appropriate 13. Behavior is appropriate 14. Directs own learning—Think and work independently 15. I would refer family, patients, self to this future physician Percent of variance Cronbach’s α

Work habits

Peer ratings

246  Formats of cognitive testing

habits and (2) interpersonal habits but with work habits accounting for a large ­proportion of the variance (57.7% versus 10.1% for interpersonal habits— Table  10.3). Four items (#5—Seeks to understand others’ views; #8—Assumes responsibility; #10—Presents consistently to superiors and peers—Trustworthy; #11—Admits and corrects own mistakes—Truthful) load on both factors (i.e., split loadings—see Table 10.3) as these items are relevant to both work and interpersonal habits. Cronbach’s α for internal consistency reliability indicated that both of the self and peer instruments’ full scales had high reliability (α = 0.87 and 0.95, ­respectively). The reliability for both factors (subscales) for the peer questionnaire had high reliability as well (α = 0.81 and 0.85).

Personality and professionalism One test that has been used to study medical student characteristics and ­performance is the NEO-PI, which is referred to as the Big Five Personality Inventory. This inventory measures five broad traits, domains, or dimensions in the affective domain. These are neuroticism (N), extraversion (E), o ­ penness (O), conscientiousness (C), and agreeableness (A). Conscientiousness has been found to be a predictor of performance in medical school and becomes ­increasingly important as students and residents advance through medical training. Other traits concerning sociability (i.e. extraversion, openness, and neuroticism) have also been found to be relevant for performance and adaptation in the healthcare environment.15 The traits of neuroticism and conscientiousness are related to stress in medical school: low extroversion, high neuroticism, and high ­conscientiousness, results in highly stressed students (brooders), while high extroversion, low neuroticism, and low conscientiousness produces low-stressed students (hedonists).16 STRUCTURE OF PROFESSIONALISM, PERSONALITY, ACHIEVEMENT, AND APTITUDE

In order to explore the structure of personality, achievement, aptitude, and ­professionalism, a factor analysis was conducted employing the peer total score, self-total score, MCAT subtests scores, undergraduate grade point average (UGPA), and the five dimensions from the NEO-PI. One-hundred-and-twenty medical students (64 men, 53.3% and 56 women, 46.7%) participated in a study. Data were obtained on (1) Rochester Peer Assessment Tool (RPAT), (2) a self-assessment survey contains the first 14 items of the peer assessment but adopted in the first person; (3) The NEO-PI is a 44 item personality inventory that describes five personality characteristics, and (4) MCAT scores and undergraduate GPA. These variables were intercorrelated using Pearson product–moment ­correlations and the resulting matrix was decomposed into principal components and then rotated to the normalized varimax criterion (converged in eight ­iterations). The number of factors extracted was based partly on the Kaiser rule (i.e., Eigenvalues > 1.0) and partly on the theoretical meaning and cohesiveness of the factors. A close inspection of the results of this analysis in Table 10.4 shows

Affective domain  247

Table 10.4  Factor analysis of personality, peer assessment, self-assessment, aptitude and achievement Factor General Aptitude Aptitude achieve­ Selffor for ment awareness science medicine Professionalism Extraversion Agreeableness Conscientious­ ness Neuroticism Openness MCAT VR score MCAT PS score MCAT BS score Peer assessment Self-assessment UGPA Percent of variance

0.601 0.785 0.829 −0.660 0.701 0.478

0.435

0.559 0.801 0.757 0.901 17.5

15.9

12.1

0.755 10.3

9.2

five factors with their variance accounting properties: (1) Aptitude for Science, (2) Aptitude for Medicine, (3) Peer Professionalism, (4) General Achievement, and (5) Self-Awareness for a total of 64.9% of the variance. STRUCTURE OF PERSONALITY, PROFESSIONALISM, AND COGNITIVE VARIABLES

The five factors for the structure of professionalism, aptitude, achievement, and personality accounted for more than two-thirds of the total variance. These ­factors are cohesive and theoretically meaningful. The first factor, Aptitude for Science, consists of MCAT Biological Sciences and Physical Sciences loadings as well as Extraversion, a characteristic of students with science aptitude who select to work with people (e.g., medicine). The second factor, Aptitude for Medicine, contains the MCAT Verbal score together with Conscientiousness and, inversely, Neuroticism—both important dimensions of personality consistent with high performance in medicine. The third factor, Professionalism, is composed of the Peer Assessment and Openness, which are theoretically cohesive characteristics. The fourth factor—General Achievement—is composed in almost equal parts of undergraduate GPA and Agreeableness, a factor that is important in achievement. The fifth factor, Self-Awareness, is composed of the Self-Assessment scores and MCAT Verbal score. Interestingly, the MCAT Verbal score has almost equal split loadings on Aptitude for

248  Formats of cognitive testing

Medicine and Self-Awareness, possibly a result of verbal abilities for both of these dimensions.

Summary The main findings are (1) the psychometric properties (response rates, descriptive statistics, reliability, validity evidence from factor analysis, feasibility) of the self– peer instrument are in accordance with theoretical expectations, (2) p ­ ersonality factors—Openness, Conscientiousness, Neuroticism—are significantly related to peer assessment of professionalism, and (3) factor analysis of personality, achievement, aptitude, and professionalism data results in five theoretically meaningful and cohesive factors.

PSYCHOMOTOR DOMAIN The psychomotor domain has to do with doing or action. It was established to address skills development relating to the physical dimensions of accomplishing a task. Table 10.5 contains a classification of the psychomotor domain. In health sciences education, psychomotor skills and technical skills— rarely defined in the health sciences literature—most commonly refer to the ­cognitive,  motor, and dexterity skills associated with carrying out physical ­examinations, clinical procedures, and surgery by using specialized medical equipment. They also frequently include skills associated with conducting physical examinations, using s­ pecialized non-surgical equipment and administering medicines. A recent study18 sets out to determine which instruments exist to directly assess psychomotor skills in medical trainees on live patients and to identify the data indicating their psychometric and edumetric properties. This was to address the Accreditation Council for Graduate Medical Education (ACGME) Milestone Project that mandates programs to assess the attainment of training outcomes, including the psychomotor (surgical or procedural) skills of medical trainees. In a systematic review, researchers identified a total of 30 instruments used to assess psychomotor skills such as the following: ●● ●● ●● ●● ●● ●● ●●

sigmoidoscopy-generic skills; sigmoidoscopy-specific skills colonoscopy-generic skills, colonoscopy-specific skills laparoscopic skills (depth perception, bimanual dexterity, efficiency) epidural anesthesia laparoscopic cholecystectomy ophthalmic plastic surgical skills microsurgery skills

Construct validity was identified in 24 instruments, internal consistency in 14, test– retest reliability in five, and inter-rater reliability in 20. The modification of a­ ttitudes, knowledge, or skills was reported using five tools. Jelovsek et al.18 ­concluded that while numerous instruments are available for the assessment of psychomotor skills

Miller’s Pyramid of clinical competence  249

Table 10.5  Dave’s psychomotor domain17 Category or stage

Behaviors

Types of experience

Verbs that describe the activity

Observe instructor Copy, follow, replicate, repeat, and repeat action, adhere, attempt, process or activity reproduce, organize, sketch, duplicate Follow written or Re-create, build, Reproduce 2. Manipulation verbal instruction perform, execute, activity from implement, instruction or acquire, conduct, memory operate Demonstrate, Perform a task or Execute skill 3. Precision complete, show, activity with reliably, perfect, calibrate, exp­ertise and to independent control, achieve, high quality of help, activity accomplish, without assi­stance is quick, master, refine or instruction; can smooth, and demo­nstrate an accurate activity Combine elements Solve, adapt, Adapt and 4. Articu­lation combine, to develop integrate coordinate, methods to meet expertise to integrate, adapt, varying, novel satisfy a new develop, requirements context or task formulate, master Construct, Define aim, 5. Natural­ization Instinctive, compose, create, approach and effortless, design, specify, strategy for use unconscious manage, invent, of activities to mastery of project-manage, meet strategic activity and originate need related skills at strategic level 1. Imitation

Copy action of another; observe and replicate

in medical trainees, the evidence supporting their psychometric (i.e., reliability and validity) and edumetric (usability, training) properties is limited.

MILLER’S PYRAMID OF CLINICAL COMPETENCE: BRINGING THE DOMAINS TOGETHER Miller proposed a framework for assessing levels of clinical competence (Figure  2.1)19 that encompasses the cognitive, affective, and psychomotor

250  Formats of cognitive testing

domains. In the pyramid, the lowest two levels test cognition (knowledge) and this is the basis for initial learning of the area. Novice learners may know something about a neurological examination, for example, or they may know how to do a neurological examination. The upper two levels test behavior or competence to determine if learners can apply what they know into practice. They can show how to do a ­neurological examination or do they actually do a neurological examination in practice. Doing such work includes both the affective domain (patient’s focus, physician’s empathy, etc.) and the psychomotor domain (physician skills in checking reflexes, muscle strength, etc.). Cognitive performance (knows or knows how) does not generally correlate well with behavior performance (shows or does). A trainee who knows how to do something doesn’t necessarily mean that they will be able to do it in practice. Nonetheless, the knowing is an important foundation for the “knows how” and “does” (Figure 10.2). 1. 2. 3. 4.

Knows: some knowledge Knows how: to apply that knowledge Shows: shows how to apply that knowledge Does: applies that knowledge in practice

Miller’s Pyramid model has been used to match assessment methods to the competency being tested. An MCQ test for students’ examination of the shoulder might show they know about it but not that they can actually do it. The test shows

4. Does (action)

3. Shows how (performance)

2. Knows How (competence)

1. Knows (knowledge)

Figure 10.2  Miller’s Pyramid of clinical competence.

Miller’s Pyramid of clinical competence  251

how an OSCE type station to examine the shoulder might be used. Direct observation in the workplace itself would provide even better ecological validity. The Miller model can also help formulate objectives for a particular assessment session: this requires careful articulation of the objectives to be achieved. A test of learner’s ability to assess risk for a cardiac event, for example, might include the following objectives and the level of Miller’s Pyramid being represented: ●●

●● ●●

Understands what is meant by risk for a cardiac event and why it is important (knows) Knows what to do if the risk is high (knows how) Can demonstrate the use of the cardiac risk calculator (shows)

To test at the does level, a practical workplace-based session would be most appropriate. Other methods might include a review of video consultations with real patients. The model can also be used to create testing sessions at the appropriate level for learners. For instance, for assessing communication skills: ●●

●●

●●

Learners in their early stages of development may be tested on why it is important to solicit patient’s ideas, concerns, and expectations and what the evidence says (knows) More experienced learners may be tested on how they might actually elicit ideas, concerns, and expectations: what phrases they might use (knows how) For advanced learners, direct observations in situ such as reviewing videos of actual consultations with patient may be employed (does)

Objective structured performance-related examination combining the domains In a study of surgical residents, researchers20 assessed surgical skills together with communication and professionalism in an objective structured performancerelated examination (OSPRE). In this seven station OSPRE, they assessed skills in excision of a skin lesion, central line, chest tube insertion, enterotomy closure, tracheostomy, laparoscopic task, and acquiring informed consent (Table  10.6). Therefore, all levels of Miller’s Pyramid were involved for this performance assessment of communication, professionalism, and surgical skills competencies. The internal consistency reliability of the checklists and global rating scales combined was adequate for communication (α = 0.75–0.92), surgical skills (α = 0.86–0.96) but not for professionalism (α = 0). There was evidence of validity as surgical skills performance improved as a function of PGY level but not for the professionalism checklist. Surgical skills and communication correlated in the two stations assessed (r = 0.55 and 0.57, p < 0.05). There is evidence for both reliability and validity for simultaneously assessing surgical skills and communication skills.

252  Formats of cognitive testing

Table 10.6  Station name, content, and skills assessed for surgical residents Station #

Name

Assessed skills

4 5

Excision of a skin lesion Central line Chest tube insertion Enterotomy closure Tracheostomy

6

Laparoscopic task

7

Acquiring informed consent

Communication + surgical skills Surgical skills Communication + surgical skills Surgical skills Professionalism + surgical skills Professionalism + surgical skills Communication in surgical context

1 2 3

Mean (SD)

Cronbach’s α

82 (15.33)

0.87

82 (16.20) 83 (8.15)

0.60 0.19

80 (18.40) 80 (9.17)

0.58 0.68

45 (19.74)

0.90

61 (15.07)

0.69

Surgical skills performance The performance of the residents (shows how—the 3rd level of Miller’s Pyramid) in the surgical tasks was good in the first five stations (excision of a skin lesion, insertion of a central line, chest tube insertion, enterotomy closure, and tracheostomy) with a minimum mean score of 80%. The laparoscopic task (station 6) was to tie a square knot using an intracorporeal technique utilizing a laparoscopic simulator, a task that requires advanced technical skills. At this level, this task proved too difficult. The correlation of the checklists and global rating scales within stations was high for most of the stations, with the exception of Station 3 (chest tube insertion). This was a very easy task for the residents thus producing low variability and a low correlation. Other easy tasks were in station 1, excision of a skin lesion, and station 5, tracheostomy. In the present study, there was construct validity evidence for the OSPRE for assessing surgical skills. The year of enrollment of the resident was related to performance in surgical skills where PGY-4 residents outperformed PGY-1 ­residents. This finding is evidence of construct validity because of the known group differences; there was a clear difference in performance between junior and senior ­residents. There were no statistically significant differences between the four PGY levels and communication skills. This provides evidence of ­divergent validity as such differences would not be expected. To assess at the highest level of Miller’s Pyramid (does: applies knowledge and skills in practice), the surgical residents should be observed performing these task in the clinic with real patients. This would involve the use of a rating scale or checklist (e.g., min-CEX) while the resident was performing the task. The resulting data allows for assessment and evaluation of the trainee at the highest levels of competence.

Summary and main points  253

SUMMARY AND MAIN POINTS The various formats of testing for cognition, affect, and psychomotor skills are viewed independently but are integrated in practice. These include selection type items (multiple-choice questions, constructed response, checklists, and other item formats). Bloom’s taxonomies of cognitive, affective, and psychomotor skills and Miller’s Pyramid of clinical competence inform assessment practices across the domains. ●●

●●

●●

●●

●●

●●

●●

●●

●●

It is common practice to think of human behavior in three categories or domains: cognition (thinking), affect (feeling), and psychomotor (doing). Although we tend to discuss behavior in these separate ways, most behavior involves all three aspects or domains. Thurstone’s negative exponential or hyperbolic model learning curve represents a principle of learning in a variety of disciplines and for a variety of learners; learning increases rapidly with practice but attains an upper limit quickly and then flattens out. Notwithstanding scientific advances in the study of thinking, perception, memory, learning, much of assessment in medicine and the other health ­professions is based on lore, intuition, anecdotal evidence, and personal experience. The use of Bloom’s taxonomies is helpful in systematizing assessment. Much of the testing in these domains involves selection items (e.g., multiple choice) or constructed response (e.g., essays) items. Jean Piaget’s theory of intelligence is that we adapt to the world employing the dual cognitive processes of assimilation and accommodation. The theory both helps to explain key ideas such as deliberate and interleaved practice, retrieval practice, and learning styles and how to assess them. The SCT is used to assess the ability to interpret medical information under conditions of uncertainty, tapping Piaget’s highest levels of cognitive functioning. There still remain a number of challenges to assess the processes of problem-solving, abstract reasoning, and meta-cognition. Students should be continuously tested so as to capitalize on the testing effect. Using retrieval practice with testing—working memory to recall or retrieve facts or knowledge—is more effective than reviewing content or rereading text. Both the experimental work and the correlational-based research suggest the need for purposeful, direct teaching for integration and encapsulation of basic sciences into clinical reasoning and skills. This integration and encapsulation needs to be assessed and tested dynamically to determine the cognitive processes and outcomes of such pedagogy. Bloom is credited for developing a taxonomy in the affective (feeling) domain that provides a framework for teaching, training, assessing, and evaluating the effectiveness of training as well as curriculum design and delivery.

254  Formats of cognitive testing

●●

●●

●●

●●

In the affective domain, the structure of professionalism, aptitude, ­achievement, and personality are accounted by Aptitude for Science ­(cognitive variables and extraversion—a characteristic of working with people (e.g., medicine); Aptitude for Medicine, (cognitive variables and Conscientiousness and, inversely, Neuroticism); Professionalism, is ­composed of the peer assessment and Openness; General Achievement is composed in almost equal parts of GPA and Agreeableness; and Self-Awareness. In health sciences education, psychomotor skills and technical skills—rarely defined in the health sciences literature—most commonly refer to motor and dexterity skills associated with carrying out physical examinations, clinical procedures, and surgery, and also using specialized medical equipment. Miller proposed a framework for assessing levels of clinical competence that encompasses the cognitive, affective, and psychomotor domains. In the pyramid, the lowest two levels test cognition (knowledge); this is the basis for initial learning of an area. The higher levels (shows, does) require more complex integration in the cognitive, affective, and psychomotor domains. Direct observations in situ or objective structured performance-related examinations (OSPREs) are methods to assess at the higher levels of Miller’s Pyramid.

REFLECTIONS AND EXERCISES Reflections 1. Compare and contrast the cognitive, affective, and psychomotor domains. (Maximum 500 words) 2. Critically evaluate Thurstone’s negative exponential learning curve as a principle of learning in a variety of disciplines and for a variety of learners. (Maximum 500 words). 3. Respond to the statement “Notwithstanding scientific advances much of assessment in the health professions is based on lore, intuition, anecdotal evidence, and personal experience” (Maximum 500 words). 4. How does Jean Piaget’s theory of intelligence explain key ideas such as deliberate and interleaved practice, retrieval practice, and learning styles and how to assess them? (Maximum 500 words) 5. Why is using retrieval practice with testing—working memory to recall or retrieve facts or knowledge—more effective than reviewing content or re-reading text? (Maximum 500 words) 6. Discuss direct observations in situ or objective structured performancerelated examinations (OSPREs) as methods to assess at the higher levels of Miller’s Pyramid (Maximum 500 words).

Summary and main points  255

Exercises 1. Create an assessment for the cognitive, affective, and psychomotor domains for surgery in the OR (Maximum 250 words). 2. In a study, a large number of residents (n = 300) were assessed on multiple tests that assess in all three domains. Correlations (Pearson’s r) among the tests from 0.30 to 0.54. Do these results provide evidence of validity? Explain. (Maximum 250 words). 3. Interpret the reliabilities for each station in Table 10.6 (Maximum 250 words). 4. In Table 10.4, the factor loading for Neuroticism on Aptitude for Medicine was −0.660. Interpret this result (Maximum 250 words)

REFERENCES 1. Ebbinghaus H. Memory: A Contribution to Experimental Psychology. New York: Dover, 1885. 2. Thurstone LL. The learning curve equation. Psychological Monographs 1919, 26, 114 pages. 3. Piaget J. The Psychology of Intelligence. Totowa, NJ: Littlefield, 1972. 4. Lubarsky S, Dory V, Duggan P, Gagnon R, Charlin B. Script concordance testing: From theory to practice: AMEE guide no. 75. Medical Teacher 2013, 35(3), 184–193. 5. Pascual-Leone A, Freitas C, Oberman L, Horvath JC, Halko M, Eldaief M, Bashir S, Vernet M, Shafi M, Westover B, Vahabzadeh-Hagh AM, Rotenberg A. Characterizing brain cortical plasticity and network dynamics across the age-span in health and disease with TMS-EEG and TMSfMRI. Brain Topography 2011, 24, 302–315. 6. Woollett K & Maguire EA. Acquiring “the Knowledge” of London’s layout drives structural brain changes. Current Biology 2011, 21, 2109–2114. 7. Larsen DP, Butler AC, Aung WY, Corboy JR, Friedman DI, Sperling MR. The effects of test-enhanced learning on long-term retention in AAN annual meeting courses. Neurology 2015, 84, 748–754. 8. Collin T, Violato C, Hecker K. Aptitude, achievement and competence in medicine: A latent variable path model. Advances in Health Science Education 2008, 14, 355–366. 9. Patel VL, Evans DA, Groen GJ. Reconciling basic science and clinical reasoning. Teaching and Learning in Medicine 1989, 1, 116–121. 10. Kulasegaram KM, Manzone JC, Ku C, Skye A, Wadey V, Woods NN. Cause and effect: Testing a mechanism and method for the cognitive integration of basic science. Academic Medicine 2015, 90(11 suppl), S63–S69.

256  Formats of cognitive testing

11. Baghdady MT, Carnahan H, Lam E, Woods NN. Test-enhanced ­learning and its effect on comprehension and diagnostic accuracy. Medical Education 2014, 48, 181–188. 12. Bloom B, Krathwhol D, Masia C. Taxonomy of educational objectives, Vol. II. The Affective Domain. New York: David McKay Company, Inc., 1964. 13. Epstein RM, Dannefer EF, Nofziger AC, Hansen JT, Schultz SH, Jospe N, Connard LW, Meldrum SC, Henson LC. Comprehensive assessment of professional competence: The Rochester experiment. Teaching and Learning in Medicine 2004, 16(2), 186–196. 14. Lurie SJ, Nofziger AC, Meldrum S, Mooney C, Epstein RM. Effects of rater selection on peer assessment among medical students. Medical Education 2006, 40(11), 1088–1097. 15. Doherty EM, Nugent E. Personality factors and medical training: A review of the literature. Medical Education 2011, 45(2), 132–140. 16. Tyssen R, Dolatowski FC, Rovik JO, Thorkildsen RF, Ekeberg O, Hem E, Gude T, Gronvold NT, Vaglum P. Personality traits and types predict medical school stress: A six-year longitudinal and nationwide study. Medical Education 2007, 41(8), 781–787. 17. Dave RH. Psychomotor Domain. Berlin: International Conference of Educational Testing, 1967. 18. Jelovsek EJ, Kow N, Diwadkar GB. Tools for the direct observation and assessment of psychomotor skills in medical trainees: A systematic review. Medical Education 2013, 47, 650–673. 19. Miller GE. The assessment of clinical skills/competence/performance. Academic Medicine 1990, 65, s63–s67. 20. Ponton-Carss A, Hutchison C, Violato C. Assessment of communication, professionalism, and surgical skills in an objective structured ­performance-related examination (OSPRE): A psychometric study. American Journal of Surgery 2011, 202(4), 433–440.

11 Constructing multiple-choice items ADVANCED ORGANIZERS • Multiple-choice items are objective tests, because they can be scored routinely according to a predetermined key, eliminating the judgments of the scorers. • The first step to constructing valid tests is to develop a test blueprint as it serves as a tool to help ensure the content validity of an exam. • The multiple-choice item consists of several parts: the stem, the keyedresponse, and several distractors. The keyed-response is the “right” answer or the option indicated on the key. All of the possible alternative answers are called options. • The stems should include verbs such as define, describe, identify (knowledge), defend, explain, interpret, explain (comprehension), and predict, operate, compute, discover, apply (application). • In a review of 46 authoritative textbooks and other sources in educational measurement, Haladyna et al. developed a taxonomy of 43 multiple-choice item writing rules. • In criterion-referenced testing, it is necessary to establish cutoff scores for pass/fail. One very common and simple way to do this is called a minimum performance level (MPL) in discussion among experts on the minimally competent candidate. • For each distractor, experts identify the probability that they believe a hypothetical minimally competent examinee could rule out as incorrect. Quartile probabilities (e.g., 25%, 50%, etc.) are usually used though deciles (10%, 20%, etc.) could also be used. The probabilities for each expert (2–3) for each option are averaged. The total test score MPL (passing score) is the sum of each item MPL.

257

258  Constructing multiple-choice items

INTRODUCTION The objectivity of a test refers to how it is scored. Tests are objectively scored to the extent that independent scorers can agree on the number of points answers should receive. If observers can agree on scoring criteria, subjective judgments, and opinions can be minimized, the test is said to be objective. Scoring procedures on some objective tests, such as multiple-choice questions (MCQs), have been so carefully planned that scoring can be automated on a computer. Objectivity in scoring is necessary if measurements are to be useful. Multiple-choice items are objective tests because they can be scored routinely according to a predetermined key, eliminating the judgments of the scorers. Objective test items produce higher reliability assessments than do subjective test items (e.g., essays) and are therefore more desirable. MCQs are efficient and effective for assessing the first three levels of Bloom’s taxonomy (knowledge, comprehension, application—Table 5.1) and, arguably, some higher levels (e.g., analysis). There is consensus among testing experts that MCQs can and should be used to measure the first three levels of Bloom’s taxonomy because they are objective, efficient, and effective. Open-ended items (e.g., essays) should be used to assess learning outcomes at the higher levels (analysis, synthesis, evaluation) because this assessment requires originality, elaboration, and divergent thinking which cannot be done by MCQs. Open-ended or essay tests are discussed in Chapter 12.

BEFORE CONSTRUCTING ITEMS The first step to constructing valid tests is to develop a TOS or test blueprint (see Table 5.2, Chapter 5). The test blueprint serves as a tool to help ensure the content validity of an exam. As you study the blueprint presented in Table 5.2, you will notice that the content of the course has been subdivided into a number of subtopics. Moreover, each section of the course has been given an overall percentage of emphasis for the test. For example, the subtopic Preventive Cardiology contains 20 items or 27% of the total test. By inference, the topic that receives the greatest emphasis on the test should also have received the greatest emphasis in the course learning activities. Very close to 27% of the course should have focused on Preventive Cardiology. If there is a discrepancy in this, alter the exam emphasis to reflect the course. In addition to the appropriate emphasis, test construction benefits from considering the appropriate cognitive level of assessment. The accompanying test blueprint identifies the first four levels of Bloom’s taxonomy (i.e. knowledge, comprehension, application, higher) as appropriate levels for the testing material related to the course (Table 5.1). The issue of cognitive levels for testing is elaborated below. In summary, using the test blueprint helps ensure that all required course topics are covered at an appropriate level of understanding thereby enhancing content validity.

Constructing multiple-choice test items  259

CONSTRUCTING MULTIPLE-CHOICE TEST ITEMS The multiple-choice item consists of several parts, the stem, the keyed-response, and several distractors. The keyed-response is the “right” answer or the option indicated on the key. All of the possible alternative answers are called options.

Stem 1. Widely spaced eyes, thin upper lip, and short eyelid openings in the preschool child are typical of Options A. Down’s syndrome B. fetal alcohol syndrome* C. rubella during the child’s prenatal development D. the effects of thalidomide * = Keyed-response (other options A, C, & D are distractors).

The stem presents the problem. It may be in two formats: (1) incomplete statement (as above) or (2) complete statement (such as a full interrogative). The stem may be quite brief as in the example above or may be lengthy and include numbers, formulae, charts, photographs, and other material. The point is to present a problem in the stem that is best solved or answered by one option (the keyedresponse). The stem should be clear, well-written, and free from extraneous or irrelevant material. Possibly the most difficult task in constructing multiple-choice items is to write plausible distractors. That is, distractors which may appear correct to the confused candidate or one with little or partial knowledge. “Schizophrenia,” for example, would not be a plausible distractor in the above example, because even candidates with little or no knowledge could eliminate it as clearly incorrect. Following is an example of a more extensive MCQ: 2 . A study of the etiology of breast cancer involves recruiting female volunteers to the study by their physicians. All participants are screened for previous cancer or pre-cancerous or chronic diseases. In other words, all patients are healthy at the beginning of the study. The study will last 25 years and women who become ill with breast cancer will be compared on social, psychological, biological, genetic, and familial factors to matched peers who have not become ill. This study design is best described as a A. direct, clinical observation study B. matched, cross-sectional/longitudinal study C. prospective, case-comparison study* D. retrospective, case-comparison study E. population-based panel study The stem of this item is much longer than the previous item (#1) and therefore requires much more reading time although both receive an equal weight (1 point).

260  Constructing multiple-choice items

This is generally not a problem so long as the average reading time for each item is manageable for the allotted testing time. Both of the items above (1 and 2) end with incomplete statements. Generally, the use of incomplete statement-type items should be minimized, because they do not define the problem adequately. This type of questions requires longer and more complex cognitive processing and therefore can add confusion to average or lower performing students. This then introduces unwanted error of measurement requiring “mental gymnastics” in addition to knowledge of the content area. The question, “Which of the following best describes this study design?” is better.

CONCEPTUAL DEPICTION OF MCQs Diagrams representing an easy, moderate, and difficult MCQ are depicted in Figure 11.1. The first frame shows an easy item because the keyed-response

A

Stem: B* Problem presentation

D

C

Easy Item

A

Stem: B* Problem presentation

D

C Moderate Item

A

D

Stem: B* Problem presentation C

Difficult Item

Figure 11.1  Conceptual depiction of an easy, moderate, and difficult MCQ item.

Cognitive level of test items  261

(“correct answer”) B clearly overlaps the problem presented in the stem, while options A, C, and D either do not (C, D) or only marginally (A) overlap with the stem or problem presented there. This type of item is typical of factual knowledge item—if the student knows the fact, the item is very easy. The moderate difficult item in the second frame shows option B clearly overlapping with the stem but options A, C, and D also have some overlap and are therefore plausible alternatives. A student must know this concept quite well in order to select the keyed-response. Students with only superficial or partial knowledge will find this item somewhat challenging. Therefore, this item will “discriminate” (i.e., distinguish) between students who know the content and those that don’t. The difficult item frame indicates even more overlap between stem and distractors (options A, C, and D). Students will require superior knowledge in order to distinguish between the correct answer (B) and the distractors. High achievers will select the correct response, while moderate or low achievers will likely be confused by this item. This item, therefore, will discriminate between those that know the content very well and those that have only superficial or partial knowledge. Moderate and low-achieving students frequently refer to this type of item as “multiple guess”—putting the blame on the item rather than on the real source of the issue which is lack of knowledge. Any given test can consist of a mix of easy, moderate, and difficult items. A good distribution is 20% easy, 40% moderate, and 40% difficult items.

COGNITIVE LEVEL OF TEST ITEMS The testing of simple recall has too frequently characterized the multiple-choice item format. This situation has been mistakenly attributed to some inherent weakness in the format itself, but this is not true. The multiple-choice format offers ample opportunity to construct items that are more complex than simple recall. Nevertheless, there are many situations where assessing the candidate’s mastery of factual knowledge is a perfectly appropriate task (i.e., definition of a medical term). Criticism is warranted when a simple recall test item is used for material that should be assessed at the comprehension, application, and/or even analysis level. Therefore, an important step to constructing a test item is to consider the learning objective underlying the learning of the material. In doing so, the level of cognitive complexity that the item should reflect can be determined. In constructing test items at various cognitive levels, refer to Bloom’s Taxonomy (see Table 5.1). Remember from the table that there are six major cognitive categories, and the general objectives at each level provides the action verbs which are useful in constructing test items in each of these categories. While it is possible to construct items at the upper three levels of the taxonomy (i.e., analysis, synthesis, evaluation), this task is quite difficult. In general, items for high-stakes course exams and licensing exams are more likely to reflect the first three levels of the taxonomy (i.e., knowledge, comprehension, application). The actual distribution of cognitive categories for any particular exam will of course

262  Constructing multiple-choice items

be influenced by the curriculum content being assessed. Consequently, it is not surprising that some exams are characterized primarily by knowledge and comprehension-type questions, while other exams may have a larger proportion of application-type questions. The stems should include verbs such as define, describe, identify (knowledge), defend, interpret, explain (comprehension), and predict, operate, compute, discover, apply (application).

COMMON GUIDELINES FOR ITEM CONSTRUCTION In a review of 46 authoritative textbooks and other sources in educational measurement, Haladyna et al. developed a taxonomy of 43 multiple-choice item writing rules.1 The taxonomy, which is summarized here as Table 11.1, is exhaustive with a great deal of detail. We will focus on a subset of these rules with particular attention. There are a variety of the multiple-choice test formats to consider when constructing these items. Table 11.2 contains six possible test item formats. In addition, format variation may also be found within the cognitive domains. For example, items developed at the knowledge level may be constructed to reflect different types of knowledge outcomes as indicated in Table 11.3.

Table 11.1  Item-writing guidelines General item-writing (procedural) 1. Use either the best answer or the correct answer format. 2. Avoid the complex multiple-choice (Type K) format. 3. Format the item vertically, not horizontally. 4. Allow time for editing and other types of item revisions. 5. Use good grammar, punctuation, and spelling consistently. 6. Minimize examinee reading time in phrasing each item. 7. Avoid trick items, those that mislead or deceive examinees into answering. General item-writing (content concerns) 8. Base each item on an educational or instructional objective. 9. Focus on a single problem. 10. Keep the vocabulary consistent with the examinees' level of understanding. 11. Avoid cuing one item with another; keep items independent of one another. 12. Use the author's examples as a basis for developing your items. 13. Avoid over specific knowledge when developing the item. 14. Avoid verbatim textbook phrasing when developing the item. (Continued)

Common guidelines for item construction  263

Table 11.1 (Continued)  Item-writing guidelines 5. Avoid items based on opinions. 1 16. Use multiple-choice to measure higher-level thinking. 17. Test for important or significant material; avoid trivial material. Stem construction 18. State the stem in either question form or completion form; use the question as much as possible 19. When using the completion format, don't leave a blank for completion in the beginning or middle of the stem. 20. Ensure that the directions in the stem are clear and that wording lets the examinee know exactly what is being asked. 21. Avoid window dressing (excessive verbiage) in the stem. 22. Word the stem positively; avoid negative phrasing. 23. Include the central idea and most of the phrasing in the stem. General option development 24. Use as many options as are feasible; aim to use four options for all items. 25. Place options in logical or numerical order. 26. Keep options independent; options should not be overlapping. 27. Keep all options in an item homogeneous in content. 28. Keep the length of options fairly consistent. 29. Avoid, or use sparingly, the phrase "all of the above." 30. Avoid, or use sparingly, the phrase "none of the above." 31. Avoid the use of the phrase "I don't know." 32. Phrase options positively, not negatively. 33. Avoid distractors that can clue test-wise examinees; e.g., avoid associations, absurd options, formal prompts, or semantic (overly specific or overly general) clues. 34. Avoid giving clues through the use of faulty grammatical construction. 35. Avoid specific determiners, such as "never" and "always." Correct option development 36. Position the correct option so that it appears about the same number of times in each possible position for a set of items. 37. Make sure there is one and only one correct option. Distractor development 38. Use plausible distractor; avoid illogical distractors. 39. Incorporate common errors of students in distractors. 40. Avoid technically phrased distractors. 41. Use familiar yet incorrect phrases as distractors. 42. Use true statements that do no correctly answer the item. 43. Avoid the use of humor when developing options. Source: Adapted from Haladyna et al.1

264  Constructing multiple-choice items

Table 11.2  Sample test item formats 1. Question format Acute intermittent porphyria is the result of a defect in the biosynthetic pathway for which of the following? A. Corticosteroid B. Fatty acid C. Glucose D. Heme* 2. Completion format An inherited metabolic disorder of carbohydrate metabolism is characterized by an abnormally increased concentration of hepatic glycogen with normal structure and no detectable increase in serum glucose concentration after oral administration of fructose. These two observations suggest that the disease is the result of the absence of A. Fructokinase B. Glucokinase C. Glucose 6-phosphatase* D. Phosphoglucomutase 3. Clinical association format A 12-year-old girl with sickle cell disease has pain in her right arm. An x-ray of the right upper extremity shows bony lesions consistent with osteomyelitis. Which of the following is the most likely causal organism? A. Clostridium septicum B. Enterococcus faecalis C. Pseudomonas aeruginosa D. Salmonella enteritidis* 4. Negative question format Which of the following is NOT a continuous variable? A. Height B. Religion* C. Temperature D. Time to solve a problem 5. Statement format Identify the 95% confidence interval for the population mean when a sample mean is 37 and the standard deviation of 5. A. 37 ± 1.96 (15) B. 37 ± 1.96 (5)* C. 37 ± 0.95 (15) D. 37 ± 0.95 (3) (Continued)

Common guidelines for item construction  265

Table 11.2 (Continued)  Sample test item formats 6. Order/sequence format The triad of the nephritic syndrome includes which of the following? A. Hypotension, proteinuria, and haematuria B. Hypertension, proteinuria, and edema* C. Night sweats, edema, and haematuria D. Urinary frequency, burning, and pain

Table 11.3  Measuring knowledge outcomes Knowledge of terminology Which one of the following statements best defines the word egress? A. digress B. enter C. exit* D. regress Knowledge of specific facts Who was the first US astronaut to orbit the earth in space? A. Alan Shepard B. John Glenn* C. Scott Carpenter D. Virgil Grissom Knowledge of principles Which of the following best describes the power of science? A. Scientists are rational and apply logic in their thought processes B. Science is public, self-correcting, and results are replicable* C. Science provides an algorithm for solving problems D. Scientists are well-educated in basic philosophical problems Knowledge of methods and procedures If you were conducting a scientific study of a problem, your first step should be to A. Collect information about the problem* B. Develop hypotheses to be tested C. Design the experiment to be conducted D. Select scientific equipment

266  Constructing multiple-choice items

CONSTRUCTING THE STEM As we have seen, a multiple-choice item is composed of a stem and four or five options. One of the options is referred to as the keyed or correct response, while the remaining options serve as distractors. In a well-constructed multiple-choice item, the stem should present a self-contained question or problem. Moreover, there should be enough information in the stem to permit the test-taker to develop a possible answer without viewing the options. In the following example, Item 1 represents an incomplete stem, while Item 2 is a better format. Item 1—Poor (Problem is not adequately specified in the stem). The bichrome test may fail for patients who have A. deuteranopia B. deuteranomaly C. protanopia D. tritanopia Item 2—Better (Problem is more clearly specified in the stem). In which of the following patient conditions will the bichrome test, which is used to determine the spherical refractive error in clinical refraction, likely fail? A. Deuteranopia B. Deuteranomaly C. Protanopia D. Tritanopia One of the most frequent but easily avoided errors in constructing the stem deals with the grammatical fit between the stem and all of the options. The item writer may have sometimes failed to proofread the item. To avoid such errors, it is useful to have a colleague read and edit the items. Someone, other than the item writer, can frequently spot errors in items more readily.

Negatively stated items Negatively stated (i.e., not, except, least) items should be avoided but there are rare occasions when they are appropriate. In the case of treatment procedures, for example, there may be an action among several which should not be performed. It is quite reasonable to develop a negatively stated item to help determine if the candidate is aware of the danger of performing a particular action. As in the case of all negatively stated stems, capitalize and bold the negation component of the stem (i.e., which of the following is NOT an appropriate action when dealing with a head-injured patient?).

Constructing the stem  267

Example You suspect placenta previa when a 26-year-old woman in her 23rd week of pregnancy comes to the emergency department with vaginal bleeding with bright red blood but is painless. Which of the following actions is NOT correct? A. Conduct a digital pelvic exam* B. Monitor the fetal heart rate C. Order bed rest as a treatment D. Order transvaginal ultrasonography as a diagnostic test E. Order a complete blood count (CBC)

Constructing the options How many options? The four-option item is the most effective form to use (i.e., the stem and options a, b, c, d). The addition of a fifth option generally does not provide sufficient discrimination power and improvement in reliability to justify the effort required to construct it.2 Given this evidence and the importance of a consistent format style throughout the exam, item writers are advised to construct all items in the four-option format. The main reason for increasing the number of options in an item is to reduce the probability of selecting the keyed-response by chance alone. These probabilities are illustrated in Table 11.4. Increasing the options beyond four does not decrease the probability of guessing very substantially. Increasing the options from two to three decreases the probability of guessing by 17% and three to four by an additional 8%. Going from four options to five only reduces the probability by 5% (Table 11.4). Given the difficulty in constructing four plausible distractors (a five-option item), it is advisable to use three plausible distractors (a four-option item) since the reduction of guessing is small (5%) compared to the effort in constructing an additional distractor. A number of empirical studies have found little or no improvement in the reliability or discrimination of tests by using five options versus four.1

Table 11.4  Probabilities of guessing with various options Number of options 2 3 4 5 6

Probability of guessing (%)

Decrease in probability (%)

50 33 25 20 17

— 17 8 5 3

268  Constructing multiple-choice items

Option length Frequently, the correct option is longer than the distractors. This is frequently because the longer option contains more information so as to make this option the best answer. Many test-wise candidates, however, are aware of this test construction error and will select the longest option. When a test-taker selects a correct option on the basis of some factor other than having the necessary knowledge, the value of the item is reduced. The item below is typical of the length construction error. In general, try to keep the options approximately equal. Option length error Neurotics are more likely than psychotics to A. be dangerous to society B. be dangerous to themselves C. have delusional symptoms D. have insight into their own inappropriate behavior but nevertheless feel rather helpless in terms of dealing with their difficulties*

Location of the correct answer The location of the keyed-response should be randomized so that it appears at approximate equal frequency in each of the four options. Usually this is not the case, particularly for novice item writers. For reasons that are not fully understood, these writers favor “c” as the correct option. Indeed, the practice is so widespread that test-wise students have developed the dictum, “When in doubt pick ‘c.’” Therefore, when reviewing your items, simply check to see that you have not favored “c” as the correct response. In a 100-item test, for example, the keyed-response should appear about 25 times in each of the four option positions. In addition to determining the location of the correct response, consideration should be given to the order in which the options are presented. Whenever appropriate, the options should be placed in either ascending or descending order, usually in ascending order. When the options are comprised of numbers, for example, it is preferable to present the options in a serial order rather than put them in random order. Example of numbers What proportion of a normal distribution falls between z = −1.16 and z = +1.16? A. 0.2460 B. 0.3770 C. 0.6230 D. 0.7540*

Constructing the stem  269

Similarly, the options should be placed in either ascending or descending order alphabetically. As for numbers, it is usual to arrange the options in ascending order. Example of alphabetical A 22-year-old student comes in to the university health center with a mild fever (100F). On careful physical examination, you notice four small vesicular lesions, each on an erythematous base. The most likely diagnosis is primary infection with A. Cytomegalovirus (CMV) B. Herpes simplex virus type 2 (HSV-2)* C. Human immunodeficiency virus type 1 (HIV-1) D. Human papilloma virus type 6 (HPV-6)

All of the above or none of the above These options should be avoided particularly when the item constructor plans to make them the correct response. It is very difficult to defend the position that a set of options are always the case or never the case. When “all of the above” or “none of the above” options serve as distractors, it is important for item constructors to recognize that many people will quickly eliminate these options. Test-wise candidates understand that the item constructor frequently uses these options as fillers due to the inability to create a third valid distractor.

Example: None of the Above What effect would a thrombolytic drug administered to a heart attack patient arriving in the emergency department by ambulance have on the patient’s systolic blood pressure? A. Decrease it B. Increase it C. Neither increase nor decrease it* D. None of the above It is evident that option D as “none of the above” is nonsensical and was used only as a filler because it is very difficult to write a plausible distractor for that option. The problem in the stem should be re-worked to avoid this difficulty. Example: All of the above (poor construction) Which are characteristic of the nephritic syndrome? A. Hypertension B. Proteinuria C. Edema D. All of the above*

270  Constructing multiple-choice items

Example: All of the above (better construction) Which are characteristic of the nephritic syndrome? A. Hypertension, proteinuria, and edema* B. Hypotension, deuteranopia, and edema C. Hypertension, proteinuria, and deuteranopia D. Hypotension, proteinuria, and edema For this “all of the above” option, even a student with partial knowledge could select D because the student might recognize two options (probably B and C) as correct so it follows that D must be the correct answer. Additionally, if “all of the above” rarely occurs, it probably is the correct answer. Finally, either of the two options (none/all of the above) frequently are not grammatically correct in relation to the stem. The best strategy is to never use either “all of the above” or “none of the above” or any other inclusive/exclusive options (e.g., A and B but not C).

Homogeneous distractors It is important that all distractors be plausible and homogeneous. The rationale for this is simple. One of the major advantages of using the multiple-choice format is that it requires candidates to make discrimination among what should be a set of compelling options with only one representing the best choice. Any question that contains distractors that are easily eliminated reduces the overall value of the question. The following example represents this type of construction error. Example: Poor Who is most closely associated with the theory of natural selection? A. Bell B. Darwin* C. Morse D. Pasteur Example: Better Who is most closely associated with the theory of natural selection? A. Darwin* B. Fleming C. Pasteur D. Virchow In the first example (poor), Darwin and Pasteur were both in the life sciences, while Bell and Morse were in the physical sciences. In the second example (better), the distractors are more homogeneous as all four are now in the life sciences.

Constructing the clinical vignette question  271

Cueing the correct answer There are ways in which a cue in the stem gives away the correct answer by repeating a key word from the stem in one of the options. The following represents an example of this type of error. In general, this particular error is found more frequently among items which involve technical language as in the case of science or medical-type items. Providing obvious clues Which of the following diseases is caused by a virus? A. Gallstones B. Scarlet fever C. Typhoid fever D. Viral pneumonia*

CONSTRUCTING THE CLINICAL VIGNETTE QUESTION Questions can be written with a clinical vignette. The vignette allows testing at the comprehension or application level of knowledge and provides face validity for the item. The vignette commonly provides the patient’s age, sex, chief complaint, and site of care, followed by personal history, family history (if relevant), then physical examination information, then laboratory data (if provided). Depending upon the purpose of the set, vignettes can be brief, prototypic presentations, or fuller descriptions that challenge examinees to identify key information. A good stem provides sufficient information but not irrelevant or distracting information such as patient’s hair color if that is not relevant to the diagnosis. A good stem can be answered without referring to the options. Dos and don’ts for effective writing of vignettes Each item should focus on an important concept, typically a common or potentially acute clinical problem. 1. Don’t assess knowledge of trivial facts 2. Don’t include trivial, “tricky,” or overly complex questions 3. Do focus on problems that would be encountered in real life

●●

Each item should assess comprehension or application of knowledge, not recall an isolated fact. 1. Do use stems that may be relatively long 2. Do use short options 3. Do use the question format 4. Do use a presenting problem of a patient for the clinical sciences 5. Do follow with the history (including duration of signs and symptoms), physical findings, results of diagnostic studies, initial treatment, subsequent findings, etc.

●●

272  Constructing multiple-choice items

6. Do use vignettes that may include only a subset of this ­i nformation, but the information should be provided in this specified order 7. Don’t use long patient vignettes for the basic sciences; keep them very brief 8. Do use laboratory vignettes for the basic sciences The stem of the item must pose a clear question, and it should be possible to arrive at an answer with the options covered. 1. Do cover up the options to determine if the question is focused and clear; examinees should be able to pose an answer based only on the stem 2. Do rewrite the stem and/or options if they could not 3. Do write a vignette for each (or many) of the options in the list 4. Do begin with the presenting problem of a patient, followed by the history (including duration of signs and symptoms), physical findings, results of diagnostic studies, initial treatment, subsequent findings, etc. 5. Do pose a clear question in the lead-in of the stem so that the candidates can pose an answer without looking at the options 6. Do satisfy the “cover-the-options” rule as an essential component of a good question ●●

General vignette example* A 22-year-old student comes in to the university health center with a mild fever (100F) and a “continuous tingling sensation running down the inside of her thighs” that has disturbed her sleep. On careful physical examination, you notice four small vesicular lesions, each on an erythematous base, adjacent to the vaginal opening. There are no other recognizable skin abnormalities but the inguinal (groin) lymph nodes are noticeably swollen and tender. Which primary infection is the most likely diagnosis that accounts for this young women’s condition? A. Cytomegalovirus (CMV) B. Herpes simplex virus type 2 (HSV-2)* C. Human immunodeficiency virus type 1 (HIV-1) D. Human papilloma virus type 6 (HPV-6) E. Varicella zoster virus (VZV)

* Many thanks to Dr. Peter Southern, Professor at the University of Minnesota Medical School who contributed this and several other items in this chapter.

Constructing the clinical vignette question  273

Vignettes with x-rays example

This X-ray is from a male patient with date of birth: May 20, 1975; side of extremity/body: right; and date of X-ray: May 01, 2018. What type of fracture is shown in the X-ray above? A. Simple spiral B. Simple oblique* C. Simple transverse D. Simple wedge E. Simple fragmentary

274  Constructing multiple-choice items

A 63-year-old man presents to the emergency department complaining of coughing that produces phlegm. He has fever, shortness of breath and difficulty breathing, chills, fatigue, sweating, and chest pain. You order a chest CT with the results below. What is the most likely differential diagnosis? A. Cytomegalovirus of the lungs B. Lung cancer C. Pneumonia* D. Tuberculosis E . Emphysema

Vignettes with photographs example

This 4-year-old child is generally happy and healthy although she does seem to have inherited a family susceptibility to atopic dermatitis. Shortly after beginning pre-school classes, she experiences an episode of skin disruption around her mouth and across her cheeks that progresses to fluid secretion and crusting, suggestive of secondary bacterial infection. What combination of bacteria is most likely to be causing the secondary skin infections when the Diagnostic Laboratory isolates a beta-hemolytic, gram-positive, coagulase positive organism and a lactose non-fermenting, non-spore-forming rod? A. Escherichia coli and Streptococcus pyogenes B. Klebsiella pneumoniae and Streptococcus pyogenes C. Pseudomonas aeruginosa and Staphylococcus aureus* D. Staphylococcus epidermidis and Pseudomonas aeruginosa E . Streptococcus agalactiae and Pseudomonas aeruginosas

Constructing the clinical vignette question  275

Vignettes with lab data example A 68-year-old woman’s hospital discharge was delayed due to unavailability of a bed in a nursing home. She is bedridden and unable to attend to personal needs. During a 3-day period, her pulse increases from 79/min to 127/min, and blood pressure decreases from 128/76 mm Hg to 103/60 mm Hg. Laboratory values include: Day 1 Hemoglobin Serum Glucose Urea nitrogen Creatinine Na+

Day 3

17.4 g/dL

18.8 g/dL

97 mg/dL 20 mg/dL 1.1 mg/dL 133 mEq/L

86 mg/dL 58 mg/dL 1.2 mg/dL 148 mEq/L

Which of the following is the most likely diagnosis? A. Hepatic globular disease B. Dehydration* C. Diabetic ketoacidosis D. Duodenal hemorrhage E . Renal failure Table 11.5, which is adapted and modified based on Gronlund’s and Cameron’s text, is a very useful checklist to run through to check that you have avoided basic item writing errors that we have discussed throughout this chapter. Use this checklist frequently to review your item construction. Table 11.5  Checklist for reviewing multiple-choice items3 Yes

No

1. This is the most appropriate type of item for this assessment 2. The amount or reading time is appropriate for each item 3. As much as possible, the item stems are stated in positive terms 4. The item stem is free of irrelevant material 5. Negative wording (e.g., NOT) has been capitalized 6. Each item stem presents a meaningful problem 7. The alternatives are grammatically consistent with the item stem (Continued)

276  Constructing multiple-choice items

Table 11.5 (Continued)  Checklist for reviewing multiple-choice items3 Yes

No

8. You have avoided repetitive words or phrases in the alternatives 9. There is only one correct or clearly BEST answer 10. The distractors are plausible to non-achievers 11. The alternative answers are brief without unnecessary words 12. The alternatives are similar in length and form 13. The items are free of verbal clues to the answers 14. Verbal alternatives are in alphabetical order 15. Numerical alternatives are in numerical order 16. You have avoided "none of the above" and "all of the above" 17. You have avoided inclusive/exclusive items (e.g., “A and B but NOT C”) 18. A colleague has reviewed your items

SETTING PASS/FAIL SCORES FOR MCQs The MPL method In criterion-referenced testing, it is necessary to establish cutoff scores for pass/fail.4 One very common and simple way to do this is called a minimum performance level (MPL) employing the Nedelsky method. Here, we employ a modified version of the Nedelsky method. Experts (e.g., physicians, nurses, dentists, etc.) assign probabilities to multiple-choice test items based on the likelihood that a group of examinees should be able to rule out incorrect options.5 These reference groups are hypothetical test-takers on the borderline between inadequate and adequate levels of performance. These are the minimally competent candidates (not failures, not stars, but those that just got their toes in the door). These also refer to the borderline between mastery and non-mastery of some domain of knowledge, ability, or skill. A good way to proceed in setting the MPL is to have a discussion among experts on the minimally competent candidate. In this meeting, participants’ discuss, describe, and clarify the hypothetical “borderline examinee.” The participants can now leave the meeting and independently inspect each option in a MCQ of some of the items being calibrated. For each distractor, experts identify the probability that they believe a hypothetical minimally competent examinee could rule out as incorrect. Quartile probabilities (e.g., 25%, 50%, etc.) are usually used though deciles (10%, 20%, etc.) could also be used. We then average the probabilities for each expert (2–3) for each option. The total test score MPL (passing score) is the sum of each item MPL. We must, therefore, establish an MPL for each item. The MPL is the value ranging

Setting pass/fail scores for MCQs  277

between 0.25 and 1.0 which reflects the probability that even a minimally competent candidate can answer this item correctly. An MPL of 0.25 indicates a very difficult item with an MPL of 1.0 reflecting an easy one. Once we have averaged the expert judgment for each option assigning a value (one of 0, 0.25, 0.50, 0.75, 1.0) to each distractor, the following formula is then applied to calculate the MPL of each item.



MPL =

1 Op − ∑PD

where Op = the number of options in the item P D = the probability that a minimally knowledgeable candidate can eliminate that option as clearly incorrect MPL = minimum performance level for that item Examples illustrate the procedure 1. When did World War II end? A. 1650 B. 1859 C. 1945* D. 1984

A = 1.0 B = 1.0 C = * D = 1.0 In this example, each of the distractors, A, B, and D are assigned 1.0 because even a minimally competent candidate can almost certainly eliminate them as incorrect. The MPL, therefore, is



1 1 1 1 = = = = 1.0 Op − ΣPD 4 − (1.0 + 1.0 + 1.0 ) 4 − 3 1

The above item receives an MPL = 1.0, because it is very easy. We can increase the difficulty in item 2 by making the options overlap more closely with the stem. 2. When did World War II end? A. 1944 B. 1945* C. 1948 D. 1950

A = 0 B = * C = 0.50 D = 0.75

278  Constructing multiple-choice items

“A.” is assigned a value of 0 because a minimally competent candidate would be very unlikely to eliminate it, while “C.” is easier to eliminate as is “D.” The MPL is:



MPL =

1 1 = = 0.36 4 − ( 0 + 0.50 + 0.75 ) 2.75

This is a difficult item as reflected in the MPL = 0.36. We can increase the item difficulty even further by requiring even more fine-grained discrimination between the options and including more detail in the stem as in item 3. 3. When did World War II end in Europe? A. March 1945 B. April 1945* C. May 1945 D. June 1945

MPL =

1 1 = = 0.25 4−0 4

This is a very difficult item as reflected in the MPL = 0.25, which is equivalent to chance guessing for this item. All MCQs should then be referenced with their MPLs when stored in an item bank. As we indicated, the total test score MPL (passing score) is the sum of each item MPL. If we select 50 MCQs from an item bank, the passing score for that test is the sum of the 50 MPLs. Conceivably, the lowest MPL possible is 0.25 (25% to pass)—a very difficult test—to 1.00 (100% to pass). In practice, total test MPLs are usually in the range of 0.55–0.78. These are criterion-referenced tests because passing/failing is based on an absolute standard and not on the norm group performance.

SUMMARY AND MAIN POINTS The objectivity of a test refers to how it is scored. Scoring procedures on some objective tests, such as multiple-choice questions (MCQs), have been so carefully planned that scoring can be automated on a computer. Multiple-choice items are objective tests because they can be scored routinely according to a predetermined key, eliminating the judgments of the scorers. Objective test items produce higher reliability assessments than do subjective test items (e.g., essays) and are therefore more desirable. ●●

●●

The first step to constructing valid tests is to develop a test blueprint as it serves as a tool to help ensure the content validity of an exam. The multiple-choice item consists of several parts: the stem, the keyed-response, and several distractors. The keyed-response is the “right” answer or the option indicated on the key. All of the possible alternative answers are called options.

Summary and main points  279

●●

●●

●●

●●

The stems should include verbs such as define, describe, identify (knowledge), defend, explain, interpret, explain (comprehension), and predict, ­operate, compute, discover, apply (application). In a review of 46 authoritative textbooks and other sources in educational measurement, Haladyna et al developed a taxonomy of 43 multiple-choice item writing rules. In criterion-referenced testing, it is necessary to establish cutoff scores for pass/fail. One very common and simple way to do this is called a MPL in discussion among experts on the minimally competent candidate. For each distractor, experts identify the probability that they believe a hypothetical minimally competent examinee could rule out as incorrect. Quartile probabilities (e.g., 25%, 50%, etc.) are usually used though deciles (10%, 20%, etc.) could also be used. The probabilities for each expert (2–3) for each option are averaged. The total test score MPL (passing score) is the sum of each item MPL.

BOX 11.1: Advocatus Diaboli: Tricks for taking MCQ tests Many students, particularly those who are anxious or suffer from test anxiety, may be compromised on MCQ tests. These candidates frequently report that such tests cause them to become very anxious, and as a result, their performance suffers significantly. Observations of unsuccessful test-takers have revealed at least two sources of difficulties. First, poor test-takers do not generally spend sufficient time reading the question component or stem of the MCQ. The stem frequently contains information that the test-taker can use to eliminate the incorrect options. Reading the stem too quickly results in missing such information and thereby, adds unnecessarily to the difficulty of the item. Second, racing through the stem in order to choose an option creates a frenzied pace for the student which adds to their already high anxiety. The following strategy has been suggested6 to help students in slowing their pace and thus read the stem more carefully and thoroughly. The strategy also gives the test-taker a greater feeling of control, and accordingly, reduces their anxiety somewhat.

Specific instructions to students Stop reading at the end of the stem and cover the options with your hand. ●● Reflect on what information the stem provides and what you know about the specific content area of the question. ●●

280  Constructing multiple-choice items

●●

●●

Respond silently to the question with what you believe might be the correct answer. After completing this task, uncover the options. Confirm your tentative response with one of the options.

This is called the SRRC strategy. Many students, particularly those experiencing anxiety during an MCQ exam, have reported that this strategy has helped them to feel more in control during examinations and has helped to maximize their performance. There are several additional strategies for approaching an MCQ test item. ●●

●●

●●

●●

●●

Turn the stem into a question. For some students, a stem such as “The field of epidemiology is concerned with” is more difficult to respond to than the question, “What is epidemiology?” While the reasons for this are not completely clear, there is evidence that converting open stems into complete questions does increase test scores.7 One possibility is that converting the incomplete stem to a question more clearly focuses the problem and thus reduces the ambiguity of the question. Additionally, it probably increases the efficiency of cognitive processes such as decoding and memory searches. Try elimination. Students frequently perceive the task of selecting the correct answer in a four-option item as selecting one from four possibilities. Through the process of elimination, the choice can be reduced to one in two options. Even for questions which are very difficult, students usually have some knowledge about the item making the elimination of one or more options possible. Watch for qualifiers as specific determiners. Options that contain absolutes such as never, only, always, certainly, or all (specific determiners) are frequently incorrect responses. Test constructors sometimes use such options to easily render the option incorrect thus reducing the need to develop other plausible distractors. Therefore, they are rarely meant to be the correct responses and more often simply provide evidence of poor test construction. These are usually created as distractors. Conversely, options that equivocate with the use of words like sometimes, frequently, in most cases, usually (also specific determiners), and so on are generally correct options. Identifying these options and responding accordingly will allow students to capitalize on poorly constructed tests rather than be penalized by them. More words tend to be right. In any well-designed MCQ item, there is only one correct option. In order for this to be true, test writers construct the correct option to be complete and to account for all the possible information. On occasion, the need to construct a valid correct option results in too little attention to the construction of the distractors. Consequently, the distractors may be considerably shorter than the correct answer. Long complete options, therefore, frequently represent the correct answer. Look for cues from the stem to the options. These may be grammatical links (e.g., past tense), plural versus singular, repeated phrases or words, and so on.

Summary and main points  281

PART A: EXAMPLE OF A TABLE OF SPECIFICATIONS FOR A CELLULAR PROCESSES TEST

Levels of understanding Content area

Knowledge

Comprehension

Application

1. Nucleus

2

5

1

8 (32%)

2. Chromosomes

2

4

2

8 (32%)

3. Active transport

2

3

4

9 (36%)

6 (24%)

12 (48%)

7 (28%)

25

1. State the general purpose of the test, i.e., how will the test results be used? For example, are there diagnostic uses for the test? (½ mark) 2. State who will take the test. (½ mark) 3. The test should include a title, explicit instructions (i.e., where do students record their answers) and indicate the time allocated to write the exam. (3 marks) 4. Include a scoring key which includes a breakdown of the frequency of the correct answer (i.e., a = 7; b = 6; c = 6, d = 6). (1 mark)

PART B: WRITING THE TEST (25 MARKS—1 PER QUESTION) According to your blueprint developed in Part A of this assignment, write a 25 item multiple-choice exam. Each item should have four options—a keyedresponse and three distractors. Below is a list that will help you construct your test. First read through the criteria below and then, after your test is written, evaluate your questions using these criteria.

CRITERIA FOR MCQS: ONE MARK WILL BE DEDUCTED FOR EACH OF THE FOLLOWING NOT ATTENDED TO 1. Is the stem of the item meaningful by itself and does it present a definite problem? (Remember, the direct question (closed stem) tends to be easier for students to answer—who, what, where, when, why, how.) 2. Does the stem include as much of the item as possible and is it free of irrelevant material? 3. Have you used negatively stated stems only when significant learning outcomes require it? Have you clearly identified the negative qualifier (i.e., bold lettering, all capitals, underlining, etc.)? 4. Are all of the alternatives grammatically consistent with the stem? 5. Does each item contain only one correct or clearly best answer?

282  Constructing multiple-choice items

6 . Are all the distractors plausible? 7. Have you avoided verbal associations between the stem and the correct answer? 8. Does the relative length of the alternatives provide a clue to the answer? 9. Does the correct answer appear in each of the alternative positions approximately an equal number of times, but in random order? 10. Have you used special alternatives such as “none of the above” or “all of the above” sparingly and only when significant learning outcomes require it? 11. Have you left at least one (1) blank line between the stem and the first alternative? 12. Are all maps/diagrams clearly labelled to indicate which questions correspond to them? 13. Have you provided references for materials taken from other sources? 14. Is the question and all of the options presented on the same page? 15. Questions with answers requiring mathematical calculations must have a rationale for each distractor.

PART C: REFLECTIONS After you have completed this assignment, write a brief reflection (1-page typewritten maximum) about your reactions to multiple-choice exams and the construction of a multiple-choice exam. NOTE: the reflection is an important part of this assignment. Although the reflection is not marked, in order to receive a grade for the entire assignment, the reflection must be handed in. TO SUBMIT TO THE INSTRUCTOR 1. Part A—Questions 1–4. 2. The multiple-choice exam (which includes Part A—Question 5). 3. The scoring key (which includes Part A—Question 6). 4. Your reflection. All of the above should be typewritten.

REFERENCES



1. Haladyna TM, Downing SM, Rodriguez MC. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15, 309–334; Haladyna TM. (1994). Developing and Validating MultipleChoice Test Items. Toronto, ON: Erlbaum; Downing SM, Haladyna TM. (2006). Handbook of Test Development. Mahwah, NJ: Lawrence Erlbaum. 2. Baghaei P, Amrahi N. (2011). The effects of the number of options on the psychometric characteristics of multiple choice items. Psychological Test and Assessment Modelling, 53(2), 192–211; Kilgour JM, Tayyaba S. (2016). An investigation into the

Summary and main points  283



optimal number of distractors in single-best answer exams. Advances in Health Science Education, 21, 571–585. 3. Gronlund NE, Cameron IJ. (2005). Assessment of Student Achievement, Canadian Edition 1st Edition. Toronto, ON: Pearson. 4. Yousuf N, Violato C, Zuberi RW. (2015). Standard setting methods for pass/fail decisions on high stakes objective structured clinical examinations: A validity study. Teaching and Learning in Medicine, 27(3), 280–291. 5. Cizek GJ, Bunch MB. (2007). The Nedelsky method. In Cizek GJ, Bunch MB, eds., Standard Setting (pp. 68–74). Thousand Oaks, CA: SAGE Publications Ltd. doi:10.4135/9781412985918. 6. Violato C, McDougall D, Marini A. (1992). Assessment of Classroom Learning. Calgary, AB: Detselig. 7. Violato C, Marini AE. (1989). Effects of stem orientation and completeness of multiple choice items on item difficulty and discrimination. Educational and Psychological Measurement, 49(1), 287–296.

12 Constructed response items ADVANCED ORGANIZERS • The constructed response or open-ended item is most commonly known as the essay test, but consists of any item format type where the examinee must construct a response rather than select one as in the multiple-choice format. • These item types are sometimes also called subjective tests because ­scoring involves subjective judgments of the scorer. By contrast, the MCQ is referred to as objective, because no judgments are required to score the items. • Constructed response questions are usually intended to assess students’ ability at the higher levels of Bloom’s taxonomy: application, analysis, synthesis, and evaluation. • There are at least two types of essay questions, the restricted and extended response forms. This classification refers to the amount of ­freedom allowed to candidates in composing their responses. The restricted response question limits the character and breadth of the student’s composition. • There are two basic approaches to the scoring of constructed response test questions: analytic scoring and holistic scoring. • Analytic scoring rubrics list specific elements of the response and specify the number of points to award each response. • The holistic rubric contains statements of a typical response at each score level so that actual responses written by test-takers provide examples of a 10-point response, an 8-point response, a 5-point response, and so on. • The total amount of effort in the use of either essay test or MCQs use is, in part, a function of the number of students to be tested. The greatest amount of effort for MCQ use is in the construction of the test items while for the essay it is the grading.

285

286  Constructed response items

• The idea of grading essays by computer has been around since the mid1960s. Researchers have recently made progress in using computers to score constructed item responses. Automated scoring reduces the time and cost of the scoring process.

INTRODUCTION The constructed-response or open-ended item is most commonly known as the essay test but consists of any item format type where the examinee must construct a response rather than select one as in the multiple-choice format. In addition to a written answer, the constructed response may consist of other performance types such as presenting a case study, an interpretive dance, or performing a physical exam on a patient. Assessing clinical skills such as conducting a physical exam is usually done in an objective structured clinical exam (OSCE) and is the subject of Chapter 13. Constructed-response items require students to apply knowledge, skills, and analytic thinking to authentic performance tasks. These are sometimes called open-response items, because there may be a number of ways to correctly answer the question; students are required to construct or develop their own answers without suggestions or choices. These item types are sometimes also called ­subjective tests, because scoring involves subjective judgments of the scorer. By ­contrast, the MCQ is referred to as objective because no judgments are required to score the items. Constructed-response items can be simple, requiring a phrase or sentence or two as answers. Or they can be complex, requiring students to read a prompt or a paragraph and write an essay or analysis of the information. Based on Bloom’s taxonomy of cognitive outcomes (Table 5.1), constructed-response questions are usually intended to assess students’ ability at the higher levels of the ­taxonomy: application, analysis, synthesis, and evaluation. The constructed-response ­formant then is intended to require students to apply, analyze, synthesize, and evaluate knowledge, reasoning, and skills. Illustrative verbs for the stems or questions for constructed-response items can include modify, operate, compute, discover, apply (application), identify, ­differentiate, discriminate, infer, relate, distinguish, detect, classify ­(analysis), combine, compose, design, plan, revise, deduce, produce (synthesis), and appraise, compare, contrast, criticize, support, justify (evaluation).

TABLE OF SPECIFICATIONS Before writing the constructed-response test (or any other test), we begin with a table of specifications. Table 12.1 is an example of a TOS for an essay test with ten questions, three levels of understanding and three topics of evidence-based medicine. The numbers in the cells represent the number of essay question for that topic and level of understanding. For example, there are two essay questions for validity of empirical evidence at the analysis level of understanding.

Types of essay items  287

Table 12.1  An example of a table of specifications for constructed response test Content area Evidence-based medicine

Levels of understanding Analysis Synthesis Evaluation

1. Need for empirical evidence for clinical application

0

1

1

2 (20%)

2. Validity of empirical evidence

2

1

1

4 (40%)

3. Translation sciences and clinical practice

1

2

1

4 (40%)

4 (40%)

4 (40%)

2 (20%) 10 (100%)

TYPES OF ESSAY ITEMS There are at least two types of essay questions, the restricted and extended response forms. This classification refers to the amount of freedom allowed to candidates in composing their responses. The restricted response question limits the character and breadth of the student’s composition. In this type of essay question, the following conditions are met: the student is directed toward a particular type of answer, the scope of the problem is limited, and the length of the essay is sometimes specified.

EXAMPLE 1 A restricted response essay: how might the principles of behaviorism help physicians to maintain motivation for their patients’ adherence to treatment plans? In answering this question, at least complete the following tasks: a. briefly describe the principles of behaviorism; b. illustrate the use of these in maintaining motivation for their patients’ adherence to treatment plans. The answer should be between one and a half and two pages long.

The essay test outlined in Table 12.1 is a restricted-response essay. There are ten questions for a 4 h examination for approximately 24 min per question. The answers should be all of equal value (10 points), a maximum of one page long.

288  Constructed response items

EXAMPLE 2 A restricted-response essay: a 30-year-old man is brought to the ­emergency department 30 min after being stung by several wasps. He is confused and has difficulty breathing. His temperature is 38°C (100.4°F), pulse is 122/min, respirations are 34/min, and blood pressure is 80/40 mm Hg. Physical examination shows dry skin and decreased capillary refill. There are multiple erythematous, inflamed marks on the back, and 1+ pitting edema of the ankles. The first treatment step is the administration of 0.9% saline solution. a. What should be administered as the most appropriate next step in management? b. Describe the pathophysiology in this patient that accounts for the clinical symptoms. c. How does the treatment function to alleviate the symptoms? The answer should be about two pages long. These examples of restricted response essay questions aim the student at the desired answer. Moreover, the scope of the essay has been circumscribed. As compared to the extended essays, restricted-response questions promote greater reliability in scoring. They may also reduce the student’s opportunity to apply, analyze, and synthesize disparate information into a coherent whole, however, thereby reducing divergent thinking, a major purpose of the constructed response. The extended-response question places fewer limitations on discussion and the form of the answer. An example of the extended-response essay question is a follows:

EXAMPLE 3 An extended-response essay question: how might humanistic psychology be used to maintain motivation for their patients’ adherence to treatment plans? Illustrate with appropriate examples the application of particular theoretical constructs from humanistic psychology in the maintenance of patients’ motivation for adherence to treatment plans.

In answering this essay question, students may select the particular theoretical constructs from humanistic psychology which they deem to be most important and relevant in maintaining motivation in their patients’. This is in contrast to the previous example of a restricted-response question. In addition, the form of the extended-response essay remains the responsibility of the student and the length is unspecified, thus, allowing for more flexibility and creativity in responses.

Guidelines for preparing essay items  289

EXAMPLE 4 An extended-response essay question: the main purpose of a clinical trial is to determine the efficacy and safety of a medical treatment such as a drug. Design a clinical trial to test the efficacy and safety of a medical treatment of your choice, adhering to the best practices and theory of clinical trials. Illustrate with appropriate discussion the application of particular theoretical and statistical principles of clinical trial design. Typical ­proposals of this sort are usually around 3,000 words of text. You may use tables, charts, appendices, etc. which are not included in the word count.

EXAMPLE 5 An extended-response essay question: a 25-year-old woman is brought to the emergency department 1 h after she fainted. She has had mild intermittent vaginal bleeding, sometimes associated with lower abdominal pain, during the past 3 days. She has had severe cramping pain in the right lower abdomen for 12 h. She has not had a menstrual period for 3 months; previously, menses occurred at regular 28-day intervals. Abdominal examination shows mild tenderness to palpation in the right lower quadrant. Bimanual pelvic examination shows a tender walnut-sized mass in the right parametrium. What is the most likely differential diagnosis? Based on this differential diagnosis, write a patient management plan, relevant treatments, prognosis, and follow-up plans.

GUIDELINES FOR PREPARING ESSAY ITEMS ●● ●●

●● ●●

●●

●●

Allow adequate time to write, re-write, and edit the essay questions. Use restricted response questions whenever possible unless extendedresponse format is clearly required for higher level measurement of cognitive levels. State the problem in the form of a question or statement. Do not use optional questions. Optional questions will undermine the content validity of the test. If an essay topic is important enough for one student, it is important enough for all students. If it is not important enough for all students, it is not important enough for any student. When asking student’s their opinions, require that they support them with knowledge and a rational argument. The essay examination should be given under standard conditions. The instructions, time, and place of writing should be uniform for all students.

290  Constructed response items

●●

●●

●●

Clearly indicate the time allotted for the total test (e.g., 2 h, a week for takehome essays—give precise due dates). Teach your students how to write essay tests with the following guidelines: ●● Ensure that all understand the general instructions. ●● Teach your students to first read and understand the question. ●● Before students begin writing, they should spend about 10% of the total question time in understanding the question and developing an outline for the answer. ●● Teach them to answer the question: if the question asks to “compare and contrast”, students should not simply describe. Teach students to use appropriate headings such as “How the two theories compare” and “How the two theories contrast”. ●● Teach them to develop an outline with key ideas. ●● Each essay should have an introduction, the body of the essay, and a conclusion. ●● Teach students to use all of the allotted time. Create a scoring key for each question.

GUIDELINES FOR SCORING ESSAY QUESTIONS ●●

●●

●●

There are two basic approaches to the scoring of constructed-response test questions: analytic scoring and holistic scoring. In both cases, develop an answer key or rubric to aid the scoring of the essay. Even so, judgment about the quality will be subjectively determined at the time of scoring. The rubric tells the scorer what aspects of the response to identify and how many points to award.

Analytic scoring ●●

●● ●●

Analytic scoring rubrics list specific elements of the response and specify the number of points to award each. On a diagnostic question, the scorer awards 2 points for providing a correct clinical impression, 2 points explanation of the pathophysiology, and 1 point for the correct treatment (total points = 5). Each question with the same value (e.g., 5 points, 10 points, etc.). Set realistic performance standards.

Holistic scoring ●●

Holistic scoring is different from analytic scoring. The holistic rubric contains statements of a typical response at each score level so that actual responses written by test-takers provide examples of a 10-point response, an 8-point response, 5-point, and so on. The student examples may also include borderline cases such as a response that just is a minimal pass (e.g., a score of 5) or a response that fails (e.g., receive less than a score of 5).

Guidelines for scoring essay questions  291

●●

●●

●●

●●

Holistic scoring is even less consistent that analytic scoring—the same response scored by two different scorers is less likely to receive the same score from both scorers than analytic scoring. For some types of constructed-response questions (e.g., measuring writing ability or moral reasoning), it may not be possible to identify specific elements of the essay. Responses to these can be scored holistically. The advantage of holistic scoring is that it allows for student creativity, ­divergent thinking, and novel responses. A good example of holistic scoring are the Gold Foundation annual essay contest that asks medical students to engage in a reflective ­w riting exercise on a topical theme related to humanism in medicine (www. gold-foundation.org/2013-humanism-in-medicine-essay-contest-winnersannounced/). First, second, and third place essays are chosen by a panel of physicians and noted writers. Winners receive a monetary award and their essays are published in Academic Medicine. Students write an essay or story of 1,000 words or less on themes such as: “Who is the ‘good’ ­doctor?” and “What do you think are the barriers to humanism in medicine today?”

General scoring guidelines ●●

Irrespective of analytical or holistic scoring, require proper grammatical responses with full sentence and paragraphs. Do not allow point form. ●● If there are multiple essay questions, grade one question at a time for all students. ●● Randomize the order of students and score all the answers to the first question. ●● Randomize again and score the second question, etc. until all are scored (this procedure will minimize item-to-item order effects as well as student-to-student order effects). ●● Score each test anonymously. Use numerical IDs only that mask student names. This will minimize the halo or horns effects. ●● Be wary of the dove and hawk effects, also called the generosity and severity effects. Some graders are doves and give easy marks, while others are hawks and mark very vigorously. Try to identify which one you are and guard against it. ●● Be wary of the central tendency effect or the average effect. That is, ­g rading everyone as in the middle without extremes of good and bad. ●● Guard against unconscious bias, another common type of error. This refers to attitudes or stereotypes that are outside our awareness but nonetheless affect our judgments and grading. These are both positive and negative automatic associations about other people based on ­characteristics such as race, ethnicity, sex, age, social class, and appearance. ●● Write comments on each paper.

292  Constructed response items

Computer scoring The greatest problems in constructed-response testing are the time and expense involved in scoring. The scorers need to be highly trained as in the Gold Foundation essays and require elaborate systems for monitoring the consistency (reliability) and accuracy (validity) of the scores. Researchers have recently made progress in using computers to score constructed item responses. Automated scoring reduces the time and cost of the scoring process. For most testing situations, constructed response items requiring human scoring are impractical and prohibitively expensive. While a number of essay scoring programs have been developed, this approach remains very controversial and not very widely used.1

COMPARING TOTAL EFFORT FOR MCQ AND ESSAYS TESTS The total amount of effort in the use of either MCQs or essay test use is, in part, a function of the number of students to be tested (Figure 12.1). The greatest amount of effort for MCQ use is in the construction of the test items. The effort in scoring is minimal, because these items can be scored by computers. The amount of effort in test construction for MCQs is large irrespective of the number of students to be tested, but the effort in scoring is relatively unaffected. By contrast, the essay test is relatively easy to construct but most of the effort is in scoring the test—creating the scoring rubric and applying to the student responses. The effort in scoring increases rapidly as the number of students increases. Figure 12.1 shows the relationships between MCQs and essay test total effort as a function of the number of candidates tested. Constructing a high-quality 100 MCQ test with clinical vignettes following the rules specified in Chapter 11 may take more than 33 h (approximately 20 min per Total Effort for MCQ and Essay Tests 120

Total Effort

100 80 60

MCQ Essays

40 20 0

10

15

20 50 Number of Students

75

100

Figure 12.1  Comparison of total effort for MCQ and essays tests.

Comparing total effort for MCQ and essays tests  293

item). Once this is done, however, the effort for scoring is small and doesn’t increase much even with a large number of candidates. Constructing a high-quality ten question essay test may take only about 2 h even strictly ­adhering to the rules as specified in this chapter. Scoring the results, however, may be quite effortful. For ten students, it may take about 5 h (30 min/student). For 20 ­students, it may take 10 h, for 50 students 25 h, for 100 students 50 h, and so on. Determining the testing format for any given situation is not dependent on the number of students but rather the levels of cognitive measurement. MCQs should be used for the first three levels of Bloom’s taxonomy (knowledge, comprehension, application), and constructed response should be used for the next three ­(analysis, synthesis, evaluation).

BOX 12.1: Advocatus Diaboli: Automated essay scoring Important skills such as critical thinking and problem solving are better assessed with constructed-response item formats, such as essays, than they are with MCQs and other selection items. Constructed-response assessments are very effortful to score (Figure 12.1), because they rely on human raters. The responses to constructed-response tasks are manually scored by content specialists who must be trained for the grading task, do the scoring, and monitored throughout to ensure that scores achieve adequate reliability. Automated essay scoring (AES) technology involves the process of scoring and evaluating written text using a computer program that builds a scoring based on a rubric provided by humans. Then by using machinelearning algorithms, it compares features to the rubric so that the computer can classify, score, or grade new text material submitted by a new group of students. AES phase can assess many different types of written responses, such as patient case management and restricted response essays. The benefits of AES include improved reliability of scoring, reducing the time required for scoring, reducing scoring costs, and immediate feedback on the results. The idea of grading essays by computer has been around since the mid-1960s; more than 50 years with the work of Ellis Page.2 At that time, researchers worked with punch cards and mainframe computers. Notwithstanding advances in test theory and computer technology since then, AES still remains controversial and has not been widely implemented in education. Some excitement was sparked in 2012 when the Hewlett Foundation sponsored a competition to determine if AES can be as reliable as human raters for scoring 22,029 essays. There were initial claims that the AES was as reliable as human scoring; this has since been criticized by a variety of commentators including the Educational Testing Service. The criticisms included3 the following: ●●

five of the eight datasets analyzed consisted of paragraphs rather than essays (Continued)

294  Constructed response items

BOX 12.1 (Continued): Advocatus Diaboli: Automated essay scoring ●●

●●

four of the eight datasets were graded by human readers for content only rather than for writing ability AES machines employed an artificial construct, the resolved score, rather than the true score thus giving AES an unfair advantage

Shermis and Hamner4 in an evaluation of the result of the Hewlett Foundation competition concluded that AES has developed to the point where it can be reliably applied in both low-stakes assessment (e.g., instructional evaluation of essays) and for high-stakes testing as well. They suggested that Page’s “fascinating inevitability,” for the potential for AES in educational assessment has been realized. AES systems, however, cannot understand written text. Rather, AES uses approximate variables that correlate with the intrinsic variables of interest to classify written-response scores. AES is best used for well-defined, precise answers to questions. In the field of medical education, Gierl et al.1 conducted a study of clinical decision making (CDM) constructed response items used by the Medical Council of Canada (MCC) in Part I of its Qualifying Examination using an AES system. This test assesses the competence of candidates who have obtained their medical degree for entry into supervised clinical practice in postgraduate training programs. The CDM component consists of short-menu and short-answer write-in questions; candidates are allowed 4 h to complete. Gierl et al. utilized responses from six CDM write-in questions from the MCC database. Physicians scored the typed CDM open-ended responses using an established scoring rubric for grading the CDM-constructed responses. Each response was scored independently by two physician raters. Data from the 2013 administration served as the validation dataset. The computer program Light Summarization Integrated Development Environment (LightSIDE), open source software, was used to analyze the data as AES. Exact agreement between the computer classification and the human raters on the validation dataset ranged from 94.6% to 98.2%. This confirmed, as Page predicted in 1966, that AES systems can classify scores at a rate as high as that of the agreement among human raters themselves. Notwithstanding, acceptance of AES has not been achieved among students, teachers, professors, and the public more generally. One criticism is that computers use a mechanistic process which is incapable of either understanding or appreciating written text. Because of this deficiency, scores are not considered valid. Another view is that the written responses of humans are best assessed using the knowledge, reasoning, and understanding of other humans. This is a criticism of machine learning and artificial intelligence generally. Page and Petersen5 called this the humanist objection. The AES criteria, critics claim, are trivial (e.g., the use (Continued)

Summary and main points  295

BOX 12.1 (Continued): Advocatus Diaboli: Automated essay scoring of the total number of words as for fluency, matching words for recognition, identified phrases for comprehension, etc.). Geirl et al. asserted that AES provides a method for efficiently scoring constructed response tasks and could complement the widespread use of selected response items (e.g., MCQs) used in medical education. AES could serve medical education in two ways. First, AES could be used to provide summative scores required for high-stakes testing. Human scoring would still be required for constructed response items on summative assessments but that AES could be the second scorer thus reducing costs and resources. Instead of using two human raters for each constructed response task, the initial score could be produced by one rater and the second score by the AES system. Second, AES could assist medical educators by helping them to provide students with rich formative feedback. This would be particularly useful in large classrooms where providing this kind of feedback by human graders would be onerous. AES could not only evaluate students but also provide them with immediate, detailed feedback on their constructed responses. AES could provide the ability to measure, monitor, and improve writing skills in their students. Current applications of AES require the judgments, expertise, and experiences of content specialists to produce the input and output required for machine learning. AES requires computer technology for the algorithmic task of extracting complex features from the essays scored by human raters. Although not yet widely accepted, AES still holds the promise of reliable and valid evaluations of written text thus automating the processing and evaluating written text. Page’s 1966 prediction that AES systems can do better evaluations of written text than human raters may yet come true.

SUMMARY AND MAIN POINTS Essay questions or constructed-response items can be used to measure at the higher levels of Bloom’s taxonomy: application, analysis, synthesis, and evaluation. There are two basic types of essay questions, the restricted and the extended response. These can be scored by either the analytic method or the holistic method, both of which require the development of a rubric. There are well-established guidelines for scoring essays that help to reduce subjectivity of scoring and error of measurement. The total amount of effort in the use of either MCQs or essay test use is, in part, a function of the number of students to be tested. The greatest amount of effort for MCQ use is in the construction of the test items, while for the essay it is the grading time. The idea of grading essays by computer has been around since

296  Constructed response items

the mid-1960s. Notwithstanding advances in test theory and computer technology since then, AES still remains controversial and has not been widely implemented. 1. The constructed-response or open-ended item is most commonly known as the essay test but consists of any item format type where the examinee must construct a response rather than select one as in the multiple-choice format. 2. These item types are sometimes also called subjective tests because scoring involves subjective judgments of the scorer. By contrast, the MCQ is referred to as objective because no judgments are required to score the items. 3. Constructed-response questions are usually intended to assess students’ ability at the higher levels of Bloom’s taxonomy: application, analysis, synthesis, and evaluation. 4. There are at least two types of essay questions, the restricted and extended response forms. This classification refers to the amount of freedom allowed to candidates in composing their responses. The restricted response question limits the character and breadth of the student’s composition 5. There are two basic approaches to the scoring of constructed-response test questions: analytic scoring and holistic scoring. 6. Analytic scoring rubrics list specific elements of the response and specify the number of points to award each response 7. The holistic rubric contains statements of a typical response at each score level so that actual responses written by test-takers provide examples of a 10-point response, an 8-point response, 5-point, and so on. 8. The total amount of effort in the use of either essay test or MCQs use is, in part, a function of the number of students to be tested. The greatest amount of effort for MCQ use is in the construction of the test items, while for the essay it is the grading. 9. The idea of grading essays by computer has been around since the mid-1960s. Researchers have recently made progress in using computers to score constructed item responses. Automated scoring reduces the time and cost of the scoring process.

REFLECTIONS AND EXERCISES Essay test construction 30 marks Purpose: to practice and develop skills in writing an essay exam. Directions:

PART A: PLANNING THE TEST (6 MARKS) Construct a four-item essay exam in a content/subject area in which you have some competence.

Summary and main points 297

Table 12.2  An example of a table of specifications for organ transplantation Content area Organ transplantation

Levels of understanding Synthesis

Evaluation

1. Need and quality of life

1

1

2 (50%)

2. Cost-benefit analysis

1

1

2 (50%)

2 (50%)

2 (50%)

4

1. Outline the content area upon which the test is based. (1 mark) 2. Construct a two-way TOS with four cells (Two Levels of Understanding by two subsections of the Content Area). This table should include the following: a. two (2) levels of understanding: synthesis and evaluation. (1 mark) b. two (2) subsections of the content area. (1 mark) c. the number of test items for each cell. (1 mark) d. the total number of items. (½ mark) e. percentage weights for each level of understanding and for each content element. (½ mark) 3. State the general purpose of the test, i.e., how will the test results be used? (½ mark) 4. State who will take the test (Table 12.2). (½ mark)

PART B: WRITING THE TEST (22 MARKS) According to your blueprint developed in Part A of this assignment: 1. Write a four-item, restricted response essay exam. (2 marks each; 8 total) 2. Write a sample answer for each essay question. Use full sentences and ­appropriate essay format (introduction, body, and conclusion). (2 marks each; 8 total) 3. Using an embedded scoring system within your sample answer, indicate how points are to be assigned for each answer. See the example below. (1 mark each; 4 total) 4. The test should include a title and explicit instructions (i.e., time for each question and value of each question). (2 marks)

EXAMPLE OF AN ESSAY QUESTION AND A RUBRIC WITH AN EMBEDDED SCORING SYSTEM VIGNETTE A number of tests by the family doctor and specialists confirm the diagnosis of astrocytoma, grade 4, brain cancer (i.e., advance stage of development) for a 15-year-old girl.6 There is no known cure for this type of cancer, but the d ­ octors’ suggest that they can slow the process if they begin ongoing chemotherapy immediately. After a few days, the mother and

298  Constructed response items

daughter decide to end the chemotherapy treatments to pursue a variety of alternative non-toxic therapies (e.g., herbology, nutritional modification, vitamin therapy). The girl’s father, however, is in direct conflict with his wife and daughter and wants them to return to the original chemotherapy treatment plan.

QUESTION In your opinion, what should the family doctor say to the family regarding their decision to pursue alternative therapies? Support your answer with (3) reasons for your opinion. You will receive (1) mark for your opinion and (3) marks for each supporting reason for a total of 10 marks. You have 20 min to complete this essay.

SAMPLE ANSWER WITH EMBEDDED MARKS In my opinion, it’s always really up to the patient to determine what course they want to go on. The doctor is a kind of the moderator on these and that it’s a horrible decision to have to make, especially for a 15-yearold girl (1). If the doctor is confident that the 15-year-old girl and her mother have made a sound decision he/she would have to defend them in their decision. And the doctor would probably need to outline the conflict to the governing body because that seems to be of pretty big importance to this case. (3) So, the family ­doctor’s obligations are to outline the options available, which are to be on chemotherapy or not be on chemotherapy. As a family doctor, I would try to work with the mother, father, and daughter to try and blend the two approaches and find a middle ground they might be comfortable with. (3) Also helping the father and mother realize that the social benefits to their daughter of good family relations are going to do way more for their daughter’s health than either of the two treatment modalities. (3) For these reasons, I feel that the doctor should be a moderator to help the whole family reach a decision on which they can be comfortable.

PART C: REFLECTION After you have completed this assignment, write a brief reflection (one page typewritten maximum) about your reactions to essay exams and the construction of an essay exam. NOTE: the reflection is an important part of this assignment. Although the reflection is not marked, in order to receive a grade for the entire assignment, the reflection must be handed in.

Summary and main points  299

To submit to the instructor 1. Part A—Questions 1–4. 2. The essay exam which includes Part B—Question 4. 3. The sample answers with an embedded scoring system. 4. Your reflection about the construction of an essay exam. 5. Two (2) marks will be awarded for professional appearance of the exam (spelling and grammar). All of the above should be type-written.

REFERENCES 1. Gierl MJ, Latifi S, Lai H, Boulais A, Champlain A. Automated essay scoring and the future of educational assessment in medical education. Medical Education 2014, 48:950–962. 2. Page EB. The imminence of grading essays by computers. Phi Delta Kappan 1966, 47:238–243. 3. Perelman L. When “the state of the art” is counting words. Assessing Writing 2014, 21:104–111; Bennett RE. The changing nature of educational assessment. Review of Research in Education 2015, 39(1):370–407. 4. Shermis MD, Hamner B. Contrasting state-of-the-art automated scoring of essays. In: Shermis MD, Burstein J, eds. Handbook of Automated Essay Evaluation: Current Application and New Directions. New York, NY: Routledge 2013:313–346. 5. Page EB, Petersen NS. The computer moves into essay grading: Updating the ancient test. Phi Delta Kappan 1995, 48:238–243. 6. Donnon T, Oddone-Paolucci E, Violato C. A predictive validity study of medical judgment vignettes to assess students’ noncognitive attributes: A 3-year prospective longitudinal study. Medical Teacher 2009, 31(4):e148–e155. doi: 10.1080/01421590802512888.

13 Objective structured clinical exams ADVANCED ORGANIZERS • Due to the errors produced by traditional methods and the need to improve the reliability and validity of clinical competency measurement, the objective structured clinical examination (OSCE) has been introduced. • The OSCE concentrates on skills, clinical reasoning, and attitudes—to a lesser degree basic knowledge. Content checklists and global ­rating scales are used to assess observed performance of specific tasks. Candidates ­circulate around a number of different stations containing various c­ ontent areas from various health disciplines. • Most OSCEs utilize standardized patients (SPs), who are typically actors trained to depict the clinical problems and presentations of real issues commonly taken from real patient cases. • OSCE validity concerns are (1) content-based—defining the content to be assessed within the assessment; (2) concurrent-based—this requires evidence that OSCEs correlate with other competency tests that measure important areas of clinical ability; and (3) construct based— one of the ways of establishing construct validity with OSCEs is through differentiating the performance levels between groups of examinees at different points in their education. • OSCE reliability concern is which reliability coefficient (inter-rater, ICC, Ep2), how many raters are required, which people (e.g., SPs or MDs) make the best raters, and what is the appropriate number of stations in an OSCE? • As OSCEs are used for both summative and formative purposes, it is important to determine the standards or the score cutoff (such as pass/ fail) which define the level of expected performance for different groups of candidates.

301

302  Objective structured clinical exams

• OSCEs are both time-consuming and resource-intensive. Administrative components, concerns such as test security and cost analysis play important roles ensuring the quality of the OSCEs. • The OSCE has been adopted worldwide with very little scrutiny and study. There is not much systematic evidence of empirical validity for the OSCE.

INTRODUCTION The traditional assessment of clinical competence since time immemorial consisted of bedside oral examinations, ward observations, casual conversations between clinical teachers and students, and chart audits. More recently, constructed response formats such as essays and selection formats such as MCQs have also been used. There are several major limitations of these assessment formats. MCQs which provide objective measurements of biomedical knowledge primarily at the lower levels of Bloom’s taxonomy are too superficial and lack ecological validity and are also disconnected from the real clinical context between patients and physicians. Bedside observations, oral exams, hallway conversations, and chart reviews contain nearly all possible sources of assessment error including (1) rater error (i.e., inconsistency due to subjectivity) such as the halo effect, hawk-dove effect, etc., (2) lack of standards for marking performance, (3) inadequate sampling of student interactions with patients and across content areas, (4) small numbers of real patients in which to observe student, resident, or physician performance, and (5) contrived situations and threatening environments in which to test student performance. Due to the errors produced by these methods and the need to improve the psychometric aspects of clinical competency measurement, Harden and colleagues1 pioneered a systematic and objective strategy for observing and evaluating clinical skills—the objective structured clinical examination (OSCE).

Clinical competence and OSCEs As discussed in Chapter 2 and throughout this book, clinical competence is a multidimensional construct composed of several complex components: knowledge, skills, attitudes, professionalism, and so on. Miller described a four-tier pyramid of clinical ability, identifying the base of the pyramid as knowledge (measuring what someone knows), followed by competence (measuring how they know), performance (shows how they know), and lastly at the summit, action (behavior). 2 Clinical competence involves a complex interplay between attributes displayed within the physician–patient encounter which enable physicians to effectively deliver care. Before the introduction of the OSCE, however, the assessment of these basic clinical skills lacked reliability and validity. The OSCE concentrates on skills, clinical reasoning, and attitudes; to a lesser degree, basic knowledge. The OSCE is not a test itself but rather a testing approach

Psychometric issues for OSCEs  303

to which multiple testing activities, such as content checklists and global rating scales are used to assess observed performance of specific tasks. The OSCE is a multiple station format. Candidates circulate around a number of different stations containing various content areas from various medical disciplines. The station lengths can range from between 5 min to 1 h, attempting to simulate a real-time medical encounter depending upon medical discipline. Family medicine physicians usually see patients for between 15 and 20 min, whereas palliative care physicians may spend 30–40 min with patients and their families. Content and principles from surgery, pediatrics, family medicine, orthopedics, and so forth are used within the OSCE. Most OSCEs utilize standardized patients (SPs), who are typically actors trained to depict the clinical problems and presentations of real issues commonly taken from real patient cases. A specified and defined number of tasks are required to be performed by examinees and recorded on structured behavioral checklists and/or rating form by examiners (usually physician judges or SPs). The encounter can be video and audio recorded and assessed at a later time. This recording can also be used to provide specific formative feedback to the candidate. Frequently, there are two components to the OSCE format, typically referred to as couplet stations where examinees first perform clinical tasks with an SP for a defined period of time (e.g., 10 min). Then they rotate to a post-encounter station containing x-ray or laboratory findings for the patient they have just seen. The post-encounter station offers a constructed-response format (i.e. various questions which require specific answers) related to the preceding patient encounter. The couplet stations are designed to measure both the application of practical skills as well as to measure the examinees ability to take and synthesize relevant patient information. The major strengths of the OSCE are objectivity and structure for scoring of the examinees’ performance, usually as a checklist or rating scale. Additionally, there is standardization of content, tasks, and processes accomplished through the training of SPs, who present the same information in a standard fashion across examinees. The OSCE is a performance-based measure that satisfies two main assessment criteria: (1) reliability and validity can be determined and improved and (2) generalizations about the performance of examinees can be made. There can also be educational uses: (1) formative evaluation that provides feedback to clinical teachers and students and (2) summative evaluation that allows clinical teachers, regulatory bodies, and licensing bodies to make pass/fail decisions.

PSYCHOMETRIC ISSUES FOR OSCES Traditional methods of clinical skill assessment were poor indicators of ability because they were confounded by unstandardized, uncontrolled, and subjective methods of evaluation. The major basic psychometric principles which underlie any measurement and evaluation are basic to OSCEs as well. These include aspects of validity, reliability, standard setting, and sources of measurement error.

304  Objective structured clinical exams

Standardized patient accuracy Does the simulated SP environment reflect the real-life context? Are these clinical environments such as SPs credible? How closely does it reflect the real context? Numerous studies have documented the accuracy of SP performance for real-life context including history taking, physical exam performance, communication, empathy, diagnoses, data collection and evaluation, prescribing, and so on.3 This accuracy and consistency of SP performance depends on the effectiveness of SP training. Considerable error variance (e.g., 25%) can be introduced by SPs when they have not been effectively trained. Tamblyn4 has studied the standardized patient method extensively. She has identified three parameters that are central to portrayal by the SP. 1. Present the same information to every candidate. If the candidate asks about diet for example, all SPs should give the same response. 2. The information should be provided under the same conditions. Do some SPs but not others spontaneously volunteer information about their medications (e.g., “I’m taking Tylenol”)? This type of variation is common and it has an impact on candidate performance. 3. The SP should not mention other problems that were not part of the official case or script (e.g., “I haven’t been sleeping well”). Performance standards are designed to be case specific, so deviations in the content of the case bias performance measurement.

Other factors that contribute errors Sex differences 1. Women SPs are more likely to receive a prescription for medication than men presenting with the same problem. 2. A woman SP presenting with the precise same appendicitis symptoms as a man SP will receive irrelevant diagnoses of ectopic pregnancy and endometriosis. ●● Body size: presenting a meningitis case, an overweight SP compared to an average weight one will attract investigations about diabetes and hypertension. ●● Consistency of portrayal: an SP portraying high fever with DVT sprays water on forehead to simulate sweating. When the spray bottle runs dry, candidates receive a different portrayal than did the earlier ones. ●● SPs that prompt a candidate that is faltering (e.g., “my pain level is 7/10”). ●● SPs that make evaluative comments (e.g., good, or that’s wrong) during the exam. ●●

Employing well-trained SPs with accuracy of case presentation reduces error. When properly trained, SPs can and do uniformly present the case in a way that reduces measurement errors arising from content presentation.

Psychometric issues for OSCEs  305

OSCE validity concerns CONTENT-BASED

Defining the content to be assessed within the evaluation is one of the most important priorities of measurement design. Content validity is concerned with comprehensiveness and relies on the adequacy and representativeness of the items sampled within the domain of interest. As with any assessment, developing a table of specifications, or TOS (Chapter 5) that incorporates the sample of items and level of measurement is central to content validity. Many studies have documented the high-content validity of the OSCE approach. Table 13.1 represents a TOS for a 12-station family medicine OSCE. Table 13.1  Table of specifications of OSCE stations utilized to evaluate residents

Patient and condition 1. Stuart Kim 62-year-old male complaining of problems urinating 2. Pat Soffer 39-year-old male arrives at rural emergency room after vomiting bright red blood

Assessments • History taking • Physical assessment • Counselling

Differential diagnoses and/or management Benign prostatic hypertrophy

• Immediate assessment Peptic ulcer disease, Mallory Weiss and treatment tear, esophageal • History taking varices, gastritis, • Physical assessment and and malignancy emergency management dealing with ABCs

3. Daisy Long 75-year-old female presenting with fatigue

• History taking with preliminary psychiatric assessment • Communication skills

Diuretic inducedhypokalemia, hypertension

4. Andrew Voight 45-year-old male admitted 5 days previously with chest pain

• History taking • Counselling • Risk assessment and management for non Q-wave myocardial infarction

Advise on use of medication and warning signs for seeking healthcare

5. Nancy Posture 58-year-old female presenting with difficulty sleeping, headaches, loss of appetite, and feeling as though “on an emotional roller coaster”

Depression and • History taking excessive alcohol • Counselling with special intake risk management for suicidal tendencies

(Continued )

306  Objective structured clinical exams

Table 13.1 (Continued )  Table of specifications of OSCE stations utilized to evaluate residents

Patient and condition

Assessments

Differential diagnoses and/or management

6. Terry Spencer 64-year-old female complaining of severe right leg pain

• History taking • Physical assessment • Promptness of management and information Sharing

Necrotizing fasciitis, Gangrene

7. Terry Heinsen 48-year-old male complaining of fever 3 days after an open cholecystectomy

• History taking • Physical assessment of abdomen post-operatively

Deep vein thrombosis

8. Katherine Jones 35-year-old female who wants to discuss her risk of getting breast cancer after her younger sister was just diagnosed

• History taking • Risk assessment • Counselling

Addressing concerns of breast cancer in the family

9. Laura Zhang 21-year-old female complaining of a severe headache

• History taking • Physical assessment of the central nervous system

Meningitis

0. Jaden Bogart 1 17-year-old male complaining of ‘shortness of breath’

• History taking • Physical assessment • Information sharing

Asthma, aggravating factors such as pets, grass, and perfumes, inhaler use, Ventolin, smoking

1. Sandy Monteiro 1 15-year-old female presenting with “excessive weight loss”

• History taking (weight loss) • History taking (family, past) • Interpersonal skills

Anorexia nervosa, anxiety, depression, obsessive compulsive disorder

2. Jason Browne 1 23-year-old male wanting a note that he is ‘sick’ to avoid an exam

• History taking • Ethical issue • Counselling

Discuss ethics Non-judgemental and empathic Counselling/ Actions

Psychometric issues for OSCEs  307

CONCURRENT-BASED

This requires evidence that OSCEs correlate with other competency tests that measure important areas of clinical ability (Chapter 6). OSCEs in pediatrics can explore the degree to which OSCE scores on various sub-components of skills correlated with final MCQ examinations. There might be correlations between patient management skills and final MCQs testing of knowledge. Other ­correlations could be between OSCE practical skills and clinical teacher’s evaluation of performance on hospital wards. Other concurrent validity study may involve determining correlations between OSCE total test scores and clerkship and residency ratings and also performance on USMLE exams. Correlations between OSCEs and other objective measurements of clinical ability, specifically testing skills and knowledge domains can provide evidence of concurrent validity. CONSTRUCT-BASED

A construct is a hypothetical entity which cannot be directly observed but must be inferred from observable behaviors and responses to test items (Chapter 6). One of the ways of establishing construct validity with OSCEs is through differentiating the performance levels between groups of examinees at different points in their education. Evidence of construct validity, for example, is provided when performance differences are exhibited between medical students, clerks, and, residents (Years 1, 2, 3, 4, or 5) on some performance as these groups have increasing level of knowledge and expertise. This is the “known group difference effect.” In a study of 27 second-year surgery residents in a 19-station OSCE covering history taking and physical exam skills, management skills, problem-­ solving skills, and technical skills revealed significant between-group differences (p < 0.05). These results were corroborated by Stillman et al. who compared performance between three groups of residents (i.e. PGY1 (n = 63) vs. PGY 2 (n = 98) vs. PGY3 and 4 (n = 80) across 19 residency programs in the United States.5 The researchers were interested in examining differences in performance on cases related to content, communication, physical findings, and differential diagnosis by residency level. Using the mean scores attained for each of the four OSCE cases, results revealed significant between group differences (p < 0.05) for the content and differential diagnosis cases between all years, as scores increased as a function of increased year in residency. There were only significant differences observed for the communication and physical findings cases, however, among first-, third-, and fourth-year residents. The two studies provided as examples are stable findings observed in studies comparing various levels of undergraduate and post-graduate for differences in proficiency levels across skills (i.e. history-taking, communication and interpersonal skills, problem-solving skills, technical skills) and across medical disciplines. These results taken together provide some evidence of construct validity for the use of OSCEs.

308  Objective structured clinical exams

Sources of measurement error in validity There are several types of factors that contribute errors: ●●

●●

●●

●●

Order effects (two types are typical): fatigue and practice effects. Examiner, candidate, and/or SP fatigue is a factor contributing to sources of measurement error in the assessment of clinical competence using OSCEs. Similarly practice effects of examiner, candidate, and/or SP can introduce error of measurement Anxiety and motivation: increased performance for OSCE order may be due to decreased student anxiety as they move through the stations, higher motivation, and increase in comfort level as a result of increased experience with the OSCE format SP factors. The SP’s understanding of the clinical problem to be presented, SPs previous experience with the simulation context and prior acting experience, SPs with similar health-related problems as the cases portrayed The best solution in reducing these types of errors is controlling the testing environment, particularly by ensuring that SPs are well trained

RELIABILITY In performance-based assessments, variance can be decomposed into several components: between raters (inter-rater reliability), between stations (inter-­ station reliability), between items (internal consistency), and between observed performance and an estimate of future performance (test reliability). Examinees’ scores are typically decomposed into two main components: raters and stations (or cases), which refer to inter-rater and inter-station reliability. Generalizability theory can be applied to this type of problem (Chapter 9).

Multiple sources of measurement error Generally, the standard reliability coefficient (inter-rater, ICC) is set at 0.80. How many raters are required to achieve this standard of 0.80? What people (e.g., SPs or MDs) make the best raters? What is the appropriate test length or number of stations in an OSCE? In their review, van der Vleuten and Swanson6 found that (1) one rater provides as good an estimate of reliability for station scores and rater consistency as two or more raters, as increasing the number of stations is more important than increasing the number of raters; (2) there are comparable inter-rater reliabilities between SPs and physician raters; (3) inter-station reliability is the most variable among the sources of reliability estimates, as an examinee’s performance on one station does not predict well with performance on another station; and (4) an average of between 7 and 15 plus stations over a minimum of 3–6 h of testing time is typically required to achieve a Ep2 ≥ 0.80. Inter-rater reliability is generally higher than reliabilities for inter-station reliability.

Generalizability theory  309

Inter-station reliability Many studies have consistently shown that performance on one station is a poor predictor of performance on another station. Newble and Swanson7 indicated that inter-station reliability is a complex issue because it deals with two types of error: (1) the inconsistency of performance across stations and (2) inconsistency in rater agreement, both affecting the score. Other factors that have been shown to produce error in examinee scores are (1) heterogeneity of examinee group, (2)  SP performance, (3) examiners, (4) scoring, and (5) integrity of the data. Another source of error in testing clinical competence involves the use of checklists versus global rating scales.

CHECKLIST VS. GLOBAL RATING SCALE Checklists contain many items, and itemize examinee performance into small units compared to global ratings, which are broader and less exhaustive than checklists. Items in a checklist can range from a few items to many items about the expected tasks to be performed within a station. Conversely, items in global rating forms are broader, and range from a single overall item to fewer than a dozen items relating to more broad-based tasks or behaviors that are observed. The differences in reliability between checklists and global rating scales are marginal. Regehr et al. have found that although there are differences between these two rating form types, they are different by chance fluctuation not by statistical difference.8 Global rating scales, though, have better validity than checklists as they correlate better with end of course written examinations (concurrent validity) and are better at discriminating levels of performance (construct validity) than checklists.

GENERALIZABILITY THEORY Generalizability theory, though underemployed, has an important place in ­performance-based measurement particularly when OSCEs are employed. G  theory permits the calculation of a dependability estimate and attempts to determine how accurate one can be about generalizing a candidate’s observed score to the range of behaviors estimated. The goal is to ascertain how much within and between variability is accounted for by each facet (Chapter 9). Three variance components and Ep2 are usually calculated: (1) between raters (inter-rater), (2) from the same rater across performance on different stations (inter-station), and (3) all of the test across for all raters. In one study,9 the Ep2 for the two-facet nested design (SP nested into case) based on the two assessment formats was 0.84 and 0.74 (checklist and global rating, respectively—Table 13.2). The Ep2 indicated that higher reliability was achieved by checklist scores than global scores. The latter may provide better assessment of coherence of clinical competency than checklists. Sources of error arise in the percent variance components for the SP and the examiner facets due to SP and examiner variability (approximately 20% of the

310  Objective structured clinical exams

Table 13.2  The variance, percentage variance, and generalizability coefficients for the two-facet nested design (SP nested into case) Assessment Checklist

Global

a

Facets Candidates (p) Cases (c) SPs nested in case p x c, resa Total Candidates (p) Cases (c) SPs nested in case p x c, resa Total

Variance 30.9609 11.6028 27.0727 85.0487 154.6851 0.1474 0.0162 0.2698 0.7323 1.1657

Variance (%)

Ep2

20.0 7.5 17.5 55.0 100 12.6 1.4 23.1 62.8 100

0.84

0.74

The variability as a result of the interaction effect (p x c) also includes random error (res) which cannot be separated and are combined and labelled as the residual.

variance). Generally, examiner inconsistency is the largest threat to the reliability of an assessment. It is therefore important to train the examiners on how to use the assessment instruments. Similarly, SP measurement error can be introduced via the inconsistent and inaccurate SP performance, choice of SP to portray the case, and/or the SP’s portrayal of the case. Therefore, careful selection and training of SPs is important.

STANDARD SETTING As OSCEs are used for both summative and formative purposes, it is important then to determine the standards or the score cut-point (such as pass/fail) which define the level of expected performance for different groups of candidates. The absolute standard-setting method based on judgments about items in a station test using the Ebel method is generally preferred in OSCE standard setting because it takes into account two dimensions of each item: relevance and difficulty. The Ebel method is a good approach in high-stakes exams. This method focuses on content items on the checklists and rates items based on their difficulty and relevance, thus placing more emphasis on ensuring content validity of the checklist or rating scale itself (Table 13.3). Judges establish standards through reviewing each potential item and rating the item by its difficulty concerning the level of ability of examinees and to its relative importance related to the necessary skills to be executed for proper fulfillment of the case. The advantage of this method is that it is thorough and strengthens the content validity of the scoring method used within the station, however the process of determining the specifics are time consuming. Employing the nine-category grid in Table 13.3, each OSCE10 item is classified into one of the cells based on its difficulty and relevance. This method for setting

OSCE logistics  311

Table 13.3  Ebel procedure for setting MPLs of clinical skills, tasks, and procedures Difficulty Relevancy Essential Important Marginal

Easy EE EI EM

Medium ME MI MM

Hard HE HI HM

MPLs has been shown to have empirical evidence of validity.11 For relevancy, three levels are used: essential, important, and marginal. Similarly, three levels are used for difficulty: easy, medium, and hard. Judges independently classify each task or clinical skill into one of the nine cells on the difficulty by relevancy table above. A clinical skill may be judged as easy and essential (EE), for example, although another may be judged as medium difficulty and important (MI). Through this process, an Ebel rating is given to each clinical skill, procedure, or task. The overall cutoff score for the station is the sum of the MPLs.

EXAMPLES • Asking the patient about the intensity of pain (“On a scale of 1–10, how much does it hurt?”), for example, may be judged easy to do and important (EI) to the assessment. • In some instances, listening to heart sounds may be judged as hard to do and marginal to the presenting problem (HM) Judges are instructed to independently classify each item into one of the nine cells in Table 13.3. They are given the following instructions: do not agonize over each but provide your BEST clinical intuition of the difficulty and relevance of each skill relevant to the particular case. Write your classification (e.g. EM) in pencil on the left-hand side of each item on the checklist. The items are circulated to the judges electronically such as Excel worksheets and the judges provide their ratings on the spread sheet. At least two judges rate each item independently. Subsequently, these are averaged to determine a weight for the item MPL.

OSCE LOGISTICS OSCEs are both time-consuming and resource-intensive. There are several theoretical and psychometric issues that arise from the use of OSCEs: test length, sample size, training time of SP’s, and establishing inter-rater reliability. Additionally, administrative components, concerns such as test security, and cost analysis play important roles ensuring the quality of the OSCEs. Test security becomes an issue particularly when dealing with large groups of candidates,

312  Objective structured clinical exams

such as those tested by national licensing bodies for entering practice at various stages of the educational spectrum (e.g., end of undergraduate medical education or post-graduate medical education). To manage the volume of candidates tested whether at a local school or at a national level (i.e., MCC, NBME, National Council Licensure Examination-RN), parallel forms of tests are used to minimize the potential confounds influencing candidate performance. The major concern to repeatedly testing candidates with the same station content is that sharing of information between candidates may compromise the validity of the exams. OSCEs can be very costly, as SPs, examiners, and other personnel and administrative staff must be paid. Cost consideration can compromise psychometric criteria such as attaining high validity and reproducibility of scores. About 12 stations are required to achieve adequate reliability, particularly for total test reliability. If the stations are very well-designed and executed, Ep2 coefficients > 0.70 can be achieved with as few as seven stations. These and other logistical concerns are important to consider when designing the TOS for the OSCE. Disadvantages include the extensive time commitment required to create the cases, the expense of resources needed (e.g., testing rooms), personnel costs (e.g., SPs and their trainers and physician examiners), and candidate testing time frequently compromise psychometric results.

STATISTICAL RESULTS OF THE FAMILY MEDICINE OSCE DESCRIBED IN TABLE 13.1 There were 51 family medicine residents that took the 12-station OSCE described in Table 13.1. The results of the exam are summarized in Table 13.4 that contains the overall results and the pass/fail for each of the OSCE stations. The mean ­number of stations passed overall was 9.45. Table 13.4  Descriptive statistics and psychometric results of OSCE assessments Station

Min

Station 1 11 Station 2 8 Station 3 14 Station 4 20 Station 5 15 Station 6 18 Station 7 13 Station 8 17 Station 9 19 Station 10 11 Station 11 16 Station 12 10 Overall ratings and communication Overall examiner ratings 1.8 Communication 3.1

Max

Mean

SD

α

18 15 24 29 29 28 16 26 30 21 24 18

14.06 12.03 20.68 23.90 23.87 21.45 14.84 21.84 23.13 15.65 20.26 13.52

2.32 2.24 3.07 3.13 4.75 2.51 1.21 2.22 3.24 3.71 3.26 2.74

0.63 0.67 0.60 0.74 0.78 0.70 0.73 0.81 0.70 0.75 0.68 0.71

3.49 3.60

0.46 0.32

— 0.86

3.9 4.0

SD, Standard Deviation; α, Cronbach’s alpha of internal consistency reliability.

Summary and conclusion  313

The internal consistency reliabilities (Cronbach’s alpha) are also summarized in Table 13.4. They range from 0.63 (Station 1) to 0.86 (Communication). The overall ratings on the stations and the communication scale are also summarized in Table 13.4. G-theory analyses for this OSCE was a two-facet nested design (SP nested into case) based on the two assessment formats Ep2 = 0.84, a value acceptable for high-stakes assessments.

SUMMARY AND CONCLUSION Harden et al. developed the OSCE in response to the highly error-laden techniques such as chart audits, MCQs, unstructured hallway conversations, infrequent student observations, and lack of structured protocols in which to establish rater consistency among clinical teachers. This work helped renew interest in the focus and application of basic psychometric principals which underlie all measurement. Through standardization and proper training, SPs can accurately and reliably present case material with high ecological validity (indistinguishable between real and simulated performance). As well, SPs are comparable to physician examiners to reliably assess the skills of examinees and provide feedback that is useful and helpful for examines. Like all assessment tasks, OSCE development should begin with a TOS. This will allow control over various testing components such as (1) that test items are specific to the objectives of real clinical tasks and are appropriate for different educational levels of examinees, (2) ensuring an adequate number of stations, content areas, and objectivity in rating performance, and (3) choosing the Ebel method of setting standards. Major limitations to designing and implementing OSCEs are practical issues such as SP training, test security and costs. Confounding variables such as order effects and error due to inadequate sampling of stations due to lack of resources can compromise reliability and validity. Although these are practical (as opposed to theoretical) aspects of the testing design, they can have a profound impact on the outcomes of the testing approach if not properly controlled and are incongruent to the goals and objectives of the OSCEs. The challenges to using OSCEs include threats to basic measurement criteria such as inter-rater and inter-station reliability and aspects of validity. OSCEs can measure clinical skills, such as physical exams, history taking, communication, and technical procedures in a uniform and systematic manner. With careful construction, training of both assessors and SPs, and controlled application, OSCEs can adequately and accurately capture important elements of clinical competence that is fundamental disciplines of medical and healthcare practice. ●●

●● ●●

The OSCE concentrates on skills, clinical reasoning, and attitudes; to a lesser degree basic knowledge. Content checklists and global rating scales are used to assess observed performance of specific tasks. Most OSCEs utilize SPs. OSCE validity concerns are (1) content based; (2) concurrent based; and (3) construct based—differentiating the performance levels between groups of examinees at different points in their education.

314  Objective structured clinical exams

●●

●● ●● ●●

OSCE reliability concerns how many raters, which people make the best ­raters and what is the appropriate number of stations? Cutoff scores (such as pass/fail) can be set with the Ebel method. OSCEs are both time-consuming and resource-intensive. The OSCE has been adopted worldwide notwithstanding the paucity of empirical validity evidence for the OSCE.

BOX 13.1: Advocatus Diaboli: The trouble with OSCEs Most assessments in medical and health sciences education lack evidence of construct validity. This includes the OSCE, a performance-based assessment. It is second in popularity and use only to the MCQ formats. There has been a rapid increase in medical, nursing, dental and other health education worldwide in the use of OSCE assessment methods. The major advantages of the OSCE format include controlling the complexity of the station based on the skill level of the candidates, the ability to sample a wide range of knowledge and skills, and clearly defining the knowledge, skills, and attitudes to be assessed. The disadvantages include the extensive time commitment required to create the cases, the expense in resources needed (e.g., testing rooms), and personnel costs (e.g., SPs and their trainers; physician examiners). Notwithstanding the widespread use, there is a need for further construct validity evidence for the use of the OSCE assessment format for assessing professionalism and competence, particularly for high-stakes outcomes (e.g., licensing or certification). Construct validity focuses on whether a test successfully measures the characteristics it is designed to assess and whether the interpretation of the test score is meaningful. In order for an assessment to be valid, it must be reliable. For measurements of performance to produce evidence of reliability, there must be a sufficient sample of observations, and these observations should be gathered with a degree of structure and standardization. Few studies have investigated the empirical evidence of validity of OSCEs. Construct validity of an assessment method includes all other forms of validity and even more extensive research such as the known group difference group method, the MTMM approach, as well as exploratory and confirmatory factor analyses. Although rare, some studies have directly investigated construct validity of OSCEs. Blaskiewicz et al.12 utilized multiple regression analysis to study the relationship between psychiatry OSCE scores from the clinical skills examination, an obstetrics/ gynecology OSCE, and the NBME psychiatry subject examination. They found that the pattern and magnitude of convergence and discrimination of scores were indicative of inadequate construct validity for both the psychiatry checklist scores and global scores. (Continued)

Summary and conclusion  315

BOX 13.1 (Continued): Advocatus Diaboli: The trouble with OSCEs Baig et al.13 employed a MTMM to study the construct validity of clinical competence—including aspects assessed by an OSCE—and identified both method and trait variance. With few exceptions, however, there is a scarcity of studies directly investigating the construct validity of measurements of clinical competence, particularly OSCEs.14 In a predictive validity of a high-stakes OSCE used to select candidates for a 3-month clinical rotation to assess practice-readiness status, Vallevand and Violato9 provided evidence of predictive validity with a 100% correct classification rate in the pass/fail rotation results. Notwithstanding these examples, insufficient systematic research has been conducted to investigate the predictive and construct validity of OSCEs. The principal function of any assessment protocol is to provide inferences about the competency, abilities, or traits of the candidates— inferences that extend beyond the sample of cases or stations included in the examination. Regardless of an assessment’s intention (e.g., formative vs. summative examination) the assessment process must be reliable, valid, defensible, and feasible. The OSCE has been adopted worldwide with very little scrutiny and study. It is so popular that it has even made an appearance on the television show Seinfeld, even though it is not a good model to conduct this exam.15 There is insufficient systematic evidence of empirical validity for the OSCE. The sample sizes are frequently too small for running complex statistical procedures such as generalizability, factor, or discriminant analyses required to establish evidence of construct validity. It is also typical to only have one rater per station which compromises the determination of inter-rater reliability. Future research should be designed for Ep2 analyses with fully crossed designs. More work needs to be done on the empirical evidence (e.g., criterion-related and construct) for the OSCE, since this aspects has been neglected. The focus in past research has been primarily on face and content validity, and the OSCE has been accepted more-orless uncritically.

REFLECTIONS AND EXERCISES Reflections 1. Notwithstanding the scant validity evidence, the OSCE has gained worldwide acceptance and use. Critically evaluate the reasons for the ascendancy of the OSCE. (Maximum 500 word answer)

316  Objective structured clinical exams

2 . Compare and contrast the efficacy of various assessors in the OSCE (e.g., SPs, physicians, undergraduate students, etc.). What are the important characteristics of the assessors for improving the reliability and validity of the OSCE? (Maximum 500 word answer) 3. Design an ideal study to investigate the construct validity of the OSCE. (Maximum 500 word answer)

Exercises 1. Using the template in Table 13.1, develop a table of specifications for an eight-station OSCE in any area of your interest. 2. What major psychometric analyses would you conduct in an OSCE? (hint: descriptive statistics, reliability, validity, item analyses, etc.) 3. For a ten-station OSCE to assess 150 students, how would you design a generalizability (Ep2) study to address variance due to students, assessors, stations, etc.? Consider maximum efficiency and cost constraints. 4. Develop a point-by-point plan for adequately training SPs in an OSCE.

REFERENCES 1. Harden R, Gleeson F. Assessment of clinical competence using an objective structured clinical examination (OSCE). Medical Education, 1979; 13: 41–54. 2. Miller GE. The assessment of clinical skills/competence/performance. Academic Medicine, 1990; 65: s63–s67. 3. Harden RM. Twelve tips for organizing an objective structured clinical examination (OSCE). Medical Teacher, 1990; 12: 259–264. 4. Tamblyn RM. Use of standardized patients in the assessment of medical practice. Canadian Medical Association Journal, 1998; 158: 205–207. 5. Stillman P, Swanson D, Regan MB, Philbin MM, Nelson V, Ebert T, Ley B, Parrino T, Shorey J, Stillman A, Hatem C, Kizirian J, Kopelman R, Levenson D, Levinson G, McCue J, Pohl H, Schiffman F, Schwartz J, Thane M, Wolf M. Assessment of clinical skills of residents utilizing standardized patients. Annals of Internal Medicine, 1991; 114: 393–401. 6. van der Vleuten CPM, Swanson DB. Assessment of clinical skills with standardized patients: state of the art. Teaching and Learning in Medicine, 1990; 2: 58–76. 7. Newble DI, Swanson DB. Psychometric characteristics of the objective structured clinical examination. Medical Education, 1988; 22: 325–334. 8. Regehr G, MacRae H, Reznick RK, Szalay D. Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination. Academic Medicine, 1998; 73: 993–997.

Summary and conclusion  317

9. Vallevand A, Violato C. A predictive and construct validity study of a high-stakes objective clinical examination for assessing the clinical competence of international medical graduates. Teaching and Learning in Medicine, 2012; 24: 168–176. 10. Ebel RL, Frisbee DA. Essentials of Educational Measurement (4th ed). Toronto, ON: Prentice Hall, 1986. 11. Violato C, Marini A, Lee C. A validity study of expert judgment procedures for setting cutoff scores on high-stakes credentialing examinations using cluster analysis. Evaluation & the Health Professions, 2003; 26: 59–72. 12. Blaskiewicz RJ, Park RS, Chibnall JT, Powell JK. The influence of testing context and clinical rotation order on students' OSCE performance. Academic Medicine, 2004; 79(6): 597–601. 13. Baig L, Violato C, Crutcher R. A construct validity study of clinical competence: A multitrait multimethod matrix approach. Journal of Continuing Education in the Health Professions, 2010; 30(1): 19–25. 14. Patricio M, Juliao M, Fareleira F, Young M, Norman G, Vaz Carneiro A. A comprehensive checklist for reporting the use of OSCEs. Medical Teacher, 2009; 31(2): 112–124. 15. www.youtube.com/watch?v=wLRlbsBJhio

14 Checklists, rating scales, rubrics, and questionnaires ADVANCED ORGANIZERS • Checklists, rating scales, rubrics, and questionnaires are instruments that state specific criteria to gather information and to make judgments about what students, trainees, and instructors know and can do in outcomes. They offer systematic ways of collecting data about specific behaviors, knowledge, skills, and attitudes and can be used at higher levels of Miller’s Pyramid. • A checklist is an instrument for identifying the presence or absence of knowledge, skills, behaviors, or attitudes used for identifying whether tasks in a procedure, process, or activity have been done. • Surgery has made extensive use of checklists as have many other disciplines such as oncology and cardiology. • A rating scale is an instrument used for assessing the performance of tasks, skill levels, procedures, processes, and products, but they indicate the degree of behavior rather than dichotomous judgment as in checklists. • Rating scales allow the degree or frequency of the behaviors, skills, and strategies to be rated. Rating scales state the criteria and provide selections to describe the quality or frequency of the performance. A five-point scale is preferred because it provides the best all-around psychometrics. • There are four basic types of rating scales: (1) numeric rating scale, (2) graphic rating scale, (3) descriptive graphic rating scale, and (4) visual analogue scale. • A rubric is a scoring guide used to evaluate the quality of constructed responses. Rubrics employ specific criteria to evaluate performance. They consist of a fixed measurement scale and detailed description, which focus on the quality of performance with the intention of including the result in a grade.

319

320  Methods of scoring, rating, and scaling

• Questionnaires are instrument consisting of a series of questions (or prompts) for the purpose of gathering information from respondents. A main advantage of questionnaires is that a great deal of information can be gathered effectively and with little cost. Unlike telephone surveys, they require much less effort and often have standardized answers that make it simple to compile data. A disadvantage of the standardized answer is that they don’t permit alternative responses.

INTRODUCTION Checklists, rating scales, rubrics, and questionnaires are instruments that state specific criteria to gather information and to make judgments about what students, trainees, and instructors know and can do in outcomes. They offer systematic ways of collecting data about specific behaviors, knowledge, skills, and attitudes and can be used at higher levels of Miller’s Pyramid. The purpose of checklists, rating scales, rubrics, and questionnaires is to devise instruments for: ●● ●● ●●

●●

●●

systematic recording of observations self-assessment determine criteria for learners prior to collecting and evaluating data on their work record the development of specific skills, strategies, attitudes, and behaviors that demonstrate learning clarify learners’ instructional by recording current accomplishments

WHAT IS A CHECKLIST? A checklist is an instrument for identifying the presence or absence of knowledge, skills, behaviors, or attitudes. They are used for identifying whether tasks in a procedure, process, or activity have been done. The activity may be a sequence of steps and include items to verify that the correct sequence was followed. The assessor usually observes the activity because it cannot be judged from the end product. Some attitudes, like showing respect for patients, can only be observed indirectly. A checklist itemizes task descriptions and provides a space beside each item to check off the completion of the task.

CHARACTERISTICS OF GOOD CHECKLISTS Checklists should have the following characteristics: ●● ●● ●● ●● ●● ●●

space for other information such as the trainees’ name, date, case, assessor clear criteria for success based on outcomes tasks organized into logical sections or flow from start to finish highlight critical tasks clear with detailed wording to avoid confusion brief and practical (e.g., one sheet of paper)

Characteristics of good checklists  321

Table 14.1 is a checklist to be used for an OSCE station with an SP portraying a patient with lung cancer. The assessor observes the trainee and checks-off the behaviors as “attempted” or “done satisfactorily.” If the performance was not attempted, the checks are left blank. The assessor will check item 2 “What patient

Table 14.1  Checklist for a station of a patient with lung cancer Candidate__________ Assessor________ Date_________Time: ___begin ____end History of presenting complaint—enquires about 1. Other symptoms (nausea, headache) 2. What patient has been taking for pain 3. Further treatment after surgery 4. Relationship status 5. Nature of relationships with wife and children 6. Employment status 7. Asks about patient’s feelings, displays empathy 8. Asks about coping mechanisms after initial diagnosis Information sharing 9. Informs patient of results of CT scan 10. Indicates that tumors are metastases from lung cancer 11. Indicates that there is no cure 12. Reassures patient that failing to stop smoking did NOT cause cancer to spread 13. Indicates will make appointment with a specialist to discuss palliative treatment options 14. Inquires how patient will be getting home 15. Arranges follow-up with family members

Done Attempted satisfactorily O O O O O O O O

O O O O O O O O

O O

O O

O O

O O

O

O

O O

O O

O O

O O

O

O

O

O

Questions to be asked by examiner (at 10 min) 16. What sort of follow-up would you arrange for this patient? 17. Would follow-up regularly 18. Demonstrates knowledge of other community resources 19. How would you respond if the patient asks about alternative (non-conventional) treatments? 20. Ask which treatment in particular the patient was thinking of using and why 21. Reflect back patient’s seeking alternative treatment as a reflection of patient’s desperation and offer to work with patient to explore all options

322  Methods of scoring, rating, and scaling

has been taking for pain” and item 9 “Informs patient of results of CT scan”, for example, as “attempted” or “done satisfactorily” or left blank. Table 14.1 has all of the characteristics of a good checklist: clear criteria for performance, one page, organized into logical sections and flow, clear with detailed wording, and has spaces to record trainees’ names, assessor, date, etc. Surgery has made extensive use of checklists as they have been promoted in an attempt to prevent mistakes related to surgery. Checklists have been widely adopted, not only in Western countries but throughout the world for increasing patient safety.1 They have been associated with improved health outcomes, including decreased surgical complications and surgical site infections. Surgical complications represent a significant cause of morbidity and mortality with the rate of major complications after in-patient surgery estimated at 3–17% in industrialized countries. The World Health Organization (WHO) has developed a surgical safety checklist (Table 14.2) that has been widely disseminated and adopted. This checklist Table 14.2  World Health Organization surgical safety checklist2 1. Has the patient confirmed his/her identity, site, procedure, and consent? 2. Is the site marked? 3. Is the anesthesia machine and medication checked completely? 4. Is the pulse oximeter on the patient functioning? 5. Does the patient have a: • Known allergy? • Difficult airway or aspiration risk? • Yes, and equipment/assistance available 6. Risk of >500 ml blood loss (7 ml/kg in children)? Yes, and two IVs/central access and fluids planned 7. Confirm all team members have introduced themselves by name and role 8. Confirm the patient’s name, procedure, and where the incision will be made 9. Has antibiotic prophylaxis been given within the last 60 min? 10. Anticipated critical events 11. To Surgeon: • What are the critical or non-routine steps? • How long will the case take? • What is the anticipated blood loss? 12. To Anesthetist: • Are there any patient-specific concerns? 13. To Nursing Team: • Has sterility (including indicator results) been confirmed? (Continued )

Characteristics of good checklists  323

Table 14.2 (Continued)  World Health Organization surgical safety checklist2 • Are there equipment issues or any concerns? • Is essential imaging displayed? 14. Nurse Verbally Confirms: • The name of the procedure • Completion of instrument, sponge, and needle counts • Specimen labelling (read specimen labels aloud, including patient name) • Whether there are any equipment problems to be addressed 15. To Surgeon, Anesthetist, and Nurse: • What are the key concerns for recovery and management of this patient? Source: http://apps.who.int/iris/bitstream/10665/44186/2/9789241598590_eng_ Checklist.pdf.

was explicitly designed for varying cultural and geographic contexts with a very clear and simple presentation. It contains most of the elements of good checklists: clear criteria for performance, one page, organized into logical sections and flow, clear with detailed wording. The statements are primarily framed in the interrogative as questions to which the answer must be either “yes” or “no.” Some of the important questions, for example, are: “3. Is the anesthesia machine and medication checked completely?” and “9. Has antibiotic prophylaxis been given within the last 60 min?” Surgical checklists are a simple and promising way for addressing surgical patient safety worldwide. Further research may be able to determine to what degree checklists improve clinical outcomes and whether improvements may be more pronounced in some settings over others. Another area which checklists have been applied is for scoring an electrocardiogram (ECG) examination protocol (Table 14.3). Cardiologist candidates that are applying for board certification are given a number of ECG examples and are asked to read them and record their answers on a 50-item answer sheet. The checklist in Table 14.3 is used by the assessor to check each item that the candidate recorded on their answer sheet.

Scoring and candidates’ score Table 14.4 summarizes the protocol for scoring each item together with the pertinent Ebel score, item value, the score for each examiner (for an example candidate), and the average score for each item. The total score for each candidate is the sum of the average score and the overall MPL is the sum of the Ebel scores for each item. If a candidate’s score is equal to or greater to the MPL, it is a “pass”; if less, it is a “fail.” A total of 150 candidates wrote this test. Three examiners independently scored each protocol and the results were averaged over the three examiners (Table 14.4). A generalizability analysis (fully

324  Methods of scoring, rating, and scaling

Table 14.3  Checklist for scoring ECG protocol (Check each item that was written by the candidates on the answer sheet) 1

2

3

4

5

6

7

8

9

10

11

Sinus/Atrial/Supraventricular PVC (± unifocal) Arrythmia—sinus/otheratrial Sinus Bradycardia Anterior/Anteroseptal/Anterolateral Recent/Acute Sinus 1° AV Block LBBB Junctional (supraventricular) Atrial (sinus) Bradycardia Occasional atrial capture of ventricles Anterior ischema or definite repolarization changes Pacemaker  LBBB (not required) Dual Chamber (DDD) Sinus/Atrial Beats 1° AV Block Sinus PACs LAH/LAD Possible ant MI/Poor R wave prog Repolarization changes Sinus 1° AV Block RBBB Long QT Antero….MI Sinus Inferior MI Left atrial abnormality Atrial fibrillation Rapid ventricular response Nonspecific ST/T changes Atrial Flutter Atrial Fib-Flutter 2:1 AV conduction Inferior MI Sinus Long QT Anterior or lateral T-wave changes (Continued )

Characteristics of good checklists  325

Table 14.3 (Continued)  Checklist for scoring ECG protocol (Check each item that was written by the candidates on the answer sheet) 12 Pacemaker VVI (ventricular) 3° AV Block Sinus Atrial fibrillation 13 3° AV Block Junctional rhythm ST&T changes/repolarization changes Sinus 14 (Consider) Pericarditis Inferior Infract (probable, possible, etc.) Ventricular rhythm 15 Tachycardia Sinus 16 RBBB Left Anterior Hemiblock/Left Axis Deviation 1° AV Block Long QT Anterior MI Sinus 17 RBBB LAH/LAD Anterior MI PAC Sinus 18 Supraventricular tachycardia Junctional rhythm 19 Accelerated Inferior MI Posterior MI Sinus 20 Second-degree AV Block Mobitz II RBBB Acute Ant MI Left anterior hemiblock/LAD Sinus 21 Inferior MI Posterior MI Lateral MI Non-acute MI (or equivalent words) (Continued )

326  Methods of scoring, rating, and scaling

Table 14.3 (Continued)  Checklist for scoring ECG protocol (Check each item that was written by the candidates on the answer sheet) 22

23

24

25 26 27 28

29

30

31

32

33

34

35

Sinus tachycardia Inferior MI Acute MI Posterior MI Sinus Non-specific T-wave changes Limb lead reversal (malposition) Non-sinus supraventricular rhythm (or equivalent words) Right ventric. conduction delay (or IRBBB or RV hypertrophy) Sinus WPW Sinus Pericarditis (±consider) DELETED Sinus Left bundle branch block Septal MI Sinus 2° AV Block (high grade AV Block) (2:1 AV Block) RAD RBBB Atrial fibrillation Actual or controlled: ventricular rate Non-specific ST&T changes/repolarization changes Sinus Inferior MI Acute or recent or indeterminate Sinus Left ventricular hypertrophy Left atrial abnormality Sinus Lead reversal (malposition) Left atrial abnormality Atrial fibrillation Fast (uncontrolled) ventricular rate RBBB Sinus tachycardia Left bundle branch block (possible) Anterior MI Left atrial abnormality (Continued )

Characteristics of good checklists  327

Table 14.3 (Continued)  Checklist for scoring ECG protocol (Check each item that was written by the candidates on the answer sheet) 36

37 38

39

40

41

42 43

44 45

46

47

Normal (or no abnormality mentioned) Sinus (or just normal) i.e., Normal, sinus rhythm, or both Atrial rhythm Sinus bradycardia Sinus 3° or complete AV block “Idioventricular rhythm” or “junctional rhythm with LBBB” Sinus 2° AV block Mobitz I Acute or recent inferior MI Acute or recent posterior MI Atrial tachycardia or atrial flutter 2:1 AV block Anteroseptal MI (or anterior or septal) Non-diagnostic ST&T changes Sinus bradycardia 1° AV Block Blocked PACs Non-specific ST-T changes Supraventricular tachycardia (or PAT) Non-specific ST-T changes Sinus or sinus tachycardia High grade / (2:1) / 2° AV Block RBBB Non-acute anteroseptal or anterior MI Non-acute inferior MI DELETE Sinus arrhythmia or PACs or sinus exit block Acute inferior MI Posterior MI Atrial flutter Inferior MI Posterior MI Possible acute/Recent MI Sinus Artifact Acute anterior MI Left atrial abnormality (Continued )

328  Methods of scoring, rating, and scaling

Table 14.3 (Continued)  Checklist for scoring ECG protocol (Check each item that was written by the candidates on the answer sheet) 48

49

50

Sinus Lead reversal (malposition) Acute anterior MI Sinus Right bundle branch block Acute anteroseptal MI Non-acute inferior MI LAD/LAH Sinus WPW

Table 14.4  Selected results (items 1 and 2) of the scoring ECG protocol Candidate #2 Item # 1

2

Item Sinus/Atrial/ Supraventricular PVC C (± unifocal) Arrythmia—sinus/ otheratrial Sinus Bradycardia Anterior/ Anteroseptal/ Anterolateral Recent/Acute

Ebel Item Examiner Examiner Examiner Average score value 1 2 3 score 0.9

1

0

0

0

0

0.9 0.7

1 1

1 0

1 0

1 0

1 0

0.9 0.5 0.9

1 1 1

1 1 1

1 1 1

1 0 1

1 0.67 1

0.9

1

1

1

1

1

Ep2 (examiners) = 0.97.

crossed, one facet design) for the results showed that the scores had very high dependability across the three examiners: variance component (rater) = 4340.642; variance component (residual) = 96.154; and Ep2 = 4340.642/(4340.642 + 96.154) = 4340.642/4436.796 = 0.97 − Ep2 = 0.97.

WHAT IS A RATING SCALE? A rating scale is an instrument used for assessing the performance of tasks, skill levels, procedures, processes, and products. Rating scales are similar to checklists except that they indicate the degree of behavior rather than dichotomous judgment such as “yes” or “no” (Table 14.2). They are composed of a list of performance

What is a rating scale?  329

statements and behavior in descriptive words usually with numbers. Two items from a rating scale of teacher effectiveness, for example, are: 1. Encourages students to participate in discussion

1 = never 2 = rarely 3 = sometimes 4 = frequently 5 = consistently

All rating scales can be classified into one of three types.

Numeric rating scale (examples are Items #1 above and #2 below) 1. Stimulates students to bring up problems

1 = never 2 = rarely 3 = sometimes 4 = frequently 5 = consistently

Graphic-rating scale (an example is below) Rate your colleague based on the quality of work, neatness, and accuracy. 1. 2. 3. 4. 5.

Poor: careless worker that repeats similar mistakes Average: work is sometimes unsatisfactory due to messiness Good: work is acceptable without many errors Very Good: reliable worker producing good quality work Excellent: work is of high quality; errors are rare and there is little wasted effort

Descriptive graphic-rating scale This scale is intended to be simple and self-explanatory. It can be easily used with children and impaired respondents (Figure 14.1).

Visual analogue scale (VAS) This scale is continuous (or analogue) and different from discrete scales such as the Likert scale. The visual analogue scales may give more precise values and superior metrics than discrete scales, thus they produce a true continuous ration

Figure 14.1  A common example of a descriptive graphic-rating scale.

330  Methods of scoring, rating, and scaling

Figure 14.2  An example of a visual analogue scale.

scale (Figure 14.2). When this scale is electronically linked on modern computers, the respondent moves the indicator up and down the scale by dragging it with the mouse and a value (e.g., 3.65) is automatically calculated. The low-tech version above shows how the value can be calculated from pencil-and-paper application. These ratings form the scale and can indicate a range of feeling (no pain to maximum possible pain) or performance such as from poor to excellent, never to always, beginning to exemplary, never to consistently, or strongly disagree (SD) to strongly agree (SA). The last scale, SD to SA (Item #1 above), is a Likert scale. Any scale that is not dichotomous (i.e., two options such as yes or no) is generally referred to as a multi-point scale. A common error that is made is to refer to all five-point scales as Likert scales but this only refers to the SD to SA scales. How many points should a scale have? It is common to see scales range from 3 to 12 points. Considerable research has demonstrated that generally, the fivepoint scale is preferred because it provides the best all-around psychometrics. McKelvie in some experiments found that the five-point scale was most reliable, and confidence judgments using continuous scales indicated that respondents were operating with five categories. Other evidence suggested that while there is no psychometric advantage in a large number of scale categories (greater than 9–12), there may be a loss of discriminant power and validity with fewer than five.3 The default, therefore, is five points.

Characteristics of rating scales Rating scales should have clearly defined, operationalized, detailed statements to be rated. On the teacher effectiveness rating scale (Table 14.5), for example, #7 is “Listens attentively to students.” This is a clearly defined statement. The range of numbers should always increase or always decrease. If the statements are positive (e.g., #8 is respectful toward students), then the highest

What is a rating scale?  331

Table 14.5  A rating scale for teacher effectiveness in a clinical environment Factor Stud Teach Feed- Learn center skills back object Evaluation Motivation 1. Encourages students to participate in discussions 2. Stimulates students to bring up problems 3. Keeps to teaching goals; avoids digressions 4. Prepares well for teaching presentations 5. Teaches on ward rounds, at clinics, and OR 6. Cover all the topics which are in the curriculum 7. Listens attentively to students 8. Is respectful toward students 9. Is available regularly for the students 10. Is easily approachable for discussions 11. States learning goals clearly 12. Prioritizes learning goals and topics 13. Debriefing the learning goals periodically 14. Evaluates student’s specialized knowledge 15. Evaluates student’s analytical abilities

0.416 0.674

0.654 0.653

0.600

0.447

0.569

0.494

0.523

0.713 0.790 0.699 0.771

0.697 0.675 0.689

0.653

0.676 (Continued )

332  Methods of scoring, rating, and scaling

Table 14.5 (Continued)  A rating scale for teacher effectiveness in a clinical environment Factor Stud Teach Feed- Learn center skills back object Evaluation Motivation 0.657

16. Evaluates student’s application of knowledge to specific patient 17. Evaluates student’s medical skills regularly

0.500

0.554

18. Evaluates student’s communication and professionalism skills

0.536

0.497

19. Regularly gives constructive feedback

0.648

20. Explains why students are incorrect

0.609

21. Offers suggestions for improvement

0.663

22. Gives students chance to reflect on feedback

0.650

23. Motivates students to study more

0.671

24. Stimulates students to keep up with the literature

0.651

25. Motivates students to learn independently

0.699

1, strongly disagree; 2,disagree; 3, neutral; 4, agree; 5, strongly agree.

number (5) should be the positive term (“strongly agree” or “always”) with the lowest number (1) the negative term (“strongly disagree” or “never”). The characteristics and descriptors listed should clear, specific, and observable. Table 14.6 contains an adapted version of the mini-CEX to directly assess clerkship students’ medical competence. The current version of mini-CEX is an eight-item, global-rating scale (1–5) that is designed to evaluate residents’ patient

What is a rating scale?  333

Table 14.6  Rating scale (modified mini-CEX assessment form) Assessor ___________________ Student _____________________ Date _____________    Patient Problem/Dx(s) _____________________ Patient Sex: M  F   Setting (I/O)________ Patient Age ___ ENCOUNTER COMPLEXITY: ___ Low ___ Moderate ___ High Not yet meets criteria 1. Communication skills

N1

N2

Meets criteria M3

E4

Exceeds Not criteria observed E5

NO

2. Medical interviewing skills 3. Physical examination skills 4. Professionalism/humanistic qualities 5. Clinical reasoning 6. Management planning 7. Organization/efficacy of encounter 8. Oral presentation 9. Overall clinical competence Mini-CEX time: Observing ___ min Assessor providing feedback to student ___ min Comments on Student’s Performance Areas of Strength: Areas for Improvement ________________   _________________ Student Signature   Assessor Signature

encounters, followed by open-ended comments. After a 15–20 min observation, an attending physician provides scale ratings on nine dimensions. This mini-CEX has been programmed to be used on mobile devices (e.g., phones, iPad, laptop, etc.) with voice-activated input. The data can be automatically downloaded to a website by activating the “submit” button. Assessors can also record the patient problems, age, sex, setting (inpatient/outpatient), and encounter complexity (low, moderate, high). Ratings range from 1–5, where 1 = not close to meeting criterion; 2 = not yet meets criterion; 3 = meets criterion; 4 = just exceeds criterion; 5 = well exceeds criterion. Rating scales allow instructors or students to indicate the degree or frequency of the behaviors, skills, and strategies displayed by the person rated. Rating scales

334  Methods of scoring, rating, and scaling

state the criteria and provide selections to describe the quality or frequency of the performance. The descriptive word is very important; the more precise are the words for each scale point, the more reliable the scale. Effective rating scales use descriptors with clearly understood measures, such as frequency (e.g., never to always). Subjective descriptors of quality (e.g., fair, good or excellent) introduce ambiguity because the adjective requires some subjective judgment. The 25-item clinical-teacher-effectiveness scale (Table 14.5) was completed by clerkship students for 585 clinical rotations. The data were factor analyzed with principal components extraction and varimax rotation resulting in six factors (student-centered, teaching skills, feedback, learning objectives, evaluation, and motivation). These six factors accounted for nearly 90% (89.3%) of the variance. In the clinical environment, it is evident that students can distinguish teaching effectiveness into its components. These factors are cohesive and theoretically meaningful.

WHAT IS A RUBRIC? A rubric is a scoring guide used to evaluate the quality of constructed responses. They usually are in table format and can be used by assessors when grading and by students when planning work. Rubrics employ specific criteria to evaluate performance. They may be used to assess individuals or groups and, as with rating scales, results may be compared over time. They consist of a fixed measurement scale and detailed description of the characteristics for each level of performance. These descriptions focus on the quality of performance with the intention of including the result in a grade. Rubrics can increase the consistency and reliability of scoring.

Developing rubrics The scoring criteria for a rubric communicate expectations of level of performance. The scoring criteria are frequently used to delineate consistent standards for grading. Scoring criteria allow assessors and trainee alike to evaluate criteria, which can be intricate and subjective. A rubric can also provide self-assessment, reflection, and peer review. This integration of performance and feedback provides formative as well as summative assessment. Rubrics are increasingly recognized as a way to both effectively assess learning and communicate expectations directly, clearly, and concisely to students. The inclusion of rubrics provides opportunities to consider what demonstrations of learning look like and to describe stages in the development and growth of knowledge, comprehension, and application. To be most effective, rubrics should allow students to see the progression of mastery in the development of comprehension and skills. Table 14.7 contains a rubric to assess participation and discussion performance in case of based small group learning.

Table 14.7  Rubric to assess participation and discussion performance in case based small group learning Component Conduct

Leadership

Reasoning

Listening

Competent

Shows respect for members of the group. Does not dominate discussion; challenges ideas respectfully Takes responsibility for maintaining the discussion whenever needed. Gives constructive feedback Arguments are reasonable, supported with evidence. Provides analysis of complex ideas for the inquiry and further the conversation Always actively attends to what others say as evidenced by regularly building on others’ comments Has carefully read and understood the readings. Comes prepared with questions and critiques of the readings

Shows respect for members of the group. Sometimes has difficulty accepting challenges to ideas or maintain respect Will take on responsibility for maintaining the discus­sion. Sometimes encour­ages others to participate Arguments are reasonable and mostly supported by evidence. Comments and ideas contribute to understanding of the material and concepts Usually listens well and takes steps to check comprehension by asking clarifying and probing questions Has read and understood the readings. Sometimes interpretations are questionable. Comes prepared with questions

Not yet competent Shows little respect for the group. Sometimes resorts to ad hominem attacks when disagrees Rarely takes an active role in maintaining the discussion. Constrains or biases the discussion Contributions are often based on opinion or unclear views. Comments suggest a difficulty in following complex lines of argument Does not regularly listen well as indicated by the repetition of comments or questions; frequent non sequiturs Has read the material; didn’t read or think carefully about it. Inconsistent commitment to preparation

Unacceptable Shows lack of respect for members of the group. Argues or dismissive; resorts to ad hominem attacks Does not play an active role in maintaining the discus­sion or undermines effort to facilitate discussion Comments are frequently so illogical or without substantiation. Resorts to ad hominem attacks on the author instead Fails to listen or attend discussion as indicated by repetition of comments or questions; non sequiturs, off-task Doesn’t understand and interpret the material; comes unprepared; can’t answer basic questions or discussion

What is a rubric?  335

Preparation

Proficient

336  Methods of scoring, rating, and scaling

To develop a rubric, the quality of performance based on the learning outcomes needs to be defined. Examples of high and low performance need to be used to demonstrate what excellent or acceptable achievement is. This provides a model of high-quality work for learners. Once the standard is established, exemplary levels and less-than-satisfactory levels of performance can be ­a rticulated. Like rating scales, rubrics should have four to five-point descriptive levels to allow for discrimination in the evaluation of the task or performance. Table 14.7 contains several components or traits of performance: conduct, leadership, reasoning, listening, and preparation. Each participant is rated on one of four categories on each component: proficient, competent, not yet competent, and unacceptable. There are descriptions of participant behavior that exemplifies that rating in each of the cells. For a proficient rating on conduct, for example, the student “Shows respect for members of the group; does not dominate discussion; challenges ideas respectfully” but for an unacceptable rating, “Shows lack of respect for members of the group. Argues or dismissive; resorts to ad hominem attacks.” Similarly, a not yet competence rating for reasoning is “Contributions are often based on opinion or unclear views. Comments suggest a difficulty in following complex lines of argument.”

QUESTIONNAIRES Questionnaires are instrument consisting of a series of questions (or prompts) for the purpose of gathering information from respondents. A main advantage of questionnaires is that a great deal of information can be gathered effectively and with little cost. Unlike telephone surveys, they require much less effort and often have standardized answers that make it simple to compile data. A disadvantage of the standardized answer is that they don’t permit alternative responses. Respondents must be able to read to answer the questions and therefore may not be suitable for some groups (e.g., children, impaired patients, foreign-language speakers). The questionnaire survey is possibly the most popularly used research method in the social, health, and educational sciences. A patient questionnaire for encounters with medical radiation technologists is summarized in Table 14.8. This brief 12-item questionnaire is intended to be completed by the patient immediately after the treatment or procedure. It employs Likert-type items. Questionnaires are commonly employed to conduct surveys, as they are very flexible in the content and formats employed. The information collected from individual questions or items become data that can be analyzed statistically. These analyses can range from descriptive statistics to factor analysis, multiple regression, and latent variable path analyses. A single survey may focus on various topics such as opinions (e.g., should assisted suicide be legal?), preferences (e.g., car vs. public transportation), behavior (e.g., eating and exercise), or factual information (e.g., illnesses), depending on its purpose. The success of survey research depends on the quality of the questionnaire and the representativeness of the sample from target populations of interest.

Questionnaires 337

Table 14.8  Patient questionnaire Medical Radiation Technologist (MRT’s) Name: ____________________ Specialty: ______________________________ Patient age ___________ Sex (circle): Man Treatment or Procedure______________________

Woman

Respond to the statement about this MRT using the following: Strongly Strongly agree Agree Neutral Disagree disagree 1. Adequately explained my treatment or procedures

1

2

3

4

5

2. Explained if and when to return for follow-up care

1

2

3

4

5

3. Shows interest in me as a patient

1

2

3

4

5

4. Treats me with respect

1

2

3

4

5

5. Helps me with my fears and worries

1

2

3

4

5

6. Provides adequate privacy

1

2

3

4

5

7. Prevents other patients from hearing confidential information about me

1

2

3

4

5

8. I would go back to this MRT

1

2

3

4

5

9. Answers my questions well

1

2

3

4

5

10. I would send a friend to this MRT

1

2

3

4

5

11. Shows concern for my comfort

1

2

3

4

5

12. Shows concern for my safety

1

2

3

4

5

An example of questionnaire survey research: Leadership Leadership is a complex, multifaceted phenomenon that is widely observed but poorly understood. Empirical work in leadership—particularly in medical ­education—is required as it is associated with student achievement and successful team functioning. A recent study employed a survey questionnaire that was sent to medical education leaders to identify the primary competencies of medical education leadership.4 Survey participants were from medical schools

338  Methods of scoring, rating, and scaling

Table 14.9  Descriptive statistics for 20 of the 63 items from the leadership questionnaire Items

Min

Max

Mean

SD

1. Maintain quality

1

5

4.16

0.77

2. Succession planning/recruiting

1

5

4.31

0.82

3. Personnel decision quality

1

5

4.54

0.69

4. Maintaining safety

1

5

4.13

0.87

5. Enhancing task knowledge

2

5

4.24

0.70

6. Eliminating barriers to performance

1

5

4.26

0.72

7. Strategic task management

1

5

4.43

0.70

8. Communication with community

1

5

3.85

0.90

9. Providing a good example

2

5

4.58

0.67

10. Knowledge of organizational justice

2

5

4.44

0.72

11. Legal regulations

1

5

4.29

0.75

12. Open-door policy

2

5

4.41

0.78

13. Explaining decisions respect

1

5

4.55

0.66

14. Servant leadership

1

5

4.30

0.77

15. Distributing rewards fairly

1

5

4.28

0.85

16. Responsibility for others

1

5

4.24

0.81

17. Financial ethics

1

5

4.17

0.93

18. Honesty and integrity

1

5

4.76

0.59

19. Being accountable

1

5

4.71

0.61

20. Time management

1

5

4.24

0.71

in several countries—Austria, Canada, Germany, Switzerland, United Kingdom, and the USA—provided demographic data. A five-point survey of 63 competencies (1 = not required, 2 = of very little importance, 3 = somewhat important, 4 = important, 5 = very important) was sent by email to professors, associate professors, and assistant professors. Most of the participants rated each item as important (4) or very important (5) but for many items the entire scale (1–5) was utilized. As examples of the items and responses, the first 20 competencies are summarized in Table 14.9. The means of the items ranged from 3.85 (#8. Communication with Community) to 4.76 (#18: Honesty and Integrity). The standard deviations are typical (