Modern Psychometrics: The Science of Psychological Assessment [4 ed.] 9781317268772, 1317268776

This popular text introduces the reader to all aspects of psychometric assessment, including its history, the constructi

1,656 189 5MB

English Pages 180 [195] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Modern Psychometrics: The Science of Psychological Assessment [4 ed.]
 9781317268772, 1317268776

Table of contents :
Cover
Half Title
Title Page
Copyright Page
Contents
Preface to the fourth edition
1. The history and evolution of psychometric testing
Introduction
What is psychometrics?
Psychometrics in the 21st century
History of assessment
Chinese origins
The ability to learn
The 19th century
Beginnings of psychometrics as a science
Intelligence testing
Eugenics and the dark decades
Psychometric testing of ability
The dark ages come to an end
An abundance of abilities
Tests of other psychological constructs
Personality
Integrity
Interests
Motivation
Values
Temperament
Attitudes
Beliefs
Summary
2. Constructing your own psychometric questionnaire
The purpose of the questionnaire
Making a blueprint
Content areas
Manifestations
Writing items
Alternate-choice items
Advantages
Disadvantages
Multiple-choice items
Advantages
Disadvantages
Rating-scale items
Advantages
Disadvantages
All questionnaires
Knowledge-based questionnaires
Person-based questionnaires
Acquiescence
Social desirability
Indecisiveness
Extreme response
Designing the questionnaire
Background information
Instructions
Layout
Piloting the questionnaire
Item analysis
Facility
Discrimination
Distractors
Obtaining reliability
Cronbach's alpha
Split-half reliability
Assessing validity
Face validity
Content validity
Standardization
3. The psychometric principles
Reliability
Test-retest reliability
Parallel-forms reliability
Split-half reliability
Interrater reliability
Internal consistency
Standard error of measurement (SEM)
Comparing test reliabilities
Restriction of range
Validity
Face validity
Content validity
Predictive validity
Concurrent validity
Construct validity
Differential validity
Standardization
Norm referencing
Types of measurement
Using interval data
Standard scores and standardized scores
T scores
Stanine scores
Sten scores
IQ scores
Normalization
Algebraic normalization
Percentile-equivalent normalization
Criterion referencing
Testing for competencies
Equivalence
Differential item functioning
Measurement invariance
Adverse impact
Summary
4. Psychometric measurement
True-score theory
Identification of latent traits with factor analysis
Spearman's two-factor theory
Vector algebra and factor rotation
Moving into more dimensions
Multidimensional scaling
Application of factor analysis to test construction
Eigenvalues
Identifying the number of factors to extract using the Kaiser criterion
Identifying the number of factors to extract using the Cattell scree test
Other techniques for identifying the number of factors to extract
Factor rotation
Rotation to simple structure
Orthogonal rotation
Oblique rotation
Limitations of the classical factor-analytic approach
Criticisms of psychometric measurement theory
The Platonic true score
Psychological vs. physical true scores
Functional assessment and competency testing
Machine learning and the black box
Summary
5. Item response theory and computer adaptive testing
Introduction
Item banks
The Rasch model
Assessment of educational standards
The Birnbaum model
The evolution of modern psychometrics
Computer adaptive testing
Test equating
Polytomous IRT
An intuitive graphical description of item response theory
Limitations of classical test theory
Estimation accuracy differs with the level of the latent trait
The score is sample dependent
All items are scored the same
A graphical introduction to item response theory
The logistic curve
3PL model: difficulty parameter
3PL model: discrimination parameter
3PL model: guessing parameter
The Fisher information function
The test information function and its relationship to the standard error of measurement
How to score an IRT test
Principles of computer adaptive testing
Summary of item response theory
Confirmatory factor analysis
6. Personality theory
Theories of personality
Psychoanalytic theory
Humanistic theory
Social learning theory
Behavioral genetics
Type and trait theories
Different approaches to personality assessment
Self-report techniques and personality profiles
Reports by others
Online digital footprints
Situational assessments
Projective measures
Observations of behavior
Task performance methods
Polygraph methods
Repertory grids
Sources and management of bias
Self-report techniques and personality profiles
Reports by others
Online digital footprints
Situational assessments
Projective measures
Observations of behavior
Task performance methods
Polygraph methods
Repertory grids
Informal methods of personality assessment
State vs. trait measures
Ipsative scaling
Spurious validity and the Barnum effect
Summary
7. Personality assessment in the workplace
Prediction of successful employment outcomes
Validation of personality questionnaires previously used in employment
Historical antecedents to the five-factor model
Stability of the five-factor model
Cross-cultural aspects of the five-factor model
Scale independence and the role of facets
Challenges to scale construction for the five-factor model
Impression management
Acquiescence
Response bias and factor structure
Development of the five OBPI personality scales
Assessing counterproductive behavior at work
The impact of behaviorism
Prepsychological theories of integrity
Modern integrity testing
Psychiatry and the medical model
The dysfunctional tendencies
The dark triad
Assessing integrity at work
The OBPI integrity scales
Conclusion
8. Employing digital footprints in psychometrics
Introduction
Types of digital footprints
Usage logs
Language data
Mobile sensors
Images and audiovisual data
Typical applications of digital footprints in psychometrics
Replacing and complementing traditional measures
New contexts and new constructs
Predicting future behavior
Studying human behavior
Supporting the development of traditional measures
Advantages and challenges of employing digital footprints in psychometrics
High ecological validity
Greater detail and longitude
Less control over the assessment environment
Greater speed and unobtrusiveness
Less privacy and control
No anonymity
Bias
Enrichment of existing constructs
Developing digital-footprint-based psychometric measures
Collecting digital footprints
How much data is needed?
Preparing digital footprints for analysis
Respondent-footprint matrix
Data sparsity
Reducing the dimensionality of the respondent-footprint matrix
Singular value decomposition
Selecting the number of singular vectors to retain
Interpreting singular vectors
Rotating singular vectors
Latent Dirichlet allocation
Dirichlet distribution parameters
Choosing the number of LDA clusters
Interpreting the LDA clusters
Building prediction models
Summary
9. Psychometrics in the era of the intelligent machine
History of computerization in psychometrics
Computerized statistics
Computerized item banks
Computerized item generation
Automated advice and report systems
The evolution of AI in psychometrics
Expert systems
Neural networks (machine learning)
Parallel processing
Predicting with statistics and machine learning
Explainability
Psychometrics in cyberspace
What and where is cyberspace?
The medium is the message
Moral development in AI
Kohlberg's theory of moral development
Do machines have morals?
The laws of robotics
Artificial general intelligence
Conclusion
References
Index

Citation preview

Modern Psychometrics

This popular text introduces the reader to all aspects of psychometric assessment, including its history, the construction and administration of traditional tests, and the latest techniques for psychometric assessment online. Rust, Kosinski, and Stillwell begin with a comprehensive introduction to the increased sophistication in psychometric methods and regulation that took place during the 20th century, including the many benefits to governments, businesses, and customers. In this new edition, the authors explore the increasing influence of the internet, wherein everything we do on the internet is available for psychometric analysis, often by AI systems operating at scale and in real time. The intended and unintended consequences of this paradigm shift are examined in detail, and key controversies, such as privacy and the psychographic microtargeting of online messages, are addressed. Furthermore, this new edition includes brand-new chapters on item response theory, computer adaptive testing, and the psychometric analysis of the digital traces we all leave online. Modern Psychometrics combines an up-to-date scientific approach with full consideration of the political and ethical issues involved in the implementation of psychometric testing in today’s society. It will be invaluable to both undergraduate and postgraduate students, as well as practitioners who are seeking an introduction to modern psychometric methods. John Rust is the founder of The Psychometrics Centre at the University of Cambridge, UK. He is a Senior Member of Darwin College, UK, and an Associate Fellow of the Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK. Michal Kosinski is an Associate Professor of organizational behavior at the Stanford Graduate School of Business, USA. David Stillwell is the Academic Director of the Psychometrics Centre at the University of Cambridge, UK. He is also a reader in computational social science at the Cambridge Judge Business School, UK.

Modern Psychometrics The Science of Psychological Assessment

Fourth Edition John Rust, Michal Kosinski, and David Stillwell

Fourth edition published 2021 by Routledge 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN and by Routledge 52 Vanderbilt Avenue, New York, NY 10017 Routledge is an imprint of the Taylor & Francis Group, an informa business © 2021 John Rust, Michal Kosinski and David Stillwell The right of John Rust, Michal Kosinski, and David Stillwell to be identified as authors of this work has been asserted by them in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. First edition published in 1989 Third edition published by Routledge 2009 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data Names: Rust, John, 1943- author. | Kosinski, Michal, author. | Stillwell, David, author. Title: Modern psychometrics : the science of psychological assessment/John Rust, Michal Kosinski and David Stillwell. Description: Fourth edition. | Milton Park, Abingdon, Oxon ; New York, NY: Routledge, 2021. | Includes bibliographical references and index. | Identifiers: LCCN 2020034344 (print) | LCCN 2020034345 (ebook) | ISBN 9781138638631 (hardback) | ISBN 9781138638655 (paperback) | ISBN 9781315637686 (ebook) Subjects: LCSH: Psychometrics. Classification: LCC BF39 .R85 2009 (print) | LCC BF39 (ebook) | DDC 150.28/ 7--dc23 LC record available at https://lccn.loc.gov/2020034344 LC ebook record available at https://lccn.loc.gov/2020034345 ISBN: 978-1-138-63863-1 (hbk) ISBN: 978-1-138-63865-5 (pbk) ISBN: 978-1-315-63768-6 (ebk) Typeset in Bembo by MPS Limited, Dehradun

Contents

Preface to the fourth edition

xii

1 The history and evolution of psychometric testing

1

Introduction 1 What is psychometrics? 1 Psychometrics in the 21st century 2 History of assessment 4 Chinese origins 4 The ability to learn 5 The 19th century 7 Beginnings of psychometrics as a science 7 Intelligence testing 8 Eugenics and the dark decades 9 Psychometric testing of ability 11 The dark ages come to an end 11 An abundance of abilities 12 Tests of other psychological constructs 13 Personality 13 Integrity 14 Interests 16 Motivation 16 Values 16 Temperament 17 Attitudes 17 Beliefs 18 Summary 18 2

Constructing your own psychometric questionnaire The purpose of the questionnaire Making a blueprint 20

20

20

vi

Contents

Content areas 21 Manifestations 21 Writing items 24 Alternate-choice items 24 Multiple-choice items 24 Rating-scale items 25 All questionnaires 26 Knowledge-based questionnaires 26 Person-based questionnaires 27 Designing the questionnaire 28 Background information 28 Instructions 28 Layout 29 Piloting the questionnaire 30 Item analysis 31 Facility 31 Discrimination 32 Distractors 33 Obtaining reliability 33 Cronbach’s alpha 34 Split-half reliability 34 Assessing validity 35 Face validity 35 Content validity 35 Standardization 36 3

The psychometric principles Reliability 38 Test–retest reliability 38 Parallel-forms reliability 39 Split-half reliability 40 Interrater reliability 40 Internal consistency 40 Standard error of measurement (SEM) 41 Comparing test reliabilities 42 Restriction of range 42 Validity 43 Face validity 43 Content validity 43 Predictive validity 43 Concurrent validity 44

38

Contents

vii

Construct validity 44 Differential validity 45 Standardization 45 Norm referencing 46 Criterion referencing 51 Equivalence 52 Differential item functioning 55 Measurement invariance 55 Adverse impact 56 Summary 57 4

Psychometric measurement

58

True-score theory 58 Identification of latent traits with factor analysis 60 Spearman’s two-factor theory 60 Vector algebra and factor rotation 62 Moving into more dimensions 64 Multidimensional scaling 65 Application of factor analysis to test construction 66 Eigenvalues 66 Identifying the number of factors to extract using the Kaiser criterion 66 Identifying the number of factors to extract using the Cattell scree test 67 Other techniques for identifying the number of factors to extract 67 Factor rotation 67 Rotation to simple structure 69 Orthogonal rotation 69 Oblique rotation 69 Limitations of the classical factor-analytic approach 70 Criticisms of psychometric measurement theory 70 The Platonic true score 71 Psychological vs. physical true scores 71 Functional assessment and competency testing 72 Machine learning and the black box 74 Summary 74 5

Item response theory and computer adaptive testing Introduction 76 Item banks 76 The Rasch model

77

76

viii Contents

Assessment of educational standards 77 The Birnbaum model 78 The evolution of modern psychometrics 78 Computer adaptive testing 79 Test equating 79 Polytomous IRT 79 An intuitive graphical description of item response theory 80 Limitations of classical test theory 80 A graphical introduction to item response theory 82 The logistic curve 82 3PL model: difficulty parameter 83 3PL model: discrimination parameter 84 3PL model: guessing parameter 84 The Fisher information function 85 The test information function and its relationship to the standard error of measurement 86 How to score an IRT test 88 Principles of computer adaptive testing 89 Summary of item response theory 91 Confirmatory factor analysis 92 6

Personality theory Theories of personality 94 Psychoanalytic theory 94 Humanistic theory 96 Social learning theory 97 Behavioral genetics 98 Type and trait theories 100 Different approaches to personality assessment 101 Self-report techniques and personality profiles 101 Reports by others 103 Online digital footprints 103 Situational assessments 104 Projective measures 104 Observations of behavior 104 Task performance methods 105 Polygraph methods 105 Repertory grids 106 Sources and management of bias 106 Self-report techniques and personality profiles 107 Reports by others 107 Online digital footprints 108 Situational assessments 108

93

Contents

ix

Projective measures 108 Observations of behavior 108 Task performance methods 109 Polygraph methods 109 Repertory grids 109 Informal methods of personality assessment 109 State vs. trait measures 110 Ipsative scaling 110 Spurious validity and the Barnum effect 111 Summary 112 7

Personality assessment in the workplace Prediction of successful employment outcomes 114 Validation of personality questionnaires previously used in employment Historical antecedents to the five-factor model 114 Stability of the five-factor model 115 Cross-cultural aspects of the five-factor model 115 Scale independence and the role of facets 117 Challenges to scale construction for the five-factor model 117 Impression management 118 Acquiescence 118 Response bias and factor structure 118 Development of the five OBPI personality scales 119 Assessing counterproductive behavior at work 120 The impact of behaviorism 120 Prepsychological theories of integrity 121 Modern integrity testing 121 Psychiatry and the medical model 122 The dysfunctional tendencies 123 The dark triad 125 Assessing integrity at work 125 The OBPI integrity scales 126 Conclusion 127

8

Employing digital footprints in psychometrics Introduction 129 Types of digital footprints 131 Usage logs 131 Language data 131 Mobile sensors 131 Images and audiovisual data 132 Typical applications of digital footprints in psychometrics 132

113 114

129

x

Contents

Replacing and complementing traditional measures 132 New contexts and new constructs 132 Predicting future behavior 133 Studying human behavior 133 Supporting the development of traditional measures 133 Advantages and challenges of employing digital footprints in psychometrics 134 High ecological validity 134 Greater detail and longitude 135 Less control over the assessment environment 135 Greater speed and unobtrusiveness 136 Less privacy and control 136 No anonymity 137 Bias 138 Enrichment of existing constructs 139 Developing digital-footprint-based psychometric measures 139 Collecting digital footprints 139 Preparing digital footprints for analysis 141 Reducing the dimensionality of the respondent-footprint matrix Building prediction models 150 Summary 151 9

Psychometrics in the era of the intelligent machine History of computerization in psychometrics 152 Computerized statistics 153 Computerized item banks 153 Computerized item generation 154 Automated advice and report systems 154 The evolution of AI in psychometrics 155 Expert systems 156 Neural networks (machine learning) 157 Parallel processing 158 Predicting with statistics and machine learning 159 Explainability 161 Psychometrics in cyberspace 162 What and where is cyberspace? 162 The medium is the message 163 Moral development in AI 164 Kohlberg’s theory of moral development 165 Do machines have morals? 166

143

152

Contents

xi

The laws of robotics 167 Artificial general intelligence 167 Conclusion 168 References Index

169 172

Preface to the fourth edition

It is now 30 years since the first edition of Modern Psychometrics was published, and in that time the science has continued to make great strides. Many of the future possibilities tentatively discussed in the first and second editions are now accepted realities. Since the publication of the third edition in 2009, the internet has completely revolutionized our lives. Psychometrics has played a major part in this, much of it good, some not so good. Psychometric profiles derived from our online digital activity are the subject of constant AI scrutiny, providing corporations, political parties, and governments with tools to nudge us for their own benefit and not always in our best interest. But also, psychographic microtargeting of information based on these profiles enables individualized learning, information retrieval, and purchasing of preferred products on a scale previously undreamed of. It is the engine that drives the big tech money machine and the digital economy. At the same time, psychometrics continues to play a central role in improving examination systems in our schools and universities, recruitment and staff development in human-resources management, and development of research tools for academic projects. This book is intended to provide both a theoretical underpinning to psychometrics and a practical guide for professionals and scholars working in all these fields. In this new edition we outline the history and discuss central issues such as IQ and personality testing and the impact of computer technology. It is increasingly recognized that modern psychometricians, because their role is so central to fair assessment and selection, must not only continue to take a stand on issues of racism and injustice but also contribute to debates concerning privacy and the regulation of corporate and state power. The book includes a practical step-by-step guide to the development of a psychometric test. This enables anyone who wishes to create their own test to plan, design, construct, and validate it to a professional standard. Knowledge-based tests of ability, aptitude, and achievement are considered, as well as person-based tests of personality, integrity, motivation, mood, attitudes, and clinical symptoms. There is extensive coverage of the psychometric principles of reliability, validity, standardization, and bias, knowledge of which is essential for the testing practitioner, whether in school examination boards, human-resources departments, or academic research. The fourth edition has been extensively updated and expanded to take into account recent developments in the field, making it the ideal companion for those wishing to achieve qualifications of professional competence in testing. But today, no psychometrics text would be complete without extensive coverage of key issues in testing in the online environment made possible by advances in internet

Preface to the fourth edition xiii

technology. Computer adaptive testing and real-time item generation are now available to any psychometrician with the necessary know-how and access to the relevant software, much of it open source, such as Concerto (The Psychometrics Centre, 2019). Psychometric skills are currently in enormous demand, not just from classic markets but also for new online applications that require understanding and measuring the unique traits of individuals, such as the provision of personalized health advice, market research, online recommendations, and persuasion. The fourth edition extends coverage of these fields to provide advice to computer scientists and AI specialists on how to develop and understand computer adaptive tests and online digital-footprint analysis. The groundwork for the collaboration that led to this fourth edition was established in The Psychometrics Centre at the University of Cambridge. We were very fortunate to have been supported by an amazing team, without whose enthusiasm, creativity, ambition, and drive many of the revolutionary developments in psychometrics would not have been possible. Among them are Iva Cek, Fiona Chan, Tanvi Chaturvedi, Kalifa Damani, Bartosz Kielczewski, Shining Li, Przemyslaw Lis, Aiden Loe, Vaishali Mahalingam, Sandra Matz, Igor Menezes, Tomoya Okubo, Vesselin Popov, Luning Sun, Ning Wang and Youyou Wu. Many have now dispersed to all corners of the world, but their work continues, and there is much more yet to do. Finally, thanks are due to Peter Hiscocks and Christoph Loch, who facilitated the move of the The Psychometrics Centre to the Judge Business School, and to Susan Golombok, the original coauthor, for her generosity and support in the preparation of this new edition.

1

The history and evolution of psychometric testing

Introduction People have always judged each other in terms of their skills, potential, character, motives, mood, and expected behavior. Since the beginning of time, skill in this practice has been passed on from generation to generation. Being able to evaluate our friends, family, colleagues, and enemies in terms of these attributes is fundamental to us as human beings. Since the introduction of the written word, these opinions and evaluations have been recorded, and our techniques for classifying, analyzing, and improving them have become not just an art but also a technology that has played an increasing role in how societies are governed. Triumphing in this field has been the secret of success in war, business, and politics. As with all technologies, it has been driven by science—in this case, the science of human behavior, the psychology of individual differences, and, when applied to psychological assessment, psychometrics. In the 20th century, early psychometricians played a key role in the development of related disciplines such as statistics and biometrics. They also revolutionized education by introducing increasingly refined testing procedures that enabled individuals to demonstrate their potential from an early age. Psychometrics required statistical and computational know-how, as well as data on a large scale, before its impact could be felt. Today, we think of big data in terms of the technological revolution, but large-scale programs implementing the analysis of human data on millions of individuals date back to over 100 years ago in the form of early national censuses and military recruitment. These early scientists did not see their subject as just an interesting academic discipline; they were also fascinated by its potential to improve all our lives. And indeed, in most ways it has, but it has been a long and rocky road—with many false starts, and indeed disasters, on the way. In this chapter, we start with definitions, followed by an evaluation of future potential, a history, a warning about past missteps, applause for current successes, and an invitation to learn from history’s lessons. It is said that those who cannot remember the past are condemned to repeat it. Let us all make sure that this does not happen, but rather that we can bring about a future that sees human potential expand to the stars.

What is psychometrics? Psychometrics is the science of psychological assessment, and has traditionally been seen as an aspect of psychology. But its impact has been much broader. The scientific principles that underpin psychometrics apply equally to other forms of assessment such as educational examinations, clinical diagnoses, crime detection, credit ratings, and staff

2

History and evolution of psychometric testing

recruitment. The early psychometricians were equally at home in all of these fields. Since then, paths have often diverged, but they have generally reunited as the importance of advances made in one context come to the attention of workers in other areas. Currently, great strides are being made in the application of machine-learning techniques and big-data analytics—particularly in the analysis of the digital traces we all leave online—and these are beginning to have a significant impact across a broad range of applications. These are both exciting and disturbing times. We experience psychometric assessment in many of our activities, for example: • • • • • • • • •

We are tested throughout our education to inform us, our parents, teachers, and policy makers about our progress (and the efficiency of teaching). We are assessed at the end of each stage of education to provide us with academic credentials and inform future schools, colleges, or employers about our strengths and weaknesses. We must pass a driving test before we are allowed to drive a car. Many of us need to pass a know-how or skills test to be able to practice our professions. We are assessed in order to gain special provisions (e.g., for learning difficulties) or to obtain prizes. When we borrow money or apply for a mortgage, we must complete credit scoring forms to assess our ability to repay the debt. We are tested at work when we apply for a promotion and when we seek another job. Our playlists are analyzed to assess our music tastes and recommend new songs. Our social media profiles are analyzed—sometimes without our consent—to estimate our personality and choose the advertisements that we are most likely to click.

Assessment can take many forms: job interviews, school examinations, multiple-choice aptitude tests, clinical diagnoses, continuous assessment, or analysis of our online footprints. But despite the wide variety of applications and manifestations, all assessments should share a common set of fundamental characteristics: they should strive to be accurate, measure what they intend to measure, produce scores that can be meaningfully compared between people, and be free from bias against members of certain groups. There are good assessments and bad assessments, and psychometrics is the science of how to maximize the quality of the assessments that we use.

Psychometrics in the 21st century Psychometrics depends on the availability of data on a large scale, and so it is no surprise that the advent of the internet has massively boosted its influence. If we had to date the internet, we would probably start at CERN, the European Organization for Nuclear Research, in Geneva, with Tim Berners-Lee’s invention of the World Wide Web in 1990; he linked the newly developed hypertext markup language (HTML) to a graphic user interface (GUI), thereby creating the first web pages. Since then, the web has expanded to make Marshall McLuhan’s “global village” a reality (McLuhan, 1964). The population of this global village grew from a handful of academics in the early 1990s to a diverse and vibrant community of one billion users in 2005, and to over four billion users

History and evolution of psychometric testing

3

(representing more than 50% of the world’s population) in 2020. Thus, within less than 20 years, the new medium of cyberspace came into existence, creating a completely new science with new disciplines, new experts, and, of course, new problems. Some aspects of this new science are exceptional. While the science of biology is only 300 years old, and that of psychology considerably younger, both their subjects of study—humans and life itself—have existed for millions of years. Not so the internet. Hence the cyberworld is unique, and it is hard to predict what to expect of its future. It is also a serious disruptor; it has completely changed the nature of its adjacent disciplines, especially computer and information sciences, but also psychology and its progeny, psychometrics. By the year 2000, the migration of psychometrics into the online world was well underway, producing both new opportunities and new challenges, particularly for global examination organizations such as the Educational Testing Service (ETS) at Princeton and Cambridge Assessment in the UK. On the positive side, gone were the massive logistical problems involved in securely delivering and recovering huge numbers of examination papers by road, rail, and air from remote parts of the world. But the downside was that examinations needed to take place at fixed times during the school or working day, and it became possible for candidates in, say, Singapore to contact their friends in, say, Mexico with advance knowledge of forthcoming questions. Opportunities for cheating were rife. To counter these challenges, the major examination boards and test publishers turned to the advantages offered by large item banks and computer adaptive testing, the psychometricians’ own version of machine learning. However, it was the development of the app—an abbreviation of “application” used to describe a piece of software that can be run through a web browser or on a mobile phone—that was to prove the most disruptive to traditional ways of thinking about psychometric assessment. One such app was David Stillwell’s myPersonality, published on Facebook in 2007 (Stillwell, 2007; Kosinski, Stillwell, & Graepel, 2013; Youyou, Kosinski, & Stillwell, 2015). It offered its users a chance to take a personality test, receive feedback on their scores, and share those scores—if they were so inclined—with their Facebook friends. It was similar to countless other quizzes widely shared on Facebook around that time, yet it employed an established and well-validated personality test taken from the International Personality Item Pool (IPIP), an opensource repository established in the 1990s for academic use as a reaction to test publishers’ domination of the testing world. The huge popularity of myPersonality was unforeseen. Within a few years, the app had collected over six million personality profiles, generated by enthusiasts who were interested to see the sort of results and feedback about themselves that had previously only been available to psychology professionals. It was one of psychometrics’ first encounters with the big-data revolution. But the availability of psychometric data on such a grand scale was to have unexpected consequences. Many saw opportunities for emulating the procedure in online advertising, destined to become the major source of revenue for the digital industry. Once the World Wide Web existed, it could be searched or trawled by search engines, the most ubiquitous of which is Google. In the mid-1990s, search engines simply provided information. By 2010 they did so with a scope and accuracy that exceeded all previous expectations; information on anything or anyone was ripe for the picking. But those who wished to be found soon became active players on the scene—it was the advertising industry’s new paradise. The battle to reach the top in search league

4

History and evolution of psychometric testing

tables—or, at the very least, the first results page—began in earnest. Once online advertising entered the fray, it became a new war zone. The battle for the keywords had begun. Marketing was no longer about putting up a board on the high street; it was about building a digital presence in cyberspace that would bring customers to you in droves. By the early 2000s, no company or organization could afford not to have a presence in cyberspace. For a high proportion of customers, companies without some digital presence simply ceased to exist. While web pages were the first universally available data source in cyberspace, social networks soon followed, and these opened a whole new world of individualized personal information about their users that was available for exploitation. Not only was standard demographic information such as age, marital status, gender, occupation, and education available, but there were also troves of new data such as the words being used in status updates and tweets, images, music preferences, and Facebook Likes. And these data sources soon became delicious morsels in a new informational feeding frenzy. They were mined extensively by tech companies and the marketing industry to hone their ability to target advertisements to the most relevant audiences—or, to put it another way, to those who might be most vulnerable to persuasion. The prediction techniques used were the same as those that had been used by psychometricians for decades: principal component analysis, cluster analysis, machine learning, and regression analysis. These were able to predict a person’s character and future behavior with far more accuracy than simple demographics. Cross-correlating demographics with traditional psychometric data, such as personality traits, showed that internet users were giving away much more information about their most intimate secrets than they realized. Thus, online psychographic targeting was born. This new methodology, creating clickbait and directing news feeds using psychological as well as demographic data, was soon considered to be far too powerful to exist in an unregulated world. But this will prove one day to have been just the midpoint in a journey that began many centuries ago.

History of assessment Chinese origins

Employers have assessed prospective employees since the beginnings of civilization, and have generated consistent and replicable techniques for doing this. China was the first country to use testing for the selection of talents (Jin, 2001; Qui, 2003). Earlier than 500 BCE, Confucius had argued that people were different from each other. In his words, “their nature might be similar, but behaviors are far apart,” and he differentiated between “the superior and intelligent” and “the inferior and dim” (Lun Yu, Chapter Yang Huo). Mencius (372–289 BCE) believed that these differences were measurable. He advised: “assess, to tell light from heavy; evaluate, to know long from short” (Mencius, Chapter Liang Hui Wang). Xunzi (310–238 BCE) built upon this theory and advocated the idea that we should “measure a candidate’s ability to determine his position [in the court]” (Xun Zi, Chapter Jun Dao). Thus, over 2,000 years ago, much of the fundamental thinking that today underpins psychometric testing was already in place, as were systems that used this in the selection of talents. In fact, there is evidence that talent selection systems appeared in China even before Confucius. In the Xia Dynasty (c. 2070–1600 BCE), the tradition of selecting officers by competition placed heavy emphasis on physical strength and skills, but by the time of the Zhou Dynasty (1046–256 BCE) the content of the tests had changed. The

History and evolution of psychometric testing

5

emperor assessed candidates not only based on their shooting skills but also in terms of their courteous conduct and good manners. From then on, the criteria used for the selection of talent grew to include the “Six Skills”: arithmetic, writing, music, archery, horsemanship, and skills in the performance of rituals and ceremonies; the “Six Conducts”: filial piety, friendship, harmony, love, responsibility, and compassion; and the “Six Virtues”: insight, kindness, judgment, courage, loyalty, and concord. During the Warring States period (475–271 BCE), oral exams became more prominent. In the Qin Dynasty, from 221 BCE, the main test syllabus primarily consisted of the ability to recite historical and legal texts, calligraphy, and the ability to write official letters and reports. The Sui (581–618 CE) and Tang Dynasties (618–907 CE) saw the introduction of the imperial examinations, a nationwide testing system that became the main method of selecting imperial officials. Formal procedures required—then as they do now—that candidates’ names should be concealed, independent assessments by two or more assessors should be made, and conditions of examination should be standardized. The general framework of assessment set down then—including a “syllabus” of material that should be learned and rules governing an efficient and fair “examination” of candidates’ knowledge—has not changed for 3,000 years. While similar but less sophisticated frameworks may have existed in other ancient civilizations, it was models based on the Chinese system that were to become the template for the modern examination system. The British East India Company, active in Shanghai, introduced the Chinese system to its occupied territories in Bengal in the early 19th century. Once the company was abolished in 1858, the system was adopted by the British for the Indian Civil Service. It subsequently became the template for civil service examinations in England, France, the USA, and much of the rest of the world. The ability to learn

It has long been recognized by teachers that some students are more capable of learning than others. In Europe in 375 BCE, for example, Socrates asked his student Glaucon: When you spoke of a nature gifted or not gifted in any respect, did you mean to say that one man will acquire a thing hastily, another with difficulty; a little learning will lead the one to discover a great deal; whereas the other, after much study and application, no sooner learns than he forgets; or again, did you mean that the one has a body that is a good servant of his mind, while the body of the other is a hindrance to him? – Would not these be the sort of differences which distinguish the man gifted by nature from the one who is ungifted? Plato, (449a–480a) Respublica V) This view of the ability to learn, generally referred to as intelligence, was very familiar to European scientists in the 19th century—almost all would have studied Greek at school and university. Intelligence was not education but educability, and represented an important distinction between the educated person and the intelligent person. An educated person is not necessarily intelligent, and an uneducated person is not necessarily unintelligent. In medieval Europe, the number of people entitled to receive an education was very small. However, the Reformation and then the Industrial Revolution were transformative. In Europe, the importance of being able to read the Bible in a native

6

History and evolution of psychometric testing

language, and the need to learn how to operate machines, led to popular support for a movement to provide an education for all—regardless of social background. One consequence was that more attention was drawn to those who continued to find it difficult to integrate into everyday society. The 19th century saw the introduction of the asylum system, in which “madhouses” were replaced by “lunatic asylums” for the mentally ill and “imbecile asylums” for those with learning difficulties. These words, obviously distasteful today, were then in common usage: the term “lunatic” in law and the term “imbecile” in the classification system used by psychiatrists. Indeed, the term “asylum” was intended to be positive, denoting a place of refuge (as in “political asylum” today). The need to offer provision for those who had difficulty with the learning process focused attention on how such people could be identified and how their needs best accommodated. In pre-Victorian England, Edward Jenner (the advocate of vaccination) proposed a four-stage hierarchy of human intellect, in which he confounded intelligence and social class. Jenner (1807) summarized the attitude of the time. In “Classes of the Human Powers of Intellect,” published in the popular magazine The Artist, he wrote: “I propose therefore to offer you some thought on the various degrees of power which appear in the human intellect; Or, to speak more correctly, of the various degrees of intellectual power that distinguish the human animal. For though all men are, as we trust and believe, capable of the divine faculty of reason, yet it is not to all that the heavenly beam is disclosed in all its splendour. 1 2

3

4

In the first and lowest order I place the idiot: the mere vegetative being, totally destitute of intellect. In the second rank I shall mention that description of Being just lifted into intellectuality, but too weak and imperfect to acquire judgement; who can perform some of the minor offices of life – can shut a door – light a fire – express sensations of pain, etc., and, although faintly endowed with perceptions of comparative good, is yet too feeble to discriminate with accuracy. A being of this degree may, with sufficient propriety, be denominated the silly poor creature, the dolt. The third class is best described by the general term of Mediocrity and includes the large mass of mankind. These crowd our streets, these line our queues, these cover our seas. It is with this class that the world is peopled. These are they who move constantly in the beaten path; these support the general order which they do not direct; these uphold the tumult which they do not stir; these echo the censure or the praise of that which they are neither capable of criticising nor admiring. The highest level is Mental Perfection; the happy union of all the faculties of the mind, which conduce to promote present and future good; all of the energies of genius, valour and judgement. In this class are found men who, surveying truth in all her loveliness, defend her from assault, and unveil her charms to the world; who rule mankind by their wisdom, and contemplate glory, as the Eagle fixes his view on the Sun, undazzled by the rays that surround it.”

History and evolution of psychometric testing

7

The 19th century

The era of European colonialism saw the spread of the Western education system to the colonized world, but uptake was slow, and ideas taken from perceptions of social status in Europe were often transferred to the subjected populations. Charles Darwin, for example—a giant figure of that age, whose theory of evolution had considerable implications for how differences both between and within species would be understood—was among those who held this Eurocentric approach, something that perturbed the development of evolutionary science in a way that would increasingly be recognized as racist. In The Descent of Man, first published in 1871, Darwin argued that the intellectual and moral faculties had been gradually perfected through natural selection, stating as evidence that “at the present day, civilized nations are everywhere supplanting barbarous nations.” Darwin’s view that natural selection in humans was an ongoing process, with “the savage” and “the lower races” being evolutionary inferior to “the civilized nations,” had considerable influence. But it was not a view that was shared by every scientist of that time. Others chose to differ, including Alfred Wallace, Darwin’s copresenter of papers to the the seminal 1858 meeting of the Linnean Society of London that introduced the idea of natural selection through survival of the fittest (Wallace, 1858). Wallace took issue with his former colleague, believing Darwin’s arguments for differences between the races to be fundamentally flawed. His observations in Southeast Asia, South America, and elsewhere convinced him that so-called “primitive” peoples all exhibited a high moral sense. He also drew attention to the ability of children from these groups to learn advanced mathematics, having taught five-year-olds in Borneo how to solve simultaneous equations, and pointed out that no evolutionary pressure could have ever been exerted on their ancestors in this direction by the natural environment. In Wallace’s view, the evolutionary factors that had led to the development of intelligence and morality in humanity had happened in the distant evolutionary past and were shared by all humans.

Beginnings of psychometrics as a science The evolution of the human intellect was also of particular interest to Darwin’s cousin, Sir Francis Galton, who in 1869 published Hereditary Genius: An Inquiry into Its Laws and Consequences. He had first carried out a study of the genealogy of famous families in 1865, based on a compendium by Sir Thomas Phillipps entitled The Million of Facts (Galton, 1865). Galton argued that genius, genetic in origin, was to be found in these families. But when he spoke of genius, he was considering much more than mere intellect. He believed that these people were superior in many other respects, be it the ability to appreciate music or art, performance in sport, or even simply physical appearance. In 1883, in order to collect data to validate his idea, he established his Anthropometric Laboratory at the International Health Exhibition at South Kensington, London, in which people attending the exhibition could have their characteristics measured for three pence (about two US dollars at today’s prices). In the late 1880s, the American psychophysicist James McKeen Cattell, recently arrived in Cambridge from Wundt’s psychophysics laboratory in Germany, introduced Galton to many of Wundt’s psychological testing instruments, which he added to his repertoire. Hence mental testing—psychometrics—was born. The data generated from Galton’s studies provided the raw material for his development of many key statistical methods such as standard deviation and correlation. Karl Pearson, an acolyte of Galton, added partial and multiple correlation coefficients, as well as the chi-square test, to the techniques available. In 1904, Charles Spearman, an army officer turned psychologist,

8

History and evolution of psychometric testing

introduced procedures for the analysis of more complex correlation matrices and laid down the foundations of factor analysis. Thus by the end of the first decade of the 20th century, the fundamentals of psychometric theory were in place. The groundwork had also been laid for the subsequent development of many new scientific endeavors: the statistical sciences, biometrics, latent variable modeling, machine learning, and artificial intelligence. Intelligence testing

The main impetus to provide an intelligence test specifically for educational selection arose in France in 1904, when the minister of public instruction in Paris appointed a committee to find a method that could identify children with learning difficulties. It was urged that “children who failed to respond to normal schooling be examined before dismissal and, if considered educable, be assigned to special classes.” (Binet and Simon, 1916). Drawing from item types already developed, the psychologist Alfred Binet and his colleague Théodore Simon put together a standard set of 30 scales that were quick and easy to administer. These were found to be very successful at differentiating between children who were seen as bright and children who were seen as dull (by teachers), and between children in institutions for special educational needs and children in mainstream schools. Furthermore, the scores of each child’s scales could be compared with those of other children of the same or similar age, thus freeing the assessment from teacher bias. The results of Binet’s testing program not only provided guidance on the education of children at an individual level but also influenced educational policy. The first version of the Binet–Simon Scale was published in 1905, and an updated version followed in 1908 when the concept of “mental age” was introduced—this being the age for which a child’s score was most typical, regardless of their chronological age. In 1911, further amendments were made to improve the ability of the test to differentiate between education and educability. Scales of reading, writing, and knowledge that had been incidentally acquired were eliminated. The English-language derivative of the Binet–Simon test, the Stanford–Binet, is still in widespread use today as one of the primary assessment methods for the identification of learning difficulties in children. Binet’s tests emphasized what he called the higher mental processes that he believed underpinned the capacity to learn: the execution of simple commands, coordination, recognition, verbal knowledge, definitions, picture recognition, suggestibility, and the completion of sentences. In their book The Development of Intelligence in Children, first published in 1916, Binet and Simon, using the language of the time, stated their belief that good judgment was the key to intelligence: It seems to us that in intelligence there is a fundamental faculty, the alteration or the lack of which, is of the utmost importance for practical life. This faculty is judgment, otherwise called good sense, practical sense, initiative, the faculty of adapting one’s self to circumstances. To judge well, to comprehend well, to reason well, these are the essential activities of intelligence. A person may be a moron or an imbecile if he is lacking in judgment; but with good judgment he can never be either. Indeed, the rest of the intellectual faculties seem of little importance in comparison with judgment. (Binet and Simon, 1916) The potential of intelligence testing to identify intellectual capacity in adults as well as children was soon recognized, and the First World War saw the introduction of such a

History and evolution of psychometric testing

9

program in the US on a grand scale. A committee under the chairmanship of Robert Yerkes, president of the American Psychological Association, was established. It was tasked with devising tests of intelligence that would be positively correlated with Binet’s scales but adapted for group use and simple to administer and score. The tests should measure a wide range of abilities, be resilient to malingering and cheating, be independent of school training, and have a minimal need for writing. In seven working days, they constructed 10 subtests with sufficient items for 10 different forms. These were piloted with 500 participants from a broad range of backgrounds, including institutions for people with special educational needs, patients at a psychiatric hospital, army recruits, officer trainees, and high school students. The entire process was completed in less than six months. By the end of the war, these tests, known as Army Alpha and Army Beta, had been administered at the rate of 200,000 per month to nearly two million American recruits. Following from the popularity and success of the army tests, the mantle for mass testing for an Intelligence Quotient (IQ) was taken up by the US College Board, which in 1926 introduced the Scholastic Aptitude Test (SAT), designed to facilitate entry into colleges throughout the US. Adopted first by Harvard and then by the University of California, by 1940 the SAT had become the standard admission test for practically all US universities. The aim was to create a level playing field on which every child leaving high school with an IQ above a certain level could have the opportunity to benefit from tertiary education. The development of meritocratic education rapidly spread around the world, and the old world order—in which education depended on birth and economic privilege—began its decline. However, the success of the new system was not universal. While more gifted individuals benefited, particularly among ethnic minorities and the working class, the effect on the majority in such groups was counterproductive. Large differences in average group scores on these tests remained, resulting in considerable group differences in admission rates, particularly to elite schools and universities, and consequently in employment and entry to professions. The impact of social background factors on test scores went largely unrecognized, and one form of elite was replaced by another. The beneficiaries were predominantly middle class and white. It seemed that IQ testing, while perhaps a panacea, was not a panacea for all. Early attempts to address the discriminating consequences of group differences in IQ test scores when used for selection purposes focused on several strategies. Predominant was a shift away from a single measure, referred to as general intelligence or “g,” toward separate measures for different IQ intellectual abilities such as numerical or verbal, tailored more specifically to the requirements of the training course or employment position in question. There was also increased attention in law to the extent to which any psychometric testing procedures would impact constitutional rights, particularly under the Bill of Rights and the 14th Amendment. Eugenics and the dark decades

But despite the enormous success of Alfred Binet and his successors in addressing the requirements of the education system, for many this was just a sideshow. The originator of the science was Galton, not Binet, and Galton’s true interest was not psychometrics—or even anthropometrics per se; rather, his concern was that the quality of the human race, particularly its intelligence, was degenerating as those of

10 History and evolution of psychometric testing

lower intelligence were having more children and passing on more and more of their “inferior” genes through succeeding generations. In 1883 he coined the terms “eugenics” (defining it as “the conditions under which men of high type are produced”) and “dysgenics” (the opposite of eugenics). Galton’s academic influence was considerable, culminating in the establishment of a eugenics department at University College London. But it was much more than just a theory. In many countries, its ambitions were soon implemented in policy. In 1907, the USA was the first country to undertake compulsory sterilization programs for eugenics purposes. The principal targets were the “feebleminded” and the mentally ill, but also included under many state laws were people who were deaf, blind, epileptic, and physically disabled. Native Americans were sterilized against their will in many states, often without their knowledge, while they were in hospital for some other reason (e.g., after giving birth). Some sterilizations also took place in prisons and other penal institutions, targeting criminality. Over 65,000 individuals were sterilized in 33 states under state compulsory sterilization programs. These were last carried out in 1981. Assessment of intelligence played a key part in many of these programs. Indeed, many early intelligence tests were designed with a eugenics agenda in mind. By 1913, Henry Goddard had introduced them to the evaluation process for potential immigrants at Ellis Island in New York, and in 1919 Lewis Terman stated in his introduction to the first edition of the Stanford–Binet Intelligence Scales (his own translations of Binet’s scales into English): It is safe to predict that in the near future intelligence tests will bring tens of thousands of … high-grade defectives under the surveillance and protection of society. This will ultimately result in the curtailing of the reproduction of feeblemindedness and in the elimination of enormous amounts of crime, pauperism, and industrial inefficiency. It is hardly necessary to emphasize that the high-grade cases, of the type now so frequently overlooked, are precisely the ones whose guardianship it is most important for the state to assume. (Terman, 1919) In 1927, the Kaiser Wilhelm Institute of Anthropology, Human Heredity, and Eugenics was the first to advocate sterilization in Germany. The year 1934 saw the introduction of the country's Law for the Prevention of Offspring with Hereditary Defects; and in 1935 this law was amended to allow abortion for the “hereditarily ill,” including the “social feebleminded” and “asocial persons.” Two years later came the introduction of Sonderkommission 3 (Special Commission Number 3), under which all local authorities in Germany were required to submit a list of all children of African descent. All such children were to be medically sterilized. We all know what followed. In 1939, euthanasia was legalized for psychiatric patients (including homosexual people) in psychiatric hospitals; and in 1942, the same methods were extended to Roma and Jewish people in concentration camps in what became the Holocaust. Debates about the implementation of eugenics generally ended with the Second World War. However, widespread beliefs among white communities about differences between races did not end. The decolonization process had yet to begin, and many people—not just in the USA but also in Africa and elsewhere—continued to attribute African and African-American academic underachievement to inherited IQ differences. Martin Luther King Jr. did not have his dream until the early 1960s, and

History and evolution of psychometric testing 11

it was not until 1994 that apartheid was finally abolished in South Africa. Until that time, arguments over whether inherited differences in intelligence could be a possible cause of group differences continued to stir up controversy among academics in both the USA and Europe, culminating in the publication of Herrnstein and Murray’s The Bell Curve in 1994, in which the authors argued that poor and ethnically diverse inner-city communities were a “cognitive underclass” formed by the interbreeding of people of poorer genetic stock. But there was light at the end of the tunnel.

Psychometric testing of ability The dark ages come to an end

The year 1984 saw the publication of James Flynn’s (Flynn, 1984) first report on what subsequently became known as the “Flynn effect,” the term used to describe the now well-documented year-by-year rise in IQ scores that dates back to the early 20th century, when the practice of IQ testing first began. By examining these changes, it was possible to extrapolate across the entire life of the test. Flynn showed that, on average, IQ scores increased by 0.3 to 0.4 IQ point every year and had been doing so for at least the past 100 years. Various theories have been put forward to explain this phenomenon, including improved nutrition and increasing familiarity with the testing process. For his part, Flynn (2007, 2016) argued that the change was due to the way in which the scientific method had influenced education. Today, we are more rational thinkers than were our ancestors, because the requirements of an industrialized and increasingly technical world force us to be just that. It is interesting that Flynn does not believe that we are actually more intelligent today, in the traditional sense of the word. If that were so, the logic of the finding would be that almost half of our great-grandparents’ generation would be diagnosed as having severe learning difficulties by today’s standards. Rather, it is what we need to comprehend scientfically that has changed. Today, most primary school children understand that the correct answer to the question “What do dogs and rabbits have in common?” is that they are both mammals. It is unlikely that our great-grandparents would think this was a matter of any consequence. They might choose the wrong answer—for example, “Dogs chase rabbits”—and fail to understand why this might be considered wrong. Flynn’s (2007, 2016) work has increasingly focused on the interplay between intelligence test scores and education—not just across time, but also between the level of education provided by each countries schools and colleges, which has historically varied enormously. As well as being an interesting phenomenon in its own right, it was clear that the average IQ of African-Americans in the 2000s was higher than the average IQ of white Americans in the 1970s. Further research by Flynn extended this work beyond the USA and found similar results in many other countries across all the continents. This evidence of the equally enormous impact of environmental factors on IQ scores of all groups rendered unnecessary any need to explain group differences in terms of genetics. Today, the controversy has been largely forgotten. The world has become increasingly multicultural and global in outlook. Moreover, in both the USA and Europe, the enormous professional success of immigrant groups that had previously been excluded speaks for itself.

12 History and evolution of psychometric testing An abundance of abilities

Today, intelligence testing is conceptualized more broadly and assessed in a variety of more positive ways. A popular view is exemplified by the work of Robert Sternberg (1990) in his triarchic model. Sternberg suggested three major forms of intelligence: analytic, creative, and practical. Those with analytic intelligence do well on IQ tests, appear “clever,” are able to solve preset problems, learn quickly, appear knowledgeable, are able to automate their expertise, and normally pass exams easily. While analytic intelligence—measured by classical IQ tests—is important, it is not always the most relevant. He gives an apocryphal example of a brilliant mathematician who appeared to have no insight at all into human relationships, to such an extent that when his wife left him unexpectedly, he was completely unable to come up with any meaningful explanation. Doubts about the sufficiency of the classical notion of IQ have often been expressed in the business world. It has been pointed out that to do well on a test of analytic intelligence, a candidate must demonstrate the ability to solve problems set by others, such as in a test or examination. A successful business entrepreneur, however, needs to ask good questions rather than supply good answers. It is the ability to come up with good questions that characterizes the creative form of intelligence. People with practical intelligence are “streetwise.” They want to understand and can stand back and deconstruct a proposal or idea, they will question the reasons for wanting to know, and they can integrate their knowledge for a deeper purpose. Hence, they are often the most successful in life. They know the right people, avoid making powerful enemies, and are able to work out the unspoken rules of the game. Howard Gardner (1983) also argued that there were multiple intelligences, each linked with an independent and separate system within the human brain. These were linguistic, logical-mathematical, spatial, musical, bodily-kinesthetic, interpersonal, and intrapersonal intelligence. Gardner emphasized the distinctive nature of some of these intelligences and their ability to operate independently of each other. For example, bodily-kinesthetic intelligence, representing excellence in sporting activities, is a new concept within the field, as are interpersonal intelligence and intrapersonal intelligence—representing the ability to understand other people and the ability to have insight into one’s own feelings, respectively. The MSCEIT is an ability-based test designed to measure emotional intelligence. Most tests today under the name of emotional intelligence are not in fact tests of intelligence at all, but rather assess the personality traits of those who are said to be sensitive to the feelings of others. The MSCEIT stands in contrast to these others in that there are predefined right and wrong answers in terms of, say, whether a person’s recognition of anger in someone else is correct or incorrect. It has been argued that many of Sternberg’s and Gardner’s diverse forms of intelligence are not really new. Many correspond with traditional notions that were already tested within IQ tests. Aristotle himself emphasized the importance of distinguishing between wisdom and intelligence, while creativity tests have been around for about 70 years. Nevertheless, their approaches have had an important role in deconstructing the idea that academic success is the sole form of intellectual merit. One result is that today, intelligence testing is conceptualized more broadly and is assessed in a more strategic manner. There is increased emphasis on a diversity of talents, enabling all to identify their strengths as well as areas that they are more likely to find challenging. Tests of specific forms of intelligence—such as numeracy, verbal proficiency,

History and evolution of psychometric testing 13

and critical thinking—remain central to assessment for entry to specific professional training programs that depend on skills in these areas. In addition, broad-spectrum assessments of the many fundamental cognitive skills that underpin key learning processes play an increasing role in targeting specific remedial learning programs to those who most need them. For example, the Wechsler Intelligence Scale for Children, Fifth Edition (WISC-V), contains specific subtests of: • • • • •

Similarities, vocabulary, information, and comprehension (to assess verbal concept formation) Block design and visual puzzles (to assess visual spatial processing) Matrix reasoning, figure weights, picture concepts, and arithmetic (to assess inductive and quantitative reasoning) Digit span, picture span, and letter-number sequencing (to assess working memory) Coding, symbol search, and cancellation (to assess processing speed)

The WISC-V is in widespread use around the world by educational psychologists who work with children with learning disabilities such as dyslexia, dyscalculia, and autism. But we should not forget that while the idea of “g,” a single score of general intelligence, has fallen out of favor, the underlying concept of IQ has captured the popular imagination (as have astrological terms before it), and wishing it away is unlikely to be successful. It has entered everyday language, and it is surprising how many people think they know their own IQ, even though they are generally wrong. The idea is particularly likely to remain popular with those who have attained high scores on IQ tests. Societies such as MENSA have grown throughout the world, and perhaps we should not begrudge the pleasure and pride felt by their members.

Tests of other psychological constructs Psychometric intelligence tests assess optimal performance, as do all tests of intelligence, ability, competence, or achievement. What these all have in common is that candidates are expected to achieve the highest score that they can and need to be sufficiently motivated to do so. Typically, in a classical psychometric test of this type, it is a simple question of how many of the questions they can answer correctly. The higher the number of correct answers, the higher the score. However, psychometric methods can also be applied to the assessment of other psychological characteristics in which people differ from each other, such as personality, integrity, interests, motivation, values, temperament, attitudes, and beliefs. Personality

Psychometric personality testing, like psychometric intelligence testing, can trace its roots back to Galton—not just the statistical methods, but also his lexical hypothesis (Galton, 1884). Galton proposed that consequential differences between people should be reflected in language, and that the more important the difference, the more likely it is to be encoded as a single word: if we differ in some way, and such a difference matters, we will—sooner or later—come up with a word capturing this difference. This is why most languages have terms such as respectfulness, gregariousness, or intelligence. If, on the other hand, a particular trait does not vary much between people, or it varies yet is inconsequential, it is less likely to be

14 History and evolution of psychometric testing

captured with a single term. This is why most languages do not have terms describing “the ability to count” (nearly everyone can do it) or “not having a favorite color” (perhaps surprisingly, that seems not to be an important individual characteristic). Galton identified about 1,000 such terms describing differences, based largely on his knowledge of English, German, and other European languages. Later, Allport and Odbert (1936) carried out a systematic survey of the English language and listed about 18,000 personality-descriptive terms, or words that could be used to describe a characteristic of a person. They grouped them into four categories: personal traits, temporary moods or activities, judgments of personal conduct, and capacities and talents. After laboriously omitting words that were rarely used or little understood, obvious synonyms, as well as “non-neutral” words such as “good” and “bad,” they were left with a list of approximately 4,500 “neutral” words that they then categorized according to their meaning to produce a smaller number of personality constructs. Following this early work based on researchers’ intuition, psychologists began to apply a data-driven approach to grouping those words into categories. They employed factor analysis, a statistical procedure in which correlated variables are combined into factors, originally developed in the context of intelligence assessment (factor analysis is discussed in Chapter 4). One of the most influential factor-analytic theorists in this field, Raymond Cattell, asked people to rate their friends—and themselves—on 200 personality-descriptive terms from the Allport and Odbert list. By analyzing those ratings using factor analysis, Cattell discovered that people were not described by random sets of personality-descriptive terms, but that there were clear patterns. For example, people rated as “warm” were often rated as “easygoing” and “gregarious,” and rarely by words such as “reserved,” “cool,” or “impersonal.” Overall, the factor analysis conducted by Cattell produced the 16 personality factors (also known as the 16PF) in Table 1.1. Hans Eysenck (1967), a prominent personality theorist, also used factor analysis, but instead of the 16 factors favored by Cattell, he argued that the structure of personality is more usefully described by two dimensions that he called neuroticism and extraversion. The dimension of neuroticism represents the difference between people who are anxious and moody on one end and calm and carefree on the other, whereas extraversion distinguishes between those who are sociable and like parties (extroverts) and those who are quiet, introspective, and reserved (introverts). Eysenck’s model did not contradict Cattell's: in fact, Cattell’s 16PF can be further reduced to produce Eysenck’s two factors. It is largely a matter of personal preference whether a researcher will opt for a larger number of factors, to give a wider description of personality, or a smaller number, to produce fewer factors that are more robust. However, in recent years, psychologists have come to favor five personality factors: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism have become ubiquitous in the literature; they are generally referred to as the “OCEAN” or “five-factor” model. This model is discussed in more detail in Chapter 7. Integrity

In many occupational settings, it is the integrity of the job applicant, rather than their personality per se, that is of interest. A lack of integrity in this context might refer to drug taking, previous criminal activity, or barefaced lying concerning a previous work record or the possession of a qualification. The subject of truth-telling has long been a primary concern of legal, police, and security systems all over the world, and during the past century, lie-detector tests using a polygraph entered widespread use among these services. Such devices assess physiological activity such as heart rate, blood pressure,

History and evolution of psychometric testing 15 Table 1.1 Most highly related personality-descriptive terms from Cattell’s 16PF Trait Name

Negatively Related

Positively Related

Warmth

Impersonal, distant, cool, reserved, detached, formal, aloof

Reasoning

Concrete-thinking, less intelligent, lower general mental capacity, unable to handle abstract problems Reactive emotionally, changeable, affected by feelings, emotionally less stable, easily upset Deferential, cooperative, avoids conflict, submissive, humble, obedient, easily led, docile, accommodating Serious, restrained, prudent, taciturn, introspective, silent

Warm, outgoing, attentive to others, kindly, easygoing, participative, likes people Abstract-thinking, more intelligent, bright, higher general mental capacity, fast learner Emotionally stable, adaptive, mature, faces reality calmly

Emotional Stability Dominance

Liveliness Rule Consciousness

Expedient, nonconforming, disregards rules, self-indulgent

Social Boldness

Shy, threat-sensitive, timid, hesitant, intimidated Utilitarian, objective, unsentimental, tough-minded, self-reliant, nononsense, rough Trusting, unsuspecting, accepting, unconditional, easy Grounded, practical, prosaic, solutionoriented, steady, conventional

Sensitivity Vigilance Abstractedness Privateness Apprehension Openness to Change Self-Reliance Perfectionism

Tension

Forthright, genuine, artless, open, guileless, naive, unpretentious, involved Self-assured, unworried, complacent, secure, free of guilt, confident, selfsatisfied Traditional, attached to the familiar, conservative, respecting traditional ideas Group-oriented, affiliative, a joiner and follower, dependent Tolerates disorder, unexacting, flexible, undisciplined, lax, selfconflict, impulsive, careless of social rules, uncontrolled Relaxed, placid, tranquil, torpid, patient, composed low drive

Dominant, forceful, assertive, aggressive, competitive, stubborn, bossy Lively, animated, spontaneous, enthusiastic, happy-go-lucky, cheerful, expressive, impulsive Rule-conscious, dutiful, conscientious, conforming, moralistic, staid, rule-bound Socially bold, venturesome, thickskinned, uninhibited Sensitive, aesthetic, sentimental, tender-minded, intuitive, refined Vigilant, suspicious, skeptical, distrustful, oppositional Abstract, imaginative, absentminded, impractical, absorbed in ideas Private, discreet, nondisclosing, shrewd, polished, worldly, astute, diplomatic Apprehensive, self-doubting, worried, guilt-prone, insecure, worrying, self-blaming Open to change, experimental, liberal, analytical, critical, freethinking, flexible Self-reliant, solitary, resourceful, individualistic, self-sufficient Perfectionistic, organized, compulsive, self-disciplined, socially precise, exacting willpower, control, selfsentimental Tense, high-energy, impatient, driven, frustrated, overwrought, time-driven

16 History and evolution of psychometric testing

respiration rate, and galvanic skin response during an interrogation. However, the clear ethical and privacy concerns involved in the use of these techniques have led to them being made illegal in most countries, at least within employment settings. Where they were previously in use, they have generally been replaced by psychometric integrity tests based on self-report data. Such tests are themselves particularly vulnerable to dishonesty; however, several strategies are able to keep the effects of this to a minimum (see Chapter 7). Interests

Psychometric tests of interests were first introduced for use in career guidance in 1927 by Edward Kellog Strong Jr., then at the Carnegie Institute of Technology. He believed that “the relationship among abilities, interests and achievements may be likened to a motorboat with a motor and a rudder. The motor (abilities) determine how fast the boat can go, the rudder (interests) determine which way the boat goes” (Strong, 1943). His Strong Interest Inventory and variants thereof are still in use today, assessing interest in occupations, subject areas, activities, leisure activities, people, and personal characteristics. In 1956, John Holland introduced a system that was based on the idea that occupational preference was a veiled expression of underlying character. This more personality-based framework was updated in the 1990s when the Holland Codes, sometimes called Holland Occupational Themes, became standard. The Holland Codes (RIASEC) refer to his personality types of realistic, investigative, artistic, social, enterprising, and conventional. Motivation

What motivates both humans and animals to act as they do is a central part of psychology. We may be motivated by basic needs such as the need to quench our thirst or relieve our hunger. However, when it comes to the psychometric assessment of motivation, we are nearly always talking about motivation in the work environment. In an occupational setting, some people are more motivated by a need for interesting work than by a need for money. For others, their primary motivation is a need for recognition of their achievements and the possibility of promotion. Several theories have been requisitioned in support of assessments of employee motivation, Abraham Maslow’s hierarchy theory of needs being one of them. It is suggested that the basic need of the employee is to survive, for which they need money, so that cash in this case (rather than food and drink as in Maslow’s original model) comes first. Once this is satisfied, they need safety (maybe adequate accommodation), and only after these two have been satisfied do friendship, belongings, and career ambition come into play. The Atkinson and McClelland model of achievement motivation, first introduced in 1953, has specified motivational needs at work as being the need for achievement, the need for authority, and the need for affiliation. In the field of industrial and organizational psychology, there are many different motivational questionnaires available. Values

A value is an enduring belief about what should be important in our lives and how people should behave. Our values influence our choice of occupation. For example, a

History and evolution of psychometric testing 17

person who is deeply concerned about the effects of climate change is unlikely to apply for a job in a coal-fired power station. Values questionnaires were first developed in a cross-cultural framework by Geert Hofstede, who between 1967 and 1973 examined international differences among IBM employees, identifying the four dimensions of individualism-collectivism, uncertainty avoidance, strength of social hierarchy, and task vs. person orientation. Later, the dimensions of long-term vs. short-term orientation and indulgence vs. restraint were added. Subsequently, Shalom Schwartz developed his Theory of Basic Human Values, first published in 1992, focusing on those that he suggested may be universal. These were self-direction, stimulation, hedonism, achievement, power, security, conformity, tradition, benevolence, and spirituality (which he calls “universalism”), and can be assessed with the Schwartz Value Survey. It is noteworthy that there is considerable conceptual overlap between the idea of a value and that of a motivator. Temperament

Although the term “temperament” was formerly used by theorists such as Gordon Allport to refer to those aspects of adult personality that were considered to be largely inherited (such as impulsiveness), it is more commonly used in relation to infants and young children, and refers to their tendency to behave in consistent ways across situations. Developmental psychologists such as Jerome Kagan saw temperament as a genetic predisposition because differences in characteristics such as activity level and emotionality can be observed in newborn infants. Beginning in the 1950s, Alexander Thomas, Stella Chess, and their colleagues initiated the New York Longitudinal Study, which identified nine temperament characteristics that affect how well a child fits in with school, friends, and family. They later suggested three dimensions of temperament in children: the “easy child” has a positive mood, can quickly establish regular routines from infancy, and easily adapts to new experiences; the “difficult child” reacts negatively and frequently cries, is irregular in attention to daily routines, and is slow to accept new experiences; and the “slow-to-warm-up child” has a low activity level, is rather negative, somewhat negative in mood, and slow to adapt. These are assessed through attention to individual differences in emotion, motor reactivity, self-regulation, and consistency in behavior. Attitudes

The term “attitude” refers to how much a person likes or dislikes an object, person, or idea—or, to put it another way, our positive or negative feelings toward an attitude object. As there are so many different objects, people, and ideas toward which people can have attitudes, there are an enormous number of possible attitude scales. Hence discussion on attitude questionnaires focuses on methodology rather than the object of measurement. Usually there is a set of questions, each with response options along the agree/disagree continuum, a system based on the work of Rensis Likert in the 1930s. Today, these are referred to as Likert scales. In the 1950s, Charles Osgood introduced the semantic differential as an alternative method of assessing attitudes, based on George Kelly’s repertory-grid technique for the assessment of personal constructs. These days, attitude measurement is a mass industry, with opinion polls playing a major role in fields ranging from politics to economic policy. However, when

18 History and evolution of psychometric testing

measuring attitude, you will usually be starting from scratch and need to develop the questionnaire yourself. Beliefs

The term “belief” refers to the cognitive component of an attitude, i.e., what a person assumes to be true. Examples are the belief in religion or the belief that a placebo drug can cure an illness. Beliefs may be held with varying degrees of certainty. They are distinct from attitudes in that they do not include an affective component, i.e., they do not include liking or disliking. In the 1980s, Albert Bandura, already well known for his work on social learning theory, introduced the concept of self-efficacy—belief in one’s own effectiveness, or, put another way, belief in oneself—and this has become an important concept in a variety of domains. People with high self-efficacy belief are more likely to persevere with and complete tasks, whether they be work-performance-related or self-directed, such as following a health regime. Those with low self-efficacy are more likely to give up and less likely to prepare effectively. Today there are many different self-belief scales available, particularly around health beliefs, but again these tend to be targeted at specific applications.

Summary In the 20th century, the reputation of psychometrics suffered considerable damage twice over: first, directly from the enthusiasm with which its testing procedures were taken up by the eugenics movement, and second, from that movement’s successors, who continued to argue that political and policy decisions should be made under the assumption that an intelligence level was part of each race’s genetic heritage—a position that the Flynn effect so soundly showed to be unwarranted. Psychometrics is both a pure and an applied science, but it is also a science that can all too easily be misapplied. Today, the use of other forms of psychometric testing in education, recruitment, and marketing—involving the whole person rather than just skills and character—is increasing in society, and it is important that all issues associated with this phenomenon be properly and objectively evaluated if we wish procedures to be efficient and fair. Many believe that there should be no testing at all, but this is based on a misunderstanding that has arisen from attempting to separate the process and instrumentation of testing from its function. It is the function of testing that is determining its use, and this function derives from the need in any society to select and assess individuals within it for a job or for entry into a university. Given that selection and assessment exist, it is important that they be carried out as accurately as possible and that they be studied and understood. Psychometrics can be defined as the science of this selection and evaluation process in human beings. But now we must realize that the ethics, ideology, and politics of this selection and assessment are as much an integral part of psychometrics as are data science and psychology. Any science dealing with selection is also, by default, dealing with rejection, and is therefore intrinsically political. It is essential that psychometrics be understood in this way if we wish to apply proper standards and controls to the current expansion in test use, and thus develop a more equitable society. It is one matter to question testing; it is quite another to question the psychometric concepts of reliability and validity on the grounds that these were developed by eugenicists. It would be just as irrational to dismiss the theory of

History and evolution of psychometric testing 19

evolution and consequently most of modern biology! The techniques developed by Darwin, Galton, Spearman, Cattell, and their followers have made a major contribution to our understanding of the principles of psychometrics. In this third decade of the 21st century, given the rise of cyberspace and the enormous power of psychographics as a tool for control within it, we must ensure that we fully anticipate the possibility of unintended consequences. It is time for urgent remedial action in the form of regulation. But what should be regulated? The right to privacy, of course, but this has been only a part of the problem. In cyberspace, our wishes, desires, and psychological vulnerabilities can be targeted all too readily on the basis of the data trail that we leave behind. Broadcast messages are filtered through our digital footprint so that we receive only those that are deemed relevant—perhaps to our own interests, but more likely to the interests of others. All this can be done at scale and in real time, with no one else knowing the microtargeted messages that we receive or are likely to receive. This filtering, determined by machine-learning algorithms trained on our input, determines what we see or hear in the online world; artificial intelligence builds models that enable the information that we receive to become increasingly targeted so that our filter bubble becomes increasingly controlling. Proper regulation of the application of psychometrics for psychographic purposes clearly needs to be established and controlled. But how? One thing is certain. Following this, the world will never be the same again. Let us not fear the future, but rather seek ways in which we can influence it for the better. For now, the machines should be our servants, not our masters … and maybe, one day, they’ll be our colleagues.

2

Constructing your own psychometric questionnaire

When you see a questionnaire in a popular magazine or newspaper, it is often no more than a series of items that are not necessarily related to each other and that are scored and interpreted individually. This chapter is a guide to the construction of psychometric questionnaires, where items can be combined to produce an overall scale. Questionnaires are used to measure a wide variety of attributes and characteristics. The most common examples are knowledge-based questionnaires—i.e., questionnaires of ability, aptitude, and achievement—and person-based questionnaires, i.e., questionnaires of personality, clinical symptoms, mood, and attitudes. Whatever type of questionnaire you wish to develop, this guide will take you through the main stages of construction and show you how to tailor your questionnaire to its particular purpose. Throughout the guide, the construction of the Golombok Rust Inventory of Marital State (GRIMS; Rust et al. 1988) will be described (in italics) as a practical example.

The purpose of the questionnaire The first step in developing a questionnaire is to ask yourself: “What is it for?” Unless you have a clear and precise answer to this question, your questionnaire will not tell you what you want to know. With the GRIMS, we wanted to develop a questionnaire to assess the quality of the relationship of couples who are married or living together. We intended that the GRIMS would be of use in research, to help therapists or counselors either evaluate the effectiveness of therapy for couples with relationship problems or investigate the impact of social, psychological, medical, or other factors on a relationship. In addition, we hoped that it would be used clinically as a quick and easy-toadminister technique for identifying the severity of a problem, for finding out which partner perceives a problem in the relationship, and for identifying any improvement or lack of improvement in either partner or both partners over time. Write down clearly and precisely the purpose of your questionnaire.

Making a blueprint A blueprint, sometimes known as the test specification, is a framework for developing the questionnaire. A grid structure is generally used with content areas along the horizontal axis and manifestations (ways in which the content areas may manifest) along the vertical axis (see Table 2.1). For practical reasons, either four or five categories are usually employed along each axis. Fewer often results in too narrow a questionnaire, and more can be too cumbersome to deal with.

Constructing your psychometric questionnaire 21 Table 2.1 A test blueprint with four content areas and four manifestations Content Areas A Manifestations

B

C

D

A B C D

Content areas

A clear purpose will enable you to specify the content of your questionnaire. The content areas should cover everything that is relevant to the purpose of the questionnaire. The many different ideas about what constitutes a good or bad relationship posed a problem when we tried to specify the content areas of the GRIMS. For this reason, we used the expertise of relationship therapists/counselors and their clients. The therapists/counselors were asked to identify areas that they believed to be important in marital harmony as well as the areas that they would assess during initial interviews. Information from clients was obtained by asking them to identify their targets for change. The views of these experts were collated to provide the following content areas that were generally considered to be important for assessing the state of a relationship: (i) interests shared (work, politics, friends, etc.) and degree of dependence and independence; (ii) communication (verbal and nonverbal); (iii) gender; (iv) warmth, love, and hostility; (v) trust and respect; (vi) roles, expectations, and goals; (vii) decision-making; and (viii) coping with problems and crises. Write down the content areas to be covered by your questionnaire. If these are not clear-cut, consult experts in the field. Manifestations

The ways in which the content areas may manifest themselves will vary according to the type of questionnaire under construction. For example, questionnaires designed to measure educational attainment may use Bloom’s (1956) taxonomy of educational objectives to tap into different forms of knowledge. For questionnaires that are more psychological in nature, behavioral, cognitive, and affective manifestations of the content areas may be more appropriate. For personality questionnaires, you will need to balance socially desirable and socially undesirable aspects of the trait, as well as for acquiescence. The latter is achieved by allowing half of the items to manifest positively (e.g., “I am outgoing” in an extraversion scale) and half to manifest negatively (e.g., “I am shy” in an extraversion scale). In specifying manifestations, it is important to ensure that different aspects of the content areas will be elicited. In constructing the GRIMS, we again took account of the experts’ information to obtain the following manifestations: (i) beliefs about, insight into, and understanding of the nature of dyadic relationships; (ii) behavior within the actual relationship; (iii) attitudes and feelings about relationships; (iv) motivation for change, understanding the possibility of change, and commitment to a future together; and (v) extent of agreement within the couple.

22 Constructing your psychometric questionnaire

As you can see from the GRIMS blueprint, what is described as a content area and what is described as a manifestation may not always be clear-cut. Write down ways in which the content areas of your questionnaire may become manifest. You will now be able to construct your blueprint. The number of cells will be the number of content areas multiplied by the number of manifestations. Between 16 and 25 cells (i.e., 4 × 4, 4 × 5, 5 × 4, or 5 × 5) is generally considered ideal for sufficient breadth while maintaining manageability. Draw your blueprint, labeling each content area (column) and each manifestation (row). Each cell in the blueprint represents the interaction of a content area with a manifestation of that content area. By writing items for your questionnaire that correspond to each cell of the blueprint, you will ensure that all aspects that are relevant to the purpose of your questionnaire are covered. A decision that has to be made when designing the blueprint is whether to give different weightings to the cells, i.e., whether to write more items for some cells than for others. This will depend on whether or not you feel that some content areas or manifestations are more important than others. In the blueprint in Table 2.2, it has been decided that content area A should receive a weighting of 40%, content area B a weighting of 40%, content area C a weighting of 10%, and content area D a weighting of 10%. For the manifestations, a weighting of 25% has been allocated to each. For the GRIMS, equal weightings were assigned to each cell, as we had no reason to believe that any of the content areas or manifestations were more important than the others. Assign percentages to each content area of your blueprint so that the total of the percentages across the content areas adds up to 100%. Assign percentages to each manifestation in your blueprint so that the total of the percentages down the manifestations adds up to 100%. Insert these percentages into your blueprint. Assigning weightings will tell you what proportion of all items in the questionnaire should be written for each cell. The next step is to decide on the total number of items to include. You must consider factors such as the size of your blueprint (a large blueprint with many content areas and manifestations will need a greater number of items than a small one) and the amount of time available for administering the questionnaire. There is no point in asking people with little time to spare to complete a lengthy inventory, as the quality of their responses will be poor and items may be omitted. The characteristics of the respondents are also important. Children and people who are older or have a physical or mental illness may be slow and unable to maintain concentration. Although it is

Table 2.2 Assignment of percentages of items to content areas (columns) and manifestations (rows) Content Areas A (40%) Manifestations

A (25%) B (25%) C (25%) D (25%)

B (40%)

C (10%)

D (10%)

Constructing your psychometric questionnaire 23

important to include a sufficient number of items to ensure high reliability, compliance among respondents is crucial, and a balance must be struck between the two. A minimum of 12 items per scale is usually required to achieve adequate reliability. In the plan, however, a minimum of 20 items should be aimed for, and a fairly straightforward questionnaire of this length should take an average respondent no longer than six minutes to complete. As it is necessary to construct a pilot version of your questionnaire in the first instance, you must remember to allow for at least 50% more items in the blueprint than you intend to include in the final version. The GRIMS was intended as a short questionnaire for use with both distressed and nondistressed couples. As we hoped to achieve a final scale of about 30 items, we planned a pilot version with 100 items. Decide on how many items to include in the pilot version of your questionnaire by taking into account the desired number of items in the final version, the size of your blueprint, the time available for testing, and the characteristics of the respondents. Once you have assigned weightings to the cells and decided on the total number of items that you require for your pilot questionnaire, you will be able to work out how many items to write for each cell. The blueprint in Table 2.3, with given weightings, shows the number of items that have to be written for each cell to obtain a pilot questionnaire with 80 items. The first step is to work out the total number of items for each content area and for each manifestation. The blueprint specifies that 40% of the items (32 items) should be on content area A, 40% on content area B (32 items), 10% (eight items) on content area C, and 10% (eight items) on content area D. These numbers are entered into the bottom row of the blueprint. Similarly, the blueprint specifies that 25% of the items (20 items) should concern each of the manifestations, and this is entered into the right-hand column of the blueprint. To calculate the number of items in each cell of the blueprint, multiply the total number of items in a content area by the percentage assigned to the manifestation in each row. For example, the number of items for the top left-hand cell (content area A/manifestation A) is 25% of 32 items, which is eight items. The number of items to be written for each cell is calculated in the same way. If you do not obtain an exact number of items for a cell, approximate to the number above or below while trying to maintain the same total number of items as you had originally planned. The 100 items in the equally weighted 40-cell GRIMS blueprint allowed two or three items per cell. Enter the number of items to be written for each cell into your blueprint.

Table 2.3 Assignment of number of items per cell, per content area, and per manifestation to a test blueprint Content Areas

Manifestations

No. of Items

A (25%) B (25%) C (25%) D (25%)

No. of Items A (40%)

B (40%)

C (10%)

D (10%)

8 8 8 8 32

8 8 8 8 32

2 2 2 2 8

2 2 2 2 8

20 20 20 20 80

24 Constructing your psychometric questionnaire

Writing items There are several types of items that are used in questionnaires, the most common of which are alternate-choice, multiple-choice, and rating-scale items. Different item types are suitable for different purposes, and consideration of the attribute or characteristic that you wish your questionnaire to measure will guide you toward an appropriate choice. Alternate-choice items

An item for which the respondent is given two choices from which to select a response, e.g., true or false, yes or no. Most commonly used in knowledge-based questionnaires, e.g., “Bogota is the capital of Colombia: true or false?” Sometimes used in personality questionnaires, e.g., “I never use a lucky charm: yes or no?” Generally considered inappropriate for clinical-symptom, mood, or attitude questionnaires—but used occasionally. Advantages

Good for assessing knowledge of facts and comprehension of materials presented in the question. Fast and easy to use. Disadvantages

For ability, aptitude, and achievement items, the correct response is often not clear-cut, i.e., completely true or completely false. Another problem is that the respondent has a 50% chance of obtaining the correct response by guessing. For personality, clinical-symptom, mood, and attitude questionnaires, there are no right or wrong answers. However, respondents often consider the narrow range of possible responses to be too restrictive. Multiple-choice items

An item for which the respondent is given more than two choices from which to select a response. It consists of two parts: (i) the stem—a statement or question that contains the problem; and (ii) the options—a list of possible responses, of which one is correct or the best and the others are distractors. Often four or five possible responses are used to reduce the probability of guessing the answer. The most widely used item type in knowledge-based questionnaires, for example: What is the capital of Colombia? A. B. C. D.

La Paz Bogotá Lima Santiago

Not used in person-based questionnaires. Advantages

Well suited to the wide variety of materials that may be presented in ability, aptitude, and achievement questionnaires. Challenging items that are easy to administer and score can

Constructing your psychometric questionnaire 25

be constructed. The effects of guessing are also reduced with multiple-choice items. For example, an item with five options gives someone a 20% chance of guessing the correct answer, compared with 50% in alternate-choice items. Disadvantages

Time and skill are needed to write good multiple-choice items. A common problem is that not all of the options are effective, i.e., they are so unlikely to be correct that they do not function as possible options. This can reduce what is intended as a five-choice item to a three- or four-choice item, or even to an alternate-choice item. Rating-scale items

An item for which the possible responses lie along a continuum, e.g., yes, don’t know, no; true, uncertain, false; strongly disagree, disagree, agree, strongly agree; always, sometimes, occasionally, hardly ever, never. Up to seven options are generally used, as it is difficult for respondents to differentiate meaningfully among more than that number. Although rating-scale items are similar to multiple-choice items in giving several response options, the options in rating scales are ranked, whereas multiple-choice item options are independent of each other. Not used in knowledge-based questionnaires. The most widely used item type in person-based questionnaires, for example: I am not a superstitious person A. B. C. D.

strongly disagree disagree agree strongly agree

Advantages

Respondents feel able to express themselves more precisely with rating-scale items than with alternate-choice items. Disadvantages

Respondents differ in their interpretations of the response options, e.g., “frequently” has a different meaning to different individuals. Some respondents tend to always choose the most extreme options. When an uneven number of response options is used, many respondents tend to choose the middle one, e.g., “don't know” or “occasionally.” The type of options should be chosen to suit the material to be presented in the questionnaire. There are no fixed rules about which type of options is best. A personality or mood questionnaire might require responses in terms of the options “not at all,” “somewhat,” and “very much.” Attitude questionnaires generally consist of statements about an attitude object followed by the options “strongly agree,” “agree,” “uncertain,” “disagree,” and “strongly disagree.” For clinical-symptom questionnaires, you might find that options relating to the frequency of occurrence—such as “always,” “sometimes,” “occasionally,” “hardly ever,” and “never”—are the most suitable.

26 Constructing your psychometric questionnaire

The most appropriate number of options to choose from will also depend on the nature of the questionnaire. It is important to provide a sufficient number for respondents to feel able to express themselves adequately while ensuring that there are not so many that they have to make meaningless discriminations. In questionnaires using rating-scale items, where strength of response should be reflected in the respondent’s score, it is common for at least four options to be used. It is sometimes necessary to use different types of items in a questionnaire because of the nature of the material to be included. However, it is preferable to use only one item type wherever possible, to produce a neatly presented questionnaire. Rating-scale items are the most appropriate for a scale of relationship state. The GRIMS items are presented as statements to which the respondents are asked to respond with “strongly agree,” “agree,” “disagree,” or “strongly disagree.” This spread of options allows strength of feeling to affect scores. The items are forced choice, i.e., there is no “don’t know” category. Decide which item type is most appropriate for your questionnaire. In general, multiple-choice items are best for knowledge-based questionnaires, and rating-scale items are best for person-based questionnaires unless you have good reason, such as speed or simplicity, for choosing alternate-choice items. A good method for deciding which to choose is to try to construct items of each type using different options. The most appropriate choice for your questionnaire will soon become clear. Before beginning to write items for your questionnaire, read the following summary of important points to remember. For a more detailed discussion of how to write good items, see Thorndike and Thorndike-Christ (2014). All questionnaires

Make sure that your items match your blueprint. The allocation of items to specific cells may become a bit fuzzy, as some items may be appropriate for more than one cell. If you find that some cells are inappropriate and you decide to omit them, do not do so without proper consideration. Remember, however, that the blueprint is a guide and not a straitjacket. Write each item clearly and simply. Avoid irrelevant material, and keep the options as short as possible. Each item should ask only one question or make only one statement. Where possible, avoid subjective words such as “frequently,” as these may be interpreted differently by different respondents. It is also important that all options function as feasible responses, i.e., that none be clearly wrong or irrelevant and therefore unlikely to be chosen. After writing your items, read them again a few days later. Also, ask a colleague to look at them to ensure that they are easily understood and unambiguous. Knowledge-based questionnaires

Make sure that alternate-choice items can undoubtedly be classified as true or false; otherwise some respondents will think of exceptions to the rule. For multiple-choice items, ensure that each item has only one correct or best response. Ideally, each distractor option should be used equally by respondents who do not choose the correct response. Remember that the more similar the options, the more difficult the item.

Constructing your psychometric questionnaire 27 Person-based questionnaires

Sometimes respondents will complete a questionnaire in a certain way irrespective of the content of the items. Acquiescence

Acquiescence is the tendency to agree with items regardless of their content. This can be reduced by ensuring that an equal or almost equal number of items is scored in each direction. To do this, it is usually necessary to reverse some of the items. For example, the item “I am satisfied with our relationship” can be reversed to “I am dissatisfied with our relationship.” When reversing items, it is important to check that the reversed item really does mean the opposite of the original item. It is best to avoid double-negative statements, as these cause confusion. Acquiescence is less likely to occur with items that are clear, unambiguous, and specific. Social desirability

Social desirability is the tendency to respond to an item in a socially acceptable manner. This can be reduced by excluding items that are clearly socially desirable or undesirable. If this is unavoidable due to the nature of your questionnaire, try to ask the question indirectly to evoke a response that is not simply a reflection of how the respondent wishes to present themselves. For example, an item to measure paranoia may be subtly phrased as “there are some people whom I trust completely” rather than “people are plotting against me.” Social desirability can also be reduced by asking respondents to give an immediate response rather than a careful consideration of each item. Indecisiveness

Indecisiveness is the tendency to use the “don’t know” or “uncertain” option. This is a common problem that can easily be eliminated by omitting the middle category. It is advisable to do so unless respondents are likely to become irritated by items that they feel are unanswerable. Extreme response

Extreme response is the tendency to choose an extreme option regardless of direction. Some respondents will use one direction for a series of items and then switch to the other direction, and so on. Again, this can be reduced by the use of clear, unambiguous, and specific items. It is important to bear in mind these habitual ways of responding when writing items. However, a careful item analysis will eliminate items that are biased toward a particular response. Examples of GRIMS items: “We both seem to like the same things” was written for the blueprint cell representing content area (i) and manifestation (ii).

28 Constructing your psychometric questionnaire

“I wish there was more warmth and affection between us” was written for the blueprint cell representing content area (iv) and manifestation (iv). Write each of your items on a small card so that you can easily make changes to wording and ordering. To order the items for your questionnaire, pick an interesting and unthreatening item to start with, and then shuffle the cards to randomize the rest. Make adjustments if too many similar-looking items occur together. For knowledge-based questionnaires that have items of increasing difficulty, order the items from easy to hard.

Designing the questionnaire Good design is crucial for producing a reliable and valid questionnaire. Respondents feel less intimidated by a questionnaire that has a clear layout and is easy to understand, and take their task of completing the questionnaire more seriously. Background information

Include headings and sufficient space for the respondent to fill out their name, age, gender, or whatever other background information you require. It is often useful to obtain the date on which the questionnaire is completed, especially if it is to be administered again. Instructions

The instructions must be clear and unambiguous. They should tell the respondent how to choose a response and how to indicate the chosen response in the questionnaire. Other relevant instructions should be given, e.g., respond as quickly as possible, respond to every item, or respond as honestly as possible. Information that is likely to increase compliance—e.g., regarding confidentiality—should be stressed. Sample instructions for a knowledge-based multiple-choice questionnaire: INSTRUCTIONS: Each item is followed by a choice of possible responses: A, B, C, D, or E. Read each item carefully and decide which choice best answers the question. Indicate your answer by circling the letter responding to your choice. Your score will be the number of correct answers, so respond to each question even if you are unsure of the correct answer. Sample instructions for a person-based rating-scale questionnaire: INSTRUCTIONS: Each statement is followed by a series of possible responses: strongly disagree, disagree, agree, or strongly agree. Read each statement carefully and decide which response best describes how you feel. Then put a tick over the corresponding response. Please respond to every statement. If you are not completely sure which response is most accurate, tick the response that you feel is most appropriate. Do not spend too long on each statement. It is important that you answer each question as honestly as possible. ALL INFORMATION WILL BE TREATED WITH THE STRICTEST CONFIDENCE.

Constructing your psychometric questionnaire 29 Layout

The following tips will help you to arrange items on the page so that they are easy to read: (a) Number each item. (b) Keep each line short, with no more than 10 or 12 words per line. (c) Ensure that the items produce a straight vertical margin down the left-hand side of the page. (d) Arrange the response options to produce a straight vertical margin down the righthand side of the page. Insert headings at the top and symbols next to each item. There should be a clear visual relationship between each item and its response options. This can be done by inserting a dotted line from the item stem to its response option. STRONGLY AGREE AGREE DISAGREE STRONGLY DISAGREE 1. ______________________ 2. ______________________ 3. ______________________ 4. ______________________ 5. ______________________

SD SD SD SD SD

D D D D D

A A A A A

SA SA SA SA SA

(e) Separate each item with a space rather than a horizontal line. If your items, instructions, and background information all fit on one page, that is good. However, it is better to produce a neat two- or three-page questionnaire than one page that looks cramped. (f) If using more than one type of item, group similar items together. Each type will need different instructions and response options. (g) Remember that different PCs, laptops, and smartphones have different layouts, and it is important for your questionnaire to look good on all of them. Having the questionnaire printed using a high-quality printer is a sensible practice in case anyone still wants to use pencil-and-paper administration. Ensure that whatever the medium, the type is large enough to be read easily. Use your computing skills creatively to plan the layout. Experiment with different fonts, colors, type sizes, and spacings to see which look best. (h) You can use design as a tool to portray or disguise the purpose of your questionnaire. For example, small, closely set type can make a questionnaire look very formal, while larger type with items spaced well apart on colored paper is friendlier. Design can set an atmosphere, so use it! The GRIMS was designed with simplicity of administration in mind. The respondent must answer 28 questions on one scrollable page, with the same response options for each question. This makes it quick and uncomplicated to complete.

30 Constructing your psychometric questionnaire

Try different layouts of your questionnaire using different media until the arrangement looks logical. Then experiment with font, color, type size, spacing, and number of pages to see what looks best. To score your questionnaire, allocate a score to each response option, and then add up the scores for all the items to give a total score for the questionnaire. For knowledge-based questionnaires, it is common to give the correct or best option for each item a score of 1 and the distractor options a score of 0. The higher the total score, the better the performance. For person-based questionnaires, scores should be allocated to response options according to a continuous scale, e.g., always = 5, usually = 4, occasionally = 3, hardly ever = 2, never = 1; yes = 2, uncertain = 1, no = 0; true = 1, false = 0. For reversed items, it is necessary to reverse the scoring (e.g., always = 1, usually = 2, occasionally = 3, hardly ever = 4, never = 5), so that each item is scored in the right direction. After reversing the scores for reversed items, add up the scores for all items to obtain the total score for the questionnaire. Depending on the way in which you have allocated scores to response options, the higher the total score, the greater or lesser the presence of the characteristic being measured. A scoring key that fits over the questionnaire to identify which option the respondent has chosen for each item and its score can be useful for quick and easy scoring. In the following example, the respondent has obtained a total score of 12 (2 + 2 + 3 + 5): NEVER HARDLY EVER OCCASIONALLY USUALLY ALWAYS 1. ____________________ 2. (reversed item)________ 3. ____________________ 4. (reversed item)________

A(5) A(1) A(5) A(1)

U(4) U(2) U(4) U(2)

O(3) O(3) O(3) O(3)

HE(2) HE(4) HE(2) HE(4)

N(1) N(5) N(1) N(5)

Simple scripts can also be written for scoring questionnaires, although it is good practice to keep a backup of the item-level data in case you ever decide to change the scoring process.

Piloting the questionnaire The next stage in constructing your questionnaire is the pilot study. This involves having the questionnaire completed by people who are similar to those for whom the questionnaire is intended. Analysis of these data will help you to select the best items for the final version of your questionnaire. If, for example, your questionnaire is intended for women with preschool-age children, you might carry out the pilot study at a baby clinic or a mothers-and-toddlers club. If it is for use with the general population, you would need to find a group of people who are representative of the population at large. This is often more difficult than finding a more specific group. You could make use of the electoral register, but this is usually too

Constructing your psychometric questionnaire 31

time-consuming to be worthwhile for a pilot study. When a truly representative group is impossible to find, an approximation is usually good enough. A common strategy is to hand out questionnaires in public places such as shopping centers, train and bus stations, airport lounges, doctors’ waiting rooms, or cafeterias of large organizations. The respondents who take part in the pilot study should vary in terms of demographic characteristics such as age, gender, and social class. There is little point in piloting a questionnaire intended for any gender only with men, or a questionnaire to be used throughout an industry only with managers and not manual workers. It is important to obtain relevant demographic information from the respondents in the pilot study to help with the validation of your questionnaire at a later stage. The pilot version of your questionnaire should be administered to as many people as possible. The minimum number of respondents required is one more than the number of items. If it is not possible to obtain this many, it is better to use fewer people than to omit the piloting stage altogether. The pilot version of the GRIMS was administered to both partners in 60 client couples from relationship therapy and relationship guidance clinics throughout the country. Administer your questionnaire and obtain relevant demographic information from a group of people who are similar to those for whom the final questionnaire is intended.

Item analysis Item analysis of the data collected in the pilot study to select the best items for the final version of your questionnaire involves an examination of the facility and the discrimination of each item. For knowledge-based multiple-choice items, it is also important to look at distractors. The first step is to create an item-analysis table with each column (a, b, c, d, e, etc.) representing an item and each row (1, 2, 3, 4, 5, etc.) representing a respondent. For knowledge-based items, insert “1” in each cell for which the respondent gave the correct answer and “0” for each incorrect answer. Add up the scores to give total scores for each row (i.e., each respondent) and each column (i.e., each item). Table 2.4 shows a sample item-analysis table for a knowledge-based questionnaire. Facility

Most questionnaires are designed to differentiate between respondents according to whatever knowledge or characteristic is being measured (see discussion of standardization in Chapter 3). A good item, therefore, is one for which different respondents give different responses. The facility index gives an indication of the extent to which all respondents answer an item in the same way. If they do, then these items are redundant, and it is important to get rid of them. For example, if every respondent gives the correct response to a particular item, this simply has the effect of adding one point to the total score for each respondent and does not discriminate among them. For knowledge-based questionnaires, the facility index is calculated by dividing the number of respondents who obtain the correct response for an item by the total number of respondents. Ideally, the facility index for each item should lie between .25 and .75, averaging .5 for the entire questionnaire. A facility index of less than .25 indicates that the item is too difficult, as very few respondents obtain the correct response; and a facility

32 Constructing your psychometric questionnaire Table 2.4 shows a sample item-analysis table for a knowledge-based questionnaire Items Respondents 1 2 3 4 5 Sum Facility Ddiscrimination

a 1 0 1 1 1 3 .8 .13

b 1 1 0 0 0 2 .4 −.48

c 0 0 0 0 0 0 0.0 UNDEF

d 1 0 1 0 1 3 .6 .67

e 1 1 1 1 1 5 1.0 UNDEF

Sum 4 2 4 2 3

index of more than .75 shows that the item is too easy, as most respondents obtain the correct response. In Table 2.4, we would want to eliminate items c and e from the final questionnaire, as everyone has responded to these items in the same way. If it is a person-based questionnaire, then items may have values of more than 1. For example, if the response options for each item are “strongly agree,” “agree,” “disagree,” and “strongly disagree,” then the item values may be 1, 2, 3, or 4. Insert the actual score for each item into the item-analysis table, remembering to ensure that reversed items are scored in the opposite direction to nonreversed items. The facility index for personbased items is calculated by summing the scores for the item for each respondent, then dividing this total by the total number of respondents. An item with a facility index that is equal to or approaching either of the extreme scores for the item should not be included in the final version of the questionnaire. It is also important to ensure by looking at the scores in the item-analysis table that a good facility index—i.e., one lying somewhere between the extreme scores—does not simply mean that everyone has chosen the middle option. Discrimination

This is the ability of each item to discriminate among respondents according to whatever the questionnaire is measuring, i.e., respondents who perform well on a knowledgebased questionnaire or who exhibit the characteristic being measured by a person-based questionnaire should respond to each item in a particular way. Items should be selected for the final version of the questionnaire only if they measure the same knowledge or characteristic as the other items in the questionnaire. In a knowledge-based questionnaire, this means that for each and every item, those with higher total scores on the questionnaire should be more likely to get the item correct than those with lower scores on the questionnaire. Items that fail to do this are said not to discriminate between high and low scorers, and hence should be removed. More usually, discrimination is measured by correlating each item with the total score from summing all the other items in the questionnaire (i.e., the total score with the item that was removed). You can use a spreadsheet program such as Excel to do this, although any statistical analysis package will do. In Table 2.4, the Pearson product-moment correlation coefficient was used (CORREL in Excel), but some prefer biserial correlations or point-biserial correlations. However, it is the relative size of the correlations rather than the actual size that matters, and these remain in the same order whichever is used. The higher

Constructing your psychometric questionnaire 33

the correlation, the more discriminating the item. A minimum correlation of .2 is generally required. Items with negative or zero correlations are always excluded. In Table 2.4, only item d fully meets these criteria. Items b, c, and e would have to be removed, as they have either a negative or an undefined correlation (undefined because the formula would have led to a division by zero). There are no hard-and-fast rules about inclusion criteria for items in the final questionnaire. It is common to choose 70%–80% of the original items. The higher this discrimination index for the item, the better. But in Table 2.4, maybe keep item a—it is the only one left! The same procedure is used whether the data are from a knowledge-based or a person-based test. Distractors

An examination of the use of distractor options by respondents who do not choose the correct or best option should be carried out for each item to ensure that each distractor is endorsed by a similar proportion of respondents. This can be done for each item, counting the number of times that each of its distractors has been endorsed. The number of endorsements should be similar for all these distractors. Items for which distractor options are not functioning properly should be considered for exclusion from the final questionnaire. When deciding which items to include in the final version of your questionnaire, you will have to take many factors into account and balance them against each other. In addition to facility, discrimination, and distractors, you will need to consider the number of items that you require for the final version (at least 12, and more usually 20, are necessary for a reliable questionnaire) and how well the items fit the blueprint. For example, you might include an item with fairly poor discrimination if you have very few items from that area of the blueprint, or you might include an item with poor facility if it has reasonable discrimination. In a personality questionnaire, it is also important to ensure that there are approximately equal numbers of reversed and nonreversed items. Ways of improving items may become clear at this stage. For example, changing the wording of an item may improve facility, or a distractor may be made more realistic. However, it is not a good idea to change many items, as you will not know how these changes affect the reliability and validity of the questionnaire. The procedures of item analysis will inform you about the characteristics of each item. It is then up to you to decide which criteria are most important for the purpose of your particular questionnaire. Decide which items from the pilot version of your questionnaire to include in the final version—taking account of facility, discrimination, and, if appropriate, distractors. Order the items and design the questionnaire as before.

Obtaining reliability Reliability is an estimate of the accuracy of a questionnaire, and is discussed in more detail in Chapter 3 (the next chapter). For example, a questionnaire is reliable if a respondent obtains a similar score on different occasions, provided that the respondent has not changed in a way that affects their response to the questionnaire. When you publish your questionnaire, you will be expected to report its reliability. Hence, you will want to have some information about the impact of your particular item selection on this. You have only so far given your respondents the questionnaire once. However, it is possible

34 Constructing your psychometric questionnaire

to estimate the reliability from the data you already have. There are two ways of doing this. Although there are many arguments over which is the most appropriate, they both generally (and rather surprisingly) deliver very similar results. Cronbach’s alpha

The first method is calculating a statistic called Cronbach’s alpha, which is a measure of the internal consistency of the questionnaire. Cronbach’s alpha is widely accepted as a surrogate for reliability. Most statistical packages allow you to do this quite easily from the data in an item-analysis table. The second is a method called split-half reliability. Split-half reliability

The second method is called split-half reliability. Here the questionnaire is divided into two halves (usually odd and even items), and the correlation between the halves is used to produce an estimate of reliability for the whole questionnaire. For split-half reliability, the Pearson product-moment correlation coefficient between the two halves of the questionnaire is used in the Spearman–Brown formula to give an estimate of reliability for the whole questionnaire: Spearman–Brown formula r11 = (2r½½)/(1+r½½) r11 = estimated reliability for the whole questionnaire and r½½ = correlation between two halves of the questionnaire. For example, if the Pearson product-moment correlation coefficient between two halves of a questionnaire is .80: r11 = 2(0.80)/(1 + 0.80) = 0.88 The greater the number of respondents, the better the estimate of reliability. If fewer than 50 respondents were included in the pilot study, it is necessary to have the final version of the questionnaire completed by more people, ensuring once again that they are similar to those for whom the questionnaire is intended. The dual use of the pilotstudy data for item selection and reliability estimation will mean that reliabilities are overestimated. Ideally, data from at least 200 respondents who were not part of the pilot study should be used in calculating reliability. Where the questionnaire is intended for different types of respondents, it is common to show that it is reliable for each type. In this case, a total of 200 respondents would be needed altogether. Whatever measure of reliability is used, a coefficient of at least .7 is generally required for person-based questionnaires, and at least .8 for knowledge-based questionnaires. For the GRIMS, split-half reliabilities were obtained for men and women separately for respondents in the pilot study, relationship therapy clients, and a general population group. Reliabilities ranged from .81 to .94. Calculate the split-half reliability for the final version of your questionnaire using data from the relevant items from all of the respondents in the pilot study, plus additional respondents if necessary. For each respondent, calculate the total score for the even items in the final version of your questionnaire and

Constructing your psychometric questionnaire 35

the total score for the odd items. Correlate the odd items with the even items using the Pearson product-moment correlation. Use this correlation coefficient in the Spearman–Brown formula to obtain an estimate of reliability for the whole questionnaire.

Assessing validity The validity of a questionnaire is the extent to which it measures what it is intended to measure. Validity is discussed in Chapter 3 (next chapter), but at this point there are only two of the many forms of validity that you should apply. Face validity

This describes the appearance of the questionnaire to respondents, i.e., whether or not it looks as if it is measuring what it claims to measure. If not, respondents may not take the questionnaire seriously. Look carefully at your selection of items and the general layout of the questionnaire with this in mind. Content validity

This is the relationship between the content and the purpose of the questionnaire, i.e., whether or not there is a good match between the test specification and the task specification. For example, the blueprint for a questionnaire used in job selection should match the job description. Content validity is generally taken care of in constructing the blueprint and in the item analysis. However, it is important to check that the balance of items in the final version of your questionnaire matches the original blueprint. Content validity of the GRIMS is high with respect to its specification, and good face validity has been incorporated into the item selection. It is also important for the GRIMS to have good diagnostic validity. This was established by determining that couples who presented at marriage guidance clinics had significantly higher scores than a matched sample from the general population. Moreover, couples presenting for marital therapy had significantly higher scores than couples presenting for sex therapy. Because the GRIMS was intended as a measure of improvement after therapy, it was important to obtain a rating of the validity of the GRIMS as an estimator of change. Couples were asked to complete the GRIMS before and after therapy; and the therapists, who were unaware of their clients’ GRIMS scores, were asked to rate the couple on a five-point scale ranging from 0 (“improved a great deal”) to 4 (“got worse”). The GRIMS scores for the male and female partners were averaged for each couple. The average score before therapy was subtracted from the average score after therapy to give a change score representing change during therapy. The change scores were correlated with the therapists’ ratings of change, producing a correlation coefficient of .77. This is firm evidence for the validity of change in the GRIMS score as an estimate of change in the quality of the relationship(s) or in the effectiveness of therapy. Ensure that your questionnaire has good face validity and content validity. Consider carefully what other forms of validity will be required at a later stage, and draft plans for any necessary data collection.

36 Constructing your psychometric questionnaire

Standardization Standardization involves obtaining scores on the final version of your questionnaire from appropriate groups of respondents (see Chapter 3). These scores are called norms. Large numbers of respondents must be carefully selected for a standardization group according to clearly specified criteria in order for norms to be meaningful. Norms can be obtained from the data in the pilot study, but this is not the preferred method. With good norms, it is possible to interpret the score of an individual respondent, i.e., whether or not their score on the questionnaire is typical. This is useful if, for example, you wish to know how an individual child performs on an ability test compared with other children of the same age, or if you wish to determine how a person with a suspected clinical disorder compares with people who have been diagnosed as having that disorder. It is not always necessary to produce norms. If your questionnaire has been developed for research, which involves comparing groups of respondents, norms can be useful in interpreting the performance of a group as a whole, but they are not crucial. If, however, you wish to interpret the score of an individual, it is necessary to have good norms against which to compare an individual score. It is important to include as many respondents as possible in the standardization group, and to ensure that they are truly representative. A minimum of several hundred is generally required, but this depends to a large extent on the nature of the respondents. Some are easier to find than others, and it is often better to obtain a smaller group of very appropriate respondents than a larger but less appropriate one. In some cases, it is necessary to obtain several standardization groups or to stratify the standardization group according to relevant variables such as age, gender, or social class. Ideally, there should be several hundred respondents in each group or stratification. Norms should be presented in terms of the mean and standard deviation for each group or stratification. The mean score for the standardization group is simply the average of the scores for the respondents in that group. The standard deviation is a measure of the amount of variation in the standardization group. (It is the square root of the average of the squared deviations from the mean.) If you have all of the scores in an Excel spreadsheet, it can easily be calculated using the STDEV function. Alternatively, use any statistical package. Once you know the mean and the standard deviation of the standardization group, also frequently called the norm group, you can calculate for each person by how many standard deviations their score differs from the mean. This figure ranges between about −3.00 and about +3.00, and is referred to as a standard score or z score. One advantage of the standard score is that anyone who understands how they are calculated can immediately interpret someone’s standard score in terms of how they compare with everyone in the standardization sample. If their z score is 0, they are right at the average. If their z score is 1.00, they are one standard deviation above the mean. If their z score is −1.50, they are one and a half standard deviations below the mean, and so on. However, it would not be an easy score to give as feedback to someone regarding their result. Hence, there are various techniques available for rescaling standard scores to make them more presentable. These are called standardized scores. The most frequent of these as far as knowledge-based tests are concerned is the T score. To obtain a T score, you simply multiply the standard score by 10, add 50, and round to the nearest whole number. For person-based tests, it is more common to multiply the z score by 2, add 5, then round the answer off. This produces a score between 1 and 9, called a

Constructing your psychometric questionnaire 37 Table 2.5 Example of standardized scores obtained for all the individuals in a standardization sample of seven people Person

1 2 3 4 5 6 7 Mean S.D.

Score Raw

z

T

Stanine

IQ

Percentile

44 48 57 66 75 76 90 65.14 16.54

−1.28 −1.04 −.49 .05 .60 .66 1.50 0.00 1.00

37 40 45 51 56 57 65 50 10

2 3 4 5 6 6 8 5 2

81 84 93 101 109 110 123 100 15

10.03 14.92 31.21 52.00 72.57 74.54 93.32

stanine. Nearly all personality tests report stanine scores. Even IQ scores today are normally standardized in the same way, but this time by multiplying by 15 and adding 100. Table 2.5 gives an example of standardized scores obtained for all the individuals in a standardization sample of seven people. It also has an additional column containing the percentile equivalent—that is, the percentage of people in the standardization sample who obtained a score at this level or less. The GRIMS was standardized using two groups: (i) a random sample of people consulting their family doctor with the usual variety of medical problems (a general-population group); and (ii) clients attending relationship guidance clinics and sexual therapy clinics (a relationshipproblems group). Standardize your questionnaire using a relevant group or groups of as many respondents as possible. Present the norms in terms of the mean and standard deviation for each group or stratification.

3

The psychometric principles

The four principles of classical psychometrics are reliability, validity, standardization, and equivalence. Reliability has been defined as the extent to which the test is free from error, and validity as the extent to which it is measuring what it is purported to measure. If a test is unreliable, it is impossible for it to be valid, so it would be a logical impossibility to have a test that is valid but completely unreliable. On the other hand, a test can be reliable but invalid. Standardization represents the method by which a score on a test is interpreted, either in comparison with other people who have taken the same test or by saying what skills or attributes a person with that particular score is likely to have. Equivalence is the requirement that a test be free from bias.

Reliability Reliability is often explained in terms of the measuring process in physics or engineering. When we measure the length of a table, for example, we assume that our measurements are reasonably reliable. We could check this by taking several length measurements and comparing them. Even if we aim to be precise, it is unlikely that we will get exactly the same length each time. On one occasion, the length of the table may be 1.989 meters, and on another 1.985 meters. But in most circumstances, we would still be reasonably happy if the errors were only this small. In the social sciences, on the other hand, the unreliability of our measures can be a major problem. We may find that, for example, a student scored 73 on a geography test on one occasion, and then two weeks later, scored 68 on the same test, and we would probably feel that these figures were as close as we could reasonably expect in a classroom test. However, if the test in question were for entrance to university, and the scoring system gave a grade of B for 70 and above and C for below 70, and if the student were a university applicant who required B grades, we would certainly have cause for concern. It is because apparently reasonable levels of reliability for a test can have such devastating effects that we need first to make tests that are as reliable as possible, and second to take account of any unreliability in interpreting test results. All published tests are required to report details of reliability and how it was calculated, and whenever a test is constructed and used, information on reliability must be included. Test–retest reliability

There are several techniques for estimating the reliability of a test. The most straightforward is test–retest reliability, which involves administering the test twice to the same

The psychometric principles 39

group of respondents, with an interval between the two administrations of, say, one month. This would yield two measurements for each person: the score on the first occasion and the score on the second occasion. A correlation coefficient calculated on these data would give us a reliability coefficient directly. If the correlation were 1.00, there would be perfect reliability, indicating that the respondents obtained the same score on both occasions. This never happens (except, perhaps, by fluke) in psychological or educational settings. If the correlation between the two occasions is 0.00, then the test has no reliability at all: whatever score was obtained on the first occasion bears no relationship whatsoever to the score on the second occasion; and by implication, if the respondents were tested again, they would come up with another completely different set of scores. In these circumstances, the scores are quite meaningless. If the correlation between the two occasions is negative, this implies that the higher the respondent’s score the first time, the lower it would be the second time (and vice versa). This never occurs except by accident, and if it does, a reliability of 0.00 is assumed. Thus, all tests can be expected to have a test–retest reliability between 0.00 and 1.00, but the higher the better. One advantage of using the correlation coefficient to calculate test–retest reliability is that it takes account of any differences in mean score between the first and second occasion. Thus, if every respondent’s score increased by five points on the second occasion but was otherwise unchanged, the reliability would still be 1.00. Only changes in relative ordering or in the number of points between the scores can affect the result. Test–retest reliability is sometimes referred to as test stability. Parallel-forms reliability

Although the test–retest method is the most straightforward, there are many circumstances in which it is inappropriate. This is particularly true in knowledge-based tests that involve some calculation in order to arrive at the answer. For these tests, it is very likely that the skills learned on the first administration will transfer to the second, so that tasks on the two occasions are not really equivalent. Differences in motivation and memory may also affect the results. A respondent’s approach to a test is often completely different in a second administration (e.g., they might be bored, or less anxious). For these reasons, an alternative technique for estimating reliability is the parallel-forms method. Here, we have not one version of the test, but two versions linked in a systematic manner. For each cell in the test specification, two alternative sets of items will have been generated, which are intended to measure the same construct but are different (e.g., 2 + 7 in the first version of an arithmetic test and 3 + 6 in the second). Two tests constructed in this way are said to be parallel. To obtain the parallel-forms reliability, each person is given both versions of the test to complete, and we calculate the correlation between the scores for the two forms. Many consider the parallel-forms method to be the best form of reliability; however, there are pragmatic reasons why it is rarely used. When a test is constructed, our main aim is to obtain the best possible items, and if we wish to develop parallel forms, not only is there twice the amount of work but there is also the possibility of obtaining a better test by taking the better items from each version and combining them into a “super test.” This is generally a more desirable outcome, and frequently where parallel forms have been generated in the initial life of a test, they have later been combined in this way, as for example in the later versions of the Stanford-Binet Intelligence Scales.

40 The psychometric principles Split-half reliability

For this technique, a test is split into two to make two half-size versions. If this is done in random fashion, a sort of pseudo-parallel form is obtained, where—although there are not necessarily parallel items within each cell of the test specification—there is no systematic bias in the way in which items from the two halves are distributed with respect to the specification. It is a common convention to construct the two forms from the odd and even items of the questionnaire, respectively, so long as this does indeed give a random spread with respect to the actual content of the items. For each individual, two scores are obtained—one for each half of the test—and these are correlated with each other (again using the correlation coefficient). The resultant correlation itself is not a reliability. It is, if anything, the reliability of half of the test. This is of no immediate use, as it is the whole test with which we have to deal. However, we can obtain the reliability of the whole test by applying the Spearman–Brown formula to this correlation:

rtest = 2 × r half /(1 + r half ), where rtest is the reliability of the test and rhalf is the correlation obtained between the two halves of the test. This tells us that the reliability is equal to twice the correlation between the two halves, divided by one plus this correlation. Thus, if the two halves correlate 0.60 with each other, then reliability = (2 × 0.60)/(1 + 0.60) = 1.20/1.60 = 0.75. It is worth noting that the reliability is always larger than the correlation between the two halves. This illustrates the general rule that the longer the test, the more reliable it is. This makes sense, because the more questions we ask, the more information we obtain; and it is for this reason that we will generally want our tests to be as long as possible, provided that there is time for administration and the respondents are cooperative. Of course, this only applies so long as the items are discriminating—that is, they are making a real contribution to the overall score. Interrater reliability

All these types of reliability particularly relate to objective tests, i.e., tests in which the scoring is completely objective. However, there are additional forms of reliability that are applicable where the degree of objectivity is reduced. For example, different markers of the same essay tend to give different marks, or different interviewers may make different ratings of the same interviewee within a structured interview. Reliability here can be found by correlating the two sets of marks or the two sets of ratings, respectively, using the correlation between the scores of the two raters. This form of reliability is known as intermarker or interrater reliability. Internal consistency

The internal consistency of a test represents the extent to which all the items within it correlate with each other. Sometimes called Cronbach’s alpha or coefficient alpha, it is frequently used as a surrogate for reliability and is an extension of the logic behind

The psychometric principles 41

split-half reliability, but this time the test is being deconstructed right down to the item level rather than just into two halves. In many ways, it is a functional average of all possible splits; however, this only applies to the extent that such splits are random. Many have argued that it is not a true reliability, and there is a good case for this argument. However, in practice, the value obtained for coefficient alpha is generally very similar to that obtained for split-half reliability on the same test, so this argument is in many ways moot. The internal consistency can be open to abuse by unscrupulous test developers, as it can be inflated by repeating the same or almost identical items several times. This ruse can be used to increase a disappointing reliability in a test under development above some critical value (normally 0.70). In fact, if all the items in a test were identical—which is clearly not desirable—then the coefficient alpha should be 1.00. An overpedantic devotion to statistical methodology can lead to the oddest conclusions. The moral? Do not aim for a reliability of 1.00; aim for the expected level for a suitably broad set of items for the construct under measurement. This would sensibly be between 0.70 and 0.80 for a personality test and between 0.80 and 0.90 for an ability test. Standard error of measurement (SEM)

From True Score Theory (see Chapter 4) we know that the observed score on a test is a summation of two components: the true score, which is unknown, and an error associated with its measurement. We also know that if we average the score over repeated observations, this average is an increasingly accurate estimation of the true score. The standard deviation of these repeated observations is known as the standard error of measurement (SEM), and it gives us important information on how accurately the observed score reflects the true score. If the SEM is very small, we can have a high degree of confidence that our observation is accurate; with a wide-ranging SEM, much less so. In order to calculate the SEM, we need two pieces of information: the standard deviation of the test and its reliability. Given this information, the error of measurement is equal to the standard deviation of the test multiplied by the square root of the result when the reliability is subtracted from 1. For example, if a test has a known reliability of 0.90 and a standard deviation of 15, then the SEM

= 15 × the square root of (1 0.90), = 15 × the square root of 0.10, = 15 × 0.32, =5(approximately). This SEM is the standard deviation of errors associated with individual scores on the test. From this, we have some idea of the distribution of error about the observed score. If it is assumed that errors of measurement are normally distributed, this then enables us to calculate a confidence interval (CI) for the observation. A CI sets upper and lower limits within which we can have a certain amount of confidence that the true score truly lies. Confidence intervals vary depending on the amount of certainty that is required. It may be important that certainty be high: we may want to risk being wrong only one time in 1,000, and this would mean we want the 99.9% CI. Or we may want just a rough-andready estimate such that the risk of being wrong is as high as one in 10 (the 90% CI). Although it is good to be accurate, the more accuracy that is required, the wider the confidence interval, and thus the greater the general uncertainty. For example, if a

42 The psychometric principles

person had an observed score of 43 on a general-knowledge test and we knew the SEM, we could obtain upper and lower limits for various levels of confidence. The 95% level may give us the information that we can be 95% certain that the person’s true score lies between 40 and 46, for example. If we need to be 99.9% certain, we may have to say only that the true score lies between 35 and 50—perhaps too wide for useful application. The usual convention is to use the 95% confidence interval for most purposes, so that there is a one-in-twenty chance of being wrong. Looked at in percentile terms, this means that anything between the 2.5 percentile and the 97.5 percentile is “within range.” The excluded 5 percentiles (that is, 2.5 at the lower end plus 2.5 at the upper end) represents the one-in-twenty probability that we are mistaken in placing the true score within this range. From the properties of the normal curve, we can easily find that this percentile range lies between 1.96 standard error units above and below the score in question. The 95% limits themselves are thus obtained by multiplying the SEM by 1.96. With an error of measurement of 5 (as in the example, obtained from a reliability of 0.90 and a standard deviation of 15), an observed score of 90, and a 95% CI, we can say that the true score lies between 90 plus or minus 5 multiplied by 1.96—that is, between about 80 and about 100. We could tell from this that another person with a score of 85 would not be significantly different, given the known amount of error that we have obtained from our knowledge of the test’s reliability. This might be important if we needed to decide between these two people. In fact, this example could easily have been obtained from scores on the Wechsler Intelligence Scale for Children (WISC), which does indeed have a reliability of about 0.9 and a standard deviation of 15. We can see from this why so many psychologists are unhappy about using intelligence tests on their own when making decisions about individual children. Comparing test reliabilities

One of the major uses of the reliability coefficient is in the evaluation of a test. Generally, different types of tests have different acceptable reliabilities. Thus, individual IQ tests generally report reliabilities in excess of 0.9 and tend to average about 0.92. With personality tests, reliabilities of greater than 0.7 are expected. Essay marking tends to produce notoriously low interrater reliabilities of about 0.6, even when complex agreed-upon marking schemes are worked out in advance between the examiners. Creativity tests are notoriously even less reliable. For example, the Alternative Uses Test—which contains items such as “How many uses can you think of for a brick?”—rarely has a reliability higher than 0.5. The problems with this type of test are intrinsic to its design. If the number of reported uses is to simply represent the score, then how do we define “different”? Thus “to build a shop” and “to build a church” should probably count as one response, but it is not possible to define very precisely what counts as the same response. The lowest reliabilities are found in projective tests, such as the Rorschach inkblot test, where reliabilities of 0.2 and lower are not unusual. Here again, it is very difficult to be completely objective in the scoring. A test with such a low reliability is useless for psychometric purposes, but can be of use in clinical settings to provide a diagnostic framework. Restriction of range

When interpreting reliability coefficients, the spread of the scores of the sample under scrutiny must also be considered. This can only really be done by ensuring that the

The psychometric principles 43

reliability has been calculated on a similar group to the one to which it is being applied. If a sample selected on some narrow criterion is used, such as university students, then reliability coefficients will be much smaller than for the general population. This phenomenon is called the restriction-of-range effect. Generally, the larger the standard deviation of the group, the higher the expected reliability. However, when calculating the SEM, the lower standard deviation of the restricted group will be balanced by its lower reliability (when calculated on this group alone), so that the SEM is particularly resilient to the restriction-of-range effect.

Validity The validity of a test also has many different forms. There are several categorization systems used, but the major groupings are face validity, content validity, predictive validity, concurrent validity, and construct validity. Face validity

Face validity concerns the acceptability of the test items, to both test user and respondent, for the operation being carried out. This should never be treated as trivial. If respondents fail to take the test seriously, the results may be meaningless. For example, some adults with cognitive impairment may be expected to have the same overall score on intelligence tests as eight-year-old children, but they may well object to the use of childish materials in a test designed for them. Similarly, applicants for a job may be disenchanted if presented with a test designed primarily for the detection of psychiatric symptoms. Evaluation of the suitability of a test must include consideration of the style and appropriateness of the items for the purpose at hand, in addition to any other formal test characteristics. Content validity

The content validity of a test examines the extent to which the test specification under which the test was constructed reflects the particular purpose for which the test was developed. In an educational setting, content validation will generally involve a comparison between the curriculum design and the test design. In the use of a selective test for employment, the content validity will be the extent to which the job specification matches the test specification. Content validity is thus the principal form of validity for the functional approach to psychometrics, and it has sometimes been described as criterionrelated or domain-referenced validity in circumstances where the test designer is using the criterion-referenced framework for skills learning and curriculum evaluation. Content validity is fundamental to psychometrics and is the main basis by which any test construction program is judged. Content validity must be judged qualitatively more often than quantitatively, as the form of any deviation from validity is usually more important than the degree. Essentially, if the test specification is not reflecting the task specification, it must be reflecting something else, and all else is a potential source of bias. Predictive validity

Predictive validity is the major form of statistical validity and is used wherever tests are used to make predictions, for example, for job selection or for a program of instruction

44 The psychometric principles

where the test is intended to predict eventual success in these areas. Predictive validity is represented as a correlation between the test score itself and a score of the degree of success in the selected field, usually called success on the criterion. Thus, for example, in the use in England and Wales of A-level grades from secondary schools to select candidates for university, it might reasonably be assumed that the number and grades of the A-levels obtained by a candidate are related to their likelihood of success at university. We could generate scores on the test by combining A-level grades in some way (e.g., for each person, grade A++ = 6, A+ = 5, A = 4, B = 3, and so on, the scores for all examinations being summed). Similarly, a score of success at university could be generated by assigning 0 to a fail, 1 to a pass or third-class degree, 2 to an upper second-class degree, and 3 to a first-class degree. A simple correlation coefficient between A-level scores and degree-class scores would give a measure of the predictive validity of the Alevel selection system. If the correlation were high—say, over 0.50—we might feel justified in selecting in this way, but if it turned out to be 0, then we would certainly have misgivings. This would mean that students’ success at university had nothing to do with A-level scores, so that many people with only one B grade, for example, could have had as good a chance of getting a first-class degree as those with three A grades. The A-level selection procedure would then have no validity. One common problem with predictive validity is that individuals who are not selected do not go on to produce a score on the criterion (people who do not go to university have no scorable degree class), so that the data are always incomplete. In these circumstances, the calculated predictive validity will nearly always be an underestimate. It is normal practice to use the data available and then justify extrapolation downward. Thus, if individuals selected with three B grades do worse than individuals selected with three A grades, it would be extrapolated that those with three Cs would have fared even less well. However, there must always be some uncertainty here. Concurrent validity

Concurrent validity is also statistical in conception and describes the correlation of a new test with existing tests that purport to measure the same construct. Thus, a new intelligence test ought to correlate with existing intelligence tests. This is a rather weak criterion on its own, as the old and new tests may well both correlate and yet neither be measuring intelligence. Indeed, this has been one of the major criticisms directed against validation procedures for intelligence tests, particularly when the conceptualization of intelligence in the later test is derivative of the conceptualization in the first, thus producing a “bootstrap” effect. However, concurrent validity—although never enough on its own—is important. If old and new tests of the same construct fail to correlate with each other, then something is probably seriously wrong. Construct validity

Construct validity is the primary form of validation underlying the trait-related approach to psychometrics. The entity that the test is measuring is normally not measurable directly, and we are only able to evaluate its usefulness by looking at the relationship between the test and the various phenomena that theory predicts. A good demonstration of construct validation is provided by Eysenck’s validation of the Eysenck Personality Inventory, which measures extraversion/introversion and neuroticism. It was not possible

The psychometric principles 45

for Eysenck to validate this scale by correlating respondents’ scores on the extraversion scale with their “actual” amount of extraversion. After all, if this were known, then there would be no need for the scale. However, he was able to suggest many ways in which extraverts might be expected to differ from introverts in their behavior. Based on his theory that extraverts had a less-aroused central nervous system, he postulated that they should be less easy to condition, and this led to a series of experiments on individual differences in conditionability. For example, it was shown that in an eyeblink conditioning experiment—with a tone heard through headphones as the conditioned stimulus and a puff of air to the eye as the unconditioned—extraverts developed the conditioned eyeblink response to the tone on its own more slowly than introverts did. He suggested that extraverts should also be less tolerant of sensory deprivation, and that the balance between excitation and inhibition in the brain would be different between extraverts and introverts, which led to a series of experiments. He also suggested that the electroencephalogram (EEG) would vary, with extraverts showing a less-aroused EEG, and this again could be tested. Finally, Eysenck was able to point to some simulations of extravert and introvert behavior, for example, the effect of alcohol that produces extraverted behavior by inhibiting cortical arousal. The validation of the construct of extraversion consists of a whole matrix of interrelated experiments. From this, Eysenck concluded that extraverts condition more slowly, are less tolerant to sensory deprivation, are less sensitive to inhibiting drugs, and are generally different from introverts on a whole variety of other psychophysiological and psychophysical tests. He claimed that his theory that extraversion had some biological basis was supported because it was able to provide a unified explanation for all these findings. Construct validation is never complete but is cumulative over the number of studies available. Differential validity

Construct validity demands not only that a test correlate highly with some other measures that it resembles but also that it not correlate with measures from which it should differ. Thus, a test of mathematics reasoning that had a correlation of 0.6 with numerical reasoning but of 0.7 with reading comprehension would be of dubious validity. It is not enough that the test and the criterion simply be correlated. We need to exclude the possibility that the correlation has come about because of some wider underlying trait such as general intelligence before we can make the claim that a specific form of ability has been assessed. Differential validity refers to the difference between the correlation between a test and its anticipated construct (convergent validity) and the correlation between the test and its possible confound (divergent validity). Classically, differential validity is assessed using the multitrait–multimethod approach, which involves the assessment of three or more traits using three or more different methods. Thus, in a personality-testing situation, extraversion, emotionality, and conscientiousness could be assessed using self-report, projective techniques, and peer ratings. High predicted correlations between the same personality trait measured by whatever method should be accompanied by low correlations elsewhere.

Standardization Simply knowing someone’s raw score on a test tells us nothing unless we know something about the test’s standardization characteristics. For example, a person may be

46 The psychometric principles

delighted if you tell them that their score on a test is 78, but the delight will fade if you tell them that everyone else who took the test scored over 425. There are two types of standardization: norm referenced and criterion referenced. Criterion-referenced tests tell us what a person with a certain score or higher can or cannot be expected to know or do. Norm-referenced tests tell us how someone’s score compares with the score of a sample of others who took the test. Norm referencing

The easiest way to norm-reference a test is simply to place everyone’s scores in order and find out the rank order of the person in question. Peter, for example, may come 30th out of 60. It is more usual to report this sort of ordering as a percentile, as this takes account of the sample size. Thus, Peter’s score of 30 out of 60 becomes 50% or the 50th percentile. These percentile scores are ordinal (ranked data), and so it is easy to convert them into an easily interpretable framework. The 50th percentile is the median (this is the average for ordinal data). The 25th and 75th percentiles are the first and third quartiles respectively, while the 10th, 20th, and 30th percentiles become the first, second, and third deciles. This form of reporting norm-referenced data is quite common for some occupational tests. However, a disadvantage of this approach is that it throws away information coming from the actual sizes of the differences between raw scores, something that is lost when data are ranked. For example, if candidates with scores of 45, 47, and 76 are ranked, these data become first, second, and third—and the meaning of the small difference between the first two and the large difference between the second and the third are lost. Our ability to make use of the distance between adjacent scores depends on our assumptions about the nature of the underlying measurement. Types of measurement

Classical measurement theory tells us to differentiate three types of measurement: nominal, ordinal, and interval. Nominal data are simply categorized. For example, the numerical country code for the UK is 44, while that for France is 33, but these are just labels. We can use data categorized in this way within a statistical analysis, but clearly need to be cautious about how they are interpreted. It makes no sense to add, subtract, multiply or divide with them. Ordinal data are ranked, as when an interview panel arranges the order in which they see the suitability of candidates for a job. Although numbers are assigned—e.g., ranks of 1 to 10 where there 10 candidates—this does not mean that the intervals between these categories have any special significance. For example, candidates 1 to 4 may have all seemed like very plausible candidates, and the panel was hard-pressed to rank them, while candidates 5 to 10 may have been felt to be of far lower caliber and were essentially seen as nonstarters. With interval data, the size of the difference between scores becomes significant. Thus, with candidates scoring 45, 50, and 60 on a test, it would make sense to say that the difference between the second and third candidates was twice as large as the difference between the first and second candidates. We need interval data in order to apply the more powerful parametric model. The average and variation parameters of a set of data vary according to its type of measurement. For nominal data, the average is the mode, defined as the most frequently occurring category. For ordinal data, the average is the median, the value such that half the scores are below it and half above it. For interval scales, the average is the mean, the

The psychometric principles 47

sum of all the scores divided by the number of scores. If data have a normal distribution, then the three forms of average—the mode, the medium, and the mean—should be the same. If, however, the data are not normal, then they will differ. Looking at national salary income, for example, the data are heavily skewed so that most workers earn a lower amount, with a few earning a great deal. With positively skewed data of this type, the mode will be much lower than the mean, with the median lying in between. (A fact frequently used selectively by politically biased commentators, each putting their chosen slant on the interpretation of the “average” income and its implications for the state of the nation.) The variation parameters for the three types of measurement are, respectively, the range of frequencies, the interquartile range, and the standard deviation. Statistical models vary depending on the measurement level of the data. Those for interval data that also have a normal distribution are described as parametric, and those for nominal and ordinal data as nonparametric. Classical psychometricians have striven for the more powerful parametric approach—so much so that wherever possible, data that at first seem to be nominal or ordinal in nature have been treated as interval or converted to some simulation of interval. Thus, giving the correct or incorrect answer to a question generates two nominal categories: right and wrong. But if we assume a latent trait of ability, make a few other assumptions, and have enough items, we can treat these as interval data with a certain degree of plausibility. Similarly, the ranked data from an item in a personality questionnaire to the response categories “never,” “hardly ever,” “sometimes,” “often,” and “very often” can be treated in the same way. Using interval data

If the scale is truly interval and its data normally distributed, then we can use the more powerful parametric statistical procedures in the analysis and interpretation. But we must first know something about the distribution of the data. The first step in obtaining this is to plot the frequency distribution of the raw data in the form of a histogram. To do this, we simply ascribe bins to ordered groups of data. If, for example, we have 100 people, each with a score out of 100 on a test, we can create bins of 0 to 9, 10 to 19, 20 to 29, and so forth—up to 90 to 100—and count how many people have scores within each bin. This information is then plotted, with frequency as the vertical y axis and the ordered bin categories as the horizontal x axis. The shape of the frequency histogram we obtain tells us whether the data are normally distributed. If they are, then norm referencing is relatively straightforward. The statistic for the average in an interval scale is the mean, obtained by summing all the scores and dividing them by the number of scores. With this alone, we would be able to say whether a person’s score was above or below average—but a more important question is “How much above or below average?” and the answer will depend on how much variation there is in the data. If the mean of a set of scores is 55, then a person with a score of 60 may be only a little above average if the range of scores is large, or very much above average if the range of scores is low. The statistic used to measure the amount of variation in a normal distribution is the standard deviation. This then becomes the unit by which deviations from the mean are measured. A person may, for example, be one standard deviation above the mean, or two, or 1.7. Their score may also be below the mean by a similar degree of standard deviation units. Norm-referenced standardization for interval data thus has two elements: first, the need to obtain information on the test scores of a population by taking appropriate samples; and second, the need to obtain a set of principles by which raw data from the test can be

48 The psychometric principles

transformed to give a set of data that have a normal distribution. And this latter becomes particularly important if the scores will need to be subjected to statistical analysis, as is usually the case. Parametric statistical tests, factor analysis, and most of the advanced techniques available strive to obtain data that are normally distributed. If the only information that is available after the administration of a test is one respondent’s score, then it will tell us nothing about that person. For example, suppose we are told that Bernard scores 23 on an extraversion test. Is this high or low? Before we can make an interpretation, we need to know what a score of 23 represents. This may be from norm-referenced information, given by knowledge of the general-population mean and standard deviation for extraversion scores. With this extra information, we can say how extraverted Bernard is in comparison with everyone else. Or it may be criterion related; for example, there may be information from the test handbook to tell us that people with extraversion scores of 22 and higher like going out all the time or have difficulty settling down for a quiet read. Comparison or criterion information of this type must be made available when a test is published. Norm-referencing procedures are much more common than criterion referencing, largely because they are easier to carry out and because for many tests it is extremely difficult to give clear and specific criteria. In order to obtain comparison data for a norm-referenced test, a population must be specified that is directly comparable to the population of intended use. Thus, a test of potential to do well in business—used to select persons for admission to business school—must be standardized with business-school applicants. If the potential sample is large, then the information can be obtained by a process of random or stratified random sampling. For information on the whole population of adults in a country, a random sample might be taken from the electoral register. Comparison data may be presented in raw form, as for example in the Eysenck Personality Inventory, where we are able to read in the handbook that the average extraversion score is about 12, with a standard deviation of about 4. We can tell from this that a person with an extraversion score of 22 is 2.5 standard deviations above the mean. As extraversion scores in the population are known to be approximately normally distributed, we can quickly derive that less than 1% of the population has a score this high. The normal population data may sometimes be given separately for different groups, and this often enables a more precise interpretation. A situation where this might be useful could be in assessing mathematics ability among applicants for university places in mathematics courses. Here, we would be interested in how applicants compared with each other, and the fact that they all performed in the top 50% of ability in mathematics in the population at large would be relatively uninformative. Standard scores and standardized scores

The interpretation of scores in terms of percentages within the general population is easy to understand, and it maps very well onto the pattern of the normal curve. Thus, we know that a person who is above average is in the top 50%, a person who is one standard deviation above the mean is in the top 16%, a person who is more than two standard deviations below the mean is in the bottom 2%, and so on. For this reason, the comparison of an individual’s score with the norm is often given in terms of the number of standard deviations by which it differs from the mean. This is called a standard score or z score. In the earlier example of Bernard with an extraversion score of 22, it is clear that with a mean of 12 and a standard deviation of 4, his score is

The psychometric principles 49

2.5 standard deviations above the mean. This score of 2.5 is referred to as the standard score and is given more formally by the formula z = (score − mean score)/(standard deviation); in this case, z = (22 − 12)/4 = 2.5. The process of determining the population data and using them to provide a method for transforming raw scores to standard scores is called test standardization (see Table 2.5). T SCORES

Standard scores—that is, scores in standard-deviation units on a test—normally range between −3.0 and +3.0 and have a mean of 0. This is not a very convenient way to present an individual’s score; for example, schoolchildren might tend to object if told that their score on a test was −1.3! There is therefore a set of conventions that are applied to these standard scores to make them more presentable. The most common of these are the T score, the stanine, the sten, and the IQ formats. For T scores, we multiply the z score by 10 and add 50. Thus, a standard score of −1.3 becomes (−1.3 × 10) + 50 = 37; much more respectable. The advantage of this format is that it turns the scores into something that resembles a traditional classroom mark, which normally has a mean of about 50 with most scores lying between 20 and 80. Unlike most classroom marks, however, it is very informative. If we were told that someone had scored 70 on a classroom mathematics test, we would not have any information unless we knew the marking scheme—it might be that the teacher involved always gives high marks. However, if the scores were transformed into T scores, then because we already know that T scores have a mean of 50 and a standard deviation of 10, it is immediately clear that a score of 70 is two standard deviations above the mean. This is equivalent to a z score of 2, which is in the top 2% of scores on that test. STANINE SCORES

The stanine technique transforms the standard scores into a scale running from 1 to 9, with a mean of 5 and a standard deviation of 2. This standardization is widely used, as a set of scores from 1 to 9 (rather like marking out of 10) has much intuitive appeal. There are no negatives and no decimals (by convention, they are rounded off to the nearest whole number). The advantage of the stanine over the T score is that it is sufficiently imprecise not to be misleading. Most tests have only a limited precision. A T score of 43, for example, would be equivalent to a stanine score of 4, as would a T score of 41. In a personality test, the difference between T scores of 41 and 43 would be much too small to be of any real significance, but their bold statement within the T score format does give a misleading impression of precision. STEN SCORES

The sten score is a standardization to a scale running from 1 to 10, with a mean of 5.5 and a standard deviation of 2. Again, as with the stanine, there are no negatives and no decimals. The sten and stanine seem very similar, but there is one important difference. With the stanine, a score of 5 represents an average band (ranging before rounding from 4.5 to 5.5, which is from 0.25 standard deviation below the mean to 0.25 standard deviation above the mean). With the sten, on the other hand, there is no average as such; a score of 5 is below average, and a score of 6 is above average.

50 The psychometric principles IQ SCORES

The IQ format originated when the definition of an IQ score in the Stanford–Binet tests of intelligence was changed from one based on the ratio of mental age to chronological age (the original meaning of “intelligence quotient”) to the now widely used standardization approach. The IQ transformation is based on a mean of 100 and a standard deviation of 15; thus, a standard score of −1.3 becomes an IQ score of (−1.3 × 15) + 100 = 80.5 (or 80 when rounded). An IQ score of 130—that is, 100 + (2 × 15)—is two standard deviations above the mean, and an IQ score of this value or higher would be obtained by less than 2% of the population. Some IQ tests use different standard deviations; for example, Cattell uses 24 scale points rather than 15. IQ-style scores are best avoided by psychometricians today. They have become something of a cult, and their extrapolations bear very little relation to normal scientific processes. IQs of 160, for example, often appear as news items in the media, yet 160 would be four standard deviations above the mean. As such, a high score would occur in the general population only three times in 100,000, and we would need to have had a norm group of about one million individuals to obtain the relevant comparisons. Usually, tests are standardized with less than 1,000. Even the WISC, with its standardization group of 2,000, had relatively few respondents at each comparison age. The behavior of individuals at probability levels of less than three in 100,000 is not something that can meaningfully be summarized in an IQ score. The whole conception of a unitary trait of intelligence breaks down at these extremes. Normalization

All these standardization techniques (z score, T score, stanine, sten, and IQ format) assume that scores for the general population already have a distribution that is normal, or at least approximately normal. There are often good reasons why a normal distribution might be expected, and it is common practice in classical test development to carry out item analysis in such a way that only items that contribute to normality are selected. However, there are occasions when sets of test scores have different distributions (perhaps with a positive or negative skew, or having more than one mode), and here alternative techniques for standardization are required. Statistical techniques are available to test for the existence or otherwise of normality. Perhaps the most straightforward of these is to split the data into equal interval categories (say, about five), and to compare (using the chi-square goodness-of-fit test) the actual number of scores in each with the number that would be expected if the data had a normal distribution. However, tests of normality are not particularly powerful, so if there is doubt, it is best to attempt to normalize the data in one of the following ways. ALGEBRAIC NORMALIZATION

The easiest technique for normalization is algebraic transformation. If the distribution of scores is very positively skewed, for example, with most of the scores being at the lower end of the scale, then taking the square root of each score will usually produce a set of transformed scores with closer to a normal distribution. The square-rooting process has a stronger effect on the large, more extreme, figures. With extreme positive skew, log transformations can alternatively be used. All these techniques are particularly important

The psychometric principles 51

if the data are to be analyzed statistically, as most statistical tests assume a normal distribution. It has been argued that this transformation procedure is unnatural and does not do full credit to the true nature of the data. However, this view is based on a misunderstanding. For norm-referenced tests, the data really only have a true nature in relation to the scores of other individuals, while even in criterion-referenced tests, the data are rarely at the interval level of measurement—the minimum level required for distribution effects to have much meaning. Furthermore, the data only have functional meaning in relation to the tasks to which they are put (t tests, correlations, etc.), and these generally require normally distributed data. If the data “make more sense” when untransformed, then they can still be reported in this way, with the caveat that tests of statistical significance were carried out using the transforms. PERCENTILE-EQUIVALENT NORMALIZATION

For some data samples (particularly where there is more than one mode), the deviations from normality of the raw population scores are too complicated to be eliminated by a straightforward algebraic transformation. These situations can generally be dealt with by the construction of a standardization transformation based on the expected standard deviation of each percentile within a normal distribution. To standardize in this way, the scores are first ranked, and then cumulative percentages are found for each point on the rank. Thus, for a sample with 100 respondents in which the top two respondents score 93 and 87, respectively, we mark 99% at 93 and 98% at 87, and so on. With different sample sizes, it will not be quite so straightforward, but the appropriate scaling can still achieve the desired results. We can then use the known relationship between the z score and the percentile to give us standard score equivalents for each raw score; z = 0 is equivalent to the 50% cumulative percentile, z = 1 to the 84% cumulative percentile, z = −1 to the 16% cumulative percentile, and so on. In the current example, a raw score of 93 would be equivalent to a z score of 2.33, and a raw score of 87 to a z score of 2.05. These would then normally be transformed into standardized scores (e.g., T scores). Criterion referencing

Although comparison against the norm can be useful, there are many situations where such comparisons are irrelevant and where the performance of the respondent would be more appropriately measured against some outside criterion. When a test has been constructed in this way, it is said to be criterion referenced, although the terms content referenced, domain referenced, and objective referenced are used by some authors. This contrasts with the norm-referencing situation, where the major characteristic is that the score of a respondent on a test is compared with the whole population of respondents. However, attempts to contrast norm- and criterion-referenced testing too strongly can be misleading, as the two approaches do have much in common. First, all items must be related to some criterion. Indeed, given that there is an intention that tests be valid, each item must relate to the purpose of the test itself. This purpose can only be judged in terms of criteria, and thus criterion referencing is a necessary aspect of validity for all tests, whether criterion or norm referenced. In fact, the situations in which it is possible to lay down a strict single criterion for a task are

52 The psychometric principles

extremely limited, and in practice the circumstances in which we are pleased if everyone answers all the items correctly on a test are unusual. This is not because people are less than delighted at success, but rather because the only occasions in which this occurs seem to be those in which the information could probably be known with no testing at all. If we do take on the considerable task of constructing a test, we hope to gain some information from the results, and the conclusion from a 100% success rate is more often that the test was probably rather too easy for the group in question. If all the many applicants for a single job were tested and shown to be equally good, we would of necessity need to question our criterion. Of course, there are many instances in which all we wish to know is whether a particular person can do a particular task or not. But this situation is not one that stands apart from traditional psychometrics; it is rather a special case within it. Testing for competencies

The widespread skepticism around psychometrics during the latter half of the last century saw the emergence of the competency testing movement, within which it was argued that ability testing was completely ineffective as a tool for selection in employment settings. The view was promulgated that neither ability tests nor grades in school were effective in predicting occupational success. This was consistent with the antitesting zeitgeist of the time, and its influence is still present to this day. Its proponents also put forward the view that such tests were intrinsically unfair to minorities, and that any relationships that had been found between ability and performance were spurious, resulting from social-class differences. The belief that criterion-referenced “competencies” would be better able to predict important behaviors than would “more traditional” norm-referenced tests became widespread. Indeed, the emphasis on competencies rather than ability was central to the model on which the National Vocational Qualification (NVQ) assessment system was founded. The more extreme proponents of this approach argued that the traditional psychometric requirements of reliability and validity no longer applied to these new types of test—for them, anything associated with traditional testing was ideologically suspect. However, once we dismiss this view and insist on the same level of psychometric rigor for both types of tests, it becomes apparent that the distinction between competencies and other test items is largely semantic, and that none of the tests designated as competency tests were fundamentally different from traditional tests. Where to norm-reference or criterion-reference them was always a matter of choice in both cases.

Equivalence The fourth psychometric principle demands that psychometric tests be as fair and free from bias as possible. In psychometric circles, this is normally formulated as a requirement that tests be equivalent. This puts it in line with the other positive requirements that tests be reliable, valid, and standardized—it seemed somewhat incongruous that a principle should require the absence of something (in this case, bias). Test equivalence is not simply desirable in a fair society, it is a matter of law in many countries. However, we should distinguish between fairness and bias. A selection procedure that is free from bias may still be perceived as unfair by candidates. Also, bias in a test may go unrecognized where it is hidden from sight, so that neither the candidate nor society at

The psychometric principles 53

large sees the results as unfair. Cases of bias, as opposed to perceived unfairness, nearly always involve issues of group membership. Indeed, we can say that bias in a test exists if the testing procedure is unfair to a group of individuals who can be defined in some way. The major issues of test bias involve ethnicity and gender, linguistic minorities, religious and cultural differences, and disability—although other categorizations are important in particular circumstances (educational level, height, age, sexual orientation, physical attractiveness, and so on). Of course, it is important that all tests be fair and be seen to be fair in their use. However, fairness only makes sense when viewed within a wider social and psychological perspective. Taken from an individual perspective, unfairness occurs whenever a wrong decision is made about an individual based on a test result. Yet wrong decisions are made all the time, particularly when an individual’s score is near the cutoff. If, for example, a student needs a A grade to gain entrance to a degree course, the examiners have decreed that an A is equivalent to 80% on the test, and the student has obtained 79%, then there must be a good statistical chance that a mistake has been made simply because of error of measurement. From the point of view of the examiners, who will be looking at the overall perspective, the best that can be hoped for is that mistakes will be kept to a minimum. Given the limited number of places available and the lack of definitive knowledge of what makes for success in a course, it would be unrealistic to hope for any high degree of accuracy near the cutoff point. Yet it is unlikely that the unsuccessful student will find this argument acceptable. A consideration of this issue rapidly shows us that there exists a series of social conventions within society concerning what is considered fair and unfair in these circumstances; these conventions are informative, as they often do not agree with conceptions of bias as viewed by the statistician. For example, consider a situation where half of the students taking an examination go, of their own volition, to a rather lively party the night before. Suppose that all of these students score on average five points lower on the examination than all other students. In these circumstances, it could be argued from a statistical point of view that the scores of the partygoers would be subject to statistical bias, and it would be reasonably easy to adjust for this bias—say, by adding five points to the score of any person who was at the party. Now it is clear that this procedure, which may give a statistically better estimate of the true score, would be extremely unlikely to be accepted by the examinees or by society, and would even make the partygoers feel a little uncomfortable. There is a general set of conventions about what examinees will accept as fair. If an examinee studies a set of possible examination questions that, although plausible, do not in fact occur in the examination, then a low mark will tend to be put down to bad luck rather than unfairness, even though a case for the latter could certainly be made. The ideology of these conventions is complex, as has been found by examination boards and employers who have to make allowance for dyslexia. Exceptions for dyslexia used to be allowed so long as it could be diagnosed as a medical condition by an appropriately qualified professional. Today, equal-opportunity legislation requires that any claim to dyslexia by a candidate has to be taken at face value regardless of the lack of supporting evidence. There are also conventions regarding corrections for guessing. These are often applied where all the items in a test are statements that have to be endorsed as either true or false; students who guess have a 50% chance of getting the correct answer for each. If there are, say, 100 items, then a student who knows nothing but guesses will have an expected score of 50. One way of reducing this effect is to use multiple-choice items

54 The psychometric principles

rather than true/false; however, even here there can be a guessing artifact. If there are five alternative responses within the multiple-choice questions, then the student still has a 20% chance of getting the item right by guessing, giving them a score of 20 on a 100item test. A technique for tackling this problem is to apply a correction for guessing. There are several formulae for this in the literature, but the most common is C = (R –(W /(N –1)) where C is the corrected score, R is the number of correct responses, W is the number of incorrect (wrong) responses, and N is the number of alternatives available. With true/ false-style items, the number of alternatives is two, so the formula reduces to number right minus number wrong. Despite this logic, such corrections are not popular. Many guesses are not random but inspired, and hence have a larger chance of being correct than the formula gives credit for. Also, to guess correctly is often seen as more than just luck. Children, in particular, do not take kindly to having their correct guesses thrown away by this procedure. From their perspective, if an item is guessed and the right answer is achieved, there is an entitlement to the mark obtained. While bias can exist anywhere that persons are divided by group, countries differ as to their priorities. Most countries have regulations requiring that selection procedures be free from ethnic bias, making it illegal to select individuals based on racial characteristics alone. For example, the Equality Acts in the UK and a number of US states lay down the law concerning racial discrimination in employment, not just in terms of job selection but in promotion as well. They also forbid most instances of sex discrimination that result from the inappropriate use of psychometric tests. Both direct and indirect discrimination are illegal under these acts. (Indirect discrimination occurs when, even in the absence of overt discrimination, the chance of obtaining a job or a promotion is dependent on a requirement that is more likely to be met by one group than by another.) These acts also cover disability, age, and other forms of discrimination. In addition to acts of Parliament or Congress, statutory bodies in many countries have been established to oversee policies generally. They act by supporting test cases, investigating employers suspected of discrimination, and providing guidelines for good practice. All this legislation has had a profound influence on how psychometric tests are used. In the US, many of the older tests are unable to meet the stringent requirements for fairness defined in the constitution or put in place by the courts. Indeed, many intelligence tests, including the earlier versions of the Stanford–Binet Tests of Intelligence and the Wechsler Intelligence Scale for Children (WISC), were outlawed last century in many states within the US. Test development since then has been far more systematic in its attempts to eliminate bias, and new versions of tests, including the fourth and fifth revisions of the WISC, have adopted stringent procedures to achieve fairness in use. Traditionally, there were three major categories of bias: item bias, now generally referred to as differential item functioning (DIF); intrinsic test bias, now generally referred to as lack of test invariance; and extrinsic test bias, now generally referred to as adverse impact. The presence or absence of bias can never be a yes-or-no matter, as groups will always differ to some degree, given a large enough sample size. The important question is: What amount of difference is acceptable? This is particularly important when tests are used for selection. As the alternative is not selecting at all, the question is not “Is this test biased?” but rather “How biased is this test in comparison to the possible alternative modes of selection?” Thus, the focus shifts away from the demonstration of statistical significance

The psychometric principles 55

per se and toward the analysis and estimation of effect sizes. It is prudent to note that with huge sample sizes, such as those achieved in educational testing, it will always be possible to detect bias at some level when statistical power is so high. This contrasts with legal requirements for tests to be “bias free.” Clearly, freedom from bias can only be demonstrated to a certain degree; there will always be some level of bias that should be considered ignorable. The same or similar issues and problems affect tests used in medicine, education, special-needs diagnosis, and recruitment testing in the business world. Differential item functioning

Item bias (differential item functioning [DIF]) is the most straightforward form of bias, insofar as it is easy to identify and therefore to remedy. It describes a situation where the bias exists within the individual items of which the test is composed. For example, items developed in the US that ask about money in terms of dollars and cents would not be appropriate for use in other countries. Linguistic forms of DIF are the most common, especially where idiomatic usage of language is involved. This is a difficulty with a diverse and widely used language such as English. If a person learns international scientific English rather than idiomatic English, they may well be perfectly fluent within a work situation but still be disadvantaged by items generated by users of English as a first language. Similarly, dialects that use different grammatical structures in, for example, the use of double negatives may again be sites of bias. In most languages, repeating a negative represents emphasis; in standard English, the negatives cancel each other out, resulting in a positive. Hence expressions such as “I ain’t got no time for breakfast” are grammatically ambiguous and demand contextual information to interpret. Also, items in the Embedded Figures Test found in many nonverbal pattern-recognition tests benefit those whose written language makes them more familiar with ideographic character forms (for example, Chinese). An issue related to DIF or item bias is that of item offensiveness. This is not the same thing, as many offensive items can be unbiased and many biased items may appear inoffensive. For example, in a well-known Stanford–Binet item dating from 1938, the child is asked to say which of two pictures of a girl or boy is ugly and which is attractive. Offensive items may interfere with performance on the subsequent items in the test. But regardless of this, offensive items should not be included for simple reasons of politeness. Items that may be sacrilegious should obviously be avoided, as should items with irrelevant sexual connotations. Racism and sexism in items are normally illegal, and it should be remembered that items that draw attention to prejudice can often be as disturbing as the items that exhibit it. Examples here might be the use of anti-Semitic passages from Dickens or racist passages from Shakespeare within English-literature examinations. The use of stereotypes, for example men and women within traditional male and female roles, conveys expectations about what is normal and should also be avoided. Measurement invariance

Intrinsic test bias (lack of test invariance, also referred to as measurement invariance) exists where a test shows differences in the mean scores of two groups that are due to the characteristics of the test. This may be independent of or in addition to any difference between the groups in the trait or function being measured. Absence of measurement

56 The psychometric principles

invariance can be due to the test having different reliabilities for the two groups or to group differences in the validity of the test (e.g., the same trait being measured in different proportions in the two groups, the measurement of an additional trait in one group, the measurement of unique traits in each group, or the measurement of nothing in common when applied to the two groups). Thus, for example, if a general-knowledge test were administered in English to two groups—one of speakers of the language as a mother tongue and the other of second-language learners—then while the test may be measuring general knowledge in one group, it would be highly contaminated by a measure of competency in English in the other group. The validities in the two groups would thus be different. Differential content validity is often a primary cause of measurement invariance. Thus a test that has been constructed to match the characteristics of successful applicants from one particular group may not be so valid when applied to another, and hence it is particularly likely to produce lower test scores for them. Items that might in fact have favored the second group would simply not have been included in the test. One way to reduce the impact of measurement invariance is to apply different cutoff scores for different groups—normally a lower pass mark for the groups that have been disadvantaged—an approach that goes under the general term of positive discrimination. However, over the decades this has had a controversial history, particularly in the US, where recourse to the Constitution has frequently led to such procedures being declared illegal, most specifically in terms of university entrance requirements. In the UK, it is illegal under the 2010 Equality Act, as it breaches the principle of equal treatment for all. All of this has been in spite of attempts by many psychometricians to model the parameters involved statistically. These generally fell out of favor when it was recognized that in most real-world situations, the reasons for differential selection at the group level rested on differences in society rather than psychometric inadequacies in the tests themselves. For this reason, positive-discrimination programs have largely been replaced by systems based on proportionally representative quotas designed to address the adverse impact of all the factors contributing to inequality. Adverse impact

Adverse impact is found when decisions leading to inequality are made following the use of the test, regardless of whether the test has measurement invariance. This can occur when two different groups have different scores on a test due to actual differences between the groups. For example, children of an immigrant group that lives in a deprived inner-city area—where the schools are of poor quality—are less likely to succeed at school and achieve the qualifications necessary for university entrance, and consequently to gain access to more desired occupations. Where the community suffers deprivation for several generations, the lack of opportunity is reflected in a lack of encouragement by parents, and a cycle of deprivation can easily be established. Most societies have legislation in place to address the concerns of groups that are underrepresented in important employment positions or have a high rate of unemployment. Such groups are designated as having “protected characteristics.” In the US, these include race, gender, age (40 and over), religion, disability status, and veteran status. In the UK, they additionally include marriage and civil partnerships, pregnancy and maternity, belief, sexual orientation, and gender reassignment. Discrimination of this type can be defined as either direct (roughly translated as deliberate) or indirect—that is,

The psychometric principles 57

coming about from some incidental cause. The reasons for indirect discrimination are complex, varied, and often hard to pin down. Evidence for such discrimination is very often obscure or unclear, and hence is minimally defined by the existence of an adverse selection ratio described as the four-fifths rule. This occurs where a selection rate for any protected group is less than 80% (that is, four-fifths) of that for the group with the highest rate. Where breaches occur, the US Equal Employment Opportunity Commission has issued guidelines for selectors in situations prone to adverse impact. They argue that affirmative-action programs should in general de-emphasize race and emphasize instead the educational level of the parents, the relative educational level of the peer group, the examination level, the quality of schooling, and the availability of compensatory facilities. They further recommend job respecification, introduction of specific training programs, and equalization of numbers of applications by changing advertising strategies.

Summary Once a test has been constructed, it is necessary to be able to describe some of its general properties and to present the test in a form in which it can be easily used. The reliability of the test must be found and quoted in a clear and precise form, such that future users are able to obtain some idea of the amount of error that might be associated with the use of the test for their own purpose. Data on the validity of the test must be given for as broad a range of applications as possible. Data must be provided on the population norms, both generally and for subgroups that may be of interest. Information should be given on the way the data should be normalized and transformed into a common format to enable comparisons between the results of different studies. Finally, the test should be as free as possible from sources of bias: differential item functioning, lack of test invariance, and adverse impact. DIF has proved to be the most amenable to direct amendment, particularly given recent advances in the application of advanced statistical modeling techniques. Lack of measurement invariance has received very wide consideration by psychometricians and policy makers alike, with the result that today, the use of any test in the US must be accompanied by positive proof that it is not biased in this way. Issues of adverse impact have led to a much closer collaboration between psychometricians and policy makers generally, and to a recognition that ideological issues are of fundamental importance to the theoretical basis of psychometrics. The political links between psychometrics and popular conceptions of human rights continue to develop through the legal structures at both legislative and individual-case levels in many countries.

4

Psychometric measurement

In the early 19th century, the physicist William Thomson, also known as Lord Kelvin—arch-measurer and originator of the Kelvin scale of absolute temperature—opined, “Whatever exists at all exists in some amount, and can therefore be measured.” (Thomson, 1891) Toward the end of that century and somewhat in his dotage, he made the additional claim that “there is nothing new to be discovered in physics now. All that remains is more and more precise measurement.” Of course, after quantum theory, we now know that he was wrong. But no one can deny that more and more precise measurement of, for example, the speed of light or even time itself has played a key part in scientific advancement. Could the same be true for psychometric measurement? In this chapter, we address several of the mathematical concepts central to psycho­ metrics. These come from several sources: 1 2 3 4

True-score theory, originally conceived to make sense of issues underlying disagreement between examiners in the awarding of school grades Factor analysis Vector algebra, a mathematical technique designed to understand physical systems involving both force and direction, such as gravity Functional psychometrics, the black box, and explainability

Despite the huge advances made in the 20th century, psychological measurement was not without its detractors. Originally most, but not all, came from those with an ideological predisposition to oppose testing in schools. Others were more persuaded by a functional “black box” approach, today much favored within machine learning. These proposed alternatives, such as criterion-referenced testing and predictive analytics, also had their impact on psychological assessment. Modern psychometrics is a synergy of all these ideas, interests, and methods.

True-score theory True-score theory was at first an attempt to introduce more scientific rigor to grading exams. Essay graders often disagreed with each other, and there was confusion about what a grader’s mark on an essay actually meant. What was it a measure of? In 1888, Francis Ysidro Edgeworth thought he had an answer. Each examiner was aspiring to obtain a mark that represented the candidates’ “true score” on the essay, but with varying levels of success. He wrote:

Psychometric measurement 59

“If we tabulate the marks given by the different examiners they will tend to be disposed after the fashion of a gendarme’s hat. I think it is intelligible to speak of the mean judgment of competent critics as the true judgment, and deviations from that mean as errors. This central figure which is, or may be supposed to be, assigned by the greatest number of equally competent judges, is to be regarded as the true value, just as the true weight of a body is determined by taking the mean of several discrepant measurements.” This is the first known definition of what is now called true-score theory, which later became known as latent-trait theory. It is fundamental to classical test theory. It states simply that any score on an item or a test by a respondent can be represented by two component parts: the respondent’s true score on the item or test and some error of measurement. This is traditionally stated as X = T + E, where X symbolizes the observed score, T the true score, and E the error. If all one knows about a test is that a particular respondent obtained a score of X, then one knows nothing at all. In these circumstances, the error and true score are inextricably mixed. For example, X may be 5, yet this could be the case if T = 3 and E = 2 or equally if T = 110 and E = −105. Thus, an observed score X on its own is of no use whatsoever. It is the true score T that we are interested in, and we need additional data to estimate this; primarily, we need some idea of the expected size of the error term E. To put this another way, we cannot know how accurate a score is unless we have some idea of how inaccurate it is likely to be. The theory of true scores takes us through various techniques for obtaining an esti­ mate of the size of the error. This is typically done through the process of replication, either by obtaining several scores from the same respondent or by obtaining scores from many different respondents. To successfully estimate true scores from such data, three assumptions must be made. First, all the errors terms E associated with observed scores X are assumed to be random and normally distributed. In the context of grading an exam, for example, graders are assumed to randomly overestimate or underestimate the true score T when giving their grades (or observed scores). This means that their errors E are random. Moreover, they are more likely to make small errors than large errors—or, in other words, their errors are normally distributed with a mean equal to 0. This is the same assumption made when we argue that when an unweighted coin is tossed a number of times, on each of which there is an equal chance of heads or tails, it is very unlikely that the coin will land the same way every time, more likely that about half the time it will land as a head and half the time as a tail. The normal curve is itself derived from this theory of random error, known as probability theory. Second, true scores are assumed to be uncorrelated with the errors. That is, the distribution of errors is approximately the same, regardless of whether a high, medium, or low score has been observed. In the context of grading an exam, for example, this can be rather more problematic. There are circumstances under which this assumption fails, for example when too many of those in a sample obtain very high or very low scores, perhaps because the test is too easy or too difficult. But these deviations are all adjustable (in principle at least) by various algebraic transformations of the raw data.

60 Psychometric measurement

Third, it is assumed that the observed scores X from the same respondent are statis­ tically independent of each other. In the context of grading an exam, for example, if the second examiner had already seen the mark of the first examiner before making their own mark, then the two marks would not be statistically independent. If the three assumptions of true-score theory are made, then a series of very simple equations falls into our lap. First, we can define the true score T statistically as an average of a very large number of observed scores X from the same person. As the number of observations approaches infinity, then the error terms E, being random, cancel each other out and leave us with a pure measure of the true score. Of course, it is not possible to take an infinite number of measures on the same person—or indeed, even more than a few such measures—without changing the measuring process itself because of the respondent’s fatigue, practice effects, and so on. But this is unimportant from the point of view of the statistical definition, which states that the true score is the score that we would obtain were this possible. Second, from true-score theory and its assumptions, we are able to derive the amount of error and hence get an idea of a test’s accuracy. Although true-score theory has been widely criticized and many attempts have been made to improve it, the alternatives are generally complicated and usually turn out to have flaws of their own. For most of the last century, true-score theory continued to provide the backbone for psychometrics. While the assumptions of true-score theory can never be perfectly met, they are a good enough approximation in most situations and have stood the test of time.

Identification of latent traits with factor analysis True-score theory introduced the idea of obtaining greater accuracy by increasing the amount of information available for the estimation of the true score, whether this be from more examiners, more respondents, or more items. In a multi-item test, each item is there because it is believed to be related to the same latent trait, and thus provides some information about the true score on this latent trait. The true score on a classical test is the sum of the true scores on each of its items. But for every item that is endorsed, there will be some degree of error—as well as, perhaps, something specific about that item that differentiates it from all the others. This was the insight that inspired Charles Spearman’s discovery of factor analysis in 1905. Spearman’s two-factor theory

Spearman was addressing the problem of how to interpret a uniformity of structure that he observed in correlation matrices, such as those consisting of correlations between various intelligence subtests (verbal, numerical, and so on) that he believed could be combined to generate an overall score on an intelligence test, which he called general intelligence, or “g.” Each subtest will have a correlation with all the other subtests, so there can be a fair number of correlations involved. For example, with only five subtests, we have 10 correlations (4 + 3 + 2 + 1); with 20 subtests, it would be 180 (19 + 18 + 17 + … + 1). These can be tabulated in a matrix (see Table 4.1), where rows and columns represent subtests and the cells represent the correlation between the row subtests and column subtests. Such matrices can contain scores on different tests, grades on different subtests, or even scores on individual items.

Psychometric measurement 61 Table 4.1 A correlation matrix, representing correlations between five subtests (a, b, c, d, and e) in a psychometric test

(“g”) a b c d e

(“g”)

a

b

c

d

e

(1.0) (.9) (.8) (.7) (.6) (.5)

(.9) (.81) .72 .63 .54 .45

(.8) .72 (.64) .56 .48 .40

(.7) .63 .56 (.49) .42 .35

(.6) .54 .48 .42 (.36) .30

(.5) .45 .40 .35 .30 (.45)

The example given in Table 4.1 illustrates his approach. Ignore for the moment all the figures in parentheses. First, Spearman arranged all the variables in what he called hierarchical order, with the variable that showed the highest general level of inter­ correlation with other variables on the left, and the variable with the least correlation on the right. He then drew attention to an algebraic pattern in the relationship between the variables. He noted that the product of the correlation between subtests a and b and the correlation between subtests c and d tended to be equal to the product of the correlation between subtests a and c and the correlation between subtests b and d: (rab × rcd)

(rac × rbd ).

Moreover, he observed that this could be extended to all the sets of four subtests (tet­ rads); thus, for example, rbc × rde rac × rbe

rbe × rcd; rae × rbc;

and so on. He measured the extent to which this rule held as the “tetrad difference.” If the four corners of the tetrad are called A, B, C, and D, then the tetrad difference is AD – BC. Thus, in Table 4.1, the tetrad differences are: .72 × .42 .63 × .40

.63 × .48 = 0; .56 × .30 .56 × .45 = 0.

.42 × .40 = 0; and

Spearman noted that such a pattern of relationships would be expected if a, b, c, d, and e were subtests of intelligence and each represented a combination of two elements: general intelligence (“g”) that contributed to each of the subtests and specific intelligence that was unique to each. Thus, if subtest a were a test of arithmetic, then the components of subtest a would be “g” and a specific ability in arithmetic. If subtest b were verbal ability, this would be composed of “g” and a specific verbal intelligence. He argued that it was the common existence of “g” in all subtests that caused the correlation. He called this his two-factor theory because each subtest was composed of two elements: “g” and something specific to that subtest. Each subtest was composed of scores on just two

62 Psychometric measurement

factors. The general factor was common to all subtests, but the specific factors were all unique to each. By including the parenthetical components in Table 4.1 within the calculation of the tetrad difference, he developed the first-ever technique for factor analysis. For example, if x is the unknown value where column a and row a cross in Table 4.1, it can be obtained from the tetrad difference formula: x × rbc

rab × rac;

that is, x × .56

.63 × .72;

thus, x

.81.

He called this value the saturation value of “g” on a, and deduced that the square root of this value would give the correlation of a with “g,” general intelligence. Thus, in Table 4.1, the column of figures under “g” represents the factor loadings of each of the five subtests on general intelligence. We see that subtest a is very highly saturated, while subtest f is less so. Of course, the example in Table 4.1 is artificial. In practice, the arithmetic would never be this neat, and the tetrad differences will never come to exactly zero. However, we can calculate such differences, find the average of the estimated values for each saturation, and use this to estimate factor loadings on “g.” Spearman took the process a step further. For his two-factor theory to be true, he used the loadings to estimate the values for each correlation. By comparing these values with the actual correlations, he could therefore find the goodness of fit of his theory. He was also able to subtract the observed correlations from the expected values and carry out the process again on the residuals, leading to the extraction of a second factor. This might occur if some of the specifics were not in fact unique. Spearman’s insight was brilliant, and it was many decades before statisticians were able to confirm his theory statistically and catch up with his intuition.

Vector algebra and factor rotation Spearman achieved his factor-analytic technique using numbers alone. However, it was difficult to visualize when more factors were involved. Improved understanding became possible when graphical techniques were used to represent the data. These techniques (see Figure 4.1), which have produced visual representations of the process, have had a major impact on the development of psychometrics. Visualizing the relationships between variables made the conceptualization of psy­ chometric issues much easier. Graphical representation of ideas have emerged again and again in psychology, from multidimensional scaling in psychophysics to the interpreta­ tion of repertory grids in social and clinical psychology. They are fundamentally based on models provided by vector algebra in which two values are ascribed to a variable: force and direction. Within factor-analytic models, variables are represented by the force element of the vector, which is held constant at a value of 1, while the angle between two variables represents the correlation between them, in such a manner that the cosine

Psychometric measurement 63

Figure 4.1 Spatial representation of the correlation coefficient. A correlation of 0.50 between two variables can be graphically represented by two lines of the same length that have an angle between them whose cosine is 0.50 (60°).

of this angle is equal to the correlation coefficient. Thus, a correlation of .50 between the variables a and b (as in Figure 4.1) is represented by two lines oa and ob of equal length, with an angle between them whose cosine is .50, that is 60°. A correlation of .71 would be represented by an angle whose cosine is .71 (45°). There are many useful characteristics that follow from this visual representation of the correlation. In Figure 4.1, we can see that one of the vectors, ob, is drawn horizontally, and the other, oa above it, is drawn at angle aob, equal to 60°. A perpendicular ad is then dropped onto ob from point a to point d. If we assume that ob and oa both have a length of 1, the distance od will then be equal to the cosine of the angle between the vectors, and therefore to the correlation itself. Also in the figure, we see that a vertical oe at a right angle to ob has been drawn and projected onto a horizontal line ae. Through Pythagoras, we know that as oa has a length of 1, then od2 + oe2 = 1. This gives us a graphical restatement of the statistical formula r2 + (1 − r)2 = 1, which tells us how we can use the correlation coefficient to partition variance. To give an example, if the correlation be­ tween age and a measure of reading ability is .50, then we can say that .502—that is, .25 or 25%—of variance of reading ability is accounted for by age. It also follows that 75% of the variance in reading ability is not accounted for by age. This is represented graphically in the figure: the cosine of the angle between oa and ob (60°) is .50, and therefore the distance od is .5. What is the distance oe? Well, its square is .75 (1 − .25 by Pythagoras), hence oe must be the square root of this, i.e., .87. This number represents a correlation, but it is a correlation between reading ability and some hypothetical variable, as no vector oe was originally given by the data. However, we can give a name to this variable, which is itself a latent trait; we could call it “that part of reading ability that is in­ dependent of age.” This value could be estimated by partial correlation analysis and used in experimental situations to eliminate age effects. While considering the graphical representation of correlations, think about two special cases. If r = 0, the variables are uncorrelated. The angle between oa and ob is 90°; the cosine of 90° is 0. The variables are thus each represented by their own separate spatial dimensions; they are said to be

64 Psychometric measurement

orthogonal. If r = 1, then the angle between oa and ob is 0 (the cosine of 0° being 1), and they merge into a single vector. Thus, the variables are graphically as well as statistically identical and are measures of the same underlying latent trait. From this very simple conception, we can demonstrate a fundamental idea of factor analysis: while the two lines oa and ob represent real variables, there is an infinite number of hidden variables that can exist in the same space, represented by lines drawn in any direction from o. The hidden variable oe represents a latent variable: that part of oa that is independent of ob. There are many example applications of these models. If oa were, for example, humans’ weight, and ob their height, then oe would be that part of the variation in the weight of human beings independent of height. It is thus a measure of obesity, not measured directly but by measuring weight and height and applying an appropriate al­ gorithm. Another example relevant to psychometrics may have ob as the score on a psychological test and oa as a measure of the criterion against which it is to be validated; thus the angle between oa and ob, representing the correlation between a (the test score) and b (the criterion), becomes a measure of validity, and oe becomes the aspect of the criterion that is not measured by the test. Moving into more dimensions

A flat piece of paper, being two-dimensional, can accurately represent at most two totally independent sources of variation in any one figure. To extend this model to factor analysis, we need to conceive not of one correlation but of a correlation matrix in which each variable is represented by a line of unit length from a common origin and the correlations between the variables are represented by the angles between the lines. Taking first the simple case of three variables x, y, and z (see Figure 4.2) and the three correlations between them, if the angle between ox and oy is 30°, between oy and oz is 30°, and between ox and oz is 60°, then it is quite easy to draw this situation on our piece of paper. This represents correlations of .87 (the cosine of 30°) between x and y and between y and z, and a correlation of .5 (the cosine of 60°) between x and z. However, if all these angles between ox, oy, and oz were 30°, it would not be possible to represent this graphically in two dimensions. It would be necessary to conceive of one

Figure 4.2 Figural representation of the correlations between three variables. The variables re­ presented are x, y, and z, where the correlations between x and y and between y and z are .87 (cosine of 30°) and that between x and z is .50 (cosine of 60°).

Psychometric measurement 65

of the lines projecting into a third dimension to represent such a matrix. With more than three variables, it may be that as many dimensions as variables are required to “draw” the full matrix, or it may be that some reduction is possible. Factor analysis seeks to find the minimum number of dimensions required to satisfactorily describe all the data from the matrix. Sometimes matrices can be reduced to one dimension—sometimes two, three, four, five, and so on. Of course, there is always a certain amount of error in any measurement, so this reduction will always be a matter of degree. However, the models used will seek solutions that can describe as much variance as possible and will then assume that all else is error. In 1931, Louis Leon Thurstone developed a technique for carrying out factor analysis within the vector model that has just been described. He extracted the first factor by the process of vector addition, which is in effect the same process as that for finding the center of gravity in another application of vector algebra, and is thus called the centroid technique. The centroid is a latent variable, but it has the property that it describes more variance than any other dimension drawn through the multidimensional space, and we can in fact calculate this amount of variance by summing the squares of all the projections onto this vector from all the observed variables. The square root of this value is called the eigenvalue of the factor, which we will discuss in more detail later in this chapter. The position of the first factor can be described by reporting the correlation between it and each of the observed variables, which are functions of the angles involved. The first factor within the centroid technique describes a basic fixing dimension, and when its position has been found, it can be extracted from the multidimensional space so that further factors are sought only in regions at right angles to it. The cosine of an angle of 90°, that is, a right angle, is 0; and thus factors represented by lines drawn at right angles to each other are independent and uncorrelated. It is for this reason that factors are sometimes called dimensions, as figuratively they behave like the dimensions of space. A unidimensional scale is one that requires only one factor to describe it in this way. If further factors exist, a unidimensional scale will not be adequate to fully describe the data, which will be described by further consecutive dimensions. Multidimensional scaling

There are many similarities between factor analysis and the process known as multi­ dimensional scaling. Multidimensional scaling originally achieved popularity among psy­ chophysicists and proved particularly useful in practice for defining psychophysical variables. It was by this technique, for example, that the idea of there being only three types of color receptors in the retina of the eye was generated, as it was found that people required only three color dimensions to describe all colors. Multidimensional scaling also provided a useful model for the behavior of the hidden values in parallel distributed processing machines (Rumelhart & McClelland, 1986). These models of parallel-processing computation show similarities with the actual arrangement of the system of connections between neurons in the human brain, and played an important part in the development of machine learning. It is likely that representational analogies of the type used in multidimensional scaling may turn out not just to be a convenient tool but also to tell us something about how the networks of neurons in the brain function. In much the same way that multidimensional scaling models have provided a con­ ceptual underpinning for psychophysics, factor analysis fulfills a similar role for psy­ chometrics. Its success may be due to more than mere statistical convenience: it could be

66 Psychometric measurement

that the figural representation of factor analysis is so powerful because it mirrors the cognitive processes whereby human beings make judgments about differences between objects (or persons). In fact, a particular neural architecture found within the human brain has been shown to carry out factor analysis when emulated within a computer. There is therefore the possibility that the brain itself uses factor analysis when trying to make sense of large amounts of data.

Application of factor analysis to test construction In psychometric test construction, the factor analysis of the intercorrelations between test items has provided an alternative to traditional item analysis and proven parti­ cularly useful in examining a test specification's conceptual structure and the bias in its scales. Eigenvalues

The original factor-analytic transformation generates as many factors as there are variables and calculates a value, called an eigenvalue, for each. The original set of variables defines the total amount of variance in the matrix, with each variable contributing one unit. Therefore, with a factor analysis of data on 10 variables, the total amount of variance present will be 10 units. The factor analysis rearranges this variance and allocates a certain amount to each factor while conserving the total amount. The quantity allocated to each factor is a function of the eigenvalue, such that the sum of the squared eigenvalues of all the original factors adds up to the total number of variables. With 10 variables, for example, there will be 10 original factors. The sum of the squares of the eigenvalues of these factors will be 10. The larger the eigenvalue of a factor, the greater the amount of the total variance it accounts for, and the more important it is. The first factor typically accumulates a fair amount of this common variance, and subsequent factors progressively less. At some point, they start having eigenvalues of less than 1, indicating that they account for less variance than is accounted for by a single variable. Identifying the number of factors to extract using the Kaiser criterion

As the purpose of factor analysis is to extract information spread across many variables and represent it with fewer dimensions, factors with eigenvalues of less than 1 are typically discarded. This intuitive rule is sometimes called the Kaiser criterion after its original advocate, Henry Kaiser. Sometimes, however, the situation is more com­ plicated. For example, there may be too many unreliable variables or uncorrelated variables, leading to solutions containing many factors with eigenvalues greater than 1. Also, sometimes, there are many factors with eigenvalues around the cutoff threshold of 1. It would make little sense to use an eigenvalue criterion level of 1 where the eigenvalues for the first seven factors were 2.10, 1.80, 1.50, 1.10, .90, .60, and .5. Although the Kaiser criterion would suggest retaining the first four factors, it might make more sense to inspect the three- and the five-factor solutions as well: there seems to be a major drop in variance explained between the third and fourth factors, and there seems to be little difference in variance explained by the fourth and fifth factors.

Psychometric measurement 67

Figure 4.3 Plot of eigenvalue vs. factor number demonstrating a Cattell scree.

Identifying the number of factors to extract using the Cattell scree test

An alternative to the Kaiser criterion is provided by the so-called Cattell scree test, which uses the metaphor of the pattern of pebbles on a seashore for the shape of a plot of eigenvalues vs. factor numbers. Cattell suggested that a scree might be expected just at the point that divides important factors from noise. Figure 4.3 presents eigenvalues of factors extracted from hypothetical data. The scree is clearly visible here, and the scree test would suggest extracting five factors. Other techniques for identifying the number of factors to extract

However, some data produce no scree, so alternatives are needed to decide on the number of factors to extract. In fact, the best guide is typically given by examination and interpretation of the meanings of the solutions containing different numbers of factors. Generally, it is best to retain as many factors as can be reasonably interpreted by their correlations with original variables. For example, the first factor could represent a general factor, the second factor age, the third factor bias, the fourth factor a potential subscale factor, and so on. Eventually, factors will be reached that are uninterpretable, and this is a good place to stop. An additional technique, particularly useful when there are large samples, is to break down the sample into subgroups and investigate the extent of si­ milarity between the separate factor analyses. The first factors that look similar across subgroups are good candidates to be retained. Factor rotation

The idea of rotating factors has been around for some time and was of particular interest to Thurstone in the 1930s. It is useful when multiple factors are required to adequately

68 Psychometric measurement

Figure 4.4 Rotation of orthogonal factors.

describe the data. Factor rotation can most easily be explained using a simple two-factor solution. As discussed before, the first factor extracted accounts for most of the variance, and the second represents the remaining variance unexplained by the first factor. However, the actual position of these factors with respect to the underlying variables is rarely easily interpretable. In fact, there are any number of different ways in which we could define such latent factors within the two-factor space. An example two-factor solution is presented in Figure 4.4. Consider as an example a situation where the loadings of six subtests of ability (arithmetic, calculation, science, reading, spelling, and writing) on a factor analysis are, respectively, .76, .68, .62, .64, .59, and .51 on factor I, and .72, .67, .63, −.65, −.67, and −.55 on factor II. If a graph is drawn plotting these loadings on the two factors (I as the y axis and II as the x axis), then they form two clusters: on the top right-hand side we have numerical abilities, while verbal abilities are on the top left-hand side. In this situation, we could interpret factor I as a general ability factor, while factor II would contrast people who are good at mathematics and bad at language with people who are good at language and bad at mathematics. However, if we draw two new lines on this graph, both through the origin and at 45° to I and II, we note that one of these lines passes very close to the mathematics cluster with hardly any loadings on language, while the other passes very close to the language cluster with hardly any loadings on mathematics. This could be in­ terpreted as reflecting the existence of two independent ability factors, one of mathematics and one of language. Both these solutions are compatible with the same set of data! Interpretation in factor analysis is never straightforward, and to fully understand the results, it is necessary to be familiar with its underlying conception.

Psychometric measurement 69 Rotation to simple structure

At one time, rotations of factor-analytic data were carried out by hand, in the manner just described, and solutions sought by drawing lines on graphs that gave easily in­ terpretable solutions. However, there was some criticism of this approach (it is valid, merely open to abuse), claiming that it was too subjective. Thurstone therefore in­ troduced a set of rules that specified standard procedures for rotation. The main one of these was rotation to simple structure. The rotation carried out in the previous ex­ ample on the mathematics and language data is such a rotation, and it involves at­ tempting to draw the rotated factors in such a position that they pass through the major dot clusters. In practice, the easiest way to do this algebraically is to require that as many of the variables as possible have loadings on the factors that are close to 0 (a loading is like a correlation between the variable and the factor). Rotation is then defined in terms of the minimization of loadings on other factors, rather than the maximization of loadings on the factor in question. This is the process that is carried out by the varimax procedure offered by most factor-analysis software, and is by far the most popular factor rotation technique. In practice, data rarely behave quite as nicely as in our example, and the software may find it difficult to decide which fit of many poor fits is best. There are other rotation solutions that can be tried in these situations, selecting solutions based on other criteria. For example, if it is impossible to find a solution where the lines pass through one set of variables while other sets of variables have low loadings, priority can be given to one or another. Orthogonal rotation

Generally, in classical factor analysis, the derived factors are independent of each other—that is, they are drawn at right angles (orthogonal) to each other. There are good reasons for this. The factor structure—because it lies in “possible” space, rather than the real space of the original correlations between the variables—needs to be constrained, as there would otherwise be far too many possible solutions. This was after all one of the reasons why the rotation to simple structure was introduced as an algorithmic alternative to the subjective drawing of rotations. There is a further advantage of orthogonal factors in that they are relatively easy to interpret. Oblique rotation

However, there are situations where the data do not easily fit an orthogonal solution, and further situations where such a solution is artificial. It might be felt that there are good reasons why two particular factors would be expected to relate to each other. An ex­ ample here might be anger and hostility as personality variables. In these situations, an oblique solution may be more appropriate. These are more difficult to interpret, as one of the main constraints has disappeared, and the factors found are not independent. The extent to which the orthogonality criteria can be relaxed can vary, and it is often found that different degrees of relaxation produce quite different solutions, so that a great deal of experience is required to interpret rotations of this type. They are best avoided by people without experience of the technique.

70 Psychometric measurement

Limitations of the classical factor-analytic approach Factor analysis is a confusing technique that can easily produce contradictory results when used by the unwary. Generally, the analysis is particularly unsuited to testing hypotheses within the hypothetico-deductive model, as it can so easily be used to support many different hypotheses from the same set of data. A common error is the assumption that if two variables have loadings on the same factor, then they must be related. This is nonsense, as can be demonstrated by drawing two lines at right angles to each other, representing two uncorrelated variables, and then drawing a line between them at 45° to each, to represent a factor with a .71 loading (the cosine of 45° is .71) from each variable! In early research, major theoretical battles were often carried out based on the results of different factor-analytic rotations. For example, was there one factor of intelligence or many? These disputes were in the end seen as pure speculations; either position could be supported depending on how the data were interpreted. An important debate between Eysenck and Cattell about whether personality could be explained best in terms of two or 16 factors turned out to be dependent on whether orthogonal or oblique rotations were used on the same data. Thus, there could be two personality factors or there could be 16—depending on how the situation was viewed. A general dissatisfaction with factor analysis was widespread in the second half of the 20th century because of the apparent ability of the technique to fit almost any solution, and at this time, stringent criteria were recommended concerning its use. It was felt that sample sizes had to be very large before the analysis was contemplated. In fact, the recommended samples were often so large as to render the use of factor analysis im­ practical in these early days. There were further constraints introduced in terms of the assumptions of the model, and the requirement that the variables in the correlation matrix have equivalent variance. This again produces a considerable practical problem, as binary data in particular often fall short of this requirement, and item scores on psy­ chometric tests are frequently binary. It was not until modern psychometric methods—such as logistic and confirmatory factor analysis (see Chapter 5)—were in­ troduced that these issues were resolved. Looking back, it seems amazing that so much was achieved using the approximate solutions of earlier days.

Criticisms of psychometric measurement theory The last century saw many debates about whether psychometric traits, and intelligences in particular, were the sort of thing that could be measured at all, and the facetious definition of intelligence as being merely that which is measured by intelligence tests received widespread acclaim (Boring, 1957). This came particularly from those who were opposed to psychological testing more generally, but these critics have all made contributions to modern psychometrics, each in their own way. Their arguments are worthy of consideration. Psychological and educational tests carry out a form of measurement, but unlike physical measures such as length or weight, there is considerable confusion over what they measure and how they can do so. But to ask “Do latent traits actually exist?” begs a number of questions. One problem is that what is measured is not a physical object but an intervening construct or a hypothetical entity. For example, in assessing whether a test of creativity is really measuring creativity, we cannot compare a person’s score on the test directly with their actual creativity. We are restricted to seeing how the test scores

Psychometric measurement 71

differentiate between creative and noncreative individuals according to some other ideas about how creative people should behave. The measurement of concepts like creativity, extraversion, and intelligence is limited by the clarity with which we can define the meaning of these constructs. The aim is to identify these latent traits and to obtain as good a measure as possible on each of them for each individual: the true score on each latent trait. It is limited by the inability of not just the psychometric community but also of society more widely to agree on what constitutes the essential character of each trait. Major criticisms of true-score theory were directed against the concept of the true score itself by those who felt that the statistical definition was misleading. It was argued that we cannot deduce from a score on a test that anything whatsoever “exists” in the brain, as intelligence is merely a construct arising from the use of the test. The true score is seen as being an abstraction and therefore of no theoretical importance. The essence of this view is that psychological measurement is fundamentally different from the way in which measurement is used in science more generally. To address this issue, we can have recourse to an alternative definition of a true score, based on Plato’s theory of reality and truth in his Metaphysics. The Platonic true score

The Platonic concept of a true score is based on Plato’s theory of truth. He believed that if anything can be thought about, then even if it does not exist in the physical world, it must exist somewhere, perhaps in some sort of Platonic heaven where imaginary things exist. The unicorn is often given as an example of such an object. Nonexistence is reserved for objects about which we cannot even think—perhaps Donald Rumsfeld’s “unknown unknowns.” Many argue that the Platonic idea of the true score is a mistake, and this comes from both those for and those against psychometric measurement in principle. Those who are for criticize the Platonic approach as unnecessary and misleading, feeling perfectly sa­ tisfied with the statistical definition. Those who are against believe that those who argue for the existence of a construct from the existence of reliable test scores make a category error. Just as behaviorists argue that there is no mind (only behavior), so it is said that there is no true score, only a series of observed scores and deductions. However, this is an oversimplification. There are many abstract nouns in use that, although not attached directly to objects, certainly exist: for example, justice. Certainly we might agree that in one sense, justice does not physically exist, but we would probably not see this as being equivalent to agreeing with the statement “There is no justice in the world” or “There is no such thing as justice.” Just because an abstract object has no physical existence, this does not mean that it cannot “exist in some quantity and therefore be measured.” For example, some have suggested that love cannot be measured. But are they really trying to say that expressions such as “Do you still love me?” or “I love you more than ever” are meaningless? No, love can be rated, and it can therefore be measured—probably even by questionnaire. Psychological vs. physical true scores

Is there something special about physical measurements that sets them apart from psy­ chological ones, making physical measurement immune from the stratagems of true-score theory? Not necessarily. While normally most people are quite happy with the idea of the

72 Psychometric measurement

length of common objects being fixed, this is in itself a somewhat Platonic approach. If we take, for example, the length of a physical object such as a table, we can never be wholly accurate in its measurement—no two people taking measurements down to a fine degree are going to agree exactly. And even if they did, unfortunately by the next day, both would be proved wrong. A little damp may cause the table to expand, or heat may cause it to either contract or expand depending on its material. It might perhaps be said that the table did indeed have a length, but these were different at different times. But even then, a table can never be completely rectangular (this is itself a Platonic concept), so which of several measurements counts as the “true” length? The length of the table to which they aspire is a true score. It sets a standard to which no real-world table can ever conform. And what is true for the measurement of tables is even more so for medical measurements such as blood pressure, which varies not only with every measurement but also second by second and across different parts of the body. Again, blood pressure is more than an observed measurement; it is a true score to which we aspire. In practice, it is as accurate as it needs to be. It might be hoped that some entities at least may be measured perfectly, but this is a forlorn hope. Even the speed of light is not known with complete accuracy, and mathematical fundamentals such as π (the ratio of the circumference of a circle to its radius) or e (the growth constant) can only be known up to a certain number of decimal places. Many Platonic entities have indeed turned out to be unicorns, but without imagination there is no science. True-score theory may have started its existence as a bit of a rhino­ ceros, but it has not ended as a unicorn. Rather, it is today a model for deep learning in a wildlife park for AI. Functional assessment and competency testing

During the second half of the 20th century, an alternative approach to psychological assessment was promulgated. This was functional, and focused on the assessment of competencies rather than latent traits. In the trait approach, the test is there to measure an underlying psychological construct, such as intelligence or extraversion. In the functional approach, on the other hand, the test is there simply to achieve a purpose: that of successfully separating respondents in terms of some target application, such as likely success at a job. Within the functionalist approach, the design of a test is completely determined by its use, and “what it measures” has no meaning other than this application. Two examples are the work of David McClelland, an advocate of competency testing in the workplace, and W. James Popham (1999), an enthusiast for criterion-referenced testing in educa­ tion. McClelland argued that trait-based assessment was completely ineffective as a tool for selection in employment settings, and that neither ability tests nor grades in school predict occupational success. He concluded that criterion-referenced competencies are better able to predict important behaviors than more traditional norm-referenced tests. The influence of his early work remains today in the popularity of the competency-based approach for the assessment of vocational qualifications. Popham argued that there had been too much emphasis on normative factors in testing. He pointed out that if, for example, we were interested in whether someone could ride a bicycle, then the performance of other people on their bicycles was irrelevant. Indeed, we should be particularly delighted if we found out that all were able to do so, and not in the least concerned that we did not have a wider spread of abilities. For him, it is only performance on the criterion that matters, even if all individuals obtain the same score.

Psychometric measurement 73

More recently, the functional model has been the basis for most psychometric or psy­ chographic systems built by machine-learning algorithms that simply learn to dis­ criminate between predefined groups. There is no consideration of how or why these particular groups were chosen in the first place. The functional approach can produce tests for many practical circumstances, but it has several weaknesses. First, we cannot assume that a test developed with one par­ ticular purpose in mind will necessarily be of any use for another. In many areas of application, however, this has been a strength of the model rather than a weakness. In education, for example, the separation of the function of formative assessment—where tests are used to identify areas of the curriculum that need to be developed by both the teacher and student during the remainder of the educational session—and sum­ mative assessment—where a final indication of the student’s attainment is given—has been generally well received. The way in which summative examinations control the curriculum has been widely criticized, and the formative assessment process has been welcomed as an approach that not only limits this control but also introduces feedback at a time when something can be done about it rather than when it is too late. However, it should be recognized that the actual content of both types of examination will be broadly similar, and in practice there will be considerable overlap between the content of each. Second, the functional model insists, almost as a point of principle, that no psycho­ logical intervening variables or traits can be relevant. As with early 20th-century be­ haviorism, the only interesting aspects of traits are the behavior to which they lead, and as this is measured and defined directly and functionally, the traits are redundant. Within functionalism, there is no such thing as, for example, an ability in mathematics; there is only the performance of individuals on various mathematics items. The pursuit of such an approach is, however, somewhat idealistic and certainly does not reflect existing practice in the real world. People do tend to use concepts such as “an ability in mathematics” and frequently apply them. Indeed, it is normally on the basis of such concepts that generalization from a test score to an actual decision is made, whether justified or not. How else could a GCSE in mathematics, for example, be used by an employer in selecting a person for a job? Certainly, it is unlikely that the mathematics syllabus was constructed with any knowledge of this employer’s particular job in mind. Neither is it likely that solving simultaneous equations will be a skill called for in the job in question. Indeed, how many people who have a GCSE in mathematics have ever “found x” since leaving school? No, the criteria used in practice here are not functional ones but involve the use of commonsense notions about the nature of individual dif­ ferences in human ability. Thus, we see that in spite of the superficial advantages and objectives of the func­ tionalist approach, trait psychology remains essential because it so closely represents the way in which people actually make decisions in the real world. While some have argued that all such trait-related processes are wrong and must be replaced by func­ tionalism, this represents an unreasonable and unwarranted idealism. It is really no good trying to prescribe human thought processes. To an extent, much of psychology is no more than an attempt to be objective and consistent in predicting how people behave. If this can be achieved by assuming the existence of traits, then so be it. Examples of the success of the approach abound, particularly in clinical psychology. A test of depression such as the Beck Depression Inventory (BDI)—although originally constructed around a framework defined by the functional model that identifies a

74 Psychometric measurement

blueprint of depressive behaviors and thoughts—would be of little use if it had to be reconstructed with each application of the concept of depression in different cir­ cumstances. Solely functional tests on their own can only be specific to a situation; they cannot easily be generalized. If we wish to generalize, then, we need a concept, a trait of depression, to provide justification for saying that the depression scale might be applicable in changed situations—for example, with children, or with reactive as well as endogenous depression. To function in this way, the BDI needs to have construct validity, and this cannot exist without presupposing the construct and trait of de­ pression itself. The BDI relates to a wide range of mood changes, behaviors, thoughts, and bodily symptoms that psychologists, psychiatrists, and therapists consider to be part of depression. Machine learning and the black box

In spite of this, the functional approach has seen a recent resurgence within AI, where it is argued that so long as the machine is able to learn how to identify the key elements that differentiate groups, then how this is achieved is irrelevant. Generally, such systems are designed to maximize particular outcomes—usually profit. Attempts by insurance companies to base premiums on post codes were made illegal by the EU in 2013. Such premiums would necessarily discriminate not just against the poor but also against any group that was more likely to experience poverty, and hence they would potentially be in breach of equal-opportunity legislation. AI algorithms are trained using vast amounts of data collected over years; if the data include past racial, gender, or other biases, the predictions of these AI algorithms will reflect these biases. With no requirement to explain how decisions are reached, the internal workings of the algorithm become a black box. This is a serious issue in the use of AI by courts and correction departments to assist in bail, sentencing, and parole decisions, as well as in areas like predictive policing. In a review of the “techlash,” Atkinson et al. (2019) concluded that in order to reduce the potential for algorithmic bias to cause harm, regulators should “ensure that companies using AI comply with existing laws in areas that are already regulated to prevent bias.” However, the regulation of this presents difficulties, some of which may be insurmountable. While the EU is in the process of passing legislation that will insist that all recommendations made by AI be explainable, the sophistication of these systems is often far beyond anything that can be explained in human terms.

Summary Accurate measurement is the key to success in many sciences, and this is as true in psy­ chology as it is in physics or chemistry. And all sciences have their unicorns, whether they be the ether, phlogiston, or animal magnetism. But in spite of these dead ends, scientists have pursued their dreams and achieved what would once have been seen as miracles. The early application of factor-analytic techniques for the identification of true scores has evolved through path diagrams and latent-variable analysis to the hidden layers within deep-learning neural nets that are so essential to modern AI. If these hidden layers remain hidden, they will continue to be black boxes. If AI one day becomes truly intelligent, it too may want to know what lies within these black boxes. Maybe it will do so before humans actually discover how their own brains work.

Psychometric measurement 75

We can see that from an ethical perspective, both the trait and functionalist ap­ proaches have their advantages and disadvantages. Neither can be said to be wholly right or wholly wrong. What is important is that psychometricians and data scientists realize which set of assumptions they are using in a particular situation and be prepared to justify this use. Regarding the use of AI for any psychometric purposes, we are still awaiting regulation, which is becoming increasingly necessary.

5

Item response theory and computer adaptive testing

Introduction Data from alternate-choice (right or wrong) items generated by traditional ability tests are binary and hence nonparametric, as they are not normally distributed at the item level. However, classical item analysis can approximate these nonparametric data to the parametric by using several very successful ruses. For example, the point-biserial cor­ relation between binary data and interval data assumes that the binary data were obtained from an underlying normal distribution, making it possible to correlate an item with the total test score and generate item-level discrimination scores. In practical applications, such as the item analysis used in classical test construction to choose which items to use in a test, the precise details of how correlations were calculated mattered little so long as the ranking of the items in terms of quality remained the same. Purists could argue about the strict statistical properties of the parameters, but if more complex methods led to the same ranking and subsequent decision of which items to include and which to delete, what did it matter? Increased computing power soon meant that far more precise and exact statistical parameters for item analysis could be calculated using exact probabilities. The advent of logit and logistic models, which allowed exact probability calculations for binary and ordinal data, respectively, has led to a much better understanding of the whole process of psychometric testing. But this said, the classical approach has stood the test of time and has the advantage of being easy to understand. It is essential for those who are developing tests for the first time if they are to gain a practical insight into what they are doing.

Item banks Even before the arrival of modern psychometrics era, big data was already on the horizon, as databases began to accumulate more and more data from the application of examination systems in a digital form. Psychometric test items were, after all, a reusable commodity, and all examination boards had not only a collection of their previous items but also information on their effectiveness. An item bank, in its widest definition, is merely a collection of items accumulated over time and stored along with some in­ formation about its effectiveness. Such a bank would be built up wherever there was a repeated need for testing of a similar sort—for example, the many multiple-choice items used in British medical examinations or in the Hong Kong school selection system. These stored banks could accumulate thousands of items over the years, and certain styles of items would begin to stand out and be available to be used over and over again.

IRT and computer adaptive testing 77

When an item bank contains far more items than will actually be required on any one occasion, there is the temptation to simply draw a few items at random each time testing is required. However, classical test theory still demands that a new norm group be collected in order to recalculate item difficulty and item discrimination in this new setting. Within the classical model, it is the test (rather than the item) that is the basic unit. But was that really necessary? Attempts were made to use existing information from the item bank to predict likely performance on the new test in the absence of any new norm group. In order to do this, more information was needed about how scores on items behaved in relation to whatever latent trait the test purported to measure. The Rasch model

A possible solution came in 1960 when Georg Rasch, a Danish mathematician with an interest in how examinations were scored, proposed that the probability of getting an answer right on a test could be modeled with a sigmoid function that related this probability to a latent trait of ability. He was able to show that in the case where all the items in the test had equal discrimination, it was possible to produce a single-item statistic (parameter) that was independent of the respondents used to pilot it and of the other items included in the pilot. For this reason, he called his technique both respondent-free and item-free. The Rasch model became popular in the United Kingdom and is still widely used because of its sim­ plicity. Rasch was able to show that if it is assumed that any guessing element is held constant (as is the case if all items are of the same type), and if only items with equal discrimination are accepted, then some of the restrictions of classical psychometrics are loosened. The Rasch model could generate a single statistic for each item, which enabled that item to be used in a wide variety of different situations regardless of which other items were included in the test and of the particular respondents involved. This approach showed considerable promise in the calibration of item banks, because if accurate, it would be exceptionally useful to edu­ cational testers, particularly those dealing with very large populations. For example, it would also enable the scores of individuals who had taken different versions of the same test to be equated, so long as there were enough items in common between versions. Assessment of educational standards

In the 1970s, the Rasch model proved particularly attractive in the United Kingdom to a group set up by the Department of Education and Science to monitor academic stan­ dards, which became known as the Assessment of Performance Unit (APU). It seemed that if an item bank of appropriate items for measuring (for example) mathematics achievement at age 11 were available, then by taking random sets of items from the bank for different groups of 11-year-olds, it would be possible to compare schools, teaching methods, and even—with the lapse of years between testing—the rise or fall of standards. However, this was not to be. Some critics pointed out that the sets of items identified as suitable for the finished test seemed to be more or less the same whether classical item analysis or the Rasch technique was used in their selection. If the items in a classical test and in the Rasch scaled test were the same, how could it be that Rasch’s claim that the test was item-free and respondent-free was justified in the one case but not in the other? This cast doubt on the claim of equal discrimination between the items. It was parti­ cularly important for the claim of being respondent-free that the items have the same discrimination as each other. Therefore, if we find the two techniques accepting the

78 IRT and computer adaptive testing

same items, there is the implication that the test of equality of discrimination of the items is in fact insufficiently powerful to eliminate items that are atypical in their dis­ crimination. This indeed turns out to be the case. The test applied for parallelism within the Rasch technique was a test of equivalent slope, with acceptance based on proving the null hypothesis—a notoriously nonpowerful statistic. Doubts like this led to widespread skepticism over whether this could be a suitable basis for public policy in schools, and the methodology fell into disfavor, particularly when it came to issues such as the assessment of falls or gains in national standards. The Birnbaum model

Birnbaum (1968) added a chapter to the seminal Lord and Novick text on mental testing in which he suggested a statistical model that could mathematically represent the behavior of items, scored as right or wrong, within a conventional ability test. This became more generally known as item response theory (IRT). This approach is based on the concept of an item characteristic curve (ICC) that, for each item, plots the probability of getting the answer right against the respondent’s ability. The ICC takes a form like that of a normal ogive, that is, a cumulative normal distribution. There should be nothing surprising about this, because the ogive is just another version of the normal distribution and is therefore to be expected where the effects of random error are included in observed measurements, as in all models based on the theory of true scores. It quickly became apparent that the Rasch model was a simple case of the same general insight, the only significant difference being Rasch’s use of the simpler but very similar sigmoid function rather than the ogive. The aim of IRT is to look at the fundamental algebraic characteristics of these ICC curves and try to model them in such a way that a set of equations can be extracted that can predict the curve from some basic data. This modeling is the same as that involved in the more familiar process of obtaining the equation that describes a straight line. This, you will remember, is based on the equation y = a + bx, so that if a and b are known, then so is the position of the straight line itself within the xy frame. When the ICC is modeled in this way, the situation unfortunately turns out to be somewhat more complex. For Birnbaum, four variables are required to define an ICC. These are one respondent variable (the respondent’s ability) and three item variables (which are called the item parameters). The three item parameters map onto the traditional notions of classical item analysis, i.e., one of them describes item facility, another item dis­ crimination, and a third the guessing effect. If the values of the three parameters are known, then it is possible to reconstruct the item characteristic curve. This is known as the three-parameter model. If the guessing parameter is left out—as might happen, for example, if it is assumed that the chance of guessing is the same for all items—then we have the two-parameter model. And if the discrimination parameter is assumed to be approximately the same for all items, then it is a one-parameter model. The person parameter is not enumerated in these models, although it is of course always present. However, at the time that Birnbaum first made his proposal, computing power was far too undeveloped to take advantage of the opportunities that it appeared to offer.

The evolution of modern psychometrics In spite of the initial skepticism encountered in the UK concerning the utility of models based on either Rasch or IRT, the worldwide trend still continued toward an increase in

IRT and computer adaptive testing 79

their use. This time the USA was in the lead, where the approach was sponsored by the Educational Testing Service in Princeton. Many psychometricians increasingly came to feel that the early reaction against IRT had been due to a premature exaggeration of its possibilities. But increasingly the complexity, duration, and expense of the algorithms became less of an issue. Times had changed, and programs for carrying out even complex analyses were soon available on PCs and laptops. Even the three-parameter model no longer presented the challenge that it once had. Computer adaptive testing

The technique proved to be particularly useful within computerized testing, where the items to be presented can be chosen based on responses already given, enabling de­ pendable results from the administration of 50% fewer items. In a computer adaptive test, as each item is presented to the respondent on screen and responded to, a calculation is made as to which of the remaining items in the bank should be given next in order to obtain the maximum information. The respondent’s score is an ongoing calculation, based on their most likely ability considering the combined probability generated from the responses to all items presented so far. As more items are presented, this estimate becomes more accurate, and testing is terminated once a predesigned level of precision is achieved. Test equating

The use of IRT models for comparing the scores of respondents who have taken tests at different levels of difficulty (related to Rasch’s claim of being respondent-free) has additionally proved very useful. This might apply, for instance, when an easy and a more difficult version of the same examination system (such as within the GCSE in England and Wales) need to be combined to generate a common set of marks. The public need for this type of comparison—for example, when an employer wishes to compare two applicants for a job with different types of qualifications—cannot be ignored. As with many areas in selection, it is not so much a question of saying what ought to happen as of ensuring that what does happen is done fairly and efficiently. It is important to point out that most of the criticisms of the Rasch model do not apply to the two- and threeparameter models. These make no assumptions about the equality of discriminability of the items, and the three-parameter model additionally takes into account the effects of guessing. Polytomous IRT

IRT approaches have been revolutionized in the 21st century by advances in compu­ tational statistics and the wide availability of software that is capable of carrying out probabilistic and nonparametric analyses of discrete data sets using maximumlikelihood algorithms. This has enabled the approach to inform the assessment of per­ sonality as well as ability. These models allow an extension of IRT to multiple response data. Hence it is now possible to examine the characteristics not just of correct vs. incorrect responses but also of the behavior of intervening response categories. For example, in a personality test with response options of “strongly agree,” “agree,” “uncertain,” “disagree,” and “strongly disagree,” it is of interest to know what people

80 IRT and computer adaptive testing

actually choose from these options. Item analysis using IRT may tell us whether some of these categories are redundant or may suggest how they could be improved.

An intuitive graphical description of item response theory Item response theory can be explained through its mathematical equations, but for many learners the equations obscure rather than enlighten. Instead, it can be understood more intuitively using graphs. This chapter will present a graphical explanation of one parti­ cular IRT model, the three-parameter logistic (3PL) model. Readers interested in mathematics or other IRT models are encouraged to refer to comprehensive books such as Embretson and Reise (2000). At the highest level, IRT is distinguishable from classical test theory (CTT) by its emphasis on the properties of the individual items that make up a test, as opposed to treating the test as a whole unit. While this may sound like a difference that would only concern mathematicians, it represents a profound shift in the way of thinking about a person taking a test. A CTT test sets a challenge to the test taker to see whether they can meet it or not; a person who completes the test gets a score that is then compared against those of other people who have taken the same test. But an IRT test is more like a series of experiments to find out what the test taker is like. Each item represents a hypothesis from the test creator about the ability of the test taker on the latent trait. Every time the test taker responds to an item, the test creator refines their estimate of the test taker’s ability. If the test taker gets the right answer, the ability estimate is increased; and if they get the wrong answer, then the ability estimate is decreased. Normally, as the test taker completes more items, the ability estimate’s confidence intervals decrease. Through this process of analyzing items individually rather than as a group, IRT overcomes various limitations of CTT’s methodology. Limitations of classical test theory Estimation accuracy differs with the level of the latent trait

In CTT, there is a single reliability estimate for any person who takes the test, no matter what their ability is. The standard error of measurement is constant, so each person’s test score is plus or minus a certain level of error, no matter what the score is. This uniform reliability estimate assumption does not hold. In fact, the reliability of an individual’s test score depends on the latent trait level (Feldt & Brennan, 1989). Imagine a test that measures the mathematical ability of typical 10-year-olds. If the test is taken by a typical 18-year-old, you would expect the test to not be very accurate at measuring their level of mathematical ability, because it does not have questions that are hard enough—it would be preposterous to even give that test to an 18-year-old. But now imagine that a particularly mathematically talented 10-year-old is taking the test. If they get every question correct, like the 18-year-old, can we really say that we have an accurate estimation of their mathematical ability? The only valid conclusion is that the test was not hard enough. The same applies when a test taker gets all or nearly all of the questions wrong—we can only really say that the test was too hard. To get an accurate estimate of ability, we need to know both the upper bound and the lower bound of the ability of the test taker. To do this we need to give them both questions that they find too difficult and get wrong and questions that they find easy and

IRT and computer adaptive testing 81

get right. So test scores are most accurate for average-ability participants who get approximately half of the test items correct (allowing us to be confident of their lower bound of ability) and half of them incorrect (allowing us to be confident of their upper bound of ability). But CTT does not take this into account. Since there is only one reliability estimate for all test takers, the CTT test creation process is biased by the aim of increasing reliability for the average test taker. Items that are too hard or too easy for the average test taker are removed, even if they are good at measuring the latent trait, because they are only useful in distinguishing between test takers who are very high or very low on the trait—and that will be a small percentage of individuals. The score is sample dependent

In a CTT test, to measure how successfully a test taker did on the test, their score needs to be compared to those of other people who have taken the test: the norm group. This means that a test creator has to choose the norm group based on the purpose of the test; a verbal-reasoning test that is to be used for selecting lawyers ideally needs a norm group of lawyers, while a verbal-reasoning test that is to be used to compare school-leavers needs a representative norm group of school-leavers. Collecting data from norm groups is ex­ pensive and time-consuming—worse still if a test has multiple purposes and so needs multiple norm groups. If even a single item in the test changes, perhaps because the item has been leaked, the new test needs to be renormed again. This makes it troublesome to compare test scores between people who have taken different tests, even if the tests are measuring the same latent trait. If one person takes a critical-thinking test and scores in the top 5%, and another person takes a different critical-thinking test and scores in the top 10%, which individual is best at critical thinking? The answer is that it depends on what the norm groups are for the two tests. If the first test is normed against the general population while the second is normed against Nobel Prize–winning scientists, then perhaps the second individual’s score is better, but we still cannot be sure. This drawback makes comparisons over long timescales pro­ blematic because tests often change. In medicine, for example, it is common to change tests of psychological symptoms every decade or so because culture and language change, and we have a better understanding of disease. Given that different individuals have taken different tests over the decades, it is hard to say whether the prevalence of certain psychological symptoms has changed. In computer adaptive testing (CAT), every respondent takes a different test from every other respondent. It is essentially impossible to use CTT for CAT, because there is no way to compare scores. All items are scored the same

Normally, getting any item correct in a test gets one point, and then all items are added together to get the total score. In other words, CTT treats all items equally. But not all items are created equal: some items are inevitably better at distinguishing on the latent trait than others. Picture an unimaginative item creator who is struggling to think of the final item on their 50-item extraversion test, and finally comes up with “I prefer to be outside rather than inside.” Their final item has only a tenuous link to extraversion, but

82 IRT and computer adaptive testing

CTT would treat the response to this item as just as important as the response to the best item in the test. In CAT, test takers who get questions right are given more difficult items. It seems intuitive that getting a more difficult question right should give more points than getting an easier question right, but CTT has no obvious mechanism for deciding how to score questions differently.

A graphical introduction to item response theory IRT models are grounded in assumptions about how people respond to the items in a test. Imagine a test that has been taken by 100,000 people. There are 20 questions, and so respondents get a score of anywhere between 0 and 20. But instead of just looking at the total score, we graph the probability of getting three of the items correct against the respondent’s total score (Figure 5.1). When a respondent gets 20 out of 20, they must have answered all items correctly, and so the proportion of correct answers for each item is 100%, and the three curves meet at the top right of the graph. Similarly, when a respondent gets 0 out of 20, they must have answered all items incorrectly, so the three curves meet at 0% at the bottom left. You can see that of the respondents who got a total score of 13, 92% got item 4 correct, 75% got item 10 correct, and 20% got item 18 correct. Indeed, item 4’s curve is always above item 18’s, showing that a reasonable proportion of respondents get item 4 correct even when their total score on the test is low. This indicates that item 4 is the easiest of the three items, and item 18 is the hardest. The logistic curve

In order to draw Figure 5.1, you need an extremely large sample of people to take the test, otherwise the curves will be too noisy to get an overall picture of each

Figure 5.1 Total score on the test vs. proportion of correct answers for three specific items.

IRT and computer adaptive testing 83

item’s difficulty. In the graph you can see that sometimes the item curve goes downward—for example, it indicates that people who score 12 on the test are less likely to get item 4 correct than people who score 11. IRT overcomes this problem by using a logistic curve to represent how test takers of different abilities respond to the item (Figure 5.2). This requires much less data to ap­ proximate a curve that describes responses to the item, and it can also be defined mathematically with only a few numbers. In this case, we will use the 3PL IRT model, which means that each item is described by three parameters. Notice in Figure 5.2 that the axes have changed. The x axis is now the ability of the respondent in whatever trait is being measured (known as theta, Θ), and the y axis is the probability of getting the item correct. So for any level of theta, the curve describes the probability of getting that single item correct. Note that the theta scale is enumerated. The 0 point on the scale indicates someone with an average level of ability, while the other numbers indicate standard deviations (SDs) from the mean. So a 1 on the theta scale in­ dicates an ability level that is 1 SD above the mean (a score higher than 84% of others), whereas −2 on the theta scale indicates an ability level that is 2 SDs below the mean (a score higher than only 2% of others). 3PL model: difficulty parameter

As we saw in Figure 5.1, some items are more difficult than others: difficulty is where a respondent needs a higher level of ability on the latent trait to have a good chance of getting the item correct. In IRT, this is defined as the point on the x axis where the logistic curve is steepest. In the 3PL model, if the guessing parameter is set to 0, then this will be the point where the probability of getting the item correct is .5. Figure 5.3 shows three items of varying difficulty. The leftmost curve describes the easiest item, and the rightmost the hardest item. The difficulty values reflect this: −1 for the easiest item (a respondent with a theta of −1 has a .5 chance of getting the item correct), 1 for the hardest item (a respondent with a theta of 1 has a .5 chance of getting the item correct).

Figure 5.2 The logistic curve approximates responses to an item.

84 IRT and computer adaptive testing

Figure 5.3 Three items of varying difficulty.

3PL model: discrimination parameter

The discrimination parameter reflects the steepness of the logistic curve. Items with a steep curve are good at discriminating between narrow ranges of the latent trait. Items with flatter curves are poor at discriminating between any level of the latent trait. Figure 5.4 shows two items with differing discrimination parameters but the same other 3PL model parameters. If you look at the most and least discriminating items, notice what happens when the theta goes from −.5 to .5. The probability of getting the most discriminating item correct increases from .25 to .75 (a large difference), whereas the probability of getting the least discriminating item correct barely in­ creases, from .45 to .55. Thus, if you wanted to know whether the respondent has a theta of −.5 or .5, their response to the most discriminating item would allow you to strongly discriminate between the two: if they got it correct, they would be far more likely to have a theta of .5 than −0.5. Conversely, their response to the least dis­ criminating would not allow you to discriminate very well between the two: no matter whether they answered correctly or incorrectly, it would not provide strong evidence for either hypothesis. In fact, the test creator would probably delete the least discriminating item here, because it is not very useful at measuring the latent trait at any level of theta. 3PL model: guessing parameter

In a multiple-choice test, if there are five response options, then even a random guess would have a .20 chance of getting the right answer. This can be represented by the

IRT and computer adaptive testing 85

Figure 5.4 Two items with differing discrimination parameters.

lowest point on the logistic curve. So far, all of the curves have gone to 0, but Figure 5.5 shows an item with a .20 guessing parameter. Note that this is why the discrimination parameter is defined as the steepest point on the logistic curve, rather than the .5 probability point. In a case where the guessing parameter is .2, the steepest point on the curve will be at the .6 probability point, i.e., the midpoint between 1 and the guessing parameter. The Fisher information function

Items provide information about different levels of the latent trait. They do not do this uniformly across the distribution, as operationalized by the difficulty parameter, and some are more able to distinguish than others, as operationalized by the discrimination parameter. The Fisher information function quantifies the amount of information provided by an item. Figure 5.6 shows the information provided by three different items. You can see that items that have high discrimination provide more information about a narrower range of the latent trait, and that the most information for each item is provided at the point where the item characteristic curve is steepest (its difficulty). Indeed, information is equal to the steepness of the item characteristic curve at each point (the tangent).

86 IRT and computer adaptive testing

Figure 5.5 An item with a .2 guessing parameter.

The test information function and its relationship to the standard error of measurement

The information provided by the whole test is based on adding up the information provided by all of the items individually. Figure 5.7 shows the information provided at different levels of the latent trait by a test that has the three items from Figure 5.6. In this case, the test is good at distinguishing between test takers with theta from around −1 to 1.8. Test information is inversely related to the standard error of measurement. Figure 5.8 shows a test that is good at distinguishing test-taker thetas from −2 to 2. Note that the standard error of measurement increases as theta goes below −2 or above 2. Therefore, an average test taker with a theta of 0 will have a smaller error bar around their theta estimate, as opposed to a test taker who scores 2.5, whose estimate will have a larger error bar. This is because the test does not measure very well at that range of theta. Take a use case of a test that is intended to select students for entry into a gifted program, where only students in the top 10th percentile are eligible. Intuitively, you can see that if the goal of a test is to determine whether a candidate is above or below the top 10th percentile, it is necessary to obtain as much information as possible around that boundary (where theta = 1.25). Thus, candidates who score near the cutoff point will have the lowest standard error of measurement, so you can make the most accurate judgments as to whether their theta is above or below the cutoff. In this case, the test does not need to accurately measure the theta of average or below-average test takers, nor does it even need to accurately measure the theta of the top 5% of test takers, as long as it can confirm they are above the cutoff. The boundary point is where the test needs to provide information to help the decision-making process, so items should be selected to maximize information at this point.

IRT and computer adaptive testing 87

Figure 5.6 Three Fisher information functions.

Figure 5.7 Test information function for a test with three items.

88 IRT and computer adaptive testing

Figure 5.8 Standard error is inversely related to test information.

How to score an IRT test

Scoring an IRT test is more involved than just adding up the points for each correct answer like in CTT. Instead, it is a process of multiplying the 3PL logistic probability curves that we have already encountered (the ICCs). But this is more intuitive than you may expect. Before the test taker has answered any questions, we do not have any evidence about what their score should be, except we can assume that they come from the general population. That is, they come from a normally distributed group and so are more likely to have a score close to 0 than a score of +3 or −3. This is called the prior. When the test taker answers a question, they get it either correct or incorrect. If they get it correct, then IRT multiplies the prior with the ICC of the item they just answered (Figure 5.9). You can see that the resulting probability is now weighted further toward the high end of the scale. That is, it is more likely that the test taker’s theta is high, which is intuitive since they just got the item correct. Now, imagine that for the next item, the test taker gets it incorrect. In this case, IRT multiplies by the inverse of the ICC curve, i.e., 1 − ICC (Figure 5.10). You can see that the resulting theta probability estimation moves toward the left as the score estimation decreases. Importantly, the probability estimation distribution also narrows somewhat, which indicates that the standard error of measurement has decreased and we can be more certain about the test taker’s likely theta score. This whole process continues until the test ends. At this point, the aim is that the width of the theta estimation distribution be as narrow as possible, which indicates that the standard error of measurement is small.

IRT and computer adaptive testing 89

Figure 5.9 Most likely score after the first item, which was answered correctly.

The previous examples have all assumed that the test is fixed. But because IRT allows the items to be mixed and matched, we can instead choose which item to give to the test taker based on their answers to the previous items. This is called computer adaptive testing.

Principles of computer adaptive testing The following description is a very brief overview. Readers who would like to read more about computer adaptive testing (CAT) are encouraged to read Wainer, Dorans, Flaugher, Green, and Mislevy (2000). A fixed linear test normally has to cover the full range of ability, so it is likely to contain questions that are too easy or too difficult for an individual test taker. If items are arranged from easiest to hardest, as is the case in most paper-based exams, a test taker with a high level of ability will have to answer lots of easy questions before they are challenged. On the other hand, a test taker with a low level of ability who struggles to answer the first few questions will be aware that the rest of the test is even harder. In the former case, the respondent wastes time answering easy questions when it is already clear that they can answer them. In the latter case, the respondent is frustrated and perhaps becomes despondent by attempting to answer questions that they have little hope of answering correctly. CAT takes the hypothesis-testing approach of IRT to its logical conclusion. Normally the CAT test starts with an item that is of average difficulty, and then for each correct answer, the test keeps presenting harder items. Conversely, if a respondent gets questions

90 IRT and computer adaptive testing

Figure 5.10 Most likely score after the second item, which was answered incorrectly.

incorrect, then the items become easier and easier until the respondent can answer correctly. Ultimately, the goal is that the test ask questions around the respondent’s esti­ mated level of ability, because these questions provide the most information and give the most accurate theta estimate (i.e., they most efficiently reduce the standard error of measurement). No matter what the test taker’s level of ability is, the test is calibrated so that all test takers get approximately 50% of the items correct and 50% of the items incorrect. Figure 5.11 shows the steps for CAT. At the beginning, the prior normal distribution indicates that the test taker’s most likely theta is 0 (average), so the item that is chosen from the item bank has a difficulty as close to 0 as possible. The test taker then answers the item, and their new theta is calculated—it will increase if they get the item correct and decrease if they get it incorrect. At this point, CAT checks whether any of the stopping-rule conditions have been met. A stopping rule might be the number of questions that the test taker has completed, the total time that the test taker has spent on the test, or a cutoff point for the standard error of measurement. Or it might be a combination of rules, so that the test ends either when the test taker has completed 10 items or when the standard error of measurement of their theta is below .4. If the stopping rule is not met, then the next item is selected from the item bank with a difficulty parameter as close as possible to the test taker’s currently estimated theta. If the stopping rule is met, then the test ends. CAT has three major advantages over a fixed linear test: 1

It increases the accuracy of the estimated theta (i.e., reduces the standard error of measurement), because items are chosen to maximize the information provided for that specific test taker.

IRT and computer adaptive testing 91

Figure 5.11 Flow of a computer-adaptive test.

2 3

It saves time, because the test taker needs to complete fewer items to get the same accuracy. It prevents frustration, because test takers do not have to answer questions that are too hard or too easy.

The improvement in test takers’ experience has been the main driver of the deployment of CAT in practice. In occupational testing, employers do not want to discourage candidates with long and frustrating tests, because even rejected candidates may end up working for other firms in the same industry that the employer has alliances with. The most poignant use of CAT for improving test takers’ experience is in health testing. For a person who has just been diagnosed as being in the early stages of a neurodegenerative disease, it is distressing to complete a full quality-of-life assessment that might have questions that are only relevant to later stages of the disease, such as “Can you feed yourself using a spoon?” Using CAT, people who currently experience a high quality of life would never be confronted with this question.

Summary of item response theory Using IRT modeling to treat a psychometric test as a collection of items instead of a whole test unit, presents multiple advantages for test creators: •

• • •



Reliability estimation depends on the test taker’s level of the latent trait, as opposed to being constant. This means that if there are a lot of difficult questions in a test, the standard error of measurement will be smaller for high levels of the latent trait and larger for low levels of the latent trait, where there are not enough questions to get an accurate measurement. Item parameters are modeled on the item level. This means that if a test is changed, only the new items have to be piloted to get their parameters, rather than the whole test. Test development and evolution is therefore cheaper and faster. Ability score (theta) is independent of the specific items used. This means that the outcomes of two different tests measuring the same latent trait can be compared. Item parameters are invariant across the ability levels of the test takers in the norm group. So items can be piloted on a subsection of the population, and the calculated parameter estimates will apply to the whole population. This makes test development easier, as it is not necessary to get a representative sample. Adaptive testing is possible, because it is possible to mix and match test items so that different candidates take different tests, but their scores can still be compared.

92 IRT and computer adaptive testing

Confirmatory factor analysis IRT addresses the assessment of a single trait, while factor analysis looks at the re­ lationship between multiple traits. Classical forms of factor analysis are now described as exploratory factor analysis (EFA), to distinguish them from latent-variable modeling approaches such as confirmatory factor analysis (CFA) (Brown, 2006). With EFA, there is an unstated assumption that we start off with some data and explore them to identify their hidden structure. But in practice, it is impossible to interpret the results of a factor analysis without applying some theoretical presuppositions to our interpreta­ tions, and so this formulation can be rather misleading. Let us take, as an example, one of the bits of information extracted by the exploration: the number of factors. Once factors are identified, they need to make sense; and as we have seen, it is not enough to depend on purely numerical criteria, such as the Kaiser criterion or the Cattell scree test. We need to assign a meaning to each factor, e.g., extraversion, numerical ability, or even simply acquiescence. But as soon as we do this, our model is no longer atheoretical. We have suggested a model of the latent structure of the set of variables. It could be argued that we should then go on to carry out a separate experiment to confirm that the hypotheses generated by the first were supported or verified. And this indeed is what CFA does. CFA first came into its own with the advent of structural equation modeling in the 1980s. Essentially, there needs to be an advance specification of various parameters, such as the number of factors, the items that are expected to load on each factor, or the direction in which the items are to be scored. The data are then fitted to this latent-variable model by maximum-likelihood techniques, and the model is then tested for goodness of fit to the data. Several different models can be attempted, and the best identified. Normally, there is then a process of elimination where different constraints are removed one by one to generate the simplest solution, which is then justified by the laws of parsimony (all else being equal, the simplest explanations are to be preferred). CFA can analyze either parametric or nonparametric data, making it particularly appropriate for use in item analysis. Tests with binary items (e.g., right vs. wrong, or yes vs. no) can be analyzed using logit models. Tests with categorical multiple-choice items can be analyzed using logistic models, as can tests with ordinal (ranked) data in a personality questionnaire (e.g., the categories of “strongly agree,” “agree,” “un­ certain,” “disagree,” and “strongly disagree”). Item analysis in classical psychometrics using EFA on tests with binary or ordinal data treated the items as parametric, which breached some fundamental assumptions and led to consequent errors, particularly for items of very high or very low difficulty. However, we should not abandon EFA approaches altogether. They remain valuable at the exploratory stage, and develop­ ment in structural equation modeling techniques now allows logit and logistic variants.

6

Personality theory

Although the term “personality” is widely used in everyday conversation, defining its meaning is not a simple task. We commonly talk about television personalities, or we may describe someone as being a personality or as having a lot of personality—suggesting the kind of person who is typically lively and loquacious and who tends not to be ignored. We also use the term “personality” in relation to a person’s most striking character­ istics. For example, we may describe someone as having a jovial personality or an ag­ gressive personality, meaning that these are their most salient characteristics and that they tend to respond to a variety of situations in a certain way. It is hard to imagine, for example, someone who is usually shy suddenly becoming the life and soul of the party, or someone who is easily angered not rising to the occasion when insulted. So when we describe the personality of someone we know, we assume their characteristics to be stable—not only in different situations but also over time. Despite personality having different meanings in different contexts, it is a term that is generally well understood in everyday language. It may seem surprising, therefore, that psychologists cannot agree on a definition. Gordon Allport, one of the earliest per­ sonality theorists, identified almost 50 different definitions of personality in the literature as early as 1937. Since then, many others have been put forward: • • • • •

“the most adequate conceptualization of a person’s behavior in all its detail” (Karl Menninger) “a person’s unique pattern of traits” ( J. P. Guilford) “the dynamic organization within the individual of those psychophysical systems that determine his characteristic behavior and thought” (Gordon Allport) “those structural and dynamic properties of an individual as they reflect themselves in characteristic responses to situations” (Lawrence Pervin) “the distinctive patterns of behavior (including thoughts and emotions) that characterize each individual’s adaptation to the situations in his or her life” (Walter Mischel)

The general conclusion must be that no substantive theory of personality can be applied with any generality, hence a more pragmatic approach is needed: adopting a definition of personality that suits the purpose for which it is being used. One reason for the lack of consensus is that definitions of personality are based on the theoretical perspective of the psychologist formulating the definition. A single, allencompassing theory, and associated definition, of personality neither does nor is ever

94 Personality theory

likely to exist. Instead, the many different theories, each focusing on different but related aspects of human behavior, contribute more or less to our understanding of what personality is.

Theories of personality The major psychological theories of personality are described in this chapter. Each theory has produced its own form of assessment—for example, questionnaires are de­ rived from the psychometric approach, projective techniques from the psychoanalytic approach, behavioral rating scales from the social learning approach, and repertory grids from the humanistic approach. The various theoretical approaches differ fundamentally in their conceptualization of human personality. The psychoanalytic approach views personality as formed within the first few years of life, relatively unchangeable thereafter, and determined by the need to control sexual and aggressive instincts. In contrast, humanistic psychologists emphasize the person’s active role in shaping their own experiences and believe that a psychologically healthy person is one who is striving for self-actualization rather than simply controlling their instincts. Social learning theorists, like psychoanalytic theorists, while deterministic in their approach, focus on environmental influences on behavior. Followers of the psychometric approach, discussed in Chapter 1, see themselves as scientists searching for facts, and criticize both psychoanalysts and humanists as being unscientific. Psychoanalytic theory

Sigmund Freud’s psychoanalytic theory, based on the notion that an individual’s per­ sonality is a manifestation of their underlying unconscious processes, has had a profound effect on the way in which human beings think about themselves. Although not a sci­ entific theory, in the sense that it cannot be tested empirically, Freud’s work has led us to accept the idea of the unconscious and that much of our behavior is motivated by impulses of which we are largely unaware. According to Freud, our adult personality is shaped by our thoughts and experiences in early childhood. But because these early thoughts are often unacceptable due to their sexual and aggressive nature, they are repressed. Although we cannot remember these early experiences as adults, they are assumed to play a critical role in our behavior. Freud conceptualized the structure of personality as comprising three parts: the id, ego, and superego. Although these parts interact, each has a different function. The id, which is entirely unconscious, is the innermost core of personality, from which the ego and superego later develop. Freud argued that infants are born as pure biological crea­ tures, motivated by intense erotic and aggressive desires, and that the psychological locus of these desires is the id. According to Freud, these desires build up in the id and produce tension, causing the id to function in an irrational and impulsive way. The seeking of immediate tension reduction is known as the pleasure principle. It is postulated that in order to reduce tension, the id forms an internal image or fantasy of the desired object. For example, a hungry infant may fantasize about the mother’s breast. But because the reduction of tension cannot be wholly achieved through fantasy alone, the ego develops to take over this function. Soon after birth, part of the id develops into the ego, or conscious self. The formation of the ego allows the infant to cope with the desires of the id through rational thought.

Personality theory 95

It must differentiate between internal desires and the reality of the external world. The ego is governed by the reality principle, which requires it to test reality and delay the discharge of tension until the appropriate object is found. For example, the ego allows the infant to reason that although the breast is not available now, it will be available again at some point in the future. This is the early beginning of the ability to delay gratifi­ cation, i.e., the ability to wait for something that we want very much. Thus, the id seeks the immediate fulfillment of desires, whereas the ego mediates between the id and the external world, testing reality and delaying impulses for immediate gratification until the appropriate conditions arise. A major function of the ego is to channel erotic and aggressive desires into more culturally acceptable activities. The superego develops from the ego and internalizes the moral standards set by parents and, indirectly, by society. By producing feelings of guilt, the superego restricts the attempts of the id and ego to obtain morally unacceptable gratification. According to Freud, a person with a well-developed superego will not succumb to immoral acts such as violence or theft, even when no witnesses are present. Essentially, the superego is the conscience, differentiating good from bad and right from wrong. The id, ego, and superego are in perpetual conflict, with the id trying to express instinctual desires, the superego trying to impose moral standards, and the ego trying to keep a balance between the two. Freud argues that an imbalance in this system will result in anxiety, and that in order to reduce this anxiety, defense mechanisms come into play. Defense mechanisms are unconscious processes that do reduce anxiety, but only at the expense of a distorted reality. They include denial, repression, and re­ gression. Denial occurs when a person does not acknowledge the existence of some aspect of reality. Repression occurs when a person keeps anxiety-inducing thoughts from consciousness. Regression occurs when a person reverts to behaviors that are characteristic of an earlier stage of development. From a psychoanalytic perspective, an individual’s personality derives partly from the way in which conflict is resolved between the id, ego, and superego, and partly from the way in which they coped in childhood with problems at different stages of development. Freud believed that we proceed through five stages of psychosexual development: oral, anal, phallic, latency, and genital. During the oral stage (first year of life), infants derive pleasure from sucking. In the anal stage (second year of life), toddlers derive pleasure from withholding and expelling feces. During the phallic stage (age three to six years), children derive pleasure from touching their genitals. It is at this stage that boys are believed to experience the Oedipal conflict—i.e., the conflict between their sexual desire for their mother and their resultant fear of castration by their father—a conflict that is resolved by identifying with their father. A parallel process is purported to occur for girls, resulting in identification with their mother. Ages seven to 12 represent the latency period, during which children are less con­ cerned with sexual desires. The final stage is the genital stage, when children reach puberty and begin to experience adult sexual feelings. A central tenet of Freud’s theory is that a child who successfully progresses through each stage will reach the genital stage and become a mature adult. But if conflict is not resolved, development may be arrested or fixated at an earlier stage, thus influencing their adult personality. For example, a person fixated at the oral stage may engage in excess—not just in eating and drinking, but indulging in excess more generally (the oral personality)—and a person fixated at the anal stage may be overly concerned with ti­ diness and saving money (the anal-retentive personality). Freud uses his theory to explain

96 Personality theory

a wide range of psychological phenomena, such as the content of dreams, slips of the tongue, and popular figures of speech. For example, the popular expressions “filthy rich” and “rolling in it” are seen as expressions of an underlying recognition that money and feces are related in this way. Therefore, according to psychoanalytic theory, our personality is largely determined by our experiences in the first five years of life. Although Freud had many followers, he was very intolerant of dissenting views. While agreeing with many of his theories, many of his closest colleagues took issue with some of Freud’s ideas. Two of the most well-known dissenters in Freud’s day were Alfred Adler and Carl Jung. Contemporary Freudians, or neo-Freudians as they are known, have adopted an object-relations approach to understanding human personality—i.e., they place more emphasis on the role of the ego and its in­ dependence from the id, as well as on the processes involved in attachment to and autonomy from parents. Humanistic theory

The humanistic approach to the study of personality focuses on the individual’s sub­ jective experience. Unlike their psychoanalytic colleagues, humanistic psychologists start with the premise that people are basically well-intentioned, not products of sexual and aggressive drives, and that they have a need to develop their potential (self-actualization). From a humanistic perspective, self-actualization—rather than the need to control undesirable instincts—is the major influence on personality development. One of the most influential humanistic psychologists was Carl Rogers, who, like Freud, developed his theoretical ideas through work with patients—or, as Rogers called them, clients. Rogers was the founding father of client-centered therapy, an approach to therapy and to the understanding of personality which expounded the fundamental nature of the individual’s tendency to fulfill their potential. According to Rogers, the tendency to fulfill one’s potential, or self-actualize, is a basic motivation of all human beings. The aim of client-centered therapy is therefore to help individuals change in a positive direction. Instead of prescribing a course of action, the role of client-centered therapists is to act as a sounding board to help individuals decide for themselves which direction to take. Fundamental to Rogers’ theory is the notion of the self-concept. All experiences are believed to be evaluated in relation to a person’s self-concept, and a person’s per­ ception of the self is believed to have a profound effect upon their thoughts, feelings, and behaviors. For example, a man who believes himself to be attractive to women will behave differently in female company than a man who does not. However, a person’s self-concept does not always reflect reality. Just because a man thinks that women find him attractive does not mean that this is necessarily the case. According to Rogers, a person whose self-concept is consistent with reality will be well-adjusted, whereas emotional problems such as anxiety will result when the two do not match. It was also Rogers who put forward the concept of the ideal self, i.e., the person we would like to be. A close correspondence between the real self and the ideal self is thought to lead to emotional well-being, whereas a large discrepancy between the two is likely to lead to psychological distress. Abraham Maslow was also a key figure in the development of humanistic psychology. Although similar in approach to Rogers, Maslow is probably best known for his hier­ archy of needs. According to Maslow, individuals are motivated by a series of needs

Personality theory 97

beginning with physiological needs such as hunger and thirst, moving up through a hierarchy including the need to feel safe and the need to feel loved, and culminating in the need to find self-fulfillment and realize their potential. It is only when needs at the bottom of the hierarchy are at least partly met that those higher up begin to motivate us to action. Maslow argues that we are unlikely to be motivated by a desire to understand our wider environment, or by a desire for beauty, if our energies are being consumed by the search for food. It is only when our basic needs are satisfied that we can turn our attention to higher levels. Maslow was particularly interested in those who had reached the highest level, selfactualization, and carried out studies to establish what distinguished these people from others. He found that college students who had reached the stage of self-actualization were exceptionally well-adjusted. He also studied eminent people such as Albert Einstein and Eleanor Roosevelt, and found them to have the following characteristics: • • • • • • • • • • •

They perceived reality efficiently and could tolerate uncertainty. They accepted themselves and others for what they were. They were spontaneous in thought and behavior. They were problem-centered rather than self-centered. They had a good sense of humor. They were highly creative. They were resistant to enculturation, although not purposely unconventional. They were concerned for the welfare of humanity. They were capable of deep appreciation of the basic experiences of life. They established deep, satisfying interpersonal relationships with a few, rather than many, people. They could look at life from an objective viewpoint.

Although humanistic psychologists would not deny that biological and environmental influences might contribute to personality development, their main emphasis is on the active role played by the individual. For psychologists such as Rogers and Maslow, self-actualization is the key to psychological health. A humanistic theorist who played a key role in the development of assessment techniques was George Kelly, whose personal-construct theory led to the idea of the repertory grid. Social learning theory

The social learning approach to understanding personality formation emphasizes the importance of the social environment in determining an individual’s behavior—that is, behavior is learned through our experiences in interacting with the environment. Individual differences in behavior are viewed as resulting from past differences in learning experiences. According to social learning theory, behavior is acquired in two ways: reinforcement and modeling. The process of reinforcement is based on the principle that behavior is modified by its consequences; behavior that has favorable consequences is more likely to be repeated, while behavior that is not rewarded or is punished is less likely to be performed again. For example, gender differences in behavior are thought by social learning theorists to result from differential reinforcement of boys and girls. The con­ sequences of many behaviors for children depend on gender: girls will generally receive a

98 Personality theory

much more positive response than boys for playing with dolls, while boys are more likely than girls to be reinforced for playing with cars and trucks. And because these behaviors produce different outcomes according to the child’s gender, they come to be performed with different frequency by boys and girls; girls more often than boys play with dolls, and boys more often than girls play with cars and trucks. Although reinforcement has a powerful influence on shaping behavior, social learning also occurs through the observation and imitation of others in the absence of re­ inforcement. This process is known as modeling or observational learning. To take gender differences in behavior as an example once again, the modeling of individuals of the same gender is considered to be important for the process of gender-role develop­ ment. Children learn about both male and female gender role behavior through observational learning. But they are more likely to imitate models of the same gender as themselves, not only because this is expected to yield more favorable consequences, but also because they come to value behavior that is considered to be appropriate for their own gender. While modeling is now generally viewed as an important aspect of social learning, the mechanisms through which this process operates appear to be rather more complex than previously thought. Cognitive social learning theorists such as Albert Bandura believe that cognitive skills play a fundamental role in modeling. These include the ability to classify people into distinct groups, to recognize personal similarity to one of these groups, and to store that group’s behavior patterns in memory as the ones to be used to guide behavior. The influential social learning theorist Walter Mischel pointed to cognitive processes that influence behavior, such as selective attention to information in the social en­ vironment, and expectations about the consequences of different behaviors. According to Mischel, it is individual differences in cognitive processing that cause individual differences in behavior among different people in the same situation. For Bandura, selfefficacy—our personal judgment of our capabilities—is a fundamental cognitive process that determines our behavior. Because social learning theory stresses the importance of the social context in de­ termining whether a person will behave in a certain way, these theorists do not ascribe to the idea of personality traits, i.e., characteristics of an individual that are shown across all situations. Instead, individuals who are shy in one social setting (e.g., at work) are not necessarily expected to be shy in another (e.g., at the gym). Although it is generally acknowledged that a person’s social environment will influence their behavior, the idea that there is likely to be little consistency in behavior from one setting to another is now considered to be too extreme by most personality theorists and researchers. Behavioral genetics

While behavioral genetics is not a theory of personality, its impact on the scientific study of individual differences in personality has been considerable. This controversial ap­ proach has been adopted with enthusiasm by evolutionary psychologists and socio­ biologists alike, and has focused on studies comparing the personality differences found between identical twins with those between nonidentical twins. A pair of identical twins share identical genes with each other. For each pair of nonidentical twins, the similarity between their genes is no different than that between nontwin siblings. These studies collect large samples of twin pairs of each type. For each psychometrically assessed

Personality theory 99

personality trait, a comparison of the average difference between each pair of identical twins and the average difference between each pair of nonidentical twins can give an estimate of how much that trait is genetically determined. Several studies have de­ monstrated that characteristics such as extraversion and neuroticism do follow this principle. However, it is now generally accepted that what we inherit is a predisposition to behave in particular ways, and that our experiences in our environment will either minimize or maximize our inherited potential. The question that psychologists are asking now is not “Is characteristic X genetically or environmentally determined?” but instead “How does the environment interact with the genetic predisposition to enhance or diminish characteristic X?” To illustrate this point, let us take the example of male and female. It has often been assumed that many boys are born with a predisposition to be more active and aggressive than girls, as well as more interested in toys such as trucks and guns; and that many girls are less interested in rough play and more interested in dolls, jewelry, and dressing up. Behavioral geneticists point to the principal genetic difference between the sexes, the presence or absence of a Y chromosome, which plays a role in the production of the androgens (male sex hormones). Thus, any predisposition toward male gender-role behavior in boys may be associated with the higher levels of testosterone to which boys are exposed. Girls’ behavior may be attributed to the lower levels of androgens to which they are exposed in the womb. However, we also know that differences exist between the way in which parents treat their sons and daughters, encouraging them toward sex-stereotyped behavior. So what is the cause of the sex differences in behavior that are apparent between boys and girls? It seems likely that both genes and the environment—as well as, most im­ portantly, the interaction between them—have a role. While boys and girls may be born with predispositions to behave in gender-stereotyped ways, parents and others act on any difference between the sexes to encourage boys and girls to follow different develop­ mental paths, resulting in even larger gender differences in some aspects of behavior. However, this is not of necessity. At some point in their lives, they may choose to do otherwise. Not all boys and girls behave in gender-stereotyped ways—there is some overlap between the sexes. There are also cultural differences in the expression of gender-role behavior (what is considered a male activity in one culture can be viewed as a female activity in another), and these cultural differences demonstrate the important influence of the environment in shaping the behavioral characteristics that an individual will exhibit. The attitudes and stereotypes to which we are exposed in our everyday life have a profound influence on how we behave, as do our own choices. Another characteristic that has received attention from behavior genetics is aggression. Whereas there is evidence from twin studies that a tendency to aggression is inherited, the extent and way in which a person will exhibit aggressive behavior largely depend on that person’s social environment. The family is thought to inhibit or exacerbate ag­ gressive behavior in childhood and adolescence according to the reaction of parents. Studies have also shown that the wider social environment plays a key role. For example, boys who are raised in a subculture where aggression is acceptable are more likely to engage in aggressive behavior themselves. With the huge advances that have been made in identifying specific genes that are responsible for specific disorders such as Huntington disease or cystic fibrosis, attempts have also been made to identify single genes that determine specific behavioral char­ acteristics. It has been claimed, for example, that there may be a single gene for

100

Personality theory

sensation-seeking, while emotionality is polygenic. These claims await replication before conclusions can be drawn regarding their significance. Nevertheless, it is increasingly being realized that most behavioral characteristics, to the extent that they are genetically determined at all, result from the interaction of many genes and not just a single gene pair. Type and trait theories

Type theorists propose that everyone can be divided into discrete categories that are qualitatively different from each other. Classifying people according to types dates back as far as 400 BCE; Hippocrates proposed that there were four personality types (melancholic, sanguine, choleric, and phlegmatic), associated with a predominance of each of the four bodily humors: black bile produced the melancholic (morose, gloomy, moody) type, blood produced the sanguine (confident, cheerful, hopeful) type, yellow bile produced the choleric (hot-tempered, irritable) type, and phlegm produced the phlegmatic (placid, indifferent, unconcerned) type. One of the most influential type theories was proposed by Jung and forms the basis of the Myers–Briggs Type Indicator (MBTI). In this theory, there are four ways in which people can be categorized: they can be either an extravert or an introvert, either sensing or intuitive, either a thinking or a feeling type, and either a judging or a perceptive type. Respondents are assigned to either one or the other of each type, and their combination of types determines their overall personality. For example, those of the ENFP type (extraverted, intuitive, feeling, and perceiving) are described as enthusiastic innovators who are skillful at handling people. Theodore Millon also used a type approach in the Millon Index of Personality Styles to classify people as, for example, retiring or outgoing, individualizing or nurturing, and complaining or agreeing. The origins of trait theory can be traced back to the development of the IQ testing movement, particularly to the work of Galton and Spearman. From the perspective of trait theory, variation in personality is viewed as continuous, i.e., a specific per­ sonality characteristic will vary in strength along a continuum. The advantage of a trait approach is that a person can be described according to the extent to which they show a particular set of characteristics. In contrast, assignment to a type is an all-ornothing manner; a person is categorized as either belonging or not belonging to a specific category. In fact, the distinction between types and traits is not as clear-cut as it may seem. To illustrate the relationship between the two, Eysenck (1970) has demonstrated that the four personality types of Hippocrates can be represented by the two independent traits of extraversion and neuroticism, and that a person can exhibit each trait to a greater or lesser degree. Thus, people can be extraverted to a greater or lesser extent, as well as neurotic to a greater or lesser extent, and can be categorized as melancholic, sanguine, choleric, or phlegmatic according to their position along these two dimensions. Despite appearing conceptually different, trait and type models are very similar in practice, and the same test can be treated as a test of either traits or types. This is because test scores are open to two possible interpretations: we can assume either that the score on the test represents a person’s actual score on a continuous trait (i.e., the extent to which the person exhibits the trait) or that it represents a measure of the probability that the person fits into one type or the other, depending on whether their score is above or below a specific cutoff point.

Personality theory

101

For example, if a person obtains a score of 3 on a 24-point scale of introversion/ extraversion, with a low score representing introversion, a high score representing extraversion, and a cutoff point of 12 that divides them, then there is a high probability that they actually are an introvert type, and a low probability that they actually are an extravert type. A score of 12 shows that they have an equal probability of being either an introvert type or an extravert type. For those with a score of 11, there is a slightly higher probability that they are an introvert type than an extravert type. Thus, a person’s score on any extraversion scale may be used to represent their position along an introversion–extraversion dimension (trait) or as an indication of the probability that the person is an extravert or an introvert (type). In practice, however, scores on type measures are collapsed so that people are categorized as belonging to either the introvert or the extravert type.

Different approaches to personality assessment Self-report techniques and personality profiles

Self-report inventories comprise a series of items to which individuals respond according to how they view themselves. Questionnaires such as the 16PF (16 personality factor questionnaire, Cattell, 1957) and the OPP (Occupational Personality Profile, Saville et al., 1984) are the most widely used self-report measures and are available for online administration, although computerized and paper-and-pencil versions are sometimes preferred. The Orpheus Business Personality Inventory (OBPI; Rust, 2019), a workbased assessment of five personality traits and seven integrity traits, is another example. The advantages of self-report inventories are: • • •

They are quick and easy to administer. Scoring is objective. Responses are obtained directly from the person being assessed.

The limitations are: • • • •

The respondents may not have good insight about themselves. They may try to present themselves in the best possible light. They may try to present themselves according to what they think is expected or desired. It is difficult to know whether the questionnaire has been completed with due care and attention.

The results of self-report psychometric personality tests are normally reported as a personality profile. Within a profile (rather than one test score), many subtest scores are presented, but in such a way that they can be compared with each other. There can often be items from as many as 20 subscales existing in one personality test, with their items randomly interspersed—the items only being brought together for scoring purposes. The raw item responses and raw scores are then standardized; the standardized subscale scores are illustrated as a profile, which will have a subscale score as one axis—usually with stanine scores ranging from 1 to 9, with 5 as the midpoint—and the various subscales laid out along the other axis. An example of a profile is given in Figure 6.1.

102

Personality theory

Figure 6.1 An example of a diagnostic profile taken from the Golombok Rust Inventory of Sexual Satisfaction (Rust and Golombok, 2020).

One of the earliest profile systems developed was the Minnesota Multiphasic Personality Inventory (MMPI), which is still in use today and provides a good general example of the technique. The MMPI was developed as a broad-spectrum personality test to be used with psychiatric patients, usually on admission to a psychiatric hospital. It consists of over 400 personality-test-style items (“Are you a nervous person?” “Do you sometimes hear voices?” “Are there enemies who are out to get you?” etc.), which form a set of subscales within the overall questionnaire. Within the MMPI there are subscales of hypochondriasis, depression, hysteria, psychopathy, paranoia, psychasthenia, schizo­ phrenia, hypomania, and masculinity–femininity. These subscales are each scored and standardized separately. An individual’s MMPI subscale scores are then presented as a profile. On such a profile, areas that are particularly problematic will stand out—the higher the peak, the greater the disturbance. Psychiatrists who see large numbers of such profiles will soon begin to recognize common patterns representing well-known con­ ditions, such as paranoid schizophrenia. Expert use of MMPI profiles can save a clinician a great deal of time. Obvious dis­ orders stand out easily, difficulties are immediately made apparent, and the data are summarized in a standard and accepted way for all clinicians. While the ways in which profiles are used by professionals depend on the use of judgment—with the instrument as a tool for helping an essentially human process—the same principles of reliability, va­ lidity, and proper standardization apply to each of the subscales as would apply to a single longer scale. Thus, the proper psychometric construction of a profiling system is a much more complex and time-consuming process than for a single test.

Personality theory

103

Reports by others

Reports by others involve an evaluation by someone who is familiar with the person being assessed, for example, an appraisal from a supervisor or more senior colleague. Rating scales are often used for this purpose. The advantages are: • •

The reports are independent of the impression that the respondent may wish to convey. The report is based on a person’s actual behavior at work.

Limitations are: • • • • • • • • •

The relevant criteria for judging successful performance are difficult to define. The report depends on how well the assessor knows the respondent. The assessor’s report may be influenced by how much they like or dislike the respondent, or by personal knowledge of the respondent (such as regarding previous qualifications). The assessor may have a vested interest in the respondent obtaining a positive or negative evaluation. The assessor may not be competent to evaluate the respondent. The assessor may have a tendency to give positive, neutral, or negative evaluations of all respondents. The assessor may hold stereotypes (for example, about male or female roles in the workplace) that may influence the report. The assessor may be reluctant to commit negative evaluations to a database. The assessor may be biased against unpopular respondents even if good at their job.

Online digital footprints

Personality analyses based on peoples’ online digital footprints are also a form of report by others, and have become increasingly controversial since the introduction of privacy laws such as the GDPR in the EU. These can also be biased. For example: • • • •

They are still based on the online digital footprint created by the person themselves, and with social networks in particular, these will generally have been designed to give a positive impression. The data may be incomplete and hence subject to large variations in the amount of error present in a prediction. The language a person uses in a familial situation may not have any relevance to how they behave at work. In cases where consent for a particular use has not been previously obtained, it is an invasion of privacy.

It is common practice for recruiters to search through the online traces of applicants—and some social media networks, such as LinkedIn, are deliberately set up for this purpose. However, the process it controversial, and it is difficult to legislate on this practice, particularly where no record of such intrusion is kept or where AI and phone apps are used to interpret the skimmed data in question.

104

Personality theory

Situational assessments

Situational assessments refer to the evaluation of a person’s behavior in circum­ stances that are like the work environment. The situation may be real-life or si­ mulated for the purpose of assessment. For example, an essential element of the selection process for astronauts involves subjecting potential recruits to simulated spacecraft conditions to assess their behavioral tolerance for the stressors that they would experience during an actual mission. Another example of the use of per­ formance methods in organizations is the digital “in-basket” technique. The person is presented with an in-basket filled with a variety of letters, memos, etc., to deal with in a limited period. They are then assessed according to criteria such as decision-making and problem-solving ability. The advantage is that it gives a sample of a person’s actual behavior in a relevant situation. The limitations are that it is time-consuming and costly. Projective measures

Projective tests involve the presentation of ambiguous stimuli, such as inkblots or pic­ tures, on which individuals can project their inner needs and feelings. It is assumed that whatever structure is attributed to the stimulus reflects the respondent’s personality. Probably the best-known projective measure is the Rorschach inkblot test, comprising 10 inkblots to which individuals are asked to respond according to what they see. The test is scored by taking account of factors such as the content of the response, the part of the inkblot that forms the focus of the response, and the time taken to respond. The Thematic Apperception Test (TAT) uses pictures as projective stimuli, and respondents are asked to say what events might have led up to the scene in the picture, what is happening in the picture, and what the outcome will be. The advantages of projective tests are: • • •

They are an indirect assessment of personality, and therefore it is less obvious what a socially desirable response should be or what might be expected. Individuals are free to respond in any way they wish (they are not constrained by standard response options). The responses are thought to represent unconscious as well as conscious processes.

Limitations are: • • • • •

No clear consensus exists about the meaning of responses. The procedure has a lack of validity against external criteria. Responses are open to situational influences such as the characteristics of the test administrator. Scoring depends on the expertise of the test administrator. Scores are unreliable because responses are situation-dependent.

Observations of behavior

Some assessments involve the systematic observation of a person’s behavior, often with a focus on the antecedents and consequences of the behaviors of interest and with a view

Personality theory

105

toward developing an intervention program. Behavioral observation is most used in clinical settings to assess behavioral problems. For example, a person with agoraphobia may be observed at the supermarket or on public transport to assess factors associated with the onset of anxiety and the consequences of the person’s response to an anxietyprovoking situation. In the workplace, observation of behavior might be used to assess how well a person can communicate information or give feedback. The advantage is that it gives a direct assessment of actual behavior and thus provides information on whether a person can do something rather than just say they can. The limitations are: • • •

It is time-consuming. It only assesses samples of behavior. Behavior may change as a result of being observed, i.e., it may show what a person can do, not what they will do when not observed.

Task performance methods

Task performance assessment involves assessing a person’s ability to conduct a relevant piece of work when carried out in a group setting, and is like a situational assessment. A widely used example in the assessment of leadership ability is the leaderless group, whereby a group of people are required to carry out a task without anyone being put in charge. The behavior of each person—and of the group as a whole—is monitored by an observer and rated according to a set of criteria of interest that may include a person’s ability to work in a team, to communicate, or to take control. The advantage is that it gives a sample of a person’s actual behavior in relevant work tasks. The limitations are: • • • • • •

It is time-consuming and costly. Respondents’ behavior may be influenced by the knowledge that they are being assessed and by who else is in the group. Respondents may perform differently in an artificial situation. Stress produced by being assessed may interfere with optimal performance. It depends on the skills and integrity of the observer. Interrater reliability can be poor.

Polygraph methods

These are largely the classic “lie detector” tests that are now mostly illegal for use other than in forensic settings—and even then, only by state organizations or those registered with them. Their role is generally limited to the identification of dishonesty during the assessment process. The physiological measures used in these circumstances include pulse rate, blood pressure, respiration, brain waves, and galvanic skin response (changes in electrical resistance of the skin associated with sweating). Most commonly, these indices have been used to assess reactivity to stressful situations. The lie detector, polygraph, or digital record produces a graphical representation of a person’s physiological state (e.g., pulse rate and galvanic skin response) in response to detailed questioning. It is claimed that lying causes changes in physiological responses that can be detected from the gra­ phical output of the machine. The advantage is that the respondents cannot easily fake their responses. The limitations are:

106

• • • • • •

Personality theory

There is no clear evidence that what is detected is lying as opposed to changes in emotional state for other reasons. There is a risk of the respondent being accused of lying even when telling the truth, due to false positive results. It is dependent on the expertise of the assessor. It is dependent on the psychological state of the respondent. The respondent’s responses may be influenced by the behavior of the assessor. It is an invasion of personal privacy.

Repertory grids

The repertory grid technique stems from personal-construct theory, a framework for understanding and assessing personality developed by George Kelly. This approach as­ sumes that everyone interprets (or “constructs”) events differently, and aims to elicit the specific constructs that are commonly used by individuals to understand their world. A construct is deduced by asking a person to state the ways in which two people (elements) are alike but different from a third. For example, from a list of significant people in a person’s life, they may be asked how their father and brother are alike but different from their sister. As this is repeated for different groups of three people, the constructs they use to organize information about others (i.e., the way they see their own social environ­ ments) begin to emerge. These may be tense vs. relaxed, loud vs. quiet, or hardworking vs. lazy. The idea that people may vary according to, for example, how tense or relaxed they are would then be considered as one of that person’s constructs. The Role Construct Repertory Test (the Rep Test) is an example of a measure that has been designed to identify personal constructs in relation to a variety of roles such as “boss,” “disliked coworker,” and “liked acquaintance.” The advantages are: • •

It can be tailored to a specific person and a specific situation. It is not immediately obvious what constitutes an acceptable response.

Limitations are: • • • •

Expertise is needed to use it effectively. There is always a need to make new versions for different purposes. Reliability and validity need to be determined for each version. It requires collaboration from the client in selecting relevant elements.

Sources and management of bias Sabotage or bias in personality assessment can be either deliberate or unconscious. For example, a person may agree with the lie-scale item “Throughout my life I have always returned everything I have borrowed” either because they are lying or because they genuinely (but mistakenly) believe this to be true. Management of sabotage can be considered from the point of view of test construction, test administration, and scoring. Test manuals should always be consulted to establish the steps that have been taken to minimize the possibility of sabotage or distortion. Because of the dangers of sabotage, evidence from outside the testing situation, such as corroboration by others, should also be considered. Sources and management of sabotage are described in the following.

Personality theory

107

Self-report techniques and personality profiles

Sabotage can take place through the respondent’s enthusiasm to present themselves in a favorable light. They may also engage in barefaced lying, by responding either with what they think is the “right” answer or, somewhat more subtly, according to how they think a successful candidate would respond. They may also not take the questionnaire seriously but just go through the options to get to the end. This can be achieved by responding randomly, or answering in the same way (e.g., “disagree”) to long sections of the questionnaire. These issues can be managed by: • • • • • • • • •

using a questionnaire with a built-in lie scale; using a questionnaire that has eliminated items to which people commonly lie; informing respondents that the test is able to detect lying; using a questionnaire that has balanced favorable with nonfavorable aspects of each personality characteristic being measured; using a questionnaire that has items for which the preferred responses are not obvious; using a questionnaire for which random responses generate a neutral score or profile; informing respondents that the test is able to detect random responding; using a questionnaire with a scoring system that rejects completed questionnaires with the same response throughout and readministers the questionnaire; and using a questionnaire for which items likely to be wrongly answered have been screened out during development.

Reports by others

Assessors may present respondents in a favorable light because this reflects positively on themselves, e.g., if they have been responsible for training the respondent. There may also be collusion between the assessor and respondent—for example, asking respondents to assess themselves. Furthermore, the assessor may be tempted to give a negative eva­ luation of someone they do not like, or a particularly good evaluation of someone they do like. Sometimes the assessor may simply not follow the instructions on the assessment forms, e.g., they omit sections. These issues can be managed by: • • • • • • • •

developing a specific assessment criterion and thoroughly training assessors in its accuracy and fairness; carefully monitoring assessments; obtaining assessments from more than one person; using independent assessors; taking steps to ensure that assessors understand the purpose of the assessment process and their responsibility within it; ensuring that the assessor is informed that self-assessment is not acceptable; allowing respondents to vet assessors; and monitoring the integrity of assessors—ensuring that assessment forms are constructed and piloted by experts.

108

Personality theory

Online digital footprints

Digital footprints may sometimes be fake, but even when this is not the case, they will normally have been carefully constructed by their self-reporting author to create a po­ sitive impression and get others to interact with them in a favorable way. Other problems arise, as they are generally assessed by machine-learning algorithms that have no human insight and are particularly prone to biases that a human could spot immediately. For example, age may be underestimated if they have many younger “friends” who are actually their students or children. As machines become more sophisticated, this may be ameliorated to some extent, although it is likely that they will still often be wrong. The only difference is that then humans will not find it so easy to spot errors, as they may be the result of the machine using complex deep learning strategies beyond the capacity of the human mind to comprehend. At present, the use of these systems is unregulated except by privacy laws, and although this is likely to change, it is also likely that AI systems will always be many steps ahead, for example with the use of derived data. Situational assessments

The respondent may not treat the assessment seriously. While this can be managed by independent monitoring of their attitudes toward assessment, their behavior may also be influenced either positively or negatively by the knowledge that they are being as­ sessed. It is not surprising that respondents may perform differently in artificial situations; their motivations will be different, and the stress of engaging in a situational assessment may interfere with optimal performance. The success of this approach is very dependent on the skills and integrity of observers. While interrater reliability can often be poor, it can certainly be improved by good training to agreed guidelines. Projective measures

One of the advantages of projective techniques is that there are no obviously correct answers in most cases. Although a respondent may lie about what they see in an inkblot or an ambiguous figure, it is far from obvious what a possible “right” answer would be. Respondents are often wary of what they may be giving away—by focusing on sexual symbolism, for example. But these are rarely part of any actual scoring system. Respondents may simply not cooperate with the task or may see it as a joke. This can be managed by using experienced test administrators who can identify noncooperation and by training test administrators in how to build rapport with the respondent to facilitate open responding. Observations of behavior

If the observations are carried out without the knowledge of the respondent, then this is in principle an excellent way of gaining information. However, there are clear ethical questions here, and it is likely that most assessment situations will be legally required to have previously obtained the respondent’s informed consent. In these circumstances, the person being observed will often be on their best behavior. This can be managed to some extent by observing the respondent for longer periods of time, or at random intervals.

Personality theory

109

Task performance methods

Respondents may present themselves in the expected role (e.g., giving instructions in the leaderless group) rather than in the role to which they may be most suited (e.g., fol­ lowing instructions). They may also deliberately undermine others who are competing in the task, or they may do what they think is required (e.g., make tough decisions during the in-basket procedure) rather than what they would do in practice (e.g., fail to make tough decisions when necessary). In general, it gives the respondent the oppor­ tunity to present what they can do (e.g., be industrious during the assessment) rather than what they do in practice (e.g., be lazy). These issues can be managed to some extent by: • • • • •

making the task as real as possible; using experienced and well-trained assessors; monitoring the respondent’s contribution to the functioning of the group rather than just leadership; ensuring that the preferred responses are not obvious; and lengthening the task and ensuring that it is sufficiently challenging.

Polygraph methods

Respondents in the know can learn to control psychophysiological responses, for ex­ ample by clenching their fists to produce a galvanic skin response to each question or by generating deceptive thoughts to mislead the machine, e.g., thinking relaxing thoughts rather than concentrating on the questions. This can be managed by careful design of the interview to enable the detection of deliberate manipulation, for example to see whether an increase in galvanic skin response occurs in response to a neutral question. Repertory grids

Repertory grid are basically self-reports, so they can also be sabotaged by the respondent presenting themselves in a favorable light or responding according to how they think a successful candidate would respond. This can be managed by careful piloting involving the simulation of attempts at distortion so that these can be recognized, and by ensuring that the test is devised by experts trained in the identification of misleading responses.

Informal methods of personality assessment Informal methods such as unstructured interviews are particularly subject to distortion by the assessor rather than the respondent. These types of distortion can include ethnic bias, gender bias, and bias against those with disabilities—as well as biases such as ageism, heterosexism, and discrimination based on social class. For more formal methods of assessment, carefully constructed assessment procedures should take account of these forms of potential bias during the development process, eliminating their effects or re­ ducing them to a minimum. It is a good idea to check whether this has been done by referring to the test handbook or the relevant research literature. The possibility of bias resulting directly from the assessor should always be actively considered, and assessors should be screened and trained, ideally within the context of the organization’s

110

Personality theory

equal-opportunity policy. Assessment procedures should also be monitored—e.g., gender monitoring in selection for senior posts—so that inequalities can be identified and rectified. It is important to recognize that bias is not unique to psychological assessment; it is a part of social interaction in everyday life. Nevertheless, there is a tendency to evaluate bias in an assessment instrument against a perfect world where bias does not exist. This is unrealistic. Instead, the bias in any assessment instrument should be compared to the bias that exists with other possible assessment procedures that might be used instead. The question should not be “Is this test biased?” but “Is this test more or less biased than the alternatives?” In fact, formal assessment methods often offer greater opportunity to reduce bias than less-formal procedures such as interviews, as questions that produce differential responding—for example, between men and women—can be identified and excluded. During an interview, however, male and female candidates may be questioned differ­ ently due to the underlying stereotypes of men and women held by the person doing the interviewing. These stereotypes are pervasive in our social world, and although they may evolve and change, they are unlikely to disappear. A major strength of objective as­ sessment techniques is that they provide a mechanism whereby the biases that result from social stereotypes can be monitored and reduced.

State vs. trait measures Some measures assess a person’s current psychological state at the time of testing (state measures), whereas others assess their general pattern of functioning, i.e., how they usually are (trait measures). A polygraph measure is an example of the former, and personality questionnaires such as the 16PF and the OBPI are both examples of the latter. Some assessments comprise both state and trait measures, allowing a person’s current state to be interpreted in light of their general trait. An example is the Spielberger StateTrait Anxiety Inventory; in this measure of anxiety, one form (the State Anxiety Inventory) asks respondents to respond to statements such as “I feel calm” according to how they feel at that very moment, and the other form (the Trait Anxiety Inventory) asks them to answer questions according to how they feel generally. Another similar example is the Beck Depression Inventory (BDI). It assesses whether a person is generally in a depressed mood most of the time, whereas the state measure is designed to pick up on mood(s) during a particular day or week.

Ipsative scaling Ipsative scaling involves the use of items for which the respondent is required to choose one of two options. In a vocational preference scale, a person may be asked if they want to be an engineer (nonipsative) or, alternatively, if they would prefer to be an engineer or a chemist (ipsative). This approach is particularly useful in differentiating between career possibilities within the same general area, e.g., architect vs. interior designer. The danger of interpreting an ipsative test as if it is normative is illustrated by the following example. Person A who obtains a score of 8 for interest in engineering and 12 for interest in chemistry would be advised to pursue chemistry, whereas person B who obtains a score of 15 for chemistry and 20 for engineering would be advised to follow an en­ gineering route. The fact that person B obtained a higher score than person A for interest in chemistry is obscured by the use of an ipsative procedure. Ipsative testing is widely

Personality theory

111

used in career guidance. The Jackson Vocational Interest Survey requires respondents to choose between options such as “Acting in a school play” and “Teaching children how to write,” and produces scales such as job security, dominant leadership, and stamina. Some personality and integrity tests also use an ipsative approach. With Giotto, an ipsative test of integrity for use in occupational settings published by Pearson Assessment, respondents are presented with alternative adjective pairs such as “tolerant” vs. “secure” and asked to select the one that most applies to them. Giotto’s scale scores are designed to identify the strongest and weakest aspects of the respondent’s character. A major benefit of the ipsative format is to reduce “faking good” by forcing individuals to choose be­ tween options that are similar with respect to social desirability. With ipsative tests, the endorsement of an item relating to one scale necessarily means that the comparison scale is not endorsed, and thus the total score obtained for each scale is not independent of the scores obtained for the other scales. Scores on ipsative tests are best interpreted in relation to scores on the other scales in the same test for the same person, rather than as absolute measures that can be compared between people (normative tests). The strength of a normative test is that all scales are in­ dependent of each other. Normative tests are favored by statisticians because the in­ dependence of the scales allows a wide range of statistical procedures to be used in data analysis, such as correlation analysis for the assessment of reliability, and because the dimensions can be interpreted independently of each other. However, with a normative test, it is possible for respondents to obtain a high score for all the scales. The advantage of an ipsative test is that it forces the respondent to rank some characteristics as more or less important than others.

Spurious validity and the Barnum effect Face validity is more important for personality assessment than it is for the assessment of ability. If questions are not viewed by respondents as measuring what they are pur­ ported to measure, then the respondents may not cooperate with the test. For example, if asked about their family, sexual orientation, or religion as part of a questionnaire designed to select them for a job, many people would feel this to be inappropriate and not answer the question. Employers have been sued for using questionnaires that include such questions. One test of the face validity of an instrument is whether its name and the names of the scales are acceptable to those who are required to complete it. For example, a scale labeled “neurotic” is more likely to be offensive to respondents than a scale labeled “emotional.” For this reason, an attempt is generally made to label scales in as positive a way as possible. Similarly, feedback and online-generated narrative reports for re­ spondents (interpretive descriptions in everyday language of what a person’s scores mean) generally emphasize their positive characteristics rather than their negative ones. Although the process of presenting a test positively does not necessarily detract from its true validity, often this is exactly the outcome. For example, the interpretations given to respondents may be so vague as to be meaningless—although very positively received! This phenomenon is known as the Barnum effect, after the famous saying by circus owner P. T. Barnum: “There is a sucker born every minute.” The statement “You are the kind of person who may have successfully overcome difficulties in the past” is a good example of the Barnum effect. It gives the impression of being meaningful but in fact is true of everyone who reads it.

112

Personality theory

It is the Barnum effect that is responsible for the success of fortune tellers and horoscopes, and many people believe that the popularity of graphology also depends on this phenomenon. Just as people recognize themselves in horoscopes, so they can recognize themselves in feedback from personality tests. Therefore, it is important to be aware that a respondent’s belief in the accuracy of their feedback may bear little re­ lationship to the validity of the test. Similarly, just because people like the feedback that they receive from an instrument does not mean that it is a valid measure. A good instrument should be capable of producing home truths that are challenging to the re­ spondent, whereas a test may be adopted enthusiastically by managers and candidates alike because the feedback is appealing—even in the absence of evidence of other types of validity. Test developers may also be influenced by the Barnum effect in the choice of scale labels. Consequently, assessors must be trained to avoid taking these labels at face value. It is always necessary to read the handbook carefully to find out what the scale is actually measuring, i.e., aspects of personality that are not obvious from the label, as well as aspects that are implied by the label but that have not actually been assessed. For ex­ ample, some scales labeled “neurotic” include questions relating to anger, whereas others specifically attempt to exclude anger items.

Summary The large number of conceptual frameworks available for the discussion of personality differences swamps our ability to summarize the field in any concise way. Different theories focus our attention on different aspects of our lives. Even once a particular trait of interest has been identified, there are myriad ways in which it can be assessed, and each has its own strengths and weaknesses. However, recent years have seen the as­ cendance of two particular frameworks. Positive aspects of personality focus on five personality traits, known as the Big Five. At the same time, the advent of big data analytics and its potential for the identification of insider threat and other forms of disruptive behavior have led to several new techniques for the assessment of personal integrity. Both will be discussed and illustrated in Chapter 7.

7

Personality assessment in the workplace

Prior to the 1990s, the use of personality testing in the workplace was widespread but controversial. Most questionnaires had been developed in academic institutions for student populations, and there were doubts about whether the same constructs would be applicable in occupational settings. Others had evolved for use in clinical contexts, such as psychiatric diagnosis, or for forensic purposes. Some occupational psychologists argued that there was no evidence that assessment by personality questionnaire had anything to offer beyond the assessment of specific competencies using, for example, work sample tests within assessment centers. However, at a more general level, they understood that interviewers are very interested in the personality of interviewees, and often find themselves working from job specifications that are very specific in their demand for certain personality types. Furthermore, some posts, such as those at the front end of sales, seem to cry out for some form of personality assessment. This was an issue that could only be settled by research validating personality tests against specified successful or unsuccessful outcomes among employees in the workforce. Two frameworks for personality testing in occupational settings emerged. The first saw an increased focus on the characteristics that might indicate successful employment outcomes. In this context, just five traits became dominant. The second focused on the identification of potentially disruptive behavior as previously assessed by integrity tests for forensic purposes, such as the detection of insider threat and cybercrime. At the same time, there was increased concern about risk factors in the character of senior executives that might lead them to spin out of control. The underlying model for these tests was generally grounded in personality disorders rather than personality per se. These gained favor as indicators of what became known as “the dark side” of personality. In this chapter, we will review both models. We will also look at an instrument that uses both. In the late 1990s, John Rust was commissioned by the Psychological Corporation to create two tests: one for the five personality traits, based on the fivefactor model but specifically designed for the workplace; and another for workplace integrity. The resultant integrity test, Giotto, is today published by Pearson Assessment. The work-based personality test, Orpheus, which included the five personality traits, was also extended to include the same seven integrity traits as in Giotto. It is today published as the Orpheus Business Personality Inventory (OBPI) by the Psychometrics Centre, part of the Cambridge Judge Business School at the University of Cambridge. The process underlying the construction of personality and integrity scales will also be addressed.

114

Personality assessment in the workplace

Prediction of successful employment outcomes Validation of personality questionnaires previously used in employment

Personality tests are more complex to validate than ability tests, because nearly all contain a multitude of different traits, each needing to be evaluated against different criteria in different work settings. Also, questionnaires associated with trait models of personality vary enormously in the number of scales they generate. Some, such as Eysenck’s Eysenck Personality Questionnaire (EPQ), target very few highly stable traits; others, such as Cattell’s 16PF, generate a plethora of interrelated and relatively unstable measures. This overabundance of scales presented a problem for occupational psychologists in their attempts to make comparisons and choices among different models and instruments, because researchers had been inconsistent in their choices of questionnaire. The issue was finally addressed in a series of studies, using a meta-analysis carried out at the beginning of the 1990s (Barrick & Mount, 1991; Tett, Jackson, & Rothstein, 1991), that all converged on a model involving five factors: openness to experience (O), conscientiousness (C), extraversion (E), agreeableness (A), and neuroticism (N). These scales, derived from the five-factor model, are often referred to as the Big Five (a term coined by Lewis Goldberg) or by their acronym, OCEAN. The five-factor model is not the only one in occupational use. Others, such as the 16PF, MBTI, and OPQ—all of which became popular earlier—still retain their champions. But today, it is the fivefactor model that stands out as the model of preference when personality is assessed psychometrically, both in the workplace and in other online applications. Historical antecedents to the five-factor model

The psychometrician Louis Leon Thurstone, using the factor-analytic approach in the 1930s, was probably the first to suggest a specific five-factor model. However, most attribute its origin to Donald Fiske, who in 1949 published a paper reporting that with five factors it was possible to obtain similar factor definitions when different assessment techniques were used, such as self-ratings, peer ratings, and observer ratings. Subsequent research confirmed that similar results were found whether the research was carried out at work, in colleges, in military training, or in clinical assessment. In many of these cases, five strong and recurrent factors were a common outcome. Knowledge of the field was further advanced at the University of Michigan by Warren Norman, who returned in the 1960s to Galton’s lexical hypothesis: the idea that personality could be best understood in terms of the many thousands of words used to describe it in the world’s languages. Norman argued that the development of earlier personality questionnaires had been flawed because computers had previously not been sufficiently powerful to analyze all these descriptors—of which he identified 1,710—at the same time. In the 1980s, advances in computer power saw Norman’s initially very extensive work followed up by Goldberg and colleagues at the University of Oregon; they factor-analyzed all of Norman’s trait adjectives. In several different studies, it was found that across a variety of samples, there was very considerable consistency for a five-factor solution, even when different methods of item extraction, rotations, and factor numbers were used. The 1990s saw work by John Digman and Oliver John, among others, continuing to support a fundamental role for the five-factor model. Paul Costa and Robert McCrae, authors of the NEO Personality Inventory, were probably the most prolific writers in the field at the time. However, many

Personality assessment in the workplace 115

academic psychometricians felt uncomfortable with Costa and McCrae’s commercialization of this research. In one of those many ironies of the internet age, it was this discomfort that led to the development of a massively successful open-source item bank: the International Personality Item Pool (1018). These freely available high-quality questionnaires, including many based on the five-factor model, were destined to change the perception of personality testing and to have a massive impact on the development of psychographic approaches to commercial online advertising. Stability of the five-factor model

While considerable evidence has accumulated to show that the five-factor solution is more stable than any other number of factors, it should always be remembered that this is essentially a consensus. A substantial minority of studies, under different circumstances with different items and different populations, have found that other solutions can provide a better fit. It is therefore not the case that for every data set the five-factor solution will emerge. Rather, the five-factor solution is merely the most frequent and ubiquitous. Although it represents the highest level of agreement among experts, there remain a minority of situations that seem to be better suited to models with more or less than five factors. Similarly, the intrinsic nature of each of the five traits also represents a consensus. Again, there is wide agreement concerning the general area covered by each trait. But the specific names given to each trait, and the slant placed on them by the researcher, vary from study to study. Table 7.1 lists some of the names that have been suggested from time to time in the literature as epitomizing each of the five factors, either by the original author or by subsequent literature reviews by other personality psychologists. Hence, for many of the five factors, there is no common consensus concerning the meaning of the fundamental construct being measured. Why are some of these factors so difficult to tie down? There are several reasons. The first arises from the grounding of the five-factor model in the statistical procedure of factor analysis. Despite the widespread use of factor analysis in psychometrics, there are important differences between the factor-analytic model and more traditional forms of item analysis. The classical personality test is based on classical test theory, derived from the idea that a score on a test represents the number of correct responses. It identifies a unique set of items that are used to construct each scale. Factor analysis, on the other hand, provides factor scores that contain, for each factor, a weighted contribution from each and every item in the questionnaire. While it could reasonably be argued that the factor scores are superior measures of each trait, there are good reasons why the classical model is normally preferred. In test construction, psychometricians generally use factor analysis as a tool to identify the best possible items for each scale. You could see this, perhaps, as giving a weight of 1 to the chosen items and 0 to the items not chosen for each scale. It is these scales, not the factor model, that will need to be defended in terms of standardization, reliability, validity, bias, and other psychometric properties. Factor score solutions, on the other hand, would have different weightings for each item in each different situation, and each would need to be defended separately. Cross-cultural aspects of the five-factor model

Although the overall evidence in support of the five-factor structure is impressive, crosscultural generalizations still require caution. Many cross-cultural psychologists have

116

Personality assessment in the workplace

Table 7.1 Names of scales that have been associated with each of the factors in the five-factor model Openness Openness-to-experience, creativity, divergent thinking, understanding, change-sentience, autonomy, experience-seeking, flexibility, achievement-via-independence, artistic interests, private selfconsciousness, culture, intellectual, cultured, polished, differentiated emotions, aesthetic sensitivity, need-for-variety, unconventional values, intuition, conformity (reversed), dogmatism (reversed), sensing (reversed) Conscientiousness Detail, attention to detail, achievement, order, endurance, persistence, competence, will-to-achieve, superego strength, locus of control, character, achievement-motive, dependability, conscientious, responsible, orderly, perceiving, tactical, judging (reversed), strategic (reversed) Extraversion Fellowship, extroversion, extrovert, extravert, surgency, talkative, assertive, energetic, impulsiveness, sociability, introversion (reversed) Agreeableness Abasement, nurturance, trust, self-monitoring, good-natured, cooperative, trustful, altruism, nurturance, caring, emotional-support, friendly-compliance, friendliness, tender-mindedness, feeling, aggression (reversed), tough-mindedness (reversed), moving-against tendency (reversed), inhibition (reversed), narcissism (reversed), hostility (reversed), coronary proneness (reversed), Type A behavior (reversed), indifference-to-others (reversed), self-centredness (reversed), spitefulness (reversed), jealously (reversed), authoritarianism (reversed), hostile non-compliance (reversed), antagonism (reversed), thinking (reversed) Neuroticism Emotion, anxiety, neurotic, emotional, emotional stability (reversed), calm (reversed), not neurotic (reversed), not easily upset (reversed), stress tolerance (reversed)

argued that differences between cultures lead to differences in personality, as the influences on people in different cultures are different. Thus, while psychologists should explore the common characteristics of personality across cultures, they should also acknowledge the ways in which cultures may be unique. While Western personality inventories such as the 16PF and NEO have been translated into many languages and used throughout the world, cross-cultural studies of personality have shown cultural similarities and differences in the manifestation of personality traits. Bond (1997) carried out a study in China that, while replicating four factors of the Western Big Five model, failed to find an openness factor. Cheung et al. (2001), using the Chinese Personality Assessment Inventory (CPAI), tested participants from mainland China and Hong Kong and found the factor structures to be similar in two of the five dimensions. However, the Chinese-tradition scale in the CPAI was not related to any factor in the Big Five model, while openness failed to manifest as a trait in China. Hence, it does seem that the optimal number of personality factors might vary from culture to culture. Another example is the Big Seven Chinese personality model developed by Wang and Cui (2005) and validated through lexical research on Chinese personality trait adjectives. This seven-factor model has also displayed significant differences from the five-factor model of Western personality structure. The Big Seven factors are extraversion, kindness, behavior style, talents, emotionality, human relations, and ways of life. Wang and Cui report a comparison of a Chinese version of the NEO, a test that measures the Big Five, and the QZPS (Big Seven Chinese Personality Scale) with a meta-analysis of the data from China. The results indicated that although the NEO and QZPS were interrelated, the dimensions did not show one-to-one

Personality assessment in the workplace 117

correspondence, and the scores of the dimensions of the NEO could not accurately reflect the dimensions of Chinese personality. Scale independence and the role of facets

Not all were convinced that five factors were sufficient to represent the full complexity of personality. Raymond Cattell, for example, continued to support the case for more extensive models, such as his 16PF. Others, such as Costa and McCrae, saw the potential in promoting questionnaires that, with more traits, appear to be more comprehensive, and suggested that each major trait may be subdivided into correlated facets. One argument against having many factors or facets concerns the large degree of association among the traits in these models, which creates a great deal of redundancy within the personality profile. If two traits are highly correlated, they are also conjoined. That is, when one is high, then so, invariably, is the other, and vice versa. We could obtain the same information more reliably by simply combining them. Hence, we may just as well have measured only one. With independent scales, on the other hand, our ability to interpret profiles is maximized. For example, even Eysenck’s two-factor solution with only extraversion and neuroticism can produce four basic profiles: high extraversion/ high neuroticism, high extraversion/low neuroticism, low extraversion/high neuroticism, and low extraversion/low neuroticism; these are consistent with the four traits identified in classical Greece: the choleric, phlegmatic, melancholic, and sanguine personalities, respectively. With three traits, we would have more possible combinations (eight, in fact, if we use all three dimensions). We can also generate further interpretations by taking the three combinations of two traits with the other held constant—producing 14 interpretations in all (the three traits alone, the three traits in combinations of two, and the eight ways in which all three can be combined). With five factors there are, in principle, 74 different profiles, each of which could receive an interpretation. Thus, if true independence among the primary scales can be attained, the five-factor model can provide more information than a questionnaire that claims to assess 70 oblique factors! Challenges to scale construction for the five-factor model

But how independent are the traits based on the five-factor model? It is often assumed that because the traits are obtained from an orthogonal factor analysis, any scales constructed to measure the five factors will be independent of each other. This is not the case, and independence can only be achieved with considerable difficulty. A failure to recognize the difference between psychometric scales and factor scores underlies much of the confusion concerning the degree of independence of scales found in tests designed to assess the Big Five. The five-factor solution on which the Big Five model is based is indeed an orthogonal one—that is, it produces five independent factors. If factor scores, calculated from the factor loadings of all the items on each factor, were used in place of scales, they would necessarily be independent—if only for the population on which they were based. However, this is achieved by using each item five times; each will have a loading on each of the five factors, although the size of these loadings will depend on the nature of the item in question. Independence in this orthogonal solution is achieved by a very careful balance between the various aspects of all five scales. However, if only those items that had high loadings on each factor

118

Personality assessment in the workplace

were selected for each scale, as is the normal procedure in test construction, the delicate balancing achieved by counterbalancing the other scales would be lost, and the Big Five scales so derived would no longer be independent of one another. In fact, some of them would have a significant degree of intercorrelation. Attempts to redress this within the factor analysis itself are more likely than not to make matters worse. This is because the five-factor solution, by its very nature, balances the different aspects of each of its component traits to achieve independence. Simply because the factor analysis generates orthogonal factors, it does not follow that the primary traits suggested by each factor will be independent from each other when they are measured separately. Impression management

Impression management (sometimes called social desirability bias) occurs when a respondent fails to resist the temptation to either fake their response or give themselves the benefit of the doubt if they are uncertain. Job applicants in occupational settings will often succumb to the temptation to be economical with the truth to some degree. This is not as reprehensible as it may sound, as all of us have at some time probably been encouraged by career counselors and others to “make the most of ourselves” when applying for a job. In clinical settings, patients who wish to see a doctor to obtain a sickleave certificate or cash in on an insurance policy will often bias their responding in the other direction, and fake “bad.” The effect of impression management is ubiquitous, affecting responses to most of the items in one way or another. Consequently, it will always have some effect on the data set and will always influence the factor analysis. This phenomenon is well known to psychometricians, who are constantly struggling to eliminate or neutralize these effects—for instance, by manufacturing an independent social desirability scale. Unfortunately, it is impossible to do so completely, and consequently test users always need to be made aware of how impression management effects have influenced results. Acquiescence

Another form of response bias is acquiescence: the tendency of some people to agree with every question or statement, and of others to disagree. This can affect every item in the questionnaire and will often emerge as the first factor in a factor analysis. Again, psychometricians have techniques for reducing its influence, usually by balancing the number of positively and negatively scored items for each trait being measured. However, for the factor analysis itself, both impression management and acquiescence will always be present to some extent. Response bias and factor structure

The effects of response bias on factor structure are not consistent but vary from sample to sample. Thus, students filling out a questionnaire for a research project are less motivated to lie than are job applicants in a very competitive market. Some forms of bias will be affected by the nature of the job being applied for. Thus, those applying for junior positions would be likely to view different responses as being socially desirable than those looking for managerial responsibility. For example, applicants for sales positions are more likely to bias their responses toward extraversion. All forms of response bias affect factor

Personality assessment in the workplace 119

structure, and consequently the nature of the factors themselves will vary depending on the nature of the data set and the respondents’ motives in agreeing to participate. This is another reason why it makes sense to treat the five-factor model as a consensus rather than a scientific discovery concerning the nature of human personality. It describes the most frequent solution under most circumstances. The loadings that occur form a complex family of related items, but one that is bound to be chaotic to some degree and cannot be completely determined. Many of the existing questionnaires that measure the Big Five have quite substantial correlations between their scales. These correlations tend to be underreported, perhaps because the belief that the five-factor solution produces five independent scales (as opposed to five independent factors) is so widespread. While the reason for the intercorrelations between Big Five scales has been explained above, the existence of these intercorrelations is undesirable, as one of the main arguments put forward in favor of the five-trait model has been the independence of its domains. This is important, as there are many reasons to suppose that models with smaller numbers of factors are superior, from both a theoretical and a practical point of view, particularly when it comes to the explanation of trait combinations. Development of the five OBPI personality scales

While the 1990s saw an increased unity of approach to personality testing in the workplace, there remained one difficulty: existing scales for the Big Five traits had largely been developed in clinical settings or with student samples. This was problematic. For example, the clinical trait label “neuroticism” is not suitable as a personality descriptor. No one at work wants to be described as neurotic. Similarly, the word “agreeable” implies that anyone with a low score is disagreeable—again, not really appropriate. And finally, openness to experience is of interest, but more to assess its opposite: conformity. Hence, this scale would do better if it were renamed and scored in the opposite direction. Part of the brief for the development of the OBPI was that it should be specifically about work, be piloted in the workplace, and use language appropriate for the workforce. To do this, the five factors were reconceptualized as a domain theory of personality, i.e., each of the five is unique to a particular psychological domain. The five domains are the social, organizational, intellectual, emotional, and perceptual—all of which are essential parts of our psychological life. Once we know a person’s position with respect to each of these five domains, we can begin to describe the functioning of their personality at work. Within the OBPI, the five factors are: 1

2

3

Fellowship, which assesses the Big Five trait of extraversion. In the social world, high fellowship scorers are generally happier when working with others or in a team. Low fellowship scorers generally prefer work that requires a degree of independence. Authority, which assesses the Big Five trait of agreeableness, but in the reverse direction. In the organizational world, high authority scorers are tough-minded; they can make tough decisions. Low authority scorers are tender-minded and generally adopt a more cooperative approach. Conformity, which assesses the Big Five trait of openness to experience, but in the reverse direction. In the intellectual world, high conformity scorers are likely to

120

4

5

Personality assessment in the workplace

respect established values. Low conformity scorers tend to seek out alternative solutions to problems. Emotion, which assesses the Big Five trait of neuroticism. In the emotional world, high emotion scorers, while of a nervous disposition, are likely to be sensitive to the feelings of others. Low emotion scorers are more able to perform under stressful conditions. Detail, which assesses the Big Five trait of conscientiousness. In the perceptual world, high detail scorers excel at mundane tasks that require particular care. Low detail scorers have less patience for routine tasks and prefer to see the wider view.

The OBPI personality scales were constructed to keep the influence of social desirability to a minimum, and were balanced for positive and negative items to avoid acquiescence effects. Scale intercorrelations were also continuously monitored during construction, as were item/total correlations and alpha coefficients, to ensure both breadth and diversity in trait coverage. The five personality scales were modeled around the five-factor template, with the additional criteria that no two Big Five traits should have intercorrelations above .3 and that the correlation with a previously constructed interim impression management (lie) scale should be as low as possible. The OBPI personality scales have reliability between .73 and .81, and have been successfully validated against supervisors’ ratings for team skills; ability to work independently, make friends with colleagues, make tough decisions, or generate new ideas; obedience to company policy; level of self-confidence; tendency to worry; attention to detail; and breadth of vision.

Assessing counterproductive behavior at work Not all assessment in the workplace concerns recruitment and staff development. In many cases, the aim is to weed out staff exhibiting disruptive behavior, for example by identifying rogue trading or other forms of insider threat. However, this is a purpose for which classical personality tests are not particularly effective. One aspect that almost all such tests, including those for the Big Five, have in common is that they are designed to focus on the positive. The caring professional would like you to focus on what you can do, not on what you cannot. This is particularly true among psychologists in occupational settings, who rightly believe in the overarching need to be as helpful to candidates and clients as they possibly can. Hence, many shortcomings are generally referred to euphemistically as training needs. How did such a situation arise? The impact of behaviorism

One factor was the influence of behaviorism. Over 100 years ago, early behaviorists were adamant that psychology needed to break free from its prescientific foundations in moral philosophy. They asserted that ethics was a religious concern and had no place in a true science. Early psychometricians were particularly influenced by this view, and hence ethical words such as “good,” “bad,” and all their variants are noticeably absent from the “natural language descriptors” chosen by Cattell and Norman. Another aspect of this early behaviorist approach is its determinism. Without free will, we are not responsible for our actions. Hence it is not the business of a true science to attribute blame. It is within this context that respondents are told in many personality questionnaire preambles: “There are no right or wrong answers—just answer each question as honestly as

Personality assessment in the workplace 121

you can.” The implicit message is: Whatever the outcome, it is not your fault; you may have not secured the job, but this is not because you gave a “wrong answer.” Just find another position to which you are more suited, or alternatively go and focus on your training needs. However, it is not the case that most employers are only interested in what their staff can do. They would also like to know whether, if appointed, applicants are likely to be lazy, careless, unreliable, or dishonest, or to damage the firm in any way. Natural language contains many terms that indicate a lack of integrity and that might describe people with such potential shortcomings, but very few of these appear in traditional personality tests. What is the history of these words? Prepsychological theories of integrity

Contemporary personality traits show an evolutionary path from the four humors: the melancholic, phlegmatic, sanguine, and choleric personalities defined by Galen in ancient Greece. Similarly, many aspects of integrity were discussed in many works from the classical period. Probably the most influential of these was the Psychomachia of Prudentius. In the fourth century CE, Prudentius was a Roman civil servant in the city of Caesar Augusta, today Zaragoza in Spain. His model was later adapted by Christian theologians and became known as the seven vices and virtues (sometimes called the passions and the sentiments). In the 13th century CE, Giotto di Bondone, an Italian artist, portrayed the virtues and vices as Prudence/Folly, Fortitude/Inconstancy, Temperance/Anger, Justice/Injustice, Faith/Idolatry, Charity/Envy, and Hope/Despair. The Psychomachia also provided the inspiration for much of the folk psychology of the Middle Ages, including Dante Alighieri’s Divine Comedy and, in 17th-century England, John Bunyan’s Pilgrim’s Progress. Prudentius saw human development as a lifelong striving for rationality. During this process, the individual must tackle various challenges. For example, greed gets you what you want yet is irrational, as a society in which personal gain is the only social motive would be untenable. Anger achieves immediate ends yet is irrational, as it allows emotion to overrule common sense. Despair enables one to cease striving yet is irrational, as it makes nonsense of human motivation. Indulgence obtains pleasure yet is irrational, as it prevents humankind from achieving its destiny. In the Psychomachia, these challenges are represented by warriors in a battle. The animal vices (the passions) are eventually overcome by the human virtues (the sentiments). The idea was not without influence on modern psychology. We can recognize its influence in the self-actualizing tendency, theories of multiple intelligences, cognitive therapy, and libido in psychoanalysis. The Prudentius model contrasts with that of Galen in that certain paths of action are recognized as desirable and others as undesirable. For Galen, if we fail to attend to detail it is because of our constitution; for Prudentius, it is because we are slovenly. Modern integrity testing

By the beginning of the 1990s, counterproductive behavior at work was psychometrically assessed using integrity testing. There were two types. Overt or clear-purpose integrity tests were direct and to the point, containing questions such as “Have you ever taken drugs?” or “Do you have a criminal record?” Personality-based or disguised-purpose tests attempted to take a more roundabout approach, identifying ways

122

Personality assessment in the workplace

of eliciting support for undesirable behavior in a way that was not too obvious and hence less likely to encourage dishonest responding. Common constructs assessed by integrity tests were theft, tardiness, and drug abuse, but also included were lack of a sense of responsibility, poor moral reasoning or work ethic, disciplinary problems, proneness to violence, and absence of long-term job commitment. The range of severity of the traits was very broad, including not only the rather infrequent criminal activities of theft of money and major fraud but also activities such as “time theft.” The latter may involve absenteeism but may in some organizations amount to little more than too-frequent tea breaks. Most of the early integrity tests have been reviewed in the Mental Measurement Yearbooks, published online by the Buros Center for Testing. However, there was little agreement among the various publishers on either the behaviors assessed or a precise definition of integrity. The small amount of data they made available generally failed to convince the psychometric community, who felt the concept to be overly broad and ill defined, and concluded that there was insufficient evidence to reach clear conclusions regarding its value. General criticisms of integrity testing were made by Jane Loevinger and David Lykken in the 1990s, while the case in favor was made by Deniz Ones and her colleagues. The latter reported a series of meta-analytic studies that reviewed the evidence for its validity. On the basis of results from 650 criterion-related validity coefficients from over 500,000 subjects, they concluded not only that the evidence for the validity of integrity tests was substantial but also that the broad construct of integrity is probably an equally good or better predictor of overall job performance than the five-factor model. It was further argued that many of the existing integrity tests were to a large extent merely assessing the Big Five trait of conscientiousness. But then both the World Health Organization and the American Psychiatric Association revised their model of personality disorders, thus introducing a new element into the assessment of negative constructs. Psychiatry and the medical model

Psychiatry and the legal profession have jointly had to tackle problems of responsibility, particularly in dealing with crime. Many people may break the law, but if they have a mental illness they may not know that what they did was wrong. This may be due to an intellectual disability, a delusion, or any number of other reasons. In these cases, it is often argued that they were not responsible for their actions. Any necessary restriction on their activities would be not about punishment but for the protection of both themselves and society. But there is no clear dividing line, and each case is different. For some crimes, a common concern in sentencing was whether the person felt remorse for what they did and intended to change their behavior in future. Hervey Cleckley (1941), in his book The Mask of Sanity, drew attention to a group of repeat offenders who habitually showed no remorse, had no intention of changing their behavior, and were returning to prison time and time again. He designated the term “psychopaths” to such people. Cleckley believed that while psychopaths could be superficially charming when they wanted to, they were constitutionally unable to feel remorse or shame, or to experience the emotions normally associated with ethical behavior. Many were likely to become recidivists, continually repeating the same or similar crimes and eventually becoming institutionalized in prisons, at great expense to the rest of society. Today, psychiatrists refer to this condition as antisocial personality disorder, one of several dysfunctional personality conditions that exist at the borderline between health and mental illness. Ten

Personality assessment in the workplace 123

personality disorders were defined in the American Psychiatric Association's Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5), published in 2013: paranoid, schizoid, schizotypal, antisocial, borderline, histrionic, narcissistic, avoidant, dependent, and obsessive-compulsive personality disorders. Many psychiatrists disagree with the identification of these conditions as illnesses at all. Rather, they are seen as different ways in which people with particular personalities can react when faced with challenging situations such as extreme stress, popularity, temptation, or even success. Candidates for jobs should not be discriminated against on the basis of any form of illness—mental or otherwise. Indeed, it is illegal to do so. But what might be of concern are personal vulnerabilities that affect performance in the workplace. Such vulnerabilities have been the concern of security services in addressing the perils of online recruitment into extremist organizations. They have also been something that banks and investment companies need to consider in attending to risk from rogue traders and other forms of insider threat. Additionally, there have been many reported instances of senior executives who have reached a tipping point and become not just impossible to manage but sometimes downright dangerous. Given the potential, it is unsurprising that organizational psychologists have expressed considerable interest in the assessment of these behaviors. The dysfunctional tendencies

Jean-Pierre Rolland in France and Robert Hogan in the US both focused their work on the development of questionnaires to assess these potentially disruptive behaviors. Rolland’s TD-12 (Inventaire des tendances dysfonctionnelles), today published by Pearson Assessment, describes the following 12 such traits: 1 Distrusting/Vigilant personality. High scorers generally exhibit a wary suspicion of others. They expect people to betray them, harm them, or take advantage of them, and often see such threats in insignificant events. When they feel that they have been humiliated, hurt, ignored, or looked down upon, they tend to hold grudges. They react to these perceived menaces or supposed attacks with anger, and counterattack promptly. 2 Distant/Introverted personality. High scorers are normally disengaged, impassive, and emotionally cold in interpersonal relationships. They appreciate solitude, do not seek the company of others, and do not easily form personal attachments or self-disclose. Generally, they are not interested in others’ feelings or do not even perceive them. Their range of emotional expressions and life experiences are both restricted. 3 Bizarre/Original personality. High scorers tend to be considered eccentrics. Their ideas, representations, and beliefs are often irrational, but can at the same time be extremely novel. Angels, fairies, ghosts, cults, and aliens often feature in their beliefs, although the purported actions and intentions of such entities often seem very odd. Their language, behavior, and dressing style are frequently bizarre and strange. 4 Impulsive/Nonconformist personality. High scorers tend not to respect other people’s rights and to transgress rules and conventions. They like to take risks and see where the limits can be pushed. They can be charming, manipulative, sly, and dishonest—using these skills to take advantage of others. They lie and have no qualms about not conforming to rules or conventions. They cannot make long-term plans.

124

Personality assessment in the workplace

5 Affective/Inconsistent personality. High scorers are emotionally changeable, ranging in mood from exaltation to depression or from satisfaction to dissatisfaction, and in attitude toward others from idealization to belittlement. Hence, they can be difficult to get on with and can be hard to please. They have marked impulsivity, with brief but intense enthusiasm for novel projects or new people, followed by rapid disappointment. 6 Theatrical/Expressive personality. People with high scores tend to be showy, theatrical, expressive, and colorful. They tend toward dramatization and excess to fulfill their need to be the center of attention. Their desire to be noticed may lead to excessive emotionality and exaggerated emotional expressions. These skills may frequently be used to manipulate the behavior of others, with no clear-cut aim in mind. 7 Egocentric/Self-Confident personality. High scorers are egocentric with unwarranted self-confidence, associated with an overall need for recognition and admiration. They feel exceptional and out of the ordinary, and hence deserving of special consideration and privileges. They lack empathy and react with frustration, irritation, and anger when they do not obtain privileges to which they believe they are entitled. 8 Cautious/Timid personality. High scorers have feelings of inadequacy and are uneasy in social situations that risk them being judged. They are hypersensitive to criticism and live with a constant fear of rejection. They will rarely be willing to take risks that might lead to their advancement or or to their ideas being rejected. When suggestions are put to them, they will often clam up for fear of giving an answer that might be considered unacceptable. 9 Docile/Dependent personality. Those with high scores have an excessive need for supervision, approval, support, and advice from others. Their wish to please in order to obtain this support and advice often leads to submissive and “clingy” behavior. They find it difficult to make everyday decisions without advice or support, and have a hard time expressing disagreement, for fear of losing support or approval. 10 Meticulous/Perfectionist personality. High scorers are preoccupied with order, perfection, and control, to the extent of forgetting the point of any activity. They are perseverant, stubborn, and inflexible about rules and procedures. They find it difficult to delegate, due to fear of others not following procedures, but their exaggerated attention to detail makes it hard for them to make decisions themselves. 11 Susceptible/Independent personality. Those with high scores tend to have negative attitudes and passive resistance to demands formulated by others. They will act with irritation, resistance, and indirect passive hostility when asked to do something that they do not want to do, and will often criticize and passively resist authority figures. They believe they are misunderstood, insufficiently recognized, and mistreated. 12 Depressive/Pessimistic personality. High scorers have persistent and invading feelings of discouragement, gloominess, boredom, sadness, and bitterness. Even when opportunities occur, they will instead focus on what could go wrong and exaggerate the consequences. They generally feel inadequate, useless, and worthless without being able to explain why, and do not enjoy situations that most people consider pleasant. Robert Hogan’s psychometric instrument also assesses subclinical features of the DSM-5 personality disorders. First published in 1997, it was designed specifically to identify

Personality assessment in the workplace 125

potential derailers within an organization, focusing on what he called the “dark side” of personality. It became particularly popular in the assessment of staff at senior levels in organizations. In many cases, a moderate degree of the trait in question may be beneficial. Senior executives and traders need to have a high degree of, say, self-confidence to function effectively. But if they become overconfident, they may reach a tipping point. This may be brought on by extreme work pressure, unanticipated threats, or simply a string of random successes. But if undetected, once this dark side begins to emerge it has the potential to cause an enormous amount of damage to the organization in terms of performance, workplace relationships, productivity, and reputation. The dark triad

Much of the work on the dark side has focused on the three traits of psychopathy, narcissism, and Machiavellianism, now frequently referred to as the dark triad. Psychometric assessments of all three traits predate the DSM model. However, the last 15 years have seen a surge of interest in the potential of these traits to detect potentially disruptive behavior, particularly in the online environment. Two of the three are clearly associated with the personality trait of the egocentric/confident personality and the impulsive/nonconformist personality, while the third, Machiavellianism, seems to combine the characteristics of several. This work is summarized by LeBreton, Shiverdecker, and Grimaldi (2018). The attempts to validate the many instruments available have produced mixed results. It has also been noted that scores on all three elements of the dark triad can be accurately predicted by existing Big Five scales, either alone or in combination. For example, psychopathy has a strong correlation with agreeableness, as does Machiavellianism. Narcissism generally has correlations with both extraversion and agreeableness. Assessing integrity at work

The word “integrity” is derived from the word “integer,” meaning wholeness. Systems with integrity are balanced, in step, and working toward a common purpose. Signs of integrity—whether at the personal, organizational, or even national level—are reliability, dependability, openness, transparency, confidence, and optimism. Our assessment of integrity plays a key part in whom we trust, where we bank, what we buy, and how we vote. Integrity tests have been used by organizations to address common concerns about the personal integrity of potential appointees, such as: • • • • • • •

Failure to take sufficient care or follow safety instructions Self-centeredness leading to a likelihood of habitual lateness or absenteeism Aggressive tendencies leading to hostility, intimidation, or racist and sexist attitudes A sense of grievance that might lead to disciplinary problems Excessive pride, with disrespect for seniors or overbearing behavior toward juniors Greediness at such a level that they should not be trusted, particularly with money An unwillingness to cope with change, even when it becomes essential

These characteristics represent the seven traditional vices of old: sloth, indulgence, anger, envy, pride, greed, and despair. As with all integrity tests, we cannot expect respondents to be particularly honest when answering questions about these types of behavior directly. Indeed, it would be rather surprising if they did so if they genuinely wanted the

126

Personality assessment in the workplace

job for which they had applied. Giotto (Rust, 1999), an integrity test published by Pearson Assessment, addresses this challenge by using an ipsative approach, as discussed in Chapter 6. One additional advantage of the ipsative approach is that at the same time as dealing with social desirability issues, it ranks the traits being assessed in order of severity. In Giotto, the assumption is made that we are all prone to each of these vices to a greater or lesser degree; hence, rather than identifying problems of bad behavior per se, it merely establishes the bad behaviors to which we are most or least prone. None of us is perfect, and our vices are as significant as our virtues (sometimes even more so). Who wants an entrepreneur who never takes risks, someone who lacks the capacity to recognize the importance of work–life balance, a sergeant major who is unable to authoritatively instill discipline, a shop steward who can see no wrong in management, a CEO who always concedes to others’ opinions, or a law enforcer who makes up the rules as they go along? Because of its ipsativity, Giotto can respect the good as well as the bad in everyone. The OBPI integrity scales

The OBPI integrity scales were also designed to bring the same ipsative insights to each of us by showing us how we are, rather than how we would like to be. The OBPI is not ipsative, and hence had to depend on an intrinsic social desirability scale to weed out items that were particularly vulnerable to impression management. This was generally successful; correlations between the OBPI integrity traits, although generally above .3, were within a tolerable range for tests of this type. The exception was greed, the scale that correlated so highly with the social desirability scale that it became essentially identical. Hence, unlike Giotto, the OBPI is not effective at identifying failures of trust. The equivalent scale instead assesses impression management, described as a low score on the trait of disclosure. The seven OBPI integrity traits are proficiency, work-orientation, patience, fair-mindedness, loyalty, disclosure, and initiative: 1

2

3

4

Proficiency assesses the degree of care that is likely to be taken in carrying out a task. It is of relevance to occupations where safety is of primary concern and in which mistakes can have particularly severe consequences. Those with lower scores may be careless and more likely to indulge their whims, but at the same time are willing to make mistakes and may indeed do so deliberately to learn from them. Work-orientation assesses work ethic and is important in occupations where staff are of necessity required to work long hours or under duress. Low scorers may be more prone to occasionally being late for work or even to absenteeism when they believe the situation requires it, but they have a better understanding of the importance of work–life balance. In the greater scheme of things, they can get their priorities right. Patience assesses the ability to control one’s own aggression, whether physical, verbal, or simply of attitude. It is of relevance to work environments where feelings can run high and where bullying has been a concern. Low scorers are more likely to thrive in hostile environments that require some form of competitive aggression to ensure that everyone complies with their instructions, so long as this is balanced and fair. Fair-mindedness assesses fairness in judging the actions of others. It is of relevance to work environments that are intrinsically competitive, and where sound judgment is required. Wherever the workplace is beset with strife, low scorers may be more likely to make mistakes through allowing their emotions to cloud their judgment.

Personality assessment in the workplace 127

5

6

7

However, they may also be able to see opportunities in situations that are about to spin out of control. Loyalty assesses the sense of obedience to company policy, particularly important among junior members of the workforce in situations where following preset rules is essential. Low scorers are more suited to positions in which there is a need for independent leadership and where existing rules give insufficient clarity as to the best way forward. Their skills will be in demand in situations where the rules need to change. Disclosure assesses the extent to which a person is likely to share their thoughts and genuine beliefs in an honest and open manner, particularly important when establishing a level of trust. Low scorers are prone to disguise their true thoughts and feelings, a requirement within many professions. This may extend to answers on this questionnaire, something to be considered when interpreting scores on other scales. Initiative assesses a sense of purpose and a forward-looking approach. It is of relevance to organizational settings that require major change or are about to undergo restructuring. Low scorers value tradition, and they function best in organizations that are embedded in the status quo. They are naturally skeptical of demands for what they see as change for change’s sake, and have a high acceptance of long-time traditions.

The interpretation of integrity traits is more challenging than it is for personality traits. Most important, however, is the need to take the work context into consideration. No one is perfect, so we need to look more at the balance between the seven traits rather than their absolute scores. In this context, it is a question of which traits are more and less important to the job in question. Most people are aware of their own weaknesses, at least to some extent, and feedback from a questionnaire—as well as a shared interpretation with a professional who is experienced in these matters—gives them the opportunity to put these in context and even turn them into strengths. The OBPI integrity scales all have reliabilities ranging between .70 and .76 and have been successfully validated against both supervisors’ ratings of relevant attributes and the seven equivalent scales in Giotto, with cross-correlations ranging from .40 for proficiency to .61 for initiative. The OBPI is available in English, Mandarin Chinese, Brazilian Portuguese, Bahasa Indonesian, and Turkish.

Conclusion The assessment of both personality and integrity is a crucial part of organizational psychology today, playing a role in recruitment, staff development, and performance monitoring. Consequently, professional bodies worldwide have specified codes of conduct for use. These include recommended procedures for informed consent, privacy, and provision of feedback on outcomes. Today the sharing of results with candidates has become accepted practice, despite some earlier misgivings concerning its appropriateness as far as integrity testing was concerned. However, there are clear distinctions between assessment and feedback when given at the individual level within an organization and when assessment is carried out en masse over the internet. The latter has become commonplace, as scores on any personality or integrity trait can easily be predicted from a person’s online footprint—whether this be social networks,

128

Personality assessment in the workplace

emails, websites, or any other trace that can be linked to them. Such data are gold dust to security services, the advertising industry, social strategists, and political campaigners; and following several major scandals, the development of databases of this type is now generally illegal in most countries. However, much more is at stake than merely personal privacy, and much current legislation is closing the stable door after the horse has bolted. Machine-learning algorithms already have enough information to build models enabling them to make behavioral predictions as soon as they have access to your cookies or other avatars that you present online. The messages, news feeds, and advertisements that you receive are no longer dependent on knowing who you are. This is a magical world for the tech giants, but not so for those trying to preserve traditions in policing, the media, and social democracy.

8

Employing digital footprints in psychometrics

Introduction Decades of research and practical applications have shown that well-made tests and selfreported questionnaires can be reliable, practical, and accurate. They have been suc­ cessfully applied across diverse contexts, ranging from recruitment and high-stakes educational assessments to clinical diagnosis. New methodologies such as computer adaptive testing (see Chapter 5) and item generators (see Chapter 9) are becoming more widespread, further driving the quality of such assessments. At the same time, however, there are major flaws inherent to self-reported questions and test items. First is their temporal character and low ecological validity: assessments offer only a brief window into respondents’ opinions and performance, and they are often administered in an artificial environment, such as in an assessment center and under time pressure. During the brief interaction with a questionnaire or test, a respondent may be affected by factors such as the testing environment, stress, fatigue, or even the weather. Consequently, their scores reflect not only the traits being measured but also these external factors, decreasing the measurement’s validity and reliability. Second, traditional assessments are limited to capturing respondents’ explicit, con­ scious, and motivated opinions and behaviors. Consequently, they are vulnerable to cheating and misrepresentation, particularly when much depends on the scores, such as in the context of recruitment or entrance exams. Misrepresentation is often unconscious, driven by a wide range of unconscious cognitive biases. Availability bias, for instance, leads to overestimating the frequency of thoughts or behaviors that are easily accessible in one’s memory. For example, after weeks spent preparing for a job interview, job can­ didates are likely to underestimate how social they normally are. Another common bias, the reference-group bias, describes the difficulty of comparing one’s trait levels with the average in a general population. Instead, we tend to compare ourselves with those around us, or a reference group. An extraverted actor, for instance, might genuinely believe themselves to be introverted if they are surrounded by even more extraverted peers. Consequently, even widely used and well-validated assessments are often relatively poor predictors of many basic real-life outcomes such as performance at work, well-being, or physical activity. How can we solve these and other limitations of traditional assessments? One ap­ proach would be to replace tests and questionnaires—narrow snapshots of respondents’ self‐reported behaviors—with long-term observations of actual behaviors, preferences, and performance in the natural environment. One could follow the respondents around for, let us say, a full year, meticulously recording all the times when they expressed

130

Digital footprints in psychometrics

happiness or sadness, measuring the time spent socializing, or assessing their performance in real-life situations requiring mathematical skills. Such a measurement would be more reliable—and more predictive of future behavior and performance—than asking people to recall the frequencies of such behaviors or measuring their performance on made-up tasks. Why, then, are we still asking respondents to agree or disagree with items such as “I like going to parties,” rather than counting the number of times they actually go to a party or the percentage of the time that they decline a party invitation? Or why do we ask respondents to count apples and oranges in unrealistic math test tasks rather than record their performance at solving life’s numerical challenges? We do it because recording real-life behaviors over long time periods has been prohibitively difficult, expensive, and time-consuming. Not to mention that it used to be very challenging—if not impossible—to record respondents’ behaviors unobtrusively without altering them in the process. Would you be able to behave naturally if you were followed around by a psychometrician meticulously writing down everything that you do? Our ongoing migration to the digital environment opened up a myriad of ways in which our behaviors, preferences, and performance can be recorded in an unobtrusive, cheap, and convenient way. We are increasingly surrounded by digital products and services that mediate our activities, communication, and social interactions. Consequently, a growing fraction of human thoughts, behaviors, and preferences leave digital traces (or digital footprints) that can be relatively easily recorded, stored, and analyzed. Such footprints include web-browsing logs, records of transactions from online and off-line marketplaces, photos and videos, GPS location logs, media playlists, voice and video call logs, language used in tweets or emails, and much more. Consider one of the most pervasive digital devices: smartphones. They are used for a broad range of activities such as communication, entertainment, and informationseeking. Many people begin and end their days with a smartphone-mediated activity, and spend many hours glued to their screens. Some keep using smartphones even when asleep—for instance, to monitor their sleeping patterns. Smartphones are packed with accurate sensors continuously tracking the behaviors and surroundings of their users. These include sensors aimed at physical movements and location, such as a GPS or accelerometer, and audible (microphone) or visual (camera) stimuli. External sensors like smartwatches track physiological states such as pulse or body temperature. Furthermore, smartphone apps facilitate and track users’ communication (e.g., email and texting), socializing (e.g., social networks and dating apps), physical exercise (e.g., health and workout apps), purchases (e.g., shopping and banking apps), and even eating patterns (e.g., calorie counters). Increasingly, apps are used to mediate or support even the most personal aspects of our lives. Our emails, web searches, and online dating activities produce digital footprints describing our most personal and intimate behaviors, thoughts, and preferences. The amount of data produced by the use of such devices is staggering. Back in 2012, IBM estimated that people produce 2.5 quintillion bytes (i.e., 2.5 billion gigabytes) of data every single day. That is about 350 megabytes for every living person on Earth. To illustrate how much that is, imagine printing 2.5 quintillion bytes on A4-sized paper as zeroes and ones using 12-point type—perhaps to save it for future historians. The re­ sulting stack of paper (recycled, hopefully) would have a height of about 400 million kilometers—nearly three times the distance between the Earth and the Sun! Moreover, the amount of data that we produce grows each year. By 2025, our daily data output is

Digital footprints in psychometrics

131

estimated to be 200 times the amount it was in 2012—that is, over 62 gigabytes per person. Combine this with the ever-growing population, and we are on track to be producing 1 zettabyte every two days (zettabyte is 1021 bytes)! Digital footprints represent a great fraction of this data. Combined with everincreasing computing power and modern statistical tools, such vast amounts of data are radically changing the potential of psychometrics. This chapter briefly introduces a few categories of digital footprints that are proving particularly useful.

Types of digital footprints Usage logs

Many off-line and online behaviors are now enabled, mediated, or tracked by digital devices, producing an enormous volume of usage logs, including social media activity records, web-browsing and web-searching histories, multimedia playlists, bank state­ ments, and more. They provide detailed, diary-like records documenting users’ beha­ viors in the online and off-line world. (A typical entry produced by a location-tracking app may look like: “John visited Palo Alto Whole Foods at 3:00 PM on May 8, 2020.”) Usage logs are widely used in predicting psychological traits and states. For example, one of the most generic types of usage logs—records of Facebook users’ Likes—has been successfully utilized to measure personality, political and religious views, well-being, and intelligence (Kosinski, Stillwell, & Graepel, 2013). Language data

Starting with Freud, many psychologists have noted that an individual’s language use is strongly related to their psychological states and traits. Yet the use of language in psy­ chological measurement has been historically limited by the difficulties posed by re­ cording it in naturalistic settings and the lack of efficient tools well-suited to analyzing it. Progress in natural language processing and the availability of large samples of language have radically changed this situation. Most of the modern languages for statistical pro­ gramming (e.g., Python, R, and MATLAB) offer a range of powerful yet easy-to-use tools aimed at analyzing language data. People across the globe share their thoughts on social networks, exchange emails, collaborate on text documents, and talk on the phone, creating an opportunity to generate samples of language that can be used in psychological measurement. Mobile sensors

Mobile devices—such as smartwatches, fitness trackers, and smartphones—are packed with precise sensors that track their users’ behavior and surroundings. They record a wide range of behaviors including physical movement (accelerometers), mobility pat­ terns (GPS sensors), social interactions (Bluetooth sensors), and events such as sneezing, throat-clearing, coughing, smoking, vacuuming, or brushing the teeth (microphones). Digital footprints generated by mobile sensors are increasingly used in the measurement of psychological states and traits, ranging from depression, schizophrenia, and bipolar disorder to mood and well-being and to cognitive ability.

132

Digital footprints in psychometrics

Images and audiovisual data

Another widespread type of digitally mediated activity involves creating and sharing images, videos, and sound clips. The resulting digital footprints provide a rich source of data on human behavior, thoughts, and communication. Historically, the computational resources necessary to employ such data in measurement were beyond the reach of most individuals and organizations, but this is changing quickly. New computational tools can automatically extract information from such data, saving large amounts of time and effort. Speech-to-text technology can be used to extract text from audio files, visual object-recognition algorithms can detect and label objects within images, and widely available emotion-detection software (e.g., Face++, IBM Watson, and Microsoft Cognitive Services) can label people’s emotions based on their facial expressions or voice recordings. Those developments enable psychometricians to build measures employing this new and robust source of data.

Typical applications of digital footprints in psychometrics Digital footprints are increasingly used in psychometrics. Here we list several areas where they are proving to be particularly useful. Replacing and complementing traditional measures

Digital footprints are increasingly used to measure psychological constructs that were traditionally measured using tests and questionnaires. For example, studies have shown that personality tests employing Facebook Likes, Facebook status updates, or tweets have reliability and validity comparable with—or even superior to—well-established per­ sonality questionnaires. Academic and industry researchers around the world are rolling out tools employing those and other digital footprints to measure a wide range of traits and states, ranging from attitudes to intelligence to well-being. While traditional tests and questionnaires have many strengths and will likely maintain their popularity, new digital-footprint-based tools are increasingly used to complement them in some contexts and replace them in others. New contexts and new constructs

Digital-footprint-based measures extend the applicability of psychometrics to contexts where traditional measures were impractical and to psychological constructs that were traditionally difficult to measure. Take the field of consumer behavior, for instance. While decades of research in this field has revealed that psychological traits and states are good predictors of consumer preferences, the practical applications of those findings were limited. Few consumers would be willing to invest time and effort into filling out a consumer-preferences questionnaire or a personality test upon entering a department store. (Not to mention that few would be willing to share their personality scores with a salesperson.) The migration to online retail spaces changed that. Such platforms com­ monly employ recommender systems crunching consumers’ digital footprints (e.g., their past purchases) to measure their product preferences and use those estimates to provide them with product recommendations. (Somewhat surprisingly, the same consumers that would be highly reluctant to share their questionnaire scores with traditional retailers

Digital footprints in psychometrics

133

seem much less opposed to sharing their digital-footprint-based scores with digital platforms.) Digital-footprint-based measures are adopted in many other, previously untapped, contexts. Music and movie recommender systems, such as those powering Spotify and YouTube, use their users’ digital footprints to predict their music or movie tastes. Dating websites use similar engines to match people with each other. Facebook’s psychometric algorithms profile their users to predict which piece of content is most likely to keep them engaged and which ad they are most likely to click on. Predicting future behavior

The psychometric tools discussed in the previous paragraph are so seamless and un­ obtrusive (especially when compared with traditional tests and questionnaires) that it is easy to miss how pervasive they have become. This is exacerbated by the fact that they are typically used differently from traditional measures: they usually consist of hundreds or thousands of scales, producing psychological profiles that are too complex to be in­ terpreted by a human. Instead, they are typically employed in models aimed at predicting future behavior. While this approach may at first sound unfamiliar to traditional psy­ chometricians, predicting future behavior is one of the main goals of psychometrics. Due to the complexity of such predictions, a traditional psychometric measure is first used to estimate respondents’ scores on a handful of easily interpretable latent dimensions (e.g., Big Five personality traits), and then such estimates are used to predict respondents’ future behaviors (e.g., probability of excelling at a given job). Such predictions are often made based on intuition or experience, rather than made by statistical models, which greatly reduces their accuracy. This is exacerbated by the need to limit the number of traits considered, in order not to overwhelm the respondent (with the number of scales) and the psychometrician (with the complexity of the resulting psychological profile). As we argue in this chapter, those limitations do not apply to digital-footprint-based tools, which can be used to extract hundreds or thousands of dimensions from respondents’ digital footprints and employ them to directly predict future behavior. Studying human behavior

Digital footprints are used to facilitate the discovery of novel psychological constructs and mechanisms. Many of the well-known psychological traits were derived from studying behavioral footprints collected in a traditional way. The patterns discovered in school grade records, for instance, facilitated the discovery of the general intelligence factor (see Chapter 4), while samples of self-reported behaviors and preferences were instrumental in the development of the five-factor model of personality (see Chapter 7). Similarly, the application of modern computational methods to large samples of digital footprints could facilitate the discovery of novel psychological constructs that might not be apparent in smaller, traditional samples. Supporting the development of traditional measures

Digital footprints can also be used to support the development of traditional ques­ tionnaires. Studying the links between digital footprints and conventional traits and states can provide the developers of conventional measures with ideas for new items. Studies of

134

Digital footprints in psychometrics

Facebook Likes, for example, revealed that extraversion is positively correlated with Liking “theater,” “socializing,” and “cheerleading,” and negatively correlated with “playing video games” and “watching anime.” While some of these associations are unsurprising and have been employed in past personality questionnaires (e.g., “socia­ lizing”), other correlates of the introversion–extraversion scale (e.g., “playing video games”) could inspire items characterized by low face validity, useful in the context where misrepresentation is an issue (e.g., recruitment). While respondents can relatively easily decode “I am skilled in handling social situations” as an item loading on the introversion–extraversion scale, it would be much less obvious to them that “I like video games” does so as well.

Advantages and challenges of employing digital footprints in psychometrics Digital footprints are revolutionizing psychometrics, offering a broad range of ad­ vantages. However, they do not come without new challenges. In fact, often the very same qualities that make them so desirable in some contexts are their largest drawbacks in others. Thus, to take full advantage of the promise of digital footprints, one must consider—and address—their dark side as well. High ecological validity

Digital footprints are usually generated in natural environments (such as online social networks) by people going about their business, rather than in the artificial setting of a laboratory. Moreover, such footprints can be (and typically are) recorded retrospectively: days, weeks, or even years after they were left behind. As a result, they often capture natural and spontaneous behaviors expressed in natural environments by people who, at the time when the data are recorded, are unaware that their behaviors are going to be analyzed. In other words, they have high ecological validity. This stands in stark contrast to conventional measures, which record respondents’ performance on artificial ability items, their (often biased and imperfect!) recollections of past behaviors, or (subjective and motivated) opinions. Moreover, their responses are often recorded in an unusual—and stressful—context, such as an assessment center, psychological lab, or psychologist’s office. Let us illustrate this with an example. Consider a job candidate filling out a personality questionnaire while applying for a job. They are pondering how to respond to an extra­ version item—“I make friends easily”—using a Likert scale ranging from “strongly disagree” to “strongly agree.” How does a person know whether they make friends easily? Some respondents may approach this by recalling times when they tried making new friends. But what if they have recently managed to inadvertently offend a stranger? Availability bias implies that such recent experiences will have an outsize influence on their response. Also, what does “easily” mean? Compared with whom? We all have reference-group bias: as we cannot know how good the average person is at making friends, we must compare ourselves with a salient reference group, such as our friends or coworkers. However, such a group could be far from representative, biasing our opinion on what “easily” means. Even a re­ latively socially apt person, for example, might think of themselves as average at making friends if they are surrounded by even more socially apt peers. Moreover, the candidate is likely wondering how to answer this question to maximize their chances of getting the

Digital footprints in psychometrics

135

job—in other words, they are likely to misrepresent their true behaviors due to social de­ sirability bias. They may have even been cheating by looking up the test on the internet beforehand and memorizing the responses that are most likely to get them hired. Finally, their behavior and recollections are, to some extent, altered by the unusual environment of the assessment center, recruiters’ attention, and other factors unrelated to their actual level of extraversion (e.g., reactivity). Digital footprints’ high ecological validity and retrospective character render them relatively immune to those issues. As they represent the records of actual—rather than self-reported—behaviors, they are not limited by respondents’ attention span, memory, energy, motivation, or subjectivity. As they are recorded retrospectively, they are un­ likely to be affected by testing situations. Also, they are less likely to be affected by cheating and misrepresentation. While it is relatively easy to strategically adjust one’s responses to a self-reported questionnaire, it is much more challenging (if at all possible) to consistently adjust one’s behavior over a time span of weeks, months, or years—not to mention that the respondent would have to anticipate that such adjustment is desirable and necessary a long time before the assessment takes place (as the data are often recorded retrospectively). In the example discussed above, an introverted job candidate could easily alter their questionnaire responses to get a high extraversion score. It would be much more challenging, however, to generate months’ or years’ worth of digital foot­ prints indicating an extraverted lifestyle. Similarly, while sufficiently motivated and unscrupulous individuals may successfully manage to cheat on an ability test, it would be much more difficult to artificially elevate their performance in real-life tasks relying on a given ability over a long period of time. Greater detail and longitude

Not only do digital footprints tend to be more ecologically valid, they are also more detailed and extend further into the past, facilitating the observation of changes in traits or behaviors over time: a stark contrast to a momentary interaction with a test or a questionnaire. Consider, for example, how difficult it would be to record the structure and change over time of an individual’s network of friends and acquaintances (i.e., their egocentric social network). Even the most motivated respondent could not accurately re-create their friendship or professional network, as such a network typically contains hundreds of actors and tens of thousands of interpersonal connections. However, a reliable image of an individual’s social network structure—and its evolution—can be easily and quickly extracted from their digital footprints left behind in digital spaces such as Facebook, LinkedIn, or instant messaging apps. Data obtained in this way would not be complete—no source of data can achieve that—but would most certainly be more complete than anything that a respondent could retrieve from their memory in the limited time span of a traditional assessment session. Less control over the assessment environment

Digital footprints’ high ecological validity has serious downsides. Although real-world en­ vironments are excellent sources of ecologically valid data, they were not designed with the intention to generate data that can be conveniently used in assessment. One of the main issues stems from real‐world environments’ constantly evolving functionality, which alters users’ behavior and the meaning of the digital footprints that they are leaving behind.

136

Digital footprints in psychometrics

In Facebook’s early days, for example, the status update entry box was preceded by the “[User’s name] is” statement, forcing users to talk about themselves in the third person (“… is studying psychometrics”). In December 2007, the prompt was replaced with placeholder text asking, “What are you doing right now?” which disappeared once a user started typing. It was later replaced by “What’s on your mind?” These seemingly minor changes resulted in a major shift in the style of the status updates that users write. A psychometric measure employing Facebook status updates would need to be updated to reflect such a change. Similar changes in the functioning of digital platforms and devices are very common, often occur unannounced, and are rarely as obvious as the example mentioned above. Moreover, they often stem from processes inherent to digital platforms’ functioning. Consider the following example: the Google search engine attempts to automatically complete users’ search queries as they are typing. The suggestions of this autocomplete function keep changing as the search engine evolves and search trends change. This, in turn, changes users’ behavior and the digital footprints that they leave behind. These problems are not entirely new: traditional tests and questionnaires are also af­ fected by changes in culture, language, and technology, and need to be occasionally updated. Not knowing that “postage stamp is to letter as sticker is to bumper” might have been a sign of low cognitive ability in the year 2000, but it soon might be merely a sign of being too young to have ever seen a postage stamp. Luckily, the development of digital-footprint-based measures could be largely automated, reducing the effort necessary to keep them up to date. Greater speed and unobtrusiveness

Another advantage of digital-footprint-based assessments is that they are fast and un­ obtrusive to the respondents. Traditional assessment can be quite burdensome to re­ spondents in terms of time, stress, and effort. For instance, one of the most popular personality inventories, the NEO-PI-R, consists of 240 questions and might take 40 minutes to fill out. Thus, obtaining personality scores for 3,000 respondents would take about 2,000 hours of their combined time, which is more than an average American spends at work in an entire year (and as much as an average German works in 18 months). Compare this with the few milliseconds that a computer algorithm needs to estimate a personality profile from retrospectively recorded digital footprints. Such an assessment could be simultaneously applied to millions of respondents, and their scores computed in mere seconds or minutes. Less privacy and control

High ecological validity, unobtrusiveness, and relative immunity to cheating and mis­ representation are some of the largest advantages of digital-footprint-based measures. The same features, however, pose serious challenges to peoples’ privacy, as their digital footprints may be used without their knowledge and consent. A broad range of companies and institutions record and store digital footprints. While national laws normally require users to consent to this, in practice the average person does not have the time to read and understand the dense legalese in privacy policies. People’s web-browsing logs, social network activities, communication, and purchases are recorded by governments, internet service providers, web browsers, social network

Digital footprints in psychometrics

137

platforms, online merchants, online marketing providers, credit-card companies, and a myriad of other institutions. Furthermore, many digital footprints are publicly available, such as blog posts, tweets, and LinkedIn profiles. Thus, digital-footprint-based tools may be used to measure intimate psychodemographic traits of large groups of people without their knowledge or consent. Moreover, even those who opt in to have their digital footprints used in assessment are facing risks. One of the downsides of traditional assessment—and one of its strengths in the context of protecting respondents’ privacy—is the control that respondents have over what information they reveal, by adjusting their responses or skipping questions entirely. One of the upsides of digital-footprint-based tests—and their weakness in the context of protecting respondents’ privacy—is that respondents have little control over what information they reveal. Moreover, a typical source of a digital footprint contains enough data to prevent respondents from manually reviewing it prior to submission. Imagine, for example, reviewing years’ worth of tweets, Facebook status updates, or credit-card purchases before submitting them to estimate your scores. This would be much more cumbersome than taking even the longest of the traditional questionnaires! Additionally, it is often difficult to anticipate what a particular footprint reveals about you. While it is relatively easy to see that Liking “dancing” or “acting” on Facebook reveals that you are an extravert, would you be able to anticipate that the same pertains to liking “Michael Jordan”? Yet another issue stems from blurry boundaries between digital footprints belonging to a respondent and to other people. Do others’ comments on a respondent’s profile belong to the respondent, and can they be used in assessment? What if the respondent quoted some of those comments in their own writing? What about digital footprints revealing that the respondent knows, dates, or works with another individual? Can such information be used in assessment, without the other person’s consent? Those and other similar questions are neither trivial nor easy to answer. These risks must be carefully managed. Respondents (and policy makers) are fully justified in being anxious about digital-footprint-based assessment tools and their im­ plications for privacy. One approach that could alleviate these risks is to offer respondents more control over the assessment process. As they often cannot be expected to review the data entered into the model, they might be offered a chance to receive their scores and feedback before they decide whether to share the results with whoever is conducting the assessment (e.g., a recruiter or a psychologist). Additionally, respondents’ consent should be sought for storing their digital footprints and reusing them in the future (e.g., for calibrating the models). As digital-footprint-based models could be misused more easily than traditional ones, with the potential damage being far greater, test authors and publishers must closely monitor the ways in which their tools are used. No anonymity

Another challenge pertains to protecting respondents’ identity. It is difficult if not im­ possible to anonymize digital-footprint samples; in fact, the growing availability of public databases and search engines makes it difficult to anonymize even small traditional samples. Research has shown, for example, that date of birth, gender, and ZIP code allow iden­ tification of nearly 90% of the US population when matched with large publicly available data sets. The problem is exacerbated in the context of digital footprints—such as bank records, Facebook Likes, or tweets—that are virtually guaranteed to be entirely unique,

138

Digital footprints in psychometrics

rendering them personally identifiable. It turns out that merely four data points suffice to uniquely identify 90% of individuals in three months’ worth of credit-card records. In the contexts of many footprints, a quick Google search suffices to de-anonymize a respondent: for instance, take a person whose tweet contains: “The new edition of Modern Psychometrics appeared on Amazon today. Thank you, John and David!” There are methods that can help in decreasing the chances of de-anonymizing data, such as spreading it across multiple data sets or adding noise to it. However, one should assume that with access to respondents’ digital footprints and enough time and effort, it is likely that some or all of them can be identified. Thus, it is of paramount importance to protect respondents’ data, even if it is anonymized, or avoid storing it in the first place. Moreover, somewhat counterintuitively, the speed, unobtrusiveness, and accuracy of digital-footprint-based measures offer a chance to reduce the need and incentives for invading people’s privacy. If a person’s traits can be promptly measured at the time when they are needed, there is no need to try to identify them (to link them with their previously extracted scores). Consider the example of newspaper websites, such as The New York Times or The Guardian. A major source of their income comes from showing ads to their readers, and ads customized to fit a reader’s psychological traits are most profitable. To achieve that, the newspapers attempt to identify their readers and match them with data purchased from data brokers, including mailing address, gender, age, household income, current location, employment and educational history, Likes, comments, shares, reposts, search and browsing history, and much more. This, ob­ viously, is “information about you that can personally identify you,” as The New York Times describes it in their privacy policy. If, however, The Guardian could accurately and consensually measure visitors’ traits based merely on the footprints they left behind while interacting with The Guardian’s website, they would have fewer incentives to deprive such visitors of their anonymity. Bias

Very much like traditional forms of assessment, digital-footprint-based measures are subject to bias. The training samples used to develop such measures are generated by humans who have prejudices and limited self-knowledge and who occupy environments that are far from fair. Given that digital-footprint-based measures are being used in the service of increasingly consequential ends, such as determining whether to grant a de­ fendant bail, it is imperative that their authors limit the extent to which they perpetuate human biases. The algorithms and equations underlying psychometric measures are not offended by accusations of bias and do not resist change. There are a number of approaches that can be used to estimate measurement bias and reduce it (see Chapter 3), and many can be applied to digital-footprint measures as well. Moreover, while striving to build measures that are as free from biases as possible, we should not forget that even a biased psychometric measure is often fairer than the human decision-making process that it is designed to aid or replace. A growing body of research indicates that complementing or replacing human judgments with psychometric measures offers the promise of reducing discrimination. Kleinberg, Lakkaraju, Leskovec, Ludwig, and Mullainathan (2018) examined judges’ decisions of whether a de­ fendant would await trial at home or in jail in a sample of 758,027 defendants arrested in New York City. This decision is consequential to both the defendant and society; while a case may take several months to resolve, some defendants may fail to reappear in court or

Digital footprints in psychometrics

139

commit more crimes while awaiting trial. The results showed that replacing judges’ decisions with an algorithmic one would result in the reduction of crime rates by up to a quarter with no change in jailing rates and a significant reduction in racial bias. Enrichment of existing constructs

Digital-footprint-based prediction models could also help to operationalize psychological constructs in a more robust way. Consider the following example: psychologists have long argued that gender is not a dichotomous but rather a linear variable. Identifying as male or female is just a proxy for someone’s location on the masculinity–femininity scale: some men are more feminine than some women, and vice versa. Questionnaire measures of masculinity–femininity exist, but they are cumbersome and thus rarely used. Instead, gender is usually measured using a dichotomous scale, which limits our ability to study gender and promotes its dichotomous operationalization. This limitation can be cir­ cumvented by employing digital-footprint-based models. Gender-prediction models do not simply categorize people as male or female but produce a masculinity–femininity score instead. Such scores can be used to categorize people into dichotomous gender categories, but could also be employed directly as a linear masculinity–femininity scale. A similar approach may be applied to other dichotomous or categorical variables, such as ethnicity, political views, or religiosity.

Developing digital-footprint-based psychometric measures Let us now introduce the typical steps involved in developing psychometric tools based on digital footprints. First, we will discuss the issues pertaining to collecting digital footprints and preparing them for analysis. Second, we will focus on exploring the patterns of behaviors encoded in the large samples of digital footprints using dimensionality-reduction techniques. Finally, we will turn our attention to building predictive models employing digital footprints as an input. Collecting digital footprints

There are a large variety of digital footprints that could be employed in assessment. These include records of language, web browsing and web searching, purchase and financialtransaction records, social networking data, geographical location, smartphone logs, and more. The applicability of digital footprints may vary depending on the outcomes that are to be predicted and the target population to which the tool is to be applied. Footprints generated on a social networking platform, for example, might be particularly informative about respondents’ personality, attitudes, and values, as those traits heavily influence behavior in a social media environment. Similarly, footprints generated while playing online games heavily depending on cognitive ability, such as chess or Minecraft, are likely to be revealing of respondents’ cognitive abilities and are particularly useful in the assessment of younger people who are more likely to play such games. Collecting data is usually a costly and time-consuming affair. Thus, before embarking on a data-collection spree, it is worth checking if the desired data have already been collected by someone else. Many organizations collect digital footprints of their employees or customers, along with consent to use such data in research. There is a growing number of publicly available data sets containing digital footprints and outcome variables for

140

Digital footprints in psychometrics

thousands or millions of anonymous people. These and other data sources could be used to build and test the performance of both prototypes and fully fledged psychometric tools. If the desired data are not readily available, one can purchase them from data brokers or collect them directly from individuals. Digital platforms often enable their users to export their data, and those users may be willing to either volunteer their data for your project or exchange them for money or other incentives. Note, however, that financial incentives, often used in traditional research, might be too expensive in the context of the huge samples typically required in projects employing digital footprints. Moreover, financial incentives do not incentivize people to respond honestly or behave naturally, merely to participate in the study. They may thus encourage dishonest or random re­ sponding and attract professional participants. Rewarding respondents with an enjoyable experience or interesting feedback can achieve a much better alignment of respondent and researcher interests. In one of our past projects, for example, we published a personality test online and offered respondents extensive feedback on their scores. The respondents could opt in to donate their re­ sponses and digital footprints exported from Facebook to our research, and about a third of them generously decided to do so. The use of feedback, rather than financial rewards, incentivized respondents to answer honestly—as otherwise their time spent filling out the questionnaires would be wasted. The resulting data were of a very high quality according to a wide range of psychometric indicators (e.g., test–retest reliability and predictive validity). Moreover, we could afford to collect data from a huge number of participants (over two million), which would not have been possible if we offered even a modest financial reward. Additionally, motivating respondents with engaging feedback (or other nonfinancial incentives) enables researchers to promote their data-collection project using a viral ap­ proach (or snowball sampling, as it is known among social scientists): encouraging re­ spondents to recruit others to participate. If the participation is engaging enough, the number of respondents may grow very quickly, providing the researchers with much data to build the digital-footprint-based tool. However, it must be noted that while the viral approach is relatively inexpensive and can produce huge data sets, it may introduce biases to the data. The first respondents to join the project and invite others are likely to dis­ proportionately affect the composition of the sample, since people tend to be friends with similar others. Furthermore, popular people are more likely to be recruited into the sample, as they are the most widely connected. Thus, researchers employing this approach should carefully examine the representativeness of their sample and correct it by weighting and other statistical techniques. Another approach that can be used to decrease the sample bias, or to recruit rare respondents, is to use online advertising platforms, such as the ones offered by Google and Facebook. Such platforms can be used to attract respondents based on a wide range of preferences (such as Liking “getting up early in the morning” on Facebook), behaviors (e.g., searching for “How to self-diagnose depression”), and demographic variables in­ cluding location, education, language, political views, ethnicity, and income. This ap­ proach can be used to obtain representative samples or reach respondents typically underrepresented in social science research, such as those stigmatized in the off-line world or those who are hesitant to meet researchers face-to-face. Importantly, while collecting data you need to consider respondents’ privacy. While many respondents may be comfortable with volunteering their public tweets or playlists to be used in measuring their psychological traits, fewer would be comfortable with

Digital footprints in psychometrics

141

sharing their bank records. Moreover, few test developers and administrators would be willing or legally able to receive, analyze, and store such sensitive data, even if re­ spondents were willing to share it. As is the case in many other research contexts, the fact that you can obtain the data does not imply that it is legal or ethically correct to use it for research or commercial purposes. How much data is needed?

As is usually the case in data-driven projects, the more data one can obtain, the better. However, as collecting data is usually time-consuming, challenging, and expensive, it would be good to know how much data is necessary. The amount of data needed depends on several factors, including the quality of the data, how revealing a given digital footprint is of a given outcome variable, the desired accuracy of the measure, and many others. Moreover, as we discuss in more detail later, a substantial fraction of the collected data may be of little use and will have to be discarded. Consequently, a priori estimating the exact amount of data needed is often dif­ ficult, if not impossible. Instead, one can extrapolate from similar past projects (many are described in detail in the academic literature) or estimate empirically by collecting some data (i.e., pilot sample), designing and testing a prototype of the measure, and deciding how much more data is needed to reach a desired level of quality. In our experience, data sets used in developing digital-footprint-based psychometric measures commonly contain tens or hundreds of thousands of respondents. Preparing digital footprints for analysis

Once you collected all (or some) of the digital footprints, you can start preparing them for analysis. Respondent-footprint matrix

Most types of digital footprints (such as text, web-browsing logs, purchase records, playlists from online radios, or Facebook Likes) can be conveniently represented as a respondent-footprint matrix. A hypothetical respondent-footprint matrix, representing websites visited by several respondents, is displayed in Figure 8.1. The rows of this matrix represent respondents, columns represent websites, and cells record the number of times a given respondent visited a given website. Jason, for example, visited Google.com 13 times and Facebook.com nine times. Cells of the respondent-footprint matrix may represent a variety of respondentfootprint associations. They often contain frequencies, such as the number of times a given website was visited or a given word was used in tweets or emails. Sometimes, the associations are dichotomous. For instance, as Facebook respondents can Like a given object only once, the cells of the respondent-Like matrix can take only two values: 1 if a given respondent liked a given object and 0 otherwise. In other contexts, one could use the cells to represent other values, such as the total (or average) time spent on a given website or the total amount of money spent on a given category of products.

142

Digital footprints in psychometrics

Figure 8.1 A hypothetical respondent-footprint matrix representing the frequencies of website visits and its trimmed version (see text for details). Cells represent the number of times a given re­ spondent visited a given website. Shading based on the frequency was added to enhance readability. Zeros were removed for clarity.

Data sparsity

A common phenomenon in digital-footprint samples is that a large fraction of both footprints and respondents appear only once or a few times in the data. Let us illustrate this on an example of web-browsing data. Even a keen internet user, for instance, can only visit a small fraction of all websites. Similarly, while a few of the most popular websites (such as Google.com or Facebook.com) may have been visited by a significant fraction of internet users, a large fraction of websites would have been visited by only one or a few people. As a result, the respondent-footprint matrices tend to be sparse, or mostly empty (i.e., most of the cells have a value of 0). As respondent-footprint matrices are also often extremely large, it is recommended to store them in a sparse format that saves computer memory by retaining only nonzero values. Most of the modern data-analysis tools—such as R, MATLAB, and Python—allow for the construction of sparse matrices. To illustrate the benefits of storing data in a sparse matrix format, consider a data set containing the visits of 100,000 respondents to two million unique websites. The resulting respondent-footprint matrix is huge: it has 100,000 rows (representing the respondents), two million columns (re­ presenting the websites), and two billion (100,000 × 2,000,000) cells. Let us assume that an average respondent visited 400 unique websites. In such a case, only 40 million (100,000 × 400), or 0.02%, of the two billion cells contain nonzero values. Representing this matrix in a nonsparse format (i.e., storing all the values, including the 0s) would require a whopping 1.5 terabytes of memory, which—at the time of writing—was available only on computers used by spies and rocket scientists. The same matrix in a sparse format would take up a mere 500 megabytes of memory, an amount easily handled by even the most modest of modern computers.

Digital footprints in psychometrics

143

Moreover, as respondents and footprints that appear once or a few times in the data set are typically of low value in subsequent analyses, they can be removed to reduce the size of the data set and the time required to conduct statistical analysis. This is relatively straightforward and can be achieved by discarding rows and columns of the respondentfootprint matrix with fewer nonzero entries than a chosen threshold. Just remember that removing respondents can push some footprints below the threshold (and vice versa), so this process should be repeated until all respondents and footprints in the matrix are above the corresponding thresholds. Consider the respondent-footprint matrix presented in Figure 8.1. Let us assume that each website needs to be visited by at least two unique respondents, and each respondent needs to visit at least two unique websites to be retained. Initially, only Sara falls below this threshold, as she has visited only one website. Removing Sara, however, pushes Vimeo.com below the threshold (as it now has only one unique visitor: Khalifa). Removing Vimeo.com, in turn, would result in Khalifa falling below the threshold and being removed. How does one select the minimum frequencies below which a given footprint or a given respondent should be removed? Set the thresholds too high and you may start removing records that would be useful in the analysis. Set it too low, and you may greatly increase the time and computational resources necessary to conduct the analyses. As in the context of other decisions that you may need to make in the process, a datadriven approach is advisable: conduct planned analyses while changing the threshold, and measure the changes in the quality of the outcomes and computational time. Start with a high threshold (e.g., one that removes a large fraction of the sample), and iteratively reduce it until the accuracy of the model (or other indicators of its quality) ceases to improve significantly, or until the computational resources required to conduct the analyses become excessive. Reducing the dimensionality of the respondent-footprint matrix

Now that we have constructed the respondent-footprint matrix, we can proceed to the next step: reducing its dimensionality (see also Chapters 4 and 7). Dimensionality re­ duction is usually unsupervised, which means that the dimensionality-reduction algo­ rithms are not given information about the outcome variables (e.g., personality), but rely solely on the patterns discovered within the respondent‐footprint matrix. Examination of the respondent-footprint matrix presented in Figure 8.1, reveals several patterns in respondents’ browsing behavior. Respondents who frequently visit Google tend to visit Facebook as well. The same applies to those interested in art-related websites (Etsy. com and Deviantart.com) and movies (IMDB.com and Rottentomatoes.com). Similar be­ havioral patterns occur in real-life respondent-footprint matrices. This is because human behaviors and the digital footprints that they leave behind are not random but form patterns. An individual expressing behavior linked with high extraversion is likely to express other such behaviors (and be extraverted). An individual expressing some symptoms of depression is likely to express other symptoms (and be depressed). The existence of such patterns means that one can reduce the dimensionality of the respondent-footprint matrix, or subsume it using a smaller number of dimensions (in the same way as the dimensionality of individuals’ responses to personality questions was reduced in Chapters 4 and 7). Dimensionality reduction is the end goal of many projects employing digital footprints. Exploring dimensions emerging from digital-footprint data sets can facilitate the discovery

144

Digital footprints in psychometrics

of novel psychological constructs and mechanisms (as we wrote earlier, dimensionality reduction of traditional data sets led to the discovery of constructs such as general in­ telligence and the five-factor model of personality). For example, studying dimensions extracted from the footprints generated by the users of a music-streaming platform could boost our understanding of music preferences. Moreover, even if the resulting dimensions are not interpretable or not novel, the ability to measure them can be highly beneficial. For instance, even if the dimensions underlying footprints from the music-streaming platform do not offer novel insights, they can be used to recommend songs and provide interesting feedback to the users. There are many methods that can be used to reduce data’s dimensionality. In Chapters 4 and 7, we discussed several factor-analytical approaches typically applied to data collected through questionnaires. Here we focus on two methods commonly used in the context of big data sets: singular value decomposition (SVD), representing eigendecomposition-based methods, and latent Dirichlet allocation (LDA), representing cluster-analytical approaches. In the following sections, we take a closer look at these two approaches. Singular value decomposition

SVD is a popular dimensionality-reduction technique similar to principal component analysis (PCA), a mathematical technique widely used in psychometrics and social sci­ ences, and similar in many ways to factor analysis. SVD performed on a centered matrix is equivalent to PCA, and thus PCA can be considered a special case of SVD. SVD is more computationally efficient, as it does not require multiplying a matrix by its transpose—a computationally costly procedure in the context of huge matrices. As SVD is fast and computationally efficient, it is widely employed with large data sets in a wide variety of fields, spanning computational social sciences, machine learning, signal processing, natural language processing, and computer vision. Popular languages for statistical programming (e.g., Python, R, and MATLAB) provide off-the-shelf functions that allow for reducing matrix dimensionality using SVD. To preserve computational resources, make sure to use a sparse SVD function, or a function that can take a sparse matrix as an input without converting it into a nonsparse format. SVD decomposes a matrix into three matrices (U, V, and Σ) exposing its underlying structure. Matrices U and V contain singular vectors subsuming the patterns present in the original matrix. Diagonal matrix Σ contains singular values representing the importance of each of the singular vectors. (A diagonal matrix is a matrix where only the diagonal cells are filled with values.) The product of U, Σ, and transposed V (U Σ VT) is equal to the original matrix. The first singular vector subsumes the most prominent pattern in the matrix, and the subsequent vectors represent patterns of decreasing importance. Thus, the dimensionality of the matrix can be reduced by discarding some of the less important singular vectors. The product of the resulting trimmed matrices U, Σ, and VT does not represent the original matrix exactly, but provides its approximation. It is a common practice to center the data (i.e., decrease the entries in matrix columns by column means) before conducting SVD. Otherwise, the first SVD vector correlates strongly with the frequencies of the objects in rows and columns. However, in the context

Digital footprints in psychometrics

145

of large matrices, centering is typically impossible because it does not preserve matrix sparsity (most of the 0s, skipped in sparse matrices, would become nonzero values). SELECTING THE NUMBER OF SINGULAR VECTORS TO RETAIN

One of the main considerations when conducting dimensionality reduction is choosing the number of dimensions to retain (see Chapter 4). This choice is not trivial, and the ideal number of dimensions depends not only on the properties of data stored in a given matrix but also on the intended application. A small number of dimensions is easier to interpret and visualize; thus, studies aimed at exploring and interpreting the structure of data usually rely on few dimensions. Retaining a larger number of dimensions preserves more in­ formation from the original matrix, which is helpful in building predictive models. Retain too many, however, and the benefits of dimensionality reduction discussed earlier are lost, and the prediction accuracy may even decrease (e.g., due to overfitting). When selecting the number of dimensions to retain, it is useful to consult the amount of variance in the original matrix explained by each consecutive singular vector. (The squares of singular values—stored in matrix Σ—are proportional to the variance of the original matrix explained by a given singular vector.) Figure 8.2 presents the variance explained by the consecutive singular vectors extracted from the trimmed respondent-footprint matrix presented in Figure 8.1. Due to its shape, this plot is referred to as a scree plot. A popular rule of thumb is to retain singular vectors together, accounting for about 70% of the variance in the original data. Here, this would result in retaining the first three singular vectors, explaining about 77% of the variance in the original matrix. Another approach suggests selecting singular vectors above the “elbow” of the scree plot. Applying this rule would also result in retaining the top three singular vectors: the scree plot has a pro­ nounced elbow at the fourth singular vector. Another approach to estimating the optimum number of dimensions to retain is data driven: one can test how the accuracy of the psychometric measure changes with the number of retained dimensions. In other words, one can follow the steps described in the later part of this chapter using one, two, three, four (and so on) dimensions, each time estimating the quality of the resulting measure. Typically, the accuracy grows fast with the number of retained dimensions and then flattens as the amount of additional information provided by

Figure 8.2 Variance explained by consecutive singular vectors in the trimmed respondent-footprint matrix presented in Figure 8.1 (right panel).

146

Digital footprints in psychometrics

Figure 8.3 Respondents’ (matrix V) and websites’ (matrix U) scores on three singular vectors extracted from the trimmed respondent-footprint matrix presented in Figure 8.1.

each consecutive dimension decreases. Picking the number of dimensions that marks the end of the rapid growth of the model’s accuracy might offer a good compromise between the amount of information retained, speed, and interpretability of the model. Note that con­ ducting such analyses on the entire data set might be computationally expensive; thus, it might be worth conducting them on a randomly selected subsample (or, even better, several randomly selected subsamples) from the entire data set. INTERPRETING SINGULAR VECTORS

Let us inspect the first three singular vectors extracted from the trimmed respondent-footprint matrix presented in Figure 8.1. The results presented in Figure 8.3 include matrices U and V, containing singular vector scores of respondents and websites, respectively. Cells are shaded based on the absolute values: the further the value is from 0, the darker the shading. Matrices U and V expose several of the patterns in the original matrix. As mentioned before, when SVD is applied to noncentered matrices, the first SVD vector (SVD1) cor­ relates strongly with the frequencies of the objects in rows and columns. This is true here: SVD1 highly correlates (r = .93) with the popularity of the websites in the respondentfootprint matrix: Deviantart.com, the most visited website, has the highest score on SVD1; while Rottentomatoes.com, the least popular website, has the lowest score. The singular vectors can be also interpreted by exploring the commonalities between the websites most strongly (positively or negatively) associated with a given vector. Two art-related websites—Etsy.com and Deviantart.com—score high on SVD1, suggesting that SVD1 seems to be capturing interest in art. This is mirrored by respondents’ SVD1 scores (matrix U): people with the highest SVD1 scores (Susan, Michal, and David) had visited many websites and were interested in art-related websites. Analogously, SVD2 seems to be capturing interest in Google.com and Facebook.com; these websites and their most active followers score highly on SVD2. Finally, SVD3 captures the interest in movie-related websites (Rottentomatoes.com and Imdb.com). ROTATING SINGULAR VECTORS

Examination of matrices U and V (Figure 8.3) highlights a typical issue pertaining to the interpretation of singular vectors. As SVD aims to maximize the variance accounted for by the first and subsequent singular vectors, top singular vectors relate highly to many respondents and many footprints, making the SVD results difficult

Digital footprints in psychometrics

147

to interpret. Most of the websites and respondents score relatively high (either positively or negatively) on all three singular vectors. To simplify the structure of the singular vectors and improve their interpretability, one can employ factor rotation techniques. Factor rotations are applied to simplify the structure of dimensions extracted from the data, such as the singular vectors employed here. Factor rotation linearly transforms the original multidimensional space into a new, rotated space. Rotation approaches can be orthogonal (i.e., producing uncorrelated dimensions) or oblique (i.e., allowing for correlations between rotated dimensions). Commonly used rotation techniques include varimax, quartimax, equimax, direct oblimin, and promax. Popular languages for statistical programming (e.g., Python, R, and MATLAB) provide functions aimed at rotating singular vectors. Here, we use varimax rotation, which minimizes both the number of dimensions related to each variable and the number of variables related to each dimension, thus improving the interpretability of the results. Figure 8.4 presents the varimaxrotated singular vectors taken from Figure 8.3. The interpretation of varimax-rotated singular vectors is much easier. Varimax-rotated matrix V shows that the first rotated singular vector (SVDrot1) represents the art-related websites: Etsy.com and Deviantart.com score high on it, while other websites have scores close to 0. SVDrot2 represents Google.com and Facebook.com, and SVDrot3 groups movie-related websites. Varimax-rotated matrix U is similarly clear. For example, Susan, David, Stan, and Michal—who had a clear preference for art-related websites (see Figure 8.1)—score highly on SVDrot1. In fact, David scores highly both on SVDrot1 and SVDrot2, which reflects his pre­ ference for Google.com, Facebook.com, and art-related websites. The interpretation of those examples focused on the similarities between websites scoring high on a given singular vector. For example, we referred to Etsy.com and Deviantart.com as “art-related websites.” The vectors could have also been interpreted by exploring the similarities between the respondents with similar scores. If, for example, respondents’ extraversion correlated highly with one of the singular vectors, this would indicate that this vector captures some of the behavioral correlates of this trait. Latent Dirichlet allocation

We now turn our attention to LDA, a cluster-analytical technique commonly used in the context of large data sets. LDA is commonly used to study patterns in language.

Figure 8.4 Varimax-rotated singular vectors from Figure 8.3.

148

Digital footprints in psychometrics

Hence, in LDA nomenclature, the respondents are referred to as documents, footprints are referred to as words, and clusters are referred to as topics. However, it can be readily applied to nontextual data, as long as they comprise exclusively positive integers (e.g., counts of products purchased by consumers or numbers of visits on a given website). LDA is one of the most readily interpretable clustering techniques, as it produces probabilities unambiguously calcu­ lating the associations between respondents, footprints, and underlying clusters. LDA libraries are available for all popular statistical programming languages (e.g., R, Python, and MATLAB). DIRICHLET DISTRIBUTION PARAMETERS

Let us apply the LDA analysis to the respondent-footprint matrix presented in Figure 8.1. An important decision in conducting LDA analysis is the selection of con­ centration parameters α and δ of the Dirichlet distribution. For symmetric Dirichlet distributions (used by most LDA implementations), α regulates the number of clusters that respondents will belong to, while δ regulates the number of clusters that each footprint will belong to. Adopting high values of α and δ increases the number of clusters that each respondent and each footprint can belong to, imposing fewer constraints on the resulting cluster structure. Consequently, adopting high values of α and δ may enhance the amount of information retained. Lower values of α and δ, on the other hand, produce more distinct and easily interpretable clusters, where each respondent and each footprint relate to a smaller number of clusters. The common approach is to adopt α = 50/k (where k is the number of clusters to be extracted) and δ = 0.1 or 200/n (where n is the number of columns in the respondentfootprint matrix). CHOOSING THE NUMBER OF LDA CLUSTERS

Next we choose the number of LDA clusters to extract. As in the context of SVD, the desirable number of dimensions depends on the intended application. Few clusters are easier to interpret and visualize, but many clusters retain more information from the original matrix and thus provide, to a point, for more accurate prediction models.

Figure 8.5 AIC and the number of LDA clusters extracted from the trimmed respondent-footprint matrix presented in Figure 8.1. (Note that the minimum number of clusters that can be extracted is k = 2.)

Digital footprints in psychometrics

149

The widespread approach relies on examining the model’s fit for the different numbers of LDA clusters extracted. Figure 8.5 plots the Akaike information criterion (AIC; the lower the value, the better the fit)—a model-fit estimate commonly reported by LDA functions—against the number of clusters extracted from the respondentfootprint matrix presented in Figure 8.1. At first, the AIC decreases rapidly, indicating the substantial growth in model fit with each additional cluster. Once there are enough clusters to represent the patterns in the data well (here, around k = 3 or 4 clusters), the line flattens, as adding additional clusters does not substantially improve the model fit. Note that estimating the LDA model’s fit for a given number of clusters requires conducting the LDA analysis. As this might be very slow on large data sets, it is advisable to use a subset of the data and nonconsecutive values of clusters (e.g., five, 10, 20, and 50 clusters, instead of all consecutive values from two to 50). INTERPRETING THE LDA CLUSTERS

We use LDA to extract three clusters from the respondent-footprint matrix presented in Figure 8.1. Due to the unrealistically small size of the respondent-footprint matrix, we do not apply the recommended values of α and δ, and set them both to 0.1 instead. Figure 8.6 presents two matrices produced by LDA expressing the associations be­ tween clusters and websites (matrix β) and clusters and respondents (matrix γ). Matrix β contains probabilities of a particular website being visited (relatively, within a cluster). For example, take cluster LDA1, which groups art-related websites. When a participant visits a website in this cluster, they will pick Deviantart.com with a probability of .53, Etsy.com with a probability of .46, and Google.com with a probability of .01. Note that the probabilities sum to 1 in each column. This is because we are dealing with mutually exclusive events encompassing all possible outcomes: if a participant visits a given cluster, they must choose one of the websites in the matrix. Matrix γ contains probabilities of a respondent visiting one of the websites in a given cluster. For example, David’s probability of visiting websites in cluster LDA1 (Etsy.com and Deviantart.com) equals .6, while his probability of visiting websites in cluster LDA2 (Google.com and Facebook.com) equals .4. Compare this with the respondent-footprint matrix (Figure 8.1) showing that David visited exclusively websites in those clusters, and that he was more likely to visit those belonging to LDA1. In matrix γ, the probabilities

Figure 8.6 Three LDA topics extracted from the trimmed respondent-footprint matrix presented in Figure 8.1. Matrix γ shows associations between respondents and clusters; matrix β shows the associations between websites and clusters.

150

Digital footprints in psychometrics

sum to 1 in each row: each of the websites visited by a participant must belong to one of the LDA clusters. Note how similar the results of the LDA analysis are to those produced by SVD (with varimax rotation). Building prediction models

As we discussed before, patterns extracted from respondent-footprint matrices can offer useful insights or facilitate the discovery of new psychological constructs and mechan­ isms. They can also be used to build models predicting future behaviors, real-life out­ comes, and psychodemographic traits. The resulting models could be used to replace or supplement traditional assessment tools. The development (training) of prediction models is relatively straightforward, and popular languages for statistical programming (e.g., Python, R, and MATLAB) include functions that allow for conducting each of the following steps. Step 1. Respondents’ digital footprints are matched with the outcome variable that the predictive model is supposed to measure. Outcome variables can include personality scores from traditional questionnaires, intelligence test scores, or real-life outcomes and behaviors, such as job performance ratings or school grades. Typically, outcome variables are collected using traditional tests and questionnaires at the same time as digital footprints. Sometimes they are extracted directly from digital footprints. For example, in building a tool aimed at predicting student performance in a given course from their performance in other courses, both the predictors and outcome variables are part of the same data set. Step 2. The sample is split into training and test sets. Training sets (usually comprising about 80%–90% of the data) will be used to train the prediction model, while test sets will be used to validate the model’s quality indicators, such as its accuracy. Step 3. The dimensionality of the digital footprints in the training set is reduced following the steps detailed earlier. While some of the predictive models can use raw respondent-footprint matrices as an input, dimensionality reduction offers several benefits. First, there are often more footprints than respondents in respondent-footprint matrices. In such cases, reducing dimensionality is essential, as many statistical analyses and prediction models require that there be more (and preferably many more) respondents than variables. Second, reducing the number of unique digital footprints may increase the statistical power of the results and decrease the risk of overfitting (i.e., generating a model that fits a particular set of data so well that it may fail to reliably describe future observations or similar data sets). Third, reducing dimensionality removes multicollinearity and redundancy from the data by grouping correlated variables into a single dimension or a single cluster. Finally, reducing dimensionality decreases the size of the data, and thus the computational resources necessary to analyze it. Step 4. Respondents’ singular value scores (or their LDA cluster membership) extracted in the previous step are used as predictors to train a predictive model aimed at the outcome variable. There is an abundance of models that can be used, ranging from relatively sophisticated approaches—such as deep neural networks, probabilistic graphical models, or support vector machines—to much simpler approaches, such as linear and logistic regressions. In practice, it is sensible to start with the simpler prediction methods before moving on to more sophisticated approaches. Not only are they computationally faster, less

Digital footprints in psychometrics

151

prone to bugs, and easier to implement, but they also provide a good baseline for judging the gains in accuracy (or the lack thereof) offered by more sophisticated approaches. Step 5. The prediction model developed in Steps 3 and 4 is applied to respondents in the test set (set aside in Step 2). Respondents’ singular value scores are estimated using the dimensionality-reduction schema developed on the training set in Step 3 and entered into the predictive model trained in Step 4 to compute the predicted values of the outcome variable. Step 6. The resulting scores of the respondents are examined in a way similar to scores obtained from newly developed traditional tests and questionnaires. In particular, one needs to estimate their reliability and validity, check if they are free from biases, and develop norms (see Chapter 3). Most of the techniques used in the context of conventional measures apply to digital-footprint-based scores. Let us explore a few of them. As in the context of traditional measures, digital-footprint-based psychometric scales’ predictive validity can be estimated by checking how predictive they are of outcomes that a particular trait should predict. The neuroticism score, for example, should be a good predictor of life satisfaction or history of depression. A scale’s concurrent validity describes its correlation with other scales aimed at the same construct. The concurrent validity of the scale trained in Step 5 could be checked by correlating it with the outcome variable. Test–retest reliability could be estimated by correlating digital-footprint-based scores computed from two samples of digital footprints produced by the same respondents at two different point of time. Similarly, split-half reliability can be estimated by correlating two scores estimated from the respondent-footprint matrix split into two random halves (by columns). A hands-on tutorial introducing the reader to extracting dimensions from digital footprints and building predictive models can be accessed through the companion website at www.modernpsychometrics.com.

Summary We are increasingly surrounded by digital products and services that mediate our ac­ tivities, communication, and social interactions. Consequently, a growing fraction of human thoughts, behaviors, and preferences leave digital footprints that can be easily recorded, stored, and analyzed. Combined with ever increasing computing power and modern statistical tools, such vast amounts of data are radically changing the potential of psychometrics. Digital footprints are increasingly used to measure psychological con­ structs that were traditionally measured using tests and questionnaires, as well as new constructs that were traditionally difficult to measure. Digital-footprint-based measures have many advantages over traditional tools, including high ecological validity, higher capacity to record behavior (also in a longitudinal manner), greater speed and un­ obtrusiveness. Yet, they also have significant drawbacks, including potential for bias and reducing respondents’ control over the assessment process. In the wrong hands, digitalfootprint-based assessments could be used to invade people’s privacy without their consent or knowledge.

9

Psychometrics in the era of the intelligent machine

History of computerization in psychometrics Psychometric procedures have always proved to be particularly amenable to computerization. However, since tests usually contain information about an individual’s psychological makeup, they are typically scored over the internet or by computer, and the data, once scored, can easily be transferred to databases, this field is very sensitive, particularly in terms of the right to privacy. Furthermore, machine learning and AI can be used to extract personal information from often seemingly superficial data sets. Such information has always been of interest to personnel and credit rating agencies, the insurance and marketing industries, law enforcement agencies, and the security services, as it enable them to make ever more accurate predictions about our future behavior. Thus special care should be taken, as AI combined with big data can extract personal and even intimate information in ever more accurate ways. The contribution of information technology to testing for college entrance, professional licensure tests, standardized achievement batteries, and scored clinical instruments has been increasing for some time, although this has not always been apparent to the large number of people who have been affected by it. At the same time, computerization of scoring, of test design, and of reliability and validity estimation is leading to significant improvements in the accuracy of the results of testing, and this is proceeding at an ever-increasing rate. For psychometrics, large data sets are not new. The US Army Alpha IQ tests developed during the First World War were administered to more than one million recruits. But 100 years ago, the statistical challenges of analyzing a database of this size were insurmountable. This number of completed paper-and-pencil tests would have filled one kilometer of filing cabinets stacked side by side. Artificial Intelligence (AI) originally evolved within the context of expert systems that combined an inference engine with a knowledge base to provide professional medical, legal, or economic advice. Today, AI is increasingly dependent on modern machine-learning (ML) algorithms that themselves depend on the availability of big data, so much so that the terms ML and AI have become more or less synonymous. These algorithms had their origins in neuropsychological studies of neural networks. Hence, while today it might seem that psychometrics and ML are strange bedfellows, they both have common origins at the interface between human psychology and statistics. While early psychometricians did develop new techniques such as factor analysis and rose to the challenge as best they could in a computer-free era, recent times have found the discipline at the intersection between big data analytics, superfast data handling, and AI. What has been the wind of change for electronic communication has been a hurricane for our discipline. How did this come about?

Psychometrics in the era of the intelligent machine

153

Computerized statistics

The impact of the computer itself has taken place at several levels. The first major development was the computerized statistical package, first on mainframe computers and later on personal computers. Most problems in psychometrics are extensions of matrix and vector algebra, fields of analysis that benefited considerably from these packages. Before the 1960s, the main restrictions on the science of testing were time limitations arising from the need to carry out large numbers of calculations. In factor analysis, matrix inversion, while essential, was time consuming. Further, because iteration is required for the solution of complex equations, each major algorithm needs to be repeatedly looped. In the 1960s computers provided, for the first time, an easy—yet still time-consuming—solution to this problem By the 1980s factor analysis programs were available on microcomputers, and by the 1990s even complex modeling was readily available to every PC user. One difficulty with the ready availability of statistical packages is that users often failed to understanding what they were doing. These programs could produce almost endless alternative ways of analysis, different forms of significance tests, and rotations of factor structure, while offering very little guidance on what was actually important out of all these many pages of figures. The use of knowledge engineering techniques (expert systems containing a knowledge base and an inference engine) has certainly changed this situation. Today, AIs using system-analytic expert systems can guide doctors, lawyers, and accountants through all the necessary steps in the analysis of results, from the initial assumptions to the final conclusions. Computerized item banks

The next most important area in which computerization has had an impact is in the development and administration of item banks. As we saw in Chapter 5, the development of item response theory (IRT), together with item-level data from which item parameters could be estimated, enables a great deal of housekeeping in the bank to be done automatically. The models used to derive the parameters involve complex iteration, and during the 1970s, analysis of the two- and three-parameter models was too expensive in computer time, so item banks were almost exclusively based on the one-parameter Rasch model, even though it was known to be inadequate. However, by the 1980s the more complex models were no longer so challenging to implement and were beginning to become widely available. Computer adaptive testing (CAT) using these IRT models allows responses from the items administered at the beginning of a testing session to be used to obtain provisional estimates of ability. This information is then used to select items of an appropriate difficulty level for the rest of the testing session. These methods, when fully developed, can save more than 50% on the number of items required, thus either saving time or increasing accuracy. One important example of the use of IRT models in test construction was the Differential Ability Scales (DAS), published by Pearson Assessment in the USA in 1990. For a while in the late 1970s, at a time when the Rasch model had fallen into disfavor, it was felt that the use of this model in the construction of the scales had been a mistake. However, the test itself proved to have been so well constructed overall, and so useful in practice, that the use of the Rasch-constructed subscales of numerical and other computational abilities continued to be recommended, albeit with caution.

154

Psychometrics in the era of the intelligent machine

In fact, they have proved to be particularly robust in their use for clinical assessment of children, and the generalization across ability levels from different subsets of items has, despite many misgivings, been found to be informative. Colin Elliott, the original author of the scales as the British Ability Scales, argued that many of the doubts about the Rasch model had arisen from situations where it had been applied to preexisting data, or to test data that had not been specifically designed to fit the model. If the Rasch model was used carefully and with a full knowledge of its limitations, as in the development of the DAS, then it was indeed possible to make use of its subject-free and item-free characteristics. Computerized item generation

Computerized psychometric tests do not always need to depend on a preexisting item bank. They also have the potential to create new items, or tailor existing items during the test administration. The utility of the computer in designing test items is based largely on the existence of a series of standard item formats. Many items have a frame, which is held constant, and elements, which can vary. Take the object-relations format, a is to b as x is to y, such as “glove is to hand as sock is to ____” (foot). There are many millions of possible sets of semantic relations which could be used to build such items, and they can be assembled automatically. The same is true for many other item types. In fact, the number of basic item types is quite limited; it is the enormous number of possible elements they can contain that provides for variation. There are many circumstances where sets of possible elements can be stored and automatically inserted into a fixed format to generate a large series of new items. This applies particularly in memory tests, but also in perceptual and numeric tests. Indeed, the possibility for some computerization applies to almost all item types. The advantages of this are several, but lie particularly in the ability to create novel items in circumstances where respondents need to be tested several times. This form of item generation was used by the British Army Recruitment Battery (BARB) in the 1990s and today forms the backbone of the open-source International Cognitive Ability Resource (International Cognitive Ability Resource, 2018; Sun, Liu & Loe, 2019). Automated advice and report systems

A further way in which the use of the computer has been extended is in the generation of expert advice and reports. Most computerized testing or scoring programs no longer report mere numbers, to be interpreted by experts, but narrative reports in a form suitable for the respondent or other end user. Where the test contains a profile of many subtests, as with the Orpheus Business Personality Inventory (see Chapter 7), the computer can identify extremes, interpret these in light of other subscale scores, and make recommendations. In these cases it is not just the subtest scores that need to be validated but also any narratives that have been derived from these scores. However, it does mean that a much more rigid series of justification rules are required. Consider the similar case of computerized medical diagnosis. If, within this procedure, an error is made and the wrong medication prescribed and the patient dies as a result, who is accountable for this mistake? Is it the computer, the clinician, or the person who wrote the diagnostic program? When diagnosis was carried out by the clinical expert alone there was at least a clear knowledge of who was accountable in the event of error.

Psychometrics in the era of the intelligent machine

155

As computers obviously cannot yet be held accountable in law, the use of computer recommendations might appear to pass the responsibility to the test constructor; however, it is more likely the end user who will be left with de facto responsibility. It has been strongly argued that publishers of such computer programs should also publish material that clearly sets out the decision rules followed by the program. Thus, a professional user should be able to understand the full implications of the program’s recommendations. It was once thought that it should never be the case that a person would be in the position where they could say, “This is the computer’s decision”—a situation that could only be eradicated if advisors were taken through all the steps required to back up the decision. However, the advent of deep-learning, multilayered ML systems such as Google DeepMind and IBM Watson, as well as self-driving cars, has rendered much of this wishful thinking somewhat obsolete. The ethical issues associated with computerized advice of this type were first recognized in the 1980s, when four major concerns were pointed out that arise from the use of computers in test interpretation. First, it was questioned whether computers are any better than human experts. The other objections were, second, that computerized reports may reach the hands of inexperienced or unqualified people; third, that decision rules may not be public; and finally, that computerized reports may not be sufficiently validated. Many of these arguments are in principle still valid today and can equally be directed toward AI; but although they all include an element of truth, it is already too late to argue for abolition. The problems, both ethical and social, that they present to society will have to be tackled one way or another, for better or worse. A code of conduct is required for each application that specifies where records are kept, who is the data controller, who shall have access, the purposes for which they shall be used, the validation techniques, and the procedures for making the decision rules public. The problems are like those encountered in the use of AI for any decision-making process involving human beings, where thought must be given to the consequences of a wrong decision that may have to be justified in a court of law. This said, advances in ML, particularly deep-learning systems, increasingly make such explanation either difficult or impossible.

The evolution of AI in psychometrics The idea of artificial intelligence itself pre-dates the electronic era—the original intelligent machines were mechanical. Calculating machines such as the abacus and astrolabe are centuries old and easily predate Charles Babbage’s 1820 plan for a “Difference Engine” to aid in the construction of mathematical tables. He subsequently designed, but never built, an “Analytic Engine” that combined both instructions and memory functions, using a combination of mechanical parts and punched cards to achieve this purpose. In 1837, the mathematician Ada Lovelace (daughter of Lord Byron) published the first algorithm intended to be carried out on such a “computing machine” and hence arguably became the world’s first computer programmer. The early 20th century saw the evolution of various analog devices designed to instantiate Babbage’s engine, but it was not until Alan Turing’s midcentury insight into the underlying digital nature of the computational world that the first real computers were built. It’s his “Turing test” that epitomizes the way in which AI is perceived even today. Modern AI can be broadly divided into two major fields: expert systems and machine learning.

156

Psychometrics in the era of the intelligent machine

Expert systems

Expert systems offer a procedure whereby the decision rules of an assessment can be explicated and incorporated into a software routine. Recent expert systems can be very sophisticated and offer the advantage of being able to make decisions quickly and consistently. In the early years, psychologists were often at the forefront of the development of expert systems, and indeed many CAT programs and narrative-report generators are simple expert systems. Early expert systems concentrated on simulating “experts” such as accountants, bank managers, and doctors as they went about their business. They encoded if-then branches into a hierarchical decision-tree framework, combining a knowledge database with an inference engine that systematized the various decisions that needed to be made. These systems were mostly diagnostic programs used in medicine, such as that introduced by Feigenbaum’s Stanford Heuristic Programming Project in the 1960s. Instead of the patient being questioned by a doctor, they answer questions suggested by a computer that has been programmed to emulate an ideal and fully informed consultant. The expertise is based on rules (e.g., “If the patient complains of a pain in the big toe, then check the level of uric acid in the blood”) and on data (e.g., all the possible causes of pain in the big toe). Because the memory of a computer is enormous and easily accessible, such a computerized expert system should in principle be able to outperform any consultant, so long as it has had a good role model. Even before the end of the last century, there were many areas in which artificial expert systems had already supplanted their human analogs, e.g., the role of bank managers in allowing credit to customers. For psychometrics, the relevant expert is the professional interviewer. Indeed, it is possible to consider even the classical psychometric test as a rudimentary expert system. Most of the questions in the type of questionnaires used in human-resources departments are, after all, similar in many ways to the sort of questions a job interviewer might ask. However, until the advent of artificial intelligence, the human interviewer has always had the ultimate edge in that they are able to explore different pathways to the same end point. They can recognize that the same skill can be obtained through many different routes. A human can use conditionals when making decisions. The questions asked will depend on how the respondent has replied to previous questions. Classical psychometric tests do not have this ability; they can only be scored by asking the same questions of all applicants, and by weighting each question in the same way, regardless of the level of its application to a particular individual. While this type of data is ideal for the statistical analysis, it lacks the flexibility of the true expert. Statistical models that combine weighted scores in this way are called linear. If the decision depends on conditionals, e.g., “only utilize the response to item x if there is a certain response to item y,” then the model is nonlinear. While statistics did have some models for simple forms of nonlinearity, the true complexity of these models was beyond its scope. Despite their many advantages, expert systems initially had limited impact on psychometrics and the examination or recruitment testing systems that were dependent on psychometric principles. Classical test theory required that every individual take the same standardized test, and the introduction of conditionals (if-then) was already a huge issue in test equating—an issue that was not resolved until the acceptance of IRT and modern psychometric methods. Today, expert systems are not generally seen as AI per se (the intelligence in them is of human origin), but they were the origin of many of the

Psychometrics in the era of the intelligent machine

157

automated systems used by accountants, doctors, and other professionals that have transformed their lives. True AI needed something more. It needed to be able to learn to improve these systems beyond the human level, and that capability was being developed by another group of scientists who were studying and replicating in machines the mechanisms by which neurons worked together in the human brain. Neural networks (machine learning)

While an expert system is a form of AI it is not the only form. We now know that machines can learn as well as follow rules. The thinking behind today’s ML algorithms began in neuropsychology with D. O. Hebb, who in the 1940s introduced the idea of the neural network as a learning mechanism. ML systems are computer programs that can learn from experience. The inner workings of the artificial neural network are a set of nodes held within the software, each of which behaves in many ways like an individual neuron in the human brain. By increasing the number of nodes, allowing signals to flow freely between them, and making the size of these signals depend on the nodes’ history, it is possible to emulate the forms of learning of which previously only living organisms were capable. One area in which such models have been particularly successful is that of pattern recognition. Thus, an artificial neural network can be trained to recognize car number plates seen from different angles in a murky environment under different lighting conditions. In most circumstances such ML networks consistently outperform recognition programs based on classical expert systems or linear statistics. The ability of ML to recognize patterns makes it of interest to psychometricians in that it provides an alternative model of the interviewer as expert. Perhaps the good interviewer is best emulated not as a person who identifies and follows sets of rules of the type found in expert systems but rather as a person adept at pattern recognition. The interviewer is required to recognize true potential amidst the complexity of behaviors, moods, and motives that make up the person applying for the job. Can these models be used in the development of psychometric assessments? In principle, yes. For example, there is a great deal of similarity between psychometrics and econometrics, in which ML systems have also found wide application (e.g., stockmarket prediction or credit risk assessment). Indeed, a credit rating is de facto scored biodata. Similarly, there are links between predictive validity and the actuarial estimation of insurance risk carried out by underwriters. There are differences, however. Classical psychometrics has evolved alongside mathematical statistics and has established almost all of its performance standards within that framework. This has consistently been a problem for CAT and other AI systems in psychometrics, in that new paradigms have demanded new and untried approaches to reliability, validity, standardization, and bias. These have generally lacked the simplicity and clarity of the old formula and have not always met with widespread acceptance, particular where legal case law has been established on the basis of tradition, where the rule is that the score is based simply on the number of questions that have been answered correctly. This having been said, there can be no doubt that the new techniques do in principle offer many advantages over the traditional method. However, any new system does need to address two important questions. First, is the gain in benefit sufficiently substantial to justify the implementation of a relatively untried process? And second, can we obtain a set of standards for its use which will meet the stringent requirements of the human-resources professional?

158

Psychometrics in the era of the intelligent machine

Parallel processing

Mid-20th-century psychology struggled at the interface of conceptual categories. As Wittgenstein observed at the time, “The confusion and barrenness of psychology is not to be explained by calling it a ‘young science.’ Its status is not the same as that of physics, for instance, in its beginnings. … For in psychology there are experimental methods and conceptual confusion. The existence of the experimental method makes us think we have the means of solving the problems which trouble us; though problems and method pass one another by.” Wittgenstein (1958, p. 232) On the one hand stood an idea of mind modeled on the telephone exchange (the “experimental method” in Wittgenstein’s case). Thoughts took place in order, one after another, as we spoke or typed or wrote—or, as in the Rubaiyat of Omar Khayyam: “The Moving Finger writes; and, having writ, /Moves on.” This is tangible in terms of the stream of consciousness favored by the early-20th-century introspectionists, the midcentury single-channel hypothesis of memory researchers, signal detection theory favored by the code breakers, the stimulus-response chain of the behaviorists, and the receiver operating characteristics of the information-processing enthusiasts. But it stands in sharp contrast to the obvious complexity and interconnectedness of the brain itself, which contains no physical or biological single channel but a multiplexity of interconnections between neurons. According to Wikipedia, the human brain has some 8.6 × 1010 (86 billion) neurons. Each neuron has on average 7,000 synaptic connections to other neurons. It has been estimated that the brain of a three-year-old child has about 1015 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for adults, ranging from 1014 to 5 × 1014 synapses (100 to 500 trillion; (Human Brain, n. d.; Suzana Herculano-Houzel (2012)). The breakthrough came with the publication in 1949 of Hebb’s The Organization of Behavior, which introduced associative learning. When two neurons are in proximity and fire together, the activity of one affects the activity of the other, and this could be the case with any number of neurons. Learning among neurons takes place not in series but in parallel—it is an intrinsic activity of a network of neurons: a “neural network.” Today it is widely accepted in psychology, and in neuroscience more generally, that learning in the brain must take place within neural networks, although the many possible mechanisms for these are far from understood. However, it is within the computing world that the idea is having its most impact. The central processing units (CPUs) and multiprocessors within all our computers and phones are synonymous with the concept of a single or very few channels, and do indeed resonate in many ways with the idea of a stream of consciousness. But if the brain contains learning systems that operate in parallel, why not computers too? This is not a new idea—designs for machines using parallel distributive processing (PDP) go back to the 1960s, and the idea is intrinsic to the design of the first machine-learning algorithms, perceptrons, so designated by Frank Rosenblatt at Cornell in 1958. These follow Hebbian-style principles and mimic the behavior of regression models in statistics. In practice, however, the problems of coordinating the timings between many different CPUs proved far more complicated than was first hoped, and today almost all ML

Psychometrics in the era of the intelligent machine

159

code runs its “parallel” processes in series through each of the CPU’s multiprocessors, which, although more time-consuming, is less problematic. Predicting with statistics and machine learning

Classical psychometrics is dependent upon linear statistics, and at a simple level there are many similarities between ML procedures and many statistical methods. A comparison is instructive between one of the earliest ML algorithms—the single-layer perceptron—and a simple multiple regression equation from classical statistics as applied to basic psychometric test item data. Suppose we have two groups of managers in an organization, with similar length of service. Group 1 consists of 500 managers who have been promoted, and group 2 consisted of 500 who have not been promoted. Then for all 1,000 individuals, suppose we have data from a 10-item Big Five psychometric personality test such as the Ten-Item Personality Inventory (Gosling, Rentfrow, & Swann, 2003). A multiple regression equation could be calculated to enable the prediction, on the basis of their item responses, of which group a particular individual was likely to be in, the beta weights being the weighting for each item. Alternatively, suppose that we apply a single-level perceptron ML algorithm, with 10 input nodes (one for each item) and two output nodes (one for each group)—single nodes represent single neurons in neural networks. In order to learn, the perceptron is supplied with data from each individual, the order of individuals having been previously randomized. This allows the algorithm to train to distinguish the probability of any individual being in either one group or the other. In these circumstances, we would find that ratios between the strengths of the links between each input node and the output mode were more or less identical to the beta weights in the regression equation. Logistic regression is basically a one-layer neural network with a sigmoid activation function, and hence is a very simple perceptron—too simple to deserve the name, really. To take it further we need to add some deep structure, first perhaps a second stage where one single intermediate (or “hidden”) node is inserted between the input and the output nodes, making it a very simple multilayered perceptron. As soon as we do this, the two models (regression and ML) really begin to diverge. It is now recognized that traditional statistics and ML are not in theoretical conflict as approaches to data analysis; rather, one can be seen as a subset of the other. It can be demonstrated that all classical statistical procedures can be formulated as special cases of simple ML networks (in most instances a single-layer perceptron). In some cases the ML solution is not simply similar to the solution of classical statistics, it is an algebraic identity. Thus, their relationship is rather like that between Newtonian and Einsteinian mechanics. One can be reduced to the other when certain simple principles hold. The simple principle in this case is that of linearity. One consequence of this is that if we use ML programming to carry out a psychometric functional test validation, we can, if we use a single-layer network, at the very least duplicate the classical item-analytic procedures. Furthermore, if we add many hidden layers (or “deep learning”) to the network we may increase its predictive power by including true nonlinear relationships within the model. If no increase is obtained, we can rest comfortable in the knowledge that our linear solution, with its easily understood explanatory statistics, is adequate and probably the best available. But more importantly, a new issue emerges: explainability. So far, both models provide evidence for an explanation of the relationship between the items and their ability to predict group membership. For example, in the regression

160

Psychometrics in the era of the intelligent machine

case it seems likely that there might be five intervening variables—the Big Five personality dimensions—given the nature of the original questionnaire. But there is a very significant human intervention in this, one that pre-dates the data themselves and lies in the design of the questionnaire and the stages gone through in its psychometric development. The ML algorithm, on the other hand, is very much on its own—it seems a breach of procedural protocol to interfere with the machine’s dependence on what might be seen as “pure data.” Furthermore, once we allow multiple layers, the perceptron diverges in other ways. Unlike regression, it no longer needs to be linear. Why this is the case requires some more explanation. Classical logic defines a few simple relationships between two statements; let’s represent them as X and Y. Both may be true (X and Y), neither may be true (neither X nor Y), and one may be true and the other not (X or Y). These all have the property of being linearly separable—that is, if they are plotted in a graph, a straight line can divide the four instances (X true, X false, Y true, and Y false). But classical logic also describes another relationship between X and Y, one in which either X is true or Y is true, but not both and not neither. This logical condition is called “exclusive or,” or XOR, and a linear divide is not possible. Each of the logical relationships X and Y, X or Y, and XOR can be represented in switches such as those found in a simple radio valve or computer program. Hence all can be involved in computing and utilized by ML algorithms. However, regression in statistics is derived from linear statistics (even so-called nonlinear regression and logistic regression depend on linear transformation in order to approximate the classical linear solution). This difference has opened up a chasm between the two approaches. It was Marvin Minsky and Seymour Papert in the 1950s who first demonstrated to an initially skeptical audience that a simple multilayer perceptron with just two intervening nodes could solve the XOR conundrum, the first step in differentiating ML from classical statistical analytic approaches. But it was still explainable, or at least it was with just a very few intervening nodes. But the machine is not constrained in the same way as the human mind. It does not need to explain anything. So why stop there? Why not simply allow the algorithm to have many intervening nodes and layers—what is now called deep learning—as many as are required to make the prediction the best possible? If all that mattered was the accuracy of the prediction, then money was to be made. Why not use postcodes to predict insurance risk or creditworthiness, use biographics to predict probability of reoffending, use online clicks to predict likelihood of increased return on investment, or use psychological profiles derived from digital footprints to predict voting intention? Machine learning had evolved to a new level, one where it could dramatically increase its ability to predict the future, but in a manner that would for ever elude the limited processing power of human beings. The availability of a predictive technology that addresses nonlinearity had important implications for several long-standing problems in psychometrics. For example, it has often been noted that a limitation of the use of psychometric tests is its tendency to produce “clone workers.” If a particular personality profile is identified as that of, for example, the ideal salesperson, then the use of the test will tend to select a team of individuals who all have this profile. This is in principle undesirable, as any effective team depends not on sameness but on balanced diversity among its members. However, we know that there are many different combinations of traits that may be equally effective in producing an ideal salesperson. Different individuals arrive at their particular set of marketing skills through diverse and myriad pathways. An ML algorithm trained to recognize the possibility of diverse pathways to the same standards of excellence could

Psychometrics in the era of the intelligent machine

161

potentially outperform any paradigm from classical psychometrics that is by its nature restricted to linear prediction. When nonlinearity adds significant power to the prediction, we have the option of promulgating a new design for the test and often the inspiration for a new theory. If we choose this route, however, we are still left with the second question—can we justify our procedures at a sufficiently precise level to be understood and accepted by our community of fellow professionals? Explainability

Our legal and social systems demand much more of decision-making than mere predictability, however accurate or profitable it may be. In the previous examples, it may be that in the past men were more likely to be promoted than women. Given a simple multilayer perceptron, we might well be able to discover that one of the intermediate nodes was simply using the questionnaire data to identify a candidate as male or female, and hence enhancing its ability to make a more accurate prediction based on data already contaminated by human prejudice. Similarly, we may find nodes utilizing poverty, ethnicity, age, and so on, that have been leveraged in a similar manner. Any organization or psychometrician who did this would immediately be in breach of equal-opportunity legislation. It is an increasing recognition of this that has led to the focus on explainability in AI. But is it practical to demand that AI only be used where these decisions can be explained, given that explanation becomes impossible even at what for the machine is quite a simple level of computation? Given that one of the main proposals of psychometrics is to develop tests that predict the future, to what extent can ML algorithms be incorporated into the discipline? One of the main principles of “the science of psychological assessment” is validity, and predictive validity is crucial to successful diagnosis or employment. If ML can improve the accuracy of such predictions, then is the role of psychometrician one of the many that are destined to be replaced by an AI? Explainability is about the question why, which there are many ways of answering. In science, the question is largely about causes—when I ask “why did this happen?” I mean “what caused this to happen?” The relationship between two coincidental facts, for example postcodes and crime, does not need to be causal, but if such a relationship is found then from a human perspective it does immediately lead us to ask “why is this so?” While it has been a common refrain in statistics that “correlation isn’t causation”—a simple significant relationship between two variables as exemplified in a correlation coefficient does not imply causality—it has often been pointed out that in many cases such a causal relationship does exist. Evidence for this might be if the cause is present when the effect occurs, the cause precedes the effect, or alternative explanations other than the direct causal one can be ruled out. But big data has challenged our traditional ways of investigating this. With enough data, any relationship, however minute, between two variables will become statistically significant, and when it comes to databases of the size we have today, there is no longer any real possibility of nonsignificance. Strictly speaking, we know that infinity plus one is still infinity, but poetically speaking this fails to do justice to the enormous number of possibilities involved. There may be an infinite number of objects in the universe, but the number of possible relationships between them isn’t just infinity times infinity, it is this plus infinity times infinity minus one, plus infinity times infinity minus two, and so on. Hence even the most enormous AI in the universe would need to have some way of reducing explanatory pathways; it’s not just the human brain that is going to require shortcuts.

162

Psychometrics in the era of the intelligent machine

Any analysis of big data demands much more than simple tests of statistical significance, and attempts to reduce links between variables have been a problem shared by statisticians and computer scientists alike. In statistics, path diagrams and structural equation modeling became popular in the 1960s and proved to be a useful way of limiting the number of possible connections between variables. Particularly following on the work of the geneticist Sewall Wright, it became important that some pathways be identified as direct and others as indirect, functions be assigned to psychometric latent variables such as intelligence, and the different significance of confounder and collider variables be taken into account. A collider blocks the association between the variables that influence it. A confounder exists where the correlation between two variables has a common cause. While causal explanations were initially somewhat frowned upon, they eventually became accepted, because causal assumptions as to the direction of effects made sense and introduced much-desired elements of simplicity. Designers of neural networks avidly adopted Occam’s principle of parsimony, so that today causal networks are very much in vogue (Pearl & Mackenzie, 2018). Explainable AI, sometimes called XAI, is now receiving considerable attention in many jurisdictions. How the results of this will integrate into our preexisting traditions within psychometrics has yet to be seen.

Psychometrics in cyberspace The use of latent variables in psychometrics dates from Edgeworth’s (1888) theory of true scores (see Chapter 4), which was the first application of vector algebra in the discipline. Other spatial analogies, not yet cyber, were introduced with spatial analogies such as the application of Pearson’s principal components analysis published in 1901, Spearman’s factorial analysis of IQ data in 1905, and Thurstone’s rotation of factors to simple structure in 1934. None of this would have been possible without the use of dimensional concepts that could enable visualization of the latent variable in space. These methods all use vector or matrix algebra, which are calculation intensive, and hence computing power was widely sought after. The introduction of the world’s first minicomputer, the PDP1, in 1960 saw an explosion of activity in all the computational sciences, psychometrics among them. Atlas, the world’s first supercomputer, soon became an essential acquisition for leading university campuses. The 1970s brought the introduction of packet-switched networks and hypertext. The first made supercomputers widely accessible and enabled electronic communication such as email through online networks. The second enabled the internet (a network of networks) and subsequently the World Wide Web. But it was the commodification of networked personal computers toward the end of the last century that set the trend for psychometricians. It was possible for early-career psychometricians and computer scientists delving into the possibilities of these new communications systems to escape from the computer lab and launch themselves and their applications into the new medium they had created. But they were not alone. Hackers, virus designers, and computer gamers were in there too—cyberspace had been born, but it was not yet intelligent. What and where is cyberspace?

To understand what has happened, is happening, and will happen in this new realm it is necessary to investigate the history, properties, and evolution of cyberspace. It is new. Just as there could have been no science of biology without life, and no science of psychology without humans, there can be no true understanding of modern online

Psychometrics in the era of the intelligent machine

163

communication without cyberscience. Cyberscience is the science of cyberspace, but does cyberspace really exist? The philosopher Andrea Monti (2001), in “Souls Writing on the Net”, argues: “There is no need for cyberspace. All internet related things can be handled with existing conceptual categories. The internet has nothing to do with technology. It is just a person talking to another person, using a different technology.” In 2011, however, the UK Government published a new Cyber Security Strategy document that defined it rather differently: “Cyberspace is an interactive domain made up of digital networks that are used to store, modify and communicate information. It includes the internet, but also other information systems that support our businesses, infrastructure and services.” In 2016 the same group identified threats within cyberspace as a Tier 1 threat to the nation. The medium is the message

Marshall McLuhan (1964) had the legendary insight that “the medium is the message.” That is, the personal and social consequences of messages are transformed by their method of communication, whether they be speech, writing, publishing, telegraph, telephone, art, sculpture, architecture, radio, or TV. Today, his catchphrase has never been more relevant. But the existence of cyberspace remains a conundrum. Where is it, what is it, how is it, why is it? After all, most media of communication, such as the book, telephone, or television, sit quite comfortably in the physical world. But the real world of our daily lives, like the flat earth that our ancestors knew, differs greatly from that of modern astrophysics. Without a completely different conception of space than that involved in, for example going to the shop, our modern understanding of the universe, its galaxies, their stars, their planets, would not be possible. Perhaps we need a new science that serves for the mental realm as geometry does for the astronomical one. Psychology itself is no longer adequate, and since the dawn of the machine, this is no longer about consciousness but about intelligence. Cyberspace is the realm within which humans and intelligent machines, whether ML or AI, interact. Before machines could be intelligent, this was rather a dull environment. After all, humans can communicate perfectly easily with each other without it. But the ability of machines to learn to discriminate between different human digital footprints was a game changer in so many ways. It made online marketing possible in real time and to many individuals simultaneously, each exquisitely microtargeted to their own unique digital footprints. The income from these activities has been almost the sole source of funding for the massive expansion of the internet, and much of this was reinvested into making the machines better at learning about human foibles. At the same time, humans increasingly exposed their hopes and fears to each other, providing more fodder for the learning algorithms. After this, humanity was never going to be the same. In the 1980s, the medium through which our online messages passed did not appear to be of any particular importance. But this is no longer the case. As soon as we pick up a

164

Psychometrics in the era of the intelligent machine

phone or a keyboard, our digital presence and its footprint is registered—we become a target. In what has become a dark forest, digital machine presences try to sell to us, influence us, learn from us, target us, persuade us, trick us, and sometimes even eliminate us. It is a world reminiscent of Dante: “In the midst of the journey of life I found myself in a dark forest, for the straight path was lost.” (Dante Alighieri, Inferno, 1320) Or more recently: The universe is a dark forest. Every civilization is an armed hunter, stalking through the trees like a ghost, gently pushing aside branches that block the path and trying to tread without sound. Even breathing is done with care. The hunter must be careful, because everywhere in the forest are hunters like him. If he finds other life—another hunter, an angel, a demon, a delicate infant or a tottering old man, a fairy or a demigod—there’s only one thing he can do. Open fire and eliminate them. In this forest, hell is other people. (Liu, 1999) For Liu Cixin, this is the explanation of the Fermi paradox (the failure of the search for extraterrestrial intelligence), but the dark space that can potentially exist between humans when they communicate is not new: “For now, we see through a glass darkly, but then face to face; now I know in part; but then shall I know even as I am known.” (St Paul, The First Letter to the Corinthians 13:12–13)

Moral development in AI Causality is not the only way of dealing with the question why. Human actions, in particular, can be explained in terms of their reason or purpose—answers to the question “why did you do that?” or “why did that happen?” Purposes are somewhat poor substitutes within science. Teleological explanations that depend on future causes such as the biblical “It came to pass in order that the words of the prophets might be fulfilled” seem almost alien today. In terms of present actions, however, humans will often refer to ethical reasons rather than causal ones: “Even though it did not have the best outcome, it just seemed to me to be the right thing to do.” This could be important, because many of the solutions that AIs suggest to us fail on grounds of ethics rather than other signifiers of success, such as personal gain or financial profit. Which raises the question “can machines override the search for advantage with a respect for ethical values?” And not just “can they be taught to do it?”—“could they learn to do it?” One of the first things we learn as children is the ability to defer gratification—to put off immediate benefits for future gains tomorrow. We also learn the principles of quid pro quo: “If we treat them with respect, they will (hopefully) do the same to us.” In the classical world this was seen as a very rational way of behaving. Indeed, the contrast between moral vices and virtues was seen as very much an evolutionary process, from the passion of the animal or irascible

Psychometrics in the era of the intelligent machine

165

child to the sentiments of the rational being and ultimately the divine. It is a moot point whether humans could learn to adopt these moral values without language, but if we assume that a linguistic community that transcends the individual is a necessary part of the process, perhaps the continuous interaction of artificial beings in cyberspace may present for them the same opportunities to learn. Kohlberg’s theory of moral development

The best-known psychological theories of moral development in humans are those of Piaget and Kohlberg, with Kolhberg perhaps having the edge in terms of the sophistication of his theory. He argues that moral development progresses through three stages: pre-conventional (up to the age of 9), conventional (most adolescents and adults), and post-conventional (10%–15% of over-20s). Each level has two stages. Level 1: Pre-conventional • •

Stage 1: Obedience. Wrong behavior is defined by what receives punishment. If a child is told off for stealing, then stealing is automatically wrong. Stage 2: Self-interest. Right behavior is determined by doing what others want. At this stage, any concern for others is motivated only by selfishness.

Level 2: Conventional • •

Stage 1: Conformity. Being good is whatever pleases others—the child adopts an attitude to morality in which right and wrong are determined by the majority. Stage 2: Law and order. Being good means doing your duty to society. At this level, importantly, no distinction is made between moral and legal principles. What is right is what is handed down by authority, and disobeying rules is always bad by definition. To this end, we obey laws without question and show a respect for authority.

Level 3: Post-conventional • •

Stage 1: Social contract orientation. Right and wrong are now determined by personal values, although these can be overridden by democratically agreed laws. When laws infringe our own sense of justice, we can choose to ignore them. Stage 2: Universal ethical principles. We live in accordance with deeply held moral principles that are seen as more important than the laws of the land.

Originally, Kohlberg—very much following Piaget’s stages of cognitive development—put level 1 as up to age 5, with stage 1 being infancy and stage 2 being preschool. Level 2 occurred largely in school-age children, and level 3 among teenagers (many of whom were still at school) and most adults. Later, he believed that very few adults reached any of the level 3 stages. While Kohlberg’s approach has come in for considerable criticism as being without a multicultural perspective, somewhat elitist, and philosophically circular, it does provide a framework within which we can compare the behavior of humans and machines from a moral perspective.

166

Psychometrics in the era of the intelligent machine

Do machines have morals?

Suppose we consider the applications of AI to advertising, primarily the maximization of return on investment to various versions of online clickbait. This is about nothing more than reward and punishment. More clicks are rewarded, fewer clicks the opposite. This looks very much like Kohlberg’s level 1 stages. Hence, if we were to match the AI algorithm to a human, we could say that it had equivalent moral development to a child of about three years of age. The only difference is that it is not parents but corporations that are doing the training. But could it ever go further than this? If it were to follow the development of human psychology, then to reach the next of Kohlberg’s levels the machine would need at least some conception of the other. In humans this requires a theory of mind—being able to recognize intentions and beliefs in others and recognize that they also have emotions, desires, knowledge, and strategies. While this may seem a tall order for machines, we can at least assume that for children the internal working model that forms the basis of such thinking is based on some form of imitation. By mirroring the behavior of others, we develop the capacity to recognize similar patterns in ourselves. It might at present seem difficult to imagine how a machine could be designed to do this. However, we already have machines that talk and interact with us in the form of smart functions such as Siri, Google Assistant, and Alexa. We have now also addressed issues arising from how children interact with these devices. This has become increasingly important as children, young children in particular, tend to treat them as people, and hence we need to ensure that suitable levels of politeness and respect are adhered to. But in educating our children in this way, are we not also training the machines in reverse—that is, to mirror the behavior of others? Certainly, they are learning to behave as if this is the case, which may be all that is needed. If this is so, then we may reasonably expect that in the near future the more complex AIs and smartbots will have the moral capacity of a seven- or eight-year-old child. But of course, an AI is not an eight-year-old child; it is far more powerful, and we are very careful to limit the boundaries of childrens’ ability to influence events, both for their own safety and for the safety of others. Without some form of discipline here, in the form of regulation, our AI entities may begin to behave like the animated brooms in Disney’s Fantasia. What happens when a child does grow into an adult yet fails to develop morally? Within psychiatry there are long-standing disputes about the diagnosis of such a condition. Cleckley (1941) in The Mask of Sanity, described it as psychopathy, a developmental disorder characterized by a lack of empathy, remorse, or shame, and often associated with superficial charm. While it is easy to see a correspondence between such a concept and a failure to progress along Kohlberg’s evolutionary path for moral development, there is contradictory evidence here. It does seem that psychopaths are able to give the “right” answers when asked moral questions. It seems that do have an understanding of what these terms mean at the cognitive level. Yet although they know what others believe to be the right thing to do, they lack any imperative to follow these values themselves. For whatever reason, their moral and emotional development, if it happens at all, follows a divergent path from that of most teenagers and adults. The most precise definition of the condition can be found in the American Psychiatric Association’s Diagnostic and Statistical Manual (see Chapter 7). The principal diagnostic criteria of what has been renamed antisocial personality disorder are failure to conform to social norms with respect to lawful behaviors, deceitfulness and cunning, reckless disregard for the safety of others, and consistent

Psychometrics in the era of the intelligent machine

167

irresponsibility. AI machines were once considered to be purely cognitive. We know now that they can not only recognize human emotions very easily but can also, like psychopaths, behave as if they experience them. When it comes to morality, they have yet to evolve beyond the level of the psychopath. The laws of robotics

If we continue to allow AI to evolve without any capacity to follow the principles of Kohlberg’s level 3, then we are in serious trouble. The danger was recognized by Isaac Asimov (1950) in his book I, Robot. He envisaged three Laws of Robotics that might address the problem: 1 2 3

A robot may not injure a human being or, through inaction, allow a human being to come to harm. A robot must obey the orders given it by human beings except where such orders would conflict with the First Law. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

Would this work? Almost certainly not, as it is rather too similar to Kohlberg’s levels 1 and 2. No, somehow our modern-day AI-driven robots need to develop their own morality—to come up with these values themselves, rather than having them prescribed by sources of authority. For Asimov, this was just fiction. But not any longer. Probably the closest we have today to one of Asimov’s robots is the self-driving car. However, current attempts to instill some principles into these devices, for example in making judgments on competing risks to either the driver or a child who has unexpectedly run out into the road, are only at level 2, comparable to teaching our children not to be rude to smartbots. Current driverless cars cannot be deceitful, or express remorse or shame, even though they can still cause accidents. They have no theory of mind. Current AI-enabled military drones are also incapable of being responsible for their actions, however many people they kill. Artificial general intelligence

So there is clearly some way to go. Although AI may be able to learn to behave as if it had intentions, it seems odd to attribute such motives to a machine. It might perhaps one day be the case within AGI (artificial general intelligence), but we’re not there yet. But does it actually matter whether such behavior is a driven in this way? If a machine could pass the Turing test in terms of complex moral issues, why should it matter whether the machine were generalizing moral principles, or whether it were conscious or not? Many believe that the idea of conscious AI is an absurdity, but they may be sleepwalking into an unwanted future. Much of our interaction with other humans has nothing to do with ascriptions of consciousness. If I ask the price of something in a shop, I am not in the least concerned over issues of whether the shopkeeper is a conscious individual or not—I just want to know the answer. Similarly, when I get help from my accountant on my tax return, the question of whether the software he used exhibits consciousness or not is far from my mind. Increasingly, our professional advisors are depending on the support of AI, which itself increasingly depends on deep-learning algorithms that are unexplainable

168

Psychometrics in the era of the intelligent machine

(see Chapter 4). Increasingly, it is corporations, rather than individual professionals, that provide this advice. These corporations are legal entities in their own right, and increasingly they depend on machine-enabled information analysis that is beyond the comprehension of any of their human directors. Are even the tech giants of today still truly human? Or has the era of artificial general intelligence already arrived, unheralded?

Conclusion Psychometrics has come a long way since the 19th century. We have seen it evolve through IQ testing, the testing of multiple intelligences, personality, attitudes, beliefs, integrity, motivation, and a plethora of other human attributes. Information technology has increased its capacity manyfold, so that modern techniques are able to diagnose our abilities, personality, desires, intentions, and likely behaviors with even more accuracy. Tech companies have turned this capacity into a money machine that continues to feed their global growth at an ever-increasing rate. At present these corporations are still accountable to the law of the land in which they operate, so that regulation is still possible. But national governments too are increasingly turning to AI solutions to regulate and control their populace, giving them powers that were previously unimaginable. The major challenge of our time is to ensure that these AI-empowered governments remain accountable to us, their citizens.

References

Alighieri, Dante (1935). The Divine Comedy of Dante Alighieri: Inferno, Purgatory, Paradise (pp. 1265–1321). New York: The Union Library Association. Allport, G., & Odbert, H. (1936). Trait names: A psycholexical study. Psychological Monographs, 47(1), 1–171. Asimov, I (1950). I, Robot. New York: Gnome Press. Atkinson, R. D., Brake, D., Castro, D., Cunliff, C., Kennedy, J., McLaughlin, M., McQuinn, A., & New, J. (2019) Guide to the “Techlash”: What it is and why it’s a threat to growth and progress. Information Technology and Innovation Foundation, October 28, 2019. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44(1), 1–26. Binet, A., & Simon, T. (1916). The Development of Intelligence in Children . Baltimore: Williams & Wilkins. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (eds.), Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley. Bloom, B. (1956). Taxonomy of Educational Objectives. New York: Longmans. Bond, M. H. (1997). Working at the Interface of Culture: Eighteen Lives in Social Sciences. London: Routledge. Boring, E. G. (1957). A History of Experimental Psychology, 2nd edition. New York: Appleton-CenturyCrofts. Brown, T. A. (2006). Confirmatory Factor Analysis for Applied Research. New York: Guilford Press. Cattell, R. B. (1957). Personality and Motivation Structure and Measurement. New York: World Book. Cheung, F., Leung, K., & Zhang, J. (2001). Indigenous Chinese personality constructs: Is the five factor model complete? Journal of Cross-Cultural Psychology, 32, 407–433. doi: 10.1177/0022022101032004003. Cleckley, H. M. (1976). The Mask of Sanity (5thedition). London: Mosby. Cleckley, H. (1941). The Mask of Sanity: An Attempt to Reinterpret the So-Called Psychopathic Personality. St Louis: C.V. Mosby. Concerto (2019). Cambridge, UK: The Psychometrics Centre, University of Cambridge. Retrieved from https://www.psychometrics.cam.ac.uk/newconcerto. Darwin, C. (1971). The Descent of Man, and Selection in Relation to Sex. London: JohnMurray. Darwin, C. (1859). On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. London: John Murray. Darwin, C. R., & Wallace, A. R. (1858). On the tendency of species to form varieties; and on the perpetuation of varieties and species by natural means of selection. Journal of the Proceedings of the Linnean Society of London, Zoology, 3(9), 45–62. Edgeworth, F. Y. (1888). The statistics of examinations. Journal of the Royal Statistical Society, 51(3), 599–635. Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Multivariate Applications Books Series. Mahwah, NJ: Lawrence Erlbaum Associates Publishers. Eysenck, H. J. (1967). The Biological Basis of Personality. Springfield, IL: Thomas Press.

170

References

Eysenck, H. J. (1970). The Structure of Human Personality (3rdedition). London: Methuen. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (ed.), Educational Measurement. The American Council on Education/Macmillan Series on Higher Education (pp. 105–146). New York: Macmillan Publishing Co, Inc; American Council on Education. Flynn, J. R. (1984). The mean IQ of Americans: Massive gains 1932 to 1978. Psychological Bulletin, 95, 29–51. Flynn, J. R. (2007). What is Intelligence?: Beyond the Flynn Effect. Cambridge, UK: Cambridge University Press. Flynn, J. R. (2016). Does your family make you smarter? Nature, nurture, and Human Autonomy. Cambridge, UK: Cambridge University Press. Galton, F. (1865). Hereditary talent and character. Macmillan’s Magazine, 12, 157–166. Note: Galton’s misspelling of Phillipps as Phillips. Galton, F. (1884). Measurement of character. Fortnightly Review, 42, 179–182. Gardner, H. (1983). Frames of Mind: The Theory of Multiple Intelligences. New York: Basic Books. Gosling, S. D., Rentfrow, P. J., & Swann, W. B., Jr. (2003). A very brief measure of the big five personality domains. Journal of Research in Personality, 37, 504–528. Hebb D. O. (1949). The Organization of Behavior: A Neuropsychological Theory. New York, NY: John Wiley & Sons. Herculano-Houzel, S. (2012). The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost. Proceedings of the National Academy of Sciences (PNAS), 109(Supplement 1), 10661–10668. first published June 20, 2012. https://doi.org/10.1073/ pnas.1201895109. International Cognitive Ability Resource (2018). Retrieved from icar-project.com International Personality Item Pool (IPIP) (2006). Retrieved from https://ipip.ori.org/. Jenner, E (1807). Classes of the human power of intellect. The Artist, 19, 1–7. Jin, Y. (Ed.) (2001). Psychometrics (pp. 2–9). Hua Dong Normal University Press: Shanghai. (in Chinese). Kleinberg, J., Lakkaraju, H., Leskovec, J. H., Ludwig, J., & Mullainathan, S. (2018) Human decisions and machine predictions. Quarterly Journal of Economics. 133, 237–293. Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior, Proceedings of the National Academy of Sciences of the USA, April 2013. LeBreton, J. M., Shiverdecker, L. K., & Grimaldi, E. M. (2018). The dark triad and workplace behavior. Annual Review of Organizational Psychology and Organizational Behavior, 5, 387–414. Liu, C. (2015). The Dark Forest (The Three Body Problem). New York: Tor Books. McLuhan, M. (1964). Understanding Media: The Extensions of Man. New York: McGraw Hill. Monti, A. (2001). Does Cyberspace Exist? APC Europe Internet Rights Workshop, Prague, February 2001. https://blog.andreamonti.eu/?p=38. Pearl, J., & Mackenzie., D. (2018). The Book of Why: The New Science of Cause and Effect. New York: Basic Books. Popham, W. J. (1999). Why standardized tests don’t measure educational quality. Educational Leadership, 56(6), 8–15. Qi, S. Q. (2003). Applying Modern Psychometric Theory in Examination (pp. 2–5). WuHan: HuaZhong Normal University Press (in Chinese). Rumelhart, D., & McClelland, J. (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. London: MIT Press. Rust, J. (1997). Giotto Manual, London, UK: Pearson Assessment. Rust, J. (2019). The Orpheus Business Personality Inventory (OBPI), Cambridge, UK: The Psychometrics Centre, University of Cambridge. Rust, J. & Golombok, S. (2020). The Golombok Rust Inventory of Sexual Satisfaction (GRISS), Cambridge, UK: The Psychometrics Centre, University of Cambridge. Rust, J., & Golombok, S. (2020). The Golombok Rust Inventory of Marital State (GRIMS), Cambridge, UK: The Psychometrics Centre, University of Cambridge.

References 171 Rust, J., Golombok, S., & Collier, J. (1988). Marital problems and sexual dysfunction: How are they related? British Journal of Psychiatry, 152, 629–631. Saville, P., Holdsworth, R., Nyfield, G., Cramp, L., & Mabey, W. (1984). Occupational Personality Questionnaires Manual. London: Saville & Holdsworth Ltd. Sternberg, R. (1990). Wisdon: Its Nature, Origin and Development. Cambridge, MA: MIT Press. Strong, E. K. (1943). Vocational Interests of Men and Women. Stanford, CA: Stanford University Press. Stillwell, D. (2007) myPersonality https://sites.google.com/michalkosinski.com/mypersonality. Sun, L., Liu, Y., & Luo, F. (2019). Automatic generation of number series reasoning items of high difficulty. Frontiers in Psychology 10:884. Terman, L. M. (1919). Measurement of Intelligence. London: Harrap. Tett, R. P., Jackson, D. N., & Rothstein, M. (1991). Personality measures as predictors of job performance: A meta-analytic review. Personnel Psychology, 44(4), 703–742. The UK Cyber Security Strategy: Protecting and promoting the UK in a digital world, November 2011, HMSO London. https://www.gov.uk/government/publications/cyber-security-strategy. Thomson, W. (1891). Popular Lectures and Addresses.. London: MacMillan. Thorndike, R. M., & Thorndike-Christ, T. M. (2014). Measurement and Evaluation in Psychology and Education (8thedition). New York, NY: Pearson Education. Wainer, H., Dorans, N. J., Flaugher, R., Green, B. F., & Mislevy, R. J. (2000). Computerized Adaptive Testing: A Primer. London, UK: Routledge. Wang, D. F., & Chiu, H. (2005). Measuring the personality of Chinese: QZPS vs. NEO PI-R. Asian Journal of Social Psychology, 8(1), 97–122. Wittgenstein, L. (1958). In G. E. M. Anscombe (trans). Philosophical Investigations (3rd edition). Englewood Cliffs, NJ: Prentice Hall. Youyou, W., Kosinski, M., & Stillwell, D. (2015). Computer-based personality judgments are more accurate than those made by humans, Proceedings of the National Academy of Sciences of the USA (PNAS), September 2013.

Index

Note - Page numbers in italic indicate a figure or table on the corresponding page

ability 5–7, 73; ability questionnaires 20, 24–5; Chinese assessment of 4–5; psychometric testing for 11–3 absenteeism 122, 125, 126 abstraction 15, 71 acquiescence effects 21, 27, 92, 118, 120 Adler, Alfred 96 administrative procedures 30, 48, 79, 101, 104; data collection concerns 141; item banks, administration of 153–54; length of test as a factor 40; management of bias 106–9; penciland-paper administration 29; test-retest reliability 38–9 adverse impact 56–7 affirmative-action programs 57 aggression 15, 93, 99, 116, 126; humanistic theory on 96; integrity tests as addressing 125; psychoanalytic approach to 94–5 agreeableness 14, 114, 116, 119, 125 Akaike information criterion (AIC) 148 algebraic normalization 50–1 algebraic transformation 50–1, 59 Allport, Gordon 13, 14, 17, 93 alternate-choice items 24, 25, 26, 76 American Psychiatric Association 122, 123, 166 anal stage in psychosexual development 95 analytic intelligence 11–2 anonymity 137–38 anthropometrics 7 anxiety 95, 96, 105, 110, 116 Aristotle 12 arithmetic testing 5, 13, 39, 61–2 Army testing 9, 152, 154 artificial general intelligence (AGI) 167–68 artificial intelligence (AI) 19, 74, 75, 156, 163; computerized reports, objections to 155; explainability, focus on 161–62; ML algorithms, as dependent on 152; moral development in AI 164–68; reliability,

approaches to 157; system-analytic expert systems, use of 153; true-score theory and 72 Asimov, Isaac 167 assessment, history of 4–7 Assessment of Performance Unit (APU) 77 Atkinson, Robert D. 16, 74 attitude 21, 99, 125, 126, 165; assessment, attitudes towards 108; attitude questionnaires 24, 25; beliefs as distinct from attitudes 17–8; digital footprints revealing 132, 139; negative attitudes 124 authority 16, 119, 124, 126, 165, 167 Babbage, Charles 155 background information in questionnaire design 28 Bandura, Albert 18, 98 Barnum, P. T. 111 Barnum effect 111–12 Basic Human Values Theory 17–8 Beck Depression Inventory (BDI) 73–4, 110 behavioral rating scales 94 behaviorism 120–21 belief 16, 17–8, 56, 112, 123, 127 The Bell Curve (Herrnstein/Murray) 11 Berners-Lee, Tim 2 bias 67, 134, 138, 140, 157; adjusting for 53, 74; assessments of 2, 110, 151; bias-free testing, legal requirements for 55; content validity, checking for potential sources of 43; in CTT test creation 81; equivalence as requiring biasfree testing 38; ethnic and racial bias 109, 139; intrinsic test bias 55–6; item analysis, eliminating bias with 27; political bias 47; as a psychometric principle 52; in rating scales 66, 103, 115; response bias and factor structure 118–19; social desirability bias 118, 135; sources of, managing 57, 106–9; split-half reliability, no systematic bias in 40; teacher bias 8; three major categories of 54; unconscious bias 129

Index big data 1, 2, 3, 76, 112, 144, 152, 161, 162 Big Five traits 14, 119–20, 122, 130, 133 Big Seven Chinese Personality Scale (QZPS) 116–17 binary data 70, 76, 92 Binet, Alfred 8, 9 biometrics 1, 7 Birnbaum, Allen 78 black box 58, 74 Bloom’s Taxonomy 21 blueprint building 26, 33, 74; content validity 35; GRIMS item examples 27–8; making a blueprint 20–3 bodily-kinesthetic intelligence 12 Bond, Michael H. 116 Bondone, Giotto di 121 brain functioning 12, 45, 66; neuron connections 65, 157, 158; polygraph tests, brain waves measured in 105 British Ability Scales 154 British Army Recruitment Battery (BARB) 154 British East India Company 5 Bunyan, John 121 Cambridge Assessment 3 Cambridge Judge Business School 113 Cattell, James McKeen 7 Cattell, Raymond B. 18, 120; 16PF model, as developer of 14, 70, 101, 114, 117; 24 scale points, using for IQ test scores 50; Cattell scree test 67 Cattell scree test 67, 92 central processing units (CPUs) 158–59 centroid technique 65 charity, virtue of 121 Chess, Stella 17 Cheung, Fanny 116 China 4, 116–17 Chinese Personality Assessment Inventory (CPAI) 116 chi-square test 7, 50 “Classes of the Human Powers of the Intellect” (Jenner) 6 classical test theory (CTT) 80–2, 88 Cleckley, Hervey 122, 166 clickbait 4, 166 client-centered therapy 96 clinical psychology 62, 73 clone workers 160 codes of conduct 155 cognitive skills 12, 98 colonialism 6 competency measures 52, 56, 72–4, 113 computer adaptive testing (CAT) 3, 79, 129; CTT scoring incompatibility 81, 82; IRT models, use of 153; principles of 89–91 computerization, history of 152–54

173

concurrent validity 43, 44, 151. see also validity confidence interval (CI) 41–2, 80 confirmatory factor analysis (CFA) 92 conformity 17, 116, 119, 120, 165 Confucius 4 conscience. see superego conscientiousness 14, 15, 45, 114, 116, 120, 122 construct validity 43, 44–5, 74 continuous assessment 2, 30, 100, 120, 130 conventionality 15, 16, 97, 116, 123, 165 correlation: correlation coefficient 7, 39, 40, 44, 63, 161; correlation matrices 7, 60, 61, 64, 70; product-moment correlation coefficient 32–3, 34, 35 cosine 62–4, 65, 70 Costa, Paul 114–15, 117 creative intelligence 11, 12 creativity tests 12, 42, 70–1 criminality 10, 14, 121, 122 criterion-referencing 43, 46, 48, 51–2, 58, 72 Cronbach’s alpha 34, 40 CTT. see classical test theory Cui, Lizhen 116 cultural differences 53, 99 cyberscience 163 cyberspace 3, 4, 18–9, 162–63, 165. see also internet Dante Alighieri 121, 164 dark side of personality 125 Darwin, Charles 6–7, 18 deep learning 72, 74, 108, 159, 160, 167 defense mechanisms 95 depression 102, 140, 151; in Affective/ Inconsistent personality 124; Beck Depression Inventory (BDI) 73–4, 110; digital footprints as measuring 131, 143 Descent of Man (Darwin) 7 detail, attention to 116, 120, 121, 124 determinism 94, 120 Development of Intelligence in Children (Binet/ Simon) 8 diagnosing 2, 42, 91, 102, 113; antisocial personality, diagnostic criteria of 166–67; bias in assessment 55; computerized diagnoses 129, 154–55; diagnostic validity 35, 161, 168; expert systems used for 156; of learning disabilities 11, 53; self-diagnosing 140; standardization in 36 Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) 123, 124, 166 diagnostic validity. see validity Differential Ability Scales (DAS) 153–54 differential item functioning (DIF) 54, 55, 57 differential validity 45, 56 digital footprints 19, 160, 163; advantages and challenges of employing 134–38; analysis,

174

Index

preparing for 141; collecting digital footprints 139–41; data sparsity 142–43; online digital footprints 103, 108; prediction models, building 149–51; types of 130, 131; typical applications of 132–34 Digman, John M. 114 dimensionality reduction 65, 143–44, 145, 150, 151 disability 10, 56, 122; bias and 53, 109; learning disabilities 13; legislation regarding 54 discrimination 6, 40, 54, 74, 123; attempts to address 9; discrimination parameters 84, 85; indirect discrimination 56–7; item discrimination 31, 76, 77–8; in knowledgebased questionnaires 31–2, 32–3; machines, learning to discriminate 73, 163; psychometric measures, reducing with 138; in Rasch model 79; rating-scale items and 26; social class, discrimination based on 109 dishonesty 14, 105, 121, 122, 123, 140 distractors 24, 31, 33 divergent validity 45 domain theory of personality 119 dyslexia 13, 53 Edgeworth, Francis Ysidro 58–9 educability 5, 8 education 2, 39, 43, 57, 70; bias in educational testing 53, 55; digital footprint, educational history as part of 138, 140; educational standards, assessment of 77–8; functional approach in educational testing 73; in medieval Europe 5–6; meritocratic education, development of 9; psychometric testing in education 1, 11, 18; taxonomy of educational objectives, use of 21; WISC-V, use by educational psychologists 13 Educational Testing Service (ETS) 3, 79 ego 94–5, 96 eigenvalues 66, 67 Einstein, Albert 97 electroencephalogram (EEG) 45 Elliott, Colin D. 154 emotional intelligence 12 emotional stability 15, 116 employment testing 43, 52, 72, 113, 114 environment 11, 99, 106; assessment environment, control over 135–36; online environment 125, 130, 134, 139, 163; in social learning theory 94, 97–8; testing environment 129, 135; work environment 16, 104, 126 error 92, 103, 154; in factor analysis 65, 70, 144; fairness in testing and 53; human inability to spot errors 108; random error 59, 78; reliability in testing and 38, 57, 71; standard error of measurement 41–2, 43, 80, 86, 91; in truescore theory 60

essay tests 40, 42, 58–9 ethical concerns 14, 18, 75, 108, 120, 141, 155, 164 ethnicity. see race and ethnicity eugenics 9–10, 18 European Organization for Nuclear Research (CERN) 2 euthanasia 10 evolutionary theory 7, 98, 121, 165, 166 examinations 3, 5, 44; item banks and 76; item offensiveness factor 55; in Rasch model 77; summative examinations control 73 expert systems 152, 153, 156–57 explainability 58, 159–60, 161 explainable AI (XAI) 162 exploratory factor analysis (EFA) 92 extraversion 71, 129, 137, 143, 147; in Big Five model 114, 116, 119; CTT scoring example 81–2; in differential validity assessment 45; EPI as measuring 44–5, 48, 117; in four personality types of Hippocrates 100; introversionextraversion scale 21, 101, 134; narcissism, correlating with 125; neuroticism, comparing to 14; in personality questionnaires 134–35; predisposition to 99; in trait approach 72 extreme response 27–8 extrinsic test bias 54 Eysenck, Hans J. 14, 44–5, 70, 100, 117 Eysenck Personality Inventory (EPI) 44–5, 48, 114, 117 Facebook 133, 135, 142, 143, 149; Facebook Likes, analysis of 4, 131, 132, 134, 137, 141; Facebook status updates 132, 136; myPersonality app, as published on 3; online advertising platforms, use on 140; singular vectors study example 146–47 Face++ emotion-detection software 132 face validity 35, 43, 53, 111 facility 31–2, 33, 78 factor analysis 7, 14, 48, 144, 152; confirmatory factor analysis 92; five-factor model, as grounded in 114, 115, 117; latent traits, identifying 60–2; matrix inversion as time consuming 15; orthogonal factors, as generating 118; test construction, applying to 66–70; true scores, attempting to identify 74; vector model and 64–5 fairness 52–3, 54, 107, 126 faith 121 feedback 36, 105, 127, 137, 140; in formative assessment 73; from music-streaming platforms 144; from myPersonality Facebook test 3; positive characteristics, emphasizing 111–12 fellowship 116, 119 Fisher information function 85–7 Fiske, Donald W. 114

Index five factor model 113, 115, 119–20, 122 Flynn, James 11 Flynn effect 11, 18 folk psychology 121 formative assessment 73 fortitude 121 Freud, Sigmund 94–6, 131 functional psychometrics 58 future behavior, predicting 133 “g.” see general intelligence Galen of Pergamon 121 Galton, Francis 7, 9, 13, 18, 100, 114 Gardner, Howard 12 GDPR privacy law 103 gender 36, 56, 110; differential reinforcement 97–8; differential responding 110; gender bias 53, 74, 109; gender stereotypes 99; GRIMS scores and 21, 35; as a linear variable 139; as standard demographic information 4, 28, 31, 137, 138 general intelligence. see under intelligence genetics 10, 162; behavioral genetics 98–100; genetic predisposition 7, 17, 99; intelligence level and genetic heritage 10, 11, 18 genital stage of psychosexual development 95 genius, study of 7 Germany 7, 10 Giotto integrity test 111, 113, 126, 127 global positioning system (GPS) 130, 131 Goddard, Henry 10 Goldberg, Lewis 114 Golombok Rust Inventory of Marital State (GRIMS) 20–3, 27, 29, 31, 34, 35, 37 Golombok Rust Inventory of Sexual Satisfaction 102 goodness-of-fit test 59, 62, 92 Google 136, 138, 149, 155, 166; online advertising platforms offered by 140; popularity of 3, 142; in respondent-footprint matrix example 141, 143; singular vector study, Google.com in 146–47 graphical techniques 105, 150; dimensions, moving into more 64–5; graphical description of item response theory 80–2; graphical representation of ideas 62–4 graphic user interface (GUI) 2 Grimaldi, Elizabeth 125 guessing 77, 78, 86; corrections for guessing 53–4; guessing parameter 84–5; in questionnaires 24–5; in three-parameter model 79, 83 Guilford, Joy P. 93 headings in questionnaires 28, 29 Hebb, Donald Olding 157, 158 Hereditary Genius (Galton) 7 Herrnstein, Richard 10

175

hierarchy of needs 96–7 Hippocrates 100 histogram 47 Hofstede, Geert 16 Hogan, Robert 123, 124–25 Holland, John 16 Holland Codes (RIASEC) 16 honesty 28, 121, 125, 127, 140 Hong Kong 76, 116 hope 121, 163 human behavior, studying 133, 143, 165 human development 121 human insight 108, 133, 155, 160 humanistic theory 94, 96–7 human relations 12, 116 human-resources management 156, 157 hypertext markup language (HTML) 2 hypothesis-testing in IRT 80, 89 I, Robot (Asimov) 167 IBM Watson 132, 155 ICC. see item characteristic curve id 94–5, 96 identical twins study 98–9 ideology 18, 52, 53, 57, 58 images and audiovisual data 132 immigrant groups 10, 11, 56 impression management 118, 120, 126 “in-basket” technique 104, 109 indecisiveness 27 information processing 158 instructions 109, 125, 126; in “Analytic Engine” design 155; assessment forms, not following instructions on 107; questionnaires, designing instructions for 28, 29 integrity: Giotto integrity test 113, 126; ipsative approach 111; modern integrity testing 121–22; OBPI integrity scales 126–27; in occupational settings 14, 125–26 intellectual domain 7, 12, 116, 120, 122; intellectuality 6; intelligent testing of intellectual abilities 8, 9; personality, intellect as a domain of 119 intelligence 7, 8, 70, 71, 163; digital footprint, measuring via 131, 132; as educapability 5; extraterrestrial intelligence 164; general intelligence 9, 13, 45, 60, 61, 62, 133, 144; inherited differences in 10, 18; as a latent variable 162; multiple intelligences 12, 121, 168; social class, confounding with 6; three major forms of 11–2; as an underlying psychological construct 72. see also artificial intelligence. see also IQ testing intelligence quotient. see IQ testing intelligence testing 11, 12, 13, 54, 70; concurrent validity with 44; early intelligence testing 8–9, 10; face validity with 43; general intelligence,

176

Index

generating scores for 60; outcome variables, intelligence test scores as 149 International Health Exhibition 7 International Personality Item Pool (IPIP) 3, 115 internet 115, 135, 152; cyberspace 3, 4, 18–9, 162–63, 165; development of 2–3; digital data trails 136, 142; en masse assessment via 127–28 interpersonal intelligence 12 interrater reliability 40, 42, 105, 108 interval data 46, 47–8, 76 interviews 2, 21, 113; availability bias 129; interrater reliability 40; interviewer-as-expert model, alternative to 157; ordinal data, interview panels utilizing 46; professional interviewers 156 intrapersonal intelligence 12 intrinsic test bias 54, 55 introversion 100, 116, 129; digital footprint of introverted job candidates 135; Distant/ Introverted personality 123; in Eysenck Personality Inventory 14, 44–5; introversionextraversion scale 21, 101, 134 ipsative scaling 110–11, 126 IQ testing: Spearman’s factorial analysis of 162; trait theory as linked to development of 100; US Army Alpha IQ tests 152 IRT. see item response theory item analysis 27, 33, 80; CFA as appropriate for analysis 92; classical analysis 50, 76, 77, 78, 115, 159; content validity, checking on 35; factor analysis as alternative to 66; item-analysis tables 31–2, 34 item banks 3, 76–7, 90, 115, 153–54 item bias 54, 55 item characteristic curve (ICC) 78, 85, 88 item discrimination 77, 78 item facility 78 item generation 154 item offensiveness 55 item response theory (IRT) 92, 153, 159; aim of 78; difficulty parameter 83–4; graphical introduction to 82–3; guessing parameter 84–5; IRT scoring 88–9; multiple advantages of 91; polytomous IRT 79–80 item scores 70 Jackson Vocational Interest Survey 111 Jenner, Edward 6 job performance 122, 149 John, Oliver P. 114 judgment 5, 8, 13, 86, 98; AI, judgment capabilities of 167; factor analysis and 65–6; fair-mindedness assessment 126; mean judgment 59; profiles, depending on the use of 102; psychometric measures as complementing 138 Jung, Carl 96, 100

justice 71, 121, 165 Kagan, Jerome 17 Kaiser, Henry 66 Kaiser criterion 66, 67, 92 Kaiser Wilhelm Institute 10 Kelly, George 17, 97, 106 Kelvin scale 48 Khayyam, Omar 158 kindness 5, 15, 116 Kleinberg, Jon 138 knowledge-based questionnaires 20, 25, 33; coefficient requirements 34; inappropriate testing methods 39; item analysis 31, 32; multiple-choice options 24, 26; ordering of items 28; sample instructions 28; scoring 30, 36 Kohlberg, Lawrence 165, 166, 167 Lakkaraju, Himabindu 138 language 5, 13, 68, 111, 165; of bizarre/original personalities 123; as a demographic variable 140; idiomatic use of 55; language records as digital footprints 130, 139; in lexical hypothesis 114; natural language 120–21, 144; secondlanguage learners 56; testing, language updates in 81, 103, 119, 136 latency, psychosexual state of 95 latent Dirichlet allocation (LDA) 144, 147–49 latent traits 70, 71, 72, 86, 133; in 3PL model 83, 84; of ability 47, 77; in CTT test creation process 81; factor analysis, identification with 60–2; Fisher information function and 85; in item response theory 80, 91 latent-variable modeling 7, 92 latent variables 65, 74, 162 law and legislation 57, 103, 128, 157, 168; equalopportunity legislation 53, 74, 161; legal principles, conventional adherence to 165; legislation influence on use of psychometric tests 54; protected characteristics of designated groups 56 layout in questionnaires 29–30 leadership 105, 109, 111, 127 league tables 3–4 learning 43, 158; ability to learn 5–6; deep learning 72, 74, 108, 159, 160, 167; social learning theory 18, 94, 97–8. see also machine learning learning disabilities 6, 8, 11, 12–3 LeBreton, James 125 length factor in true-score theory 72 Leskovec, Jure 138 lexical hypothesis 13, 114 lie-detector tests 14, 105 Likert, Rensis 17 Likert scale 17, 134 linearity 156, 160, 161; factor rotation,

Index transforming 146–47; fixed linear testing 89, 90–1; gender as a linear variable 139; ML networks and 157, 159; multicollinearity 150 linguistic communities 53, 165 LinkedIn 103, 135, 137 Liu Cixin 164 liveliness, personality trait of 15, 93 Loevinger, Jane 122 logistical curves 82–3, 84, 85 log transformations 50 Lord, Frederic M. 78 Lovelace, Ada 155 loyalty 5, 126, 127 luck 53, 54 Lugwig, Jens 138 lying 14, 105–7, 108, 118, 120, 123 Lykken, David 122 Machiavellianism 125 machine learning (ML) 163; AI, as a field of 155, 166; behavioral predictions, ability to make 128; biases of machine-learning algorithms 108; black box approach, favoring 58, 74; computer adaptive testing and 3; cyberspace filtering, as determining 19; functional model as basis for 73; neural networks 157; parallelprocessing computation and 65; PDP, utilizing 158; predicting with 159, 160; as a predictive technique 2, 4; psychometrics, common origins with 152; SVD use and 144 manifestation in blueprints 20–2, 23, 27–8 marital harmony 21, 35 marking schemes 42, 49 The Mask of Sanity (Cleckley) 122, 166 Maslow, Abraham 16, 96–7 mathematics 7, 69, 72, 130, 155; assessment of ability in 48, 49, 68, 73, 77, 80; construct validity example 45; in item response theory 78, 80, 83; logical-mathematical intelligence 12; mathematical statistics, psychometrics evolving alongside 157; singular value decomposition technique 144; vector algebra 58, 62–4, 65, 153, 162 MATLAB programming language 131, 142, 144, 147, 148, 149 McClelland, David 16, 72 McCrae, Robert 114–15, 117 McLuhan, Marshall 2, 163 mean scores 36, 37, 39, 46–50, 55, 59, 83 measurement theory 70–1 median as the average 46–7 memory 39, 129, 135; memory function in computers 142, 155, 156; memory tests 13, 154; single-channel hypothesis 158; in social learning theory 98 Mencius 4 Menninger, Karl 93

177

MENSA society 13 mental illness 6, 10, 22, 122–23 Mental Measurement Yearbooks (Buros Center for Testing) 122 Mental Perfection, level of 6 Microsoft Cognitive Services 132 The Million of Facts (Galton) 7 Millon, Theodore 100 Millon Index of Personality Styles 110 Minnesota Multiphasic Personality Inventory (MMPI) 102 Minsky, Marvin 160 Mischel, Walter 93, 98 mobile sensors 131 mode 46, 47, 50, 51, 54, 159 modeling 7, 57, 97; complex modeling 153; IRT modeling 78, 91; as observational learning 98; statistical modeling techniques 57; structural equation modeling 92, 162 Monti, Andrea 163 motivation 16 MSCEIT test 12 Mullainathan, Sendhil 138 multidimensional scaling 65–6 multiple choice 28, 31, 76; in 3PL model 84–5; guessing artifact issue 53–4; logistic models, analysis with 92; multiple choice items 24–6 multiple intelligences 12, 121, 168 Murray, Charles 10 Myers-Briggs Type Indicator (MBTI) 100 myPersonality app 3 narcissism 116, 123, 125 narrative reports 111, 154, 156 National Vocational Qualification (NVQ) 52 natural language descriptors 120–21 natural language processing 131, 144 natural selection 7 NEO personality inventories 114, 136 neural networks 150, 152, 157, 158, 159, 162 neuropsychological studies 152 neuroscience 158 neuroticism 14, 44, 99, 100, 111, 112, 114, 116, 117, 119, 120, 151 New York Longitudinal Study 17 nominal data 46–7 normal distribution 47–8, 50–1, 76, 78, 90 normalization 50, 57 Norman, Warren T. 114, 120 normative testing 111 norm referenced testing 46–9, 51, 52, 72 norms 36, 37, 57, 151, 166 Novick, Melvin R. 78 null hypothesis 78 numerical ability 9, 45, 60, 68, 92, 153 objective evaluation 18, 40, 42, 51, 73, 101, 110

178

Index

objectivity as a personality trait 15, 97 oblique rotation 69, 70 observational learning 98 Occupational Personality Profile (OPP) 101 OCEAN. see five factor model Odbert, Henry S 13–4 one-parameter model. see Rasch model Ones, Deniz 122 online psychographic targeting 4 openness as a personality trait 14, 15, 114, 116, 119, 120, 125 oral stage in psychosexual development 95 ordinal data 46–7, 76, 92 Organization of Behavior (Hebb) 158 Orpheus Business Personality Inventory (OPBI): integrity scales 126–27; personality scales 119–20; subsets in profile of 154; trait measures, assessing 110; as a work-based personality test 101, 113 orthogonal rotation 69, 70 Papert, Seymour 160 parallel distributed processing (PDP) 65, 158 parallel-forms of reliability 39 parametric data 46–7, 48, 76, 92 parsimony, laws of 92, 162 PDP1 minicomputer 162 Pearson, Karl 7, 162 Pearson Assessment 111, 113, 123, 126, 153 percentile-equivalent normalization 51 perceptual domain 119, 120 perfectionist personality 15, 124 personal construct theory 97, 106 personality: behavioral genetics 98–100; fivefactor model 115–18; humanistic theory 96–7; impression management 118; informal methods of assessment 109–10; ipsative scaling 110–11; OBPI personality scales 119–20; observations of behavior 104–5, 108; personality assessment 101–2, 112; psychoanalytic theory 94–5; repertory grid, identifying through 106, 109; reports by others 103, 107; social learning theory 97–8; theories of 93–4; type and trait theories 100–1; in the workplace 113 personality tests 45, 79, 114, 119, 121; 10-Item Personality Inventory 159; classical personality test 115, 120; on Facebook 132; feedback 112, 140; International Personality Item Pool (IPIP) 3, 115; MMPI personality test 102; in occupational settings 113, 114; personality testing 13–4; reliability of 41, 42; self-report results 101; stanine scores, reporting 37; T scores in 49 person-based tests 20, 24, 25, 26, 27, 30, 32, 34 Pervin, Lawrence 93 phallic stage of psychosexual development 95

Phillipps, Thomas 7 physical true scores 71–2 Piaget 165 pilot studies 9, 33, 36, 77, 91; experts, assessment forms piloted by 107, 109; OBPI as piloted in the workplace 119; piloting a questionnaire 30–1; pilot sample, testing for data needed with 23, 141; split-half reliability, calculating 34–5 Plato 5 Platonic true score 71, 72 pleasure principle 94 polytomous IRT 79 Popham, W. James 72 positive discrimination 56 practical intelligence 11, 12 predictive validity 43, 140, 151, 157, 161 principal component analysis (PCA) 4, 144, 162 probability theory 59 product-moment correlation coefficient 32–3, 34, 35 profiling system, constructing 102 projective tests 42, 104 prudence 15, 121 psychiatry 6, 102, 122–23, 166 psychoanalytic theory 94–6 Psychomachia (Prudentius) 121 psychometrics: in 21st century 2–4; ability, testing of 11–3; AI, evolution in psychometrics 155–62; background 1–2; computerization in 152–55; criticisms of 70–4; in cyberspace 162–64; digital-footprint-based psychometric measures 139–51; digital footprints 132–34, 134–39; equivalence principle 52–7; evolution of modern psychometrics 78–80; reliability principle 38–43; as a science 7–10; standardization principle 45–52; validity principle 43–5 psychopathy 102, 122, 125, 166, 167 psychophysics 7, 62, 65 psychosexual development 95 Python programming language 131, 142, 144, 147, 148, 149 quality-of-life 91 questionnaires 16, 29, 37, 132, 137; 16PF model 14, 15, 101, 110, 114, 116, 117; attitude questionnaires 17, 24, 25; Big Five, measuring 119; blame, avoiding attributing 121–22; content areas 21, 22; discrimination use 32–3; disruptive behaviors, assessing 123–24; explainability in 160–61; external factors in test taking 129; factor analysis approach 115, 118, 144; feedback on answers 127, 140; human-resources tests 156; job applicants and personality tests 134–35; masculinity-femininity, measuring 139; MMPI, subscales in 102; norms, role in 36; person-based

Index questionnaires 20, 24, 25, 26, 27, 30, 32, 34; piloting the questionnaire 30–1; psychometric approach, as derived from 94; ranked data in personality questionnaires 47, 92; sabotage, handling 107; split-half reliability 34–5, 40; student populations, tests designed for 113; traditional tests and questionnaires 133–34, 136, 149–50, 151. see also knowledge-based questionnaires quotas 56 QZPS. see Big Seven Chinese Personality Scale R programming language 131, 142, 144, 147, 148, 149 race and ethnicity 7, 9, 10, 139, 161; affirmativeaction as de-emphasizing 57; ethnic bias 54, 109; ethnicity as a variable 139, 140; intelligence level, early linking with 18; as a protected characteristic 56; racial data, including in algorithms 74; test bias, race as a factor in 53 random sampling 48 rapport-building 108 Rasch, Georg 77 Rasch model 77–8, 79, 153–54 rating scale items 24, 25–6, 28, 103 reading comprehension 45 reading scales 8 record tracking 130–31, 135, 139 reinforcement, process of 97–8 reliability 57, 106, 129, 152, 157; correlation analysis for assessing 111; in CTT testing 80–1; distractors and 33; estimate of reliability 34–5, 91; Facebook Likes, reliability in tracking 132; high and low reliabilities 23, 42; integrity, as a sign of 125; internal consistency 40–1; interrater reliability 40, 42, 105, 108; in OBPI personality scales 120; as a psychometric principle 18, 38, 52, 102, 115; restriction of range 42–3; test-retest reliability 38–9, 140, 151 religion 53, 56, 131; ethics as a religious concern 120; religiosity as a variable 139; religious beliefs 17, 111 repertory grids 17, 62, 94, 97, 106, 109 response bias 118–19 restriction of range effect 42–3 Rogers, Carl 96, 97 Role Construct Repertory Test (Rep Test) 106 Rolland, Jean-Pierre 123 Roosevelt, Eleanor 97 Rorschach inkblot test 104 Rosenblatt, Frank 158 rotation 69, 70, 114; factor rotation, linear transformation of 146–47; Thurstone’s rotation of factors 67–8, 162; varimax rotation 147, 149; vector algebra and factor rotation 62–5 Rubaiyat of Omar Khayyam 158 rule consciousness 15

179

Rust, John 113 sabotage 106, 107, 109 Scholastic Aptitude Test (SAT) 9 Schwartz, Shalom 16–7 science 3, 158, 161; measurement in 71, 74; psychometrics as a science 1–2, 7–10, 18; testing, restrictions on the science of 153; true science 120 scoring 2, 30, 46; computerization of 152, 154; IRT test, how to score 88–9; objectivity in 40, 101; reliability in 38, 42; sabotage, management of 106, 107, 108; singular vectors, interpreting 147; of thematic apperception tests 104 scree test 92 selection 7, 9, 34, 54, 110; China, selection of talents in 4–5; educational selection 8, 76; equivalence factor 52; face validity, checking for 35; in LDA analysis 148; predictive validity and 43–4; psychometrics as the science of 18; in Rasch model 77, 79; in situational assessments 104; trait-based assessment as a tool for 72 self-actualization 94, 96–7, 107, 121 self-concept, notion of 96 self-confidence 120, 124 self-reporting 14, 45, 129; online digital footprints and 108, 135; of personality profiles 101–2, 107; repertory grids as self-reports 109. see also five factor model semantic differential method 17 sensitivity 12, 15, 116, 120, 124 sensory deprivation 45 sex and sexuality 37, 96, 99, 102, 108; in Freudian theory 94–5; homosexuals, euthanasia legalized for 10; sexism 54, 55, 109, 125; sexual orientation 53, 56, 111 Shiverdecker, Levi 125 Simon, Théodore 8 singular value decomposition 144 singular vectors (SVDs) 64, 65, 144, 145, 146–47 situational assessment 104, 105, 108 Sixteen Personality Factor (16PF) model 14, 15, 101, 110, 114, 116, 117 smartphones 29, 130, 131, 139 social background, impact on test scores 9 social class 6, 31, 36, 52, 109 social desirability 27, 111, 118, 120, 126, 135 social learning theory 18, 94, 97–8 social sciences 38, 140, 144 Socrates 5 Sonderkommission 3 law of Germany 10 “Souls Writing on the Net” (Monti) 163 Spearman, Charles 7, 18, 60–2, 100, 162 Spearman-Brown formula 34–5, 40 special needs 55 Spielberger State-Trait Anxiety Inventory 110

180

Index

split-half reliability 34–5, 40, 41 standard deviation 7, 50, 83; interval data, using 47–8; in test standardization 36–7, 49, 51 standard error of measurement (SEM) 41–2, 43, 80, 86, 91 standardization 36, 37, 50; criterion referencing 51–2; norm referencing 46–9; as a psychometric principle 38, 102, 115, 157; of test characteristics 45–6 Stanford-Binet testing 8, 10, 39, 50, 54, 55 Stanford Heuristic Programming Project 156 State Anxiety Inventory 110 statine scores 37, 49, 50, 101 statistical models 47, 57, 78, 133, 156 statistical packages 34, 36, 153 statistical programming languages 131, 144, 147, 148, 149 statistical significance 51, 54, 162 sten scores 49, 50 stereotyping 55, 99, 103, 110 sterilization 10 Sternberg, Robert 12 Stillwell, David 3 stopping-rule conditions 90 Strong, Edward Kellog, Jr. 16 Strong Interest Inventory 16 structural equation modeling 92, 162 subtests 9, 12–3, 60–2, 68, 101, 154 superego 94–5 survival of the fittest 7 T scores 36, 49, 50, 51 talent 4–5, 12, 13, 116 task performance methods 105, 109 TAT. see Thematic Apperception Test TD-12 questionnaire (Inventaire des tendances dysfonctionnelles) 123–24 temperament 13, 17 temperance 121 Terman, Lewis M. 10 test design 12, 43, 152 test-retest reliability 38–9, 140, 151 tetrad difference 61–2 Thematic Apperception Test (TAT) 104 Thomas, Alexander 17 Thomson, William 58 Three PL (3PL) model 83–5 Thurstone, Louis Leon 65, 67, 69, 114, 162 tough-mindedness 15, 116, 119 Trait Anxiety Inventory 110 true-score theory 58–60, 71, 74, 78, 162. see also latent traits

Turing, Alan 155, 167 twin studies 98–9 two-factor solution (Eysenck) 14, 117 two-factor theory (Spearman) 60–2, 68 unidimensional scale 65 United Kingdom (UK) 3, 46, 54, 56, 77–8, 163 United States (US): African Americans, research on 11; eugenics, stance towards 9–10; IRT models and 79, 153; test development in 54, 55, 56, 57 university candidates 9, 18, 38, 44, 48, 56 usage logs 130, 131, 136–37, 139, 141 validity 57, 64, 104, 132, 152; construct validity 43, 44–5, 74; criterion-related validity 51, 122; differential validity 45, 56; ecological validity 129, 134–35, 136; face validity 35, 43, 111, 134; limitations in testing 104, 106; predictive validity 43–4, 140, 151, 157, 161; as a psychometric principle 18, 38, 52, 102, 115; questionnaire changes affecting 33; spurious validity 111–12 values 13, 16–7, 164–65, 166, 167 variance 63, 65, 68; eigenvalues and total variance calculation 66; equivalent variance in correlation matrix 70; measurement invariance 55–6, 57; single vectors and 145, 146; test invariance 54, 55, 56, 57 vector algebra 58, 62–4, 65, 153, 162 vigilance, personality trait of 15, 123 virtues 5, 121, 126, 165 vocation 52, 72, 110–11 Wallace, Alfred 7 Wang, Haiyang 116 warmth, personality trait of 15, 21, 28 way of life as a Big Seven factor 116 Wechsler Intelligence Scale for Children (WISC) 12–3, 42, 50, 54 wisdom 6, 12 Wittgenstein, Ludwig 158 women 30, 34, 96, 110, 139, 161 World Wide Web 2, 3, 162 Wright, Sewall 162 Wundt, Wilhelm 7 Xunzi 4 Yerkes, Robert 8 z scores 36, 48, 49, 50, 51