Computational Aspects of Psychometric Methods With R 2022059024, 2022059025, 9780367515386, 9780367515393, 9781003054313

393 132 22MB

English Pages [348] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computational Aspects of Psychometric Methods With R
 2022059024, 2022059025, 9780367515386, 9780367515393, 9781003054313

Table of contents :
Cover
Half Title
Series Page
Title Page
Copyright Page
Dedication
Contents
Preface
Notation
Acronyms
Author bios
1. Introduction
1.1. Brief history of psychometrics
1.2. Measurement in social sciences
1.2.1. Educational measurement
1.2.2. Psychological assessment
1.2.3. Health-related outcome measures
1.2.4. Other areas of measurement
1.3. Measurement data in this book
1.4. Psychometrics with R
1.5. Exploring measurement data
1.5.1. Item scores
1.5.2. Test scores
1.5.3. Covariates and more complex data structures
1.6. Modeling measurement data
1.7. ShinyItemAnalysis interactive application
1.8. Summary
2. Validity
2.1. Introduction
2.2. Sources of validity-supporting evidence
2.2.1. Evidence based on test content
2.2.2. Evidence based on relations to other variables
2.2.3. Evidence based on internal structure
2.3. Statistical methods in test validation
2.3.1. Inferences based on ratios
2.3.2. One sample and a paired t test
2.3.3. Two sample t test
2.3.4. More samples - ANOVA
2.3.5. Correlation coefficients
2.3.6. Regression models
2.3.6.1. Simple linear regression
2.3.6.2. Multiple regression
2.3.6.3. More complex designs
2.4. Further issues
2.4.1. Estimation of model parameters
2.4.1.1. Ordinary least squares method
2.4.1.2. Maximum likelihood method
2.4.2. Model selection and model fit
2.4.3. Correction for range restriction
2.5. Validity in interactive application
2.6. Summary
3. Internal structure of the test and factor analysis
3.1. Introduction
3.2. Correlation structure
3.3. Cluster analysis
3.4. Factor analysis
3.4.1. Exploratory factor analysis
3.4.1.1. The single factor model
3.4.1.2. More factors
3.4.2. Factor rotation
3.4.3. Factor scores
3.4.4. The number of factors
3.4.5. Model selection and model fit
3.4.6. Confirmatory factor analysis
3.4.7. Hierarchical and more complex structures
3.5. Internal structure and FA in interactive application
3.6. Summary
4. Reliability
4.1. Introduction
4.2. Formal definition and properties in CTT
4.2.1. Definition of reliability
4.2.2. Reliability as correlation between measurements
4.2.3. Implications of low reliability
4.2.4. Rules of thumb
4.2.5. Reliability of composites
4.2.5.1. Spearman-Brown prophecy formula
4.2.6. Increasing reliability
4.3. Estimation of reliability
4.3.1. Reliability estimation with correlation coefficients
4.3.1.1. Test-retest reliability
4.3.1.2. Parallel forms
4.3.1.3. Split-half coefficient
4.3.2. Cronbach's alpha
4.3.2.1. Cronbach's alpha and inter-item correlations
4.4. Estimation of reliability with variance components
4.4.1. ANOVA method of estimation
4.4.1.1. One-way ANOVA
4.4.1.2. Two-way ANOVA and Cronbach's alpha
4.4.2. Maximum likelihood
4.4.3. Restricted maximum likelihood
4.4.4. Bootstrap confidence intervals
4.4.5. Bayesian estimation
4.5. More sources of error and G-theory
4.5.1. A one-facet study
4.5.2. A two-facet study
4.6. Other estimates of reliability
4.7. Reliability in interactive application
4.8. Summary
5. Traditional item analysis
5.1. Introduction
5.2. Item difficulty
5.2.1. Difficulty in binary items
5.2.2. Difficulty in ordinal items
5.3. Item discrimination
5.3.1. Correlation between item and total score
5.3.2. Difference between upper and lower group
5.4. Item characteristic curve
5.5. Distractor analysis
5.6. Reliability if an item is dropped
5.7. Item validity
5.8. Missed items
5.9. Item analysis in interactive application
5.10. Summary
6. Item analysis with regression models
6.1. Introduction
6.2. Model specification
6.3. Models for continuous items
6.3.1. Linear regression model
6.4. Models for binary items
6.4.1. Logistic regression model
6.4.2. Other link functions, probit regression model
6.4.3. IRT parametrization
6.4.4. Nonlinear regression models
6.5. Estimation of item parameters
6.5.1. Nonlinear least squares
6.5.2. Maximum likelihood method
6.6. Model selection
6.6.1. Likelihood-ratio test
6.6.2. Akaike information criterion
6.6.3. Bayesian information criterion
6.7. Models for polytomous items
6.7.1. Ordinal regression models
6.7.1.1. Cumulative logit model
6.7.1.2. Adjacent-categories logit model
6.7.2. Multinomial regression models
6.8. Joint model
6.8.1. Person-item map
6.9. Regression models in interactive application
6.10. Summary
7. Item response theory models
7.1. Introduction
7.2. General concepts and assumptions
7.2.1. IRT model assumptions
7.2.2. IRT models for binary data
7.2.2.1. Rasch or 1PL IRT model
7.2.2.2. 2PL IRT model
7.2.2.3. Normal-ogive model
7.2.2.4. 3PL IRT model
7.2.2.5. 4PL IRT model
7.3. Estimation methods for IRT models
7.3.1. Heuristic methods and starting values
7.3.2. Joint maximum likelihood
7.3.3. Conditional maximum likelihood
7.3.4. Marginal maximum likelihood
7.3.4.1. Estimation of person abilities in MML
7.3.5. Bayesian IRT models
7.3.6. Item and test information
7.3.7. Model selection and model fit
7.4. Binary IRT models in R
7.4.1. The mirt package
7.4.2. The ltm package
7.4.3. The eRm package
7.4.4. Other IRT packages
7.4.4.1. The TAM package
7.4.4.2. The ShinyItemAnalysis package
7.4.5. The lme4 and nlme packages
7.4.6. Bayesian IRT with the brms package
7.5. Relationship between IRT and factor analysis
7.6. IRT models in interactive application
7.7. Summary
8. More complex IRT models
8.1. Introduction
8.2. IRT models for polytomous items
8.2.1. Cumulative logit IRT models
8.2.1.1. Graded response model
8.2.1.2. Graded rating scale model
8.2.2. Adjacent-categories logit IRT models
8.2.2.1. Generalized partial credit model
8.2.2.2. Partial credit model
8.2.2.3. Rating scale model
8.2.3. Baseline-category logit IRT models
8.2.3.1. Nominal response model
8.2.4. Other IRT models for polytomous data
8.2.5. Item-specific IRT models
8.3. Multidimensional IRT models
8.3.1. Multidimensional 2PL model
8.3.2. Multidimensional graded response model
8.3.3. Confirmatory multidimensional IRT models
8.4. Estimation in more complex IRT models
8.4.1. Maximum likelihood methods
8.4.2. Regularization methods
8.4.3. Bayesian methods with MCMC and MH-RM
8.4.4. Model selection and model fit
8.5. More complex IRT models in interactive application
8.6. Summary
9. Differential item functioning
9.1. Introduction
9.2. Definition
9.2.1. DIF examples and interpretations
9.2.2. Matching criterion
9.3. Traditional DIF detection methods
9.3.1. Delta plot method
9.3.2. Mantel-Haenszel test
9.3.3. SIBTEST
9.4. DIF detection based on regression models
9.4.1. Logistic regression
9.4.1.1. Testing for DIF
9.4.2. Generalized logistic regression models
9.4.3. Group-specific cumulative logit model
9.4.4. Group-specific adjacent category logit model
9.4.5. Group-specific multinomial regression model
9.5. IRT-based DIF detection methods
9.5.1. Group-specific IRT models
9.5.2. Lord's test
9.5.3. Likelihood-ratio test
9.5.4. Raju's test
9.6. Other methods
9.6.1. Iterative hybrid ordinal logistic regression with IRT
9.6.2. Regularization approach for DIF detection
9.6.3. Measurement invariance: Factor analytic approach
9.7. DIF detection in interactive application
9.8. Summary
10. Outlook on applications and more advanced psychometric topics
10.1. Introduction
10.2. Computerized adaptive testing
10.2.1. Item bank
10.2.2. Ability estimation
10.2.3. Item selection algorithms
10.2.4. Stopping rules
10.2.5. CAT implementation in interactive application
10.2.6. Post-hoc analysis
10.2.7. CAT simulation study with MCMC
10.3. Test equating and linking
10.4. Generalizing latent variable models
10.5. Big data and computational psychometrics
10.6. Interactive psychometric modules
10.7. Summary
A: Introduction to R
A.1. Obtaining and running R and RStudio
A.2. Starting with R
A.3. Installation of R packages
A.4. Data handling in R
A.4.1. Data types in R
A.4.2. Wide and long data format
A.4.3. Data handling with tidyverse
A.5. Graphics in R
A.5.1. Graphics in base R
A.5.2. Graphics with the ggplot2 package
A.5.3. Trellis graphics with the lattice package
A.6. Interactive Shiny applications
B: Descriptive statistics
C: Distributions of random variables
C.1. Discrete random variables
C.2. Continuous random variables
D: Measurement data in ShinyItemAnalysis
E: Exercises
E.1. Exercises for Chapter 1
E.2. Exercises for Chapter 2
E.3. Exercises for Chapter 3
E.4. Exercises for Chapter 4
E.5. Exercises for Chapter 5
E.6. Exercises for Chapter 6
E.7. Exercises for Chapter 7
E.8. Exercises for Chapter 8
E.9. Exercises for Chapter 9
E.10. Exercises for Chapter 10
References
Index

Citation preview

Computational Aspects of Psychometric Methods This book covers the computational aspects of psychometric methods involved in developing measurement instruments and analyzing measurement data in social sciences. It covers the main topics of psychometrics such as validity, reliability, item analysis, item response theory models, and computerized adaptive testing. The computational aspects comprise the statistical theory and models, comparison of estimation methods and algorithms, as well as an implementation with practical data examples in R and also in an interactive ShinyItemAnalysis application. Key Features: • Statistical models and estimation methods involved in psychometric research • Includes reproducible R code and examples with real datasets • Interactive implementation in ShinyItemAnalysis application The book is targeted toward a wide range of researchers in the field of educational, psychological, and health-related measurements. It is also intended for those developing measurement instruments and for those collecting and analyzing data from behavioral measurements, who are searching for a deeper understanding of underlying models and further development of their analytical skills.

Chapman & Hall/CRC Statistics in the Social and Behavioral Sciences Series Series Editors Jeff Gill, Steven Heeringa, Wim J. van der Linden, Tom Snijders Recently Published Titles Big Data and Social Science: Data Science Methods and Tools for Research and Practice, Second Edition Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter and Julia Lane Understanding Elections through Statistics: Polling, Prediction, and Testing Ole J. Forsberg Analyzing Spatial Models of Choice and Judgment, Second Edition David A. Armstrong II, Ryan Bakker, Royce Carroll, Christopher Hare, Keith T. Poole and Howard Rosenthal Introduction to R for Social Scientists: A Tidy Programming Approach Ryan Kennedy and Philip Waggoner Linear Regression Models: Applications in R John P. Hoffman Mixed-Mode Surveys: Design and Analysis Jan van den Brakel, Bart Buelens, Madelon Cremers, Annemieke Luiten, Vivian Meertens, Barry Schouten and Rachel Vis-Visschers Applied Regularization Methods for the Social Sciences Holmes Finch An Introduction to the Rasch Model with Examples in R Rudolf Debelak, Carolin Stobl and Matthew D. Zeigenfuse Regression Analysis in R: A Comprehensive View for the Social Sciences Jocelyn H. Bolin Intensive Longitudinal Analysis of Human Processes Kathleen M. Gates, Sy-Min Chow, and Peter C. M. Molenaar Applied Regression Modeling: Bayesian and Frequentist Analysis of Categorical and Limited Response Variables with R and Stan Jun Xu The Psychometrics of Standard Setting: Connecting Policy and Test Scores Mark Reckase Crime Mapping and Spatial Data Analysis using R Juanjo Medina and Reka Solymosi Computational Aspects of Psychometric Methods: With R Patricia Martinková and Adéla Hladká For more information about this series, please visit: https://www.routledge.com/ Chapman--HallCRC-Statistics-in-the-Social-and-Behavioral-Sciences/book-series/CHSTSOBESCI

Computational Aspects of Psychometric Methods With R

Patrícia Martinková and Adéla Hladká

Designed cover image: Patrícia Martinková and Adéla Hladká First edition published 2023 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2023 Taylor & Francis Group, LLC Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data Names: Martinková, Patrícia, author. | Hladká, Adéla, author. Title: Computational aspects of psychometric methods : with R / Patrícia Martinková, Adéla Hladká. Description: First Edition. | Boca Raton, FL : Taylor & Francis, [2023] | Series: Chapman & Hall/CRC statistics in the social & behavioral sciences | Includes bibliographical references and index. | Identifiers: LCCN 2022059024 (print) | LCCN 2022059025 (ebook) | ISBN 9780367515386 (hardback) | ISBN 9780367515393 (paperback) | ISBN 9781003054313 (ebook) Subjects: LCSH: Psychometrics. | Social sciences--Evaluation. Classification: LCC BF39 .F278 2023 (print) | LCC BF39 (ebook) | DDC 150.1/5195--dc23/eng/20230405 LC record available at https://lccn.loc.gov/2022059024 LC ebook record available at https://lccn.loc.gov/2022059025 ISBN: 978-0-367-51538-6 (hbk) ISBN: 978-0-367-51539-3 (pbk) ISBN: 978-1-003-05431-3 (ebk) DOI: 10.1201/9781003054313 Typeset in LM Roman by KnowledgeWorks Global Ltd. Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.

To Igor, Šimon, Filip, Lukáš, and to my parents P.M. To Miroslav, Jindřich, and to my parents A.H.

Contents

Preface

xiii

Notation

xvii

Acronyms

xix

Author bios

xxiii

1 Introduction 1.1 Brief history of psychometrics . . . . . . . . . . . . 1.2 Measurement in social sciences . . . . . . . . . . . . 1.2.1 Educational measurement . . . . . . . . . . . 1.2.2 Psychological assessment . . . . . . . . . . . 1.2.3 Health-related outcome measures . . . . . . . 1.2.4 Other areas of measurement . . . . . . . . . . 1.3 Measurement data in this book . . . . . . . . . . . . 1.4 Psychometrics with R . . . . . . . . . . . . . . . . . 1.5 Exploring measurement data . . . . . . . . . . . . . 1.5.1 Item scores . . . . . . . . . . . . . . . . . . . 1.5.2 Test scores . . . . . . . . . . . . . . . . . . . 1.5.3 Covariates and more complex data structures 1.6 Modeling measurement data . . . . . . . . . . . . . 1.7 ShinyItemAnalysis interactive application . . . . . 1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

1 1 3 4 6 6 7 7 10 12 12 16 18 20 22 23

2 Validity 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 2.2 Sources of validity-supporting evidence . . . . . . . 2.2.1 Evidence based on test content . . . . . . . . 2.2.2 Evidence based on relations to other variables 2.2.3 Evidence based on internal structure . . . . . 2.3 Statistical methods in test validation . . . . . . . . 2.3.1 Inferences based on ratios . . . . . . . . . . . 2.3.2 One sample and a paired t test . . . . . . . . 2.3.3 Two sample t test . . . . . . . . . . . . . . . 2.3.4 More samples – ANOVA . . . . . . . . . . . . 2.3.5 Correlation coefficients . . . . . . . . . . . . . 2.3.6 Regression models . . . . . . . . . . . . . . . 2.3.6.1 Simple linear regression . . . . . . . 2.3.6.2 Multiple regression . . . . . . . . . . 2.3.6.3 More complex designs . . . . . . . . 2.4 Further issues . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

25 25 25 26 26 27 27 27 29 32 34 37 41 41 45 47 48 vii

Contents

viii

2.4.1

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

48 48 49 50 51 53 54

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

55 55 55 59 61 62 62 64 65 67 68 69 70 74 75 76

4 Reliability 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Formal definition and properties in CTT . . . . . . . . . . . 4.2.1 Definition of reliability . . . . . . . . . . . . . . . . . . 4.2.2 Reliability as correlation between measurements . . . 4.2.3 Implications of low reliability . . . . . . . . . . . . . . 4.2.4 Rules of thumb . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Reliability of composites . . . . . . . . . . . . . . . . . 4.2.5.1 Spearman-Brown prophecy formula . . . . . 4.2.6 Increasing reliability . . . . . . . . . . . . . . . . . . . 4.3 Estimation of reliability . . . . . . . . . . . . . . . . . . . . . 4.3.1 Reliability estimation with correlation coefficients . . . 4.3.1.1 Test-retest reliability . . . . . . . . . . . . . 4.3.1.2 Parallel forms . . . . . . . . . . . . . . . . . 4.3.1.3 Split-half coefficient . . . . . . . . . . . . . . 4.3.2 Cronbach’s alpha . . . . . . . . . . . . . . . . . . . . . 4.3.2.1 Cronbach’s alpha and inter-item correlations 4.4 Estimation of reliability with variance components . . . . . . 4.4.1 ANOVA method of estimation . . . . . . . . . . . . . 4.4.1.1 One-way ANOVA . . . . . . . . . . . . . . . 4.4.1.2 Two-way ANOVA and Cronbach’s alpha . . . 4.4.2 Maximum likelihood . . . . . . . . . . . . . . . . . . . 4.4.3 Restricted maximum likelihood . . . . . . . . . . . . . 4.4.4 Bootstrap confidence intervals . . . . . . . . . . . . . . 4.4.5 Bayesian estimation . . . . . . . . . . . . . . . . . . . 4.5 More sources of error and G-theory . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

77 77 78 78 79 80 80 81 81 83 84 84 84 85 85 87 88 89 89 89 92 96 97 99 100 101

2.5 2.6

Estimation of model parameters . . . . 2.4.1.1 Ordinary least squares method 2.4.1.2 Maximum likelihood method . 2.4.2 Model selection and model fit . . . . . . 2.4.3 Correction for range restriction . . . . . Validity in interactive application . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

3 Internal structure of the test and factor analysis 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . 3.2 Correlation structure . . . . . . . . . . . . . . . . 3.3 Cluster analysis . . . . . . . . . . . . . . . . . . . 3.4 Factor analysis . . . . . . . . . . . . . . . . . . . . 3.4.1 Exploratory factor analysis . . . . . . . . . 3.4.1.1 The single factor model . . . . . . 3.4.1.2 More factors . . . . . . . . . . . . 3.4.2 Factor rotation . . . . . . . . . . . . . . . . 3.4.3 Factor scores . . . . . . . . . . . . . . . . . 3.4.4 The number of factors . . . . . . . . . . . . 3.4.5 Model selection and model fit . . . . . . . . 3.4.6 Confirmatory factor analysis . . . . . . . . 3.4.7 Hierarchical and more complex structures . 3.5 Internal structure and FA in interactive application 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

Contents . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

103 107 107 108 109

5 Traditional item analysis 5.1 Introduction . . . . . . . . . . . . . . . . . . . . 5.2 Item difficulty . . . . . . . . . . . . . . . . . . . 5.2.1 Difficulty in binary items . . . . . . . . . 5.2.2 Difficulty in ordinal items . . . . . . . . . 5.3 Item discrimination . . . . . . . . . . . . . . . . 5.3.1 Correlation between item and total score 5.3.2 Difference between upper and lower group 5.4 Item characteristic curve . . . . . . . . . . . . . 5.5 Distractor analysis . . . . . . . . . . . . . . . . . 5.6 Reliability if an item is dropped . . . . . . . . . 5.7 Item validity . . . . . . . . . . . . . . . . . . . . 5.8 Missed items . . . . . . . . . . . . . . . . . . . . 5.9 Item analysis in interactive application . . . . . 5.10 Summary . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

111 111 111 112 113 114 115 116 117 119 121 121 123 124 125

6 Item analysis with regression models 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 6.2 Model specification . . . . . . . . . . . . . . . . . . 6.3 Models for continuous items . . . . . . . . . . . . . 6.3.1 Linear regression model . . . . . . . . . . . . 6.4 Models for binary items . . . . . . . . . . . . . . . . 6.4.1 Logistic regression model . . . . . . . . . . . 6.4.2 Other link functions, probit regression model 6.4.3 IRT parametrization . . . . . . . . . . . . . . 6.4.4 Nonlinear regression models . . . . . . . . . . 6.5 Estimation of item parameters . . . . . . . . . . . . 6.5.1 Nonlinear least squares . . . . . . . . . . . . 6.5.2 Maximum likelihood method . . . . . . . . . 6.6 Model selection . . . . . . . . . . . . . . . . . . . . 6.6.1 Likelihood-ratio test . . . . . . . . . . . . . . 6.6.2 Akaike information criterion . . . . . . . . . . 6.6.3 Bayesian information criterion . . . . . . . . 6.7 Models for polytomous items . . . . . . . . . . . . . 6.7.1 Ordinal regression models . . . . . . . . . . . 6.7.1.1 Cumulative logit model . . . . . . . 6.7.1.2 Adjacent-categories logit model . . . 6.7.2 Multinomial regression models . . . . . . . . 6.8 Joint model . . . . . . . . . . . . . . . . . . . . . . . 6.8.1 Person-item map . . . . . . . . . . . . . . . . 6.9 Regression models in interactive application . . . . 6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

127 127 127 129 129 131 132 133 135 136 139 139 140 140 141 141 141 142 142 142 145 146 150 152 153 154

4.6 4.7 4.8

4.5.1 A one-facet study . . . . . . . 4.5.2 A two-facet study . . . . . . Other estimates of reliability . . . . Reliability in interactive application Summary . . . . . . . . . . . . . . .

ix

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Contents

x

7 Item response theory models 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 7.2 General concepts and assumptions . . . . . . . . . . . 7.2.1 IRT model assumptions . . . . . . . . . . . . . 7.2.2 IRT models for binary data . . . . . . . . . . . 7.2.2.1 Rasch or 1PL IRT model . . . . . . . 7.2.2.2 2PL IRT model . . . . . . . . . . . . 7.2.2.3 Normal-ogive model . . . . . . . . . . 7.2.2.4 3PL IRT model . . . . . . . . . . . . 7.2.2.5 4PL IRT model . . . . . . . . . . . . 7.3 Estimation methods for IRT models . . . . . . . . . . 7.3.1 Heuristic methods and starting values . . . . . 7.3.2 Joint maximum likelihood . . . . . . . . . . . . 7.3.3 Conditional maximum likelihood . . . . . . . . 7.3.4 Marginal maximum likelihood . . . . . . . . . . 7.3.4.1 Estimation of person abilities in MML 7.3.5 Bayesian IRT models . . . . . . . . . . . . . . . 7.3.6 Item and test information . . . . . . . . . . . . 7.3.7 Model selection and model fit . . . . . . . . . . 7.4 Binary IRT models in R . . . . . . . . . . . . . . . . . 7.4.1 The mirt package . . . . . . . . . . . . . . . . 7.4.2 The ltm package . . . . . . . . . . . . . . . . . 7.4.3 The eRm package . . . . . . . . . . . . . . . . . 7.4.4 Other IRT packages . . . . . . . . . . . . . . . 7.4.4.1 The TAM package . . . . . . . . . . . . 7.4.4.2 The ShinyItemAnalysis package . . 7.4.5 The lme4 and nlme packages . . . . . . . . . . 7.4.6 Bayesian IRT with the brms package . . . . . . 7.5 Relationship between IRT and factor analysis . . . . . 7.6 IRT models in interactive application . . . . . . . . . 7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

155 155 156 156 158 158 159 159 160 161 161 162 162 165 166 167 168 168 169 171 171 176 178 179 179 179 180 181 181 183 186

8 More complex IRT models 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . 8.2 IRT models for polytomous items . . . . . . . . . 8.2.1 Cumulative logit IRT models . . . . . . . . 8.2.1.1 Graded response model . . . . . . 8.2.1.2 Graded rating scale model . . . . 8.2.2 Adjacent-categories logit IRT models . . . . 8.2.2.1 Generalized partial credit model . 8.2.2.2 Partial credit model . . . . . . . . 8.2.2.3 Rating scale model . . . . . . . . . 8.2.3 Baseline-category logit IRT models . . . . . 8.2.3.1 Nominal response model . . . . . 8.2.4 Other IRT models for polytomous data . . 8.2.5 Item-specific IRT models . . . . . . . . . . 8.3 Multidimensional IRT models . . . . . . . . . . . 8.3.1 Multidimensional 2PL model . . . . . . . . 8.3.2 Multidimensional graded response model . . 8.3.3 Confirmatory multidimensional IRT models 8.4 Estimation in more complex IRT models . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

187 187 187 187 187 192 193 193 195 196 198 198 201 201 202 202 204 206 208

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

Contents . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

208 208 208 209 209 210

9 Differential item functioning 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 DIF examples and interpretations . . . . . . . . . . 9.2.2 Matching criterion . . . . . . . . . . . . . . . . . . . 9.3 Traditional DIF detection methods . . . . . . . . . . . . . 9.3.1 Delta plot method . . . . . . . . . . . . . . . . . . . 9.3.2 Mantel-Haenszel test . . . . . . . . . . . . . . . . . . 9.3.3 SIBTEST . . . . . . . . . . . . . . . . . . . . . . . . 9.4 DIF detection based on regression models . . . . . . . . . . 9.4.1 Logistic regression . . . . . . . . . . . . . . . . . . . 9.4.1.1 Testing for DIF . . . . . . . . . . . . . . . 9.4.2 Generalized logistic regression models . . . . . . . . 9.4.3 Group-specific cumulative logit model . . . . . . . . 9.4.4 Group-specific adjacent category logit model . . . . 9.4.5 Group-specific multinomial regression model . . . . 9.5 IRT-based DIF detection methods . . . . . . . . . . . . . . 9.5.1 Group-specific IRT models . . . . . . . . . . . . . . 9.5.2 Lord’s test . . . . . . . . . . . . . . . . . . . . . . . 9.5.3 Likelihood-ratio test . . . . . . . . . . . . . . . . . . 9.5.4 Raju’s test . . . . . . . . . . . . . . . . . . . . . . . 9.6 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 Iterative hybrid ordinal logistic regression with IRT 9.6.2 Regularization approach for DIF detection . . . . . . 9.6.3 Measurement invariance: Factor analytic approach . 9.7 DIF detection in interactive application . . . . . . . . . . . 9.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

211 211 211 212 214 214 214 219 223 225 226 227 234 235 238 240 243 243 246 248 248 251 251 252 253 253 254

10 Outlook on applications and more advanced psychometric topics 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Computerized adaptive testing . . . . . . . . . . . . . . . . . . . . . 10.2.1 Item bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Ability estimation . . . . . . . . . . . . . . . . . . . . . . . . 10.2.3 Item selection algorithms . . . . . . . . . . . . . . . . . . . . 10.2.4 Stopping rules . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.5 CAT implementation in interactive application . . . . . . . . 10.2.6 Post-hoc analysis . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.7 CAT simulation study with MCMC . . . . . . . . . . . . . . 10.3 Test equating and linking . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Generalizing latent variable models . . . . . . . . . . . . . . . . . . 10.5 Big data and computational psychometrics . . . . . . . . . . . . . . 10.6 Interactive psychometric modules . . . . . . . . . . . . . . . . . . . 10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

255 255 255 257 257 257 258 259 261 264 267 267 268 269 269

8.5 8.6

8.4.1 Maximum likelihood methods . . . . . . . . . 8.4.2 Regularization methods . . . . . . . . . . . . 8.4.3 Bayesian methods with MCMC and MH-RM 8.4.4 Model selection and model fit . . . . . . . . . More complex IRT models in interactive application Summary . . . . . . . . . . . . . . . . . . . . . . . .

xi

. . . . . .

. . . . . .

. . . . . .

xii

Contents

A Introduction to R A.1 Obtaining and running R and RStudio . . . . . A.2 Starting with R . . . . . . . . . . . . . . . . . . . A.3 Installation of R packages . . . . . . . . . . . . . A.4 Data handling in R . . . . . . . . . . . . . . . . . A.4.1 Data types in R . . . . . . . . . . . . . . . A.4.2 Wide and long data format . . . . . . . . A.4.3 Data handling with tidyverse . . . . . . A.5 Graphics in R . . . . . . . . . . . . . . . . . . . . A.5.1 Graphics in base R . . . . . . . . . . . . . A.5.2 Graphics with the ggplot2 package . . . A.5.3 Trellis graphics with the lattice package A.6 Interactive Shiny applications . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

B Descriptive statistics

271 271 272 273 274 276 277 278 279 279 280 281 281 283

C Distributions of random variables 285 C.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 C.2 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 289 D Measurement data in ShinyItemAnalysis E Exercises E.1 Exercises E.2 Exercises E.3 Exercises E.4 Exercises E.5 Exercises E.6 Exercises E.7 Exercises E.8 Exercises E.9 Exercises E.10 Exercises

for for for for for for for for for for

Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter

1 2 3 4 5 6 7 8 9 10

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

293 . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

295 295 296 297 297 299 299 301 302 303 304

References

305

Index

319

Preface In many cases, the submission of measurement or testing results is a key moment in people’s lives with far-reaching consequences, such as with university admissions or the hiring of new employees. In other cases, a quick and precise assessment may help diagnose a patient’s condition. No matter the field or scope, all assessments that are used to measure latent traits, such as knowledge, attitudes and opinions, emotions, cognitions and personality, fatigue, or pain, are required to produce valid, reliable, and fair scores, as is outlined in the Standards of Educational and Psychological Testing (AERA, APA, & NCME, 2014). In order to foster adoption of the recommendations provided by the Standards, this book covers the computational aspects of psychometric methods involved in developing measurement instruments and analyzing measurement data in social sciences. The computational aspects comprise both the statistical theory and models as well as implementation with practical data examples in R (R Core Team, 2022). R is a freely available statistical software program which has been available for many years now, and many R packages have been developed to cover complex psychometric analysis. The R code-based environment is a perfect fit for making both research and practice reproducible and easily shared. To overcome the initial apprehension which the command-line nature of the software may cause to those new to R, we also include examples in the interactive application of the ShinyItemAnalysis package (Martinková & Drabinová, 2018). The book is targeted toward a wide range of readers. It is intended for researchers in the field of educational, psychological, and health-related measurement. It is also intended for those developing measurement instruments and for those collecting and analyzing data from behavioral measurements who are searching for a deeper understanding of underlying models and the further development of their analytical skills. Researchers and practitioners who are familiar with psychometric methods and would like to learn how to perform the analyses in R may appreciate the practical examples with selected R codes and interpretations of the output created by the software. On the other hand, readers with a statistical background may benefit from the demonstration about the various uses of statistical tools in a measurement practice. The book may be used as study material in courses and workshops on measurement, psychometrics, and test development, especially when using R or ShinyItemAnalysis application. There are a number of books on educational and psychological measurement and psychometric analysis available, including Brennan (2006), R. M. Thorndike and Thorndike-Christ (2010), Lane, Raymond, and Haladyna (2015), or a more statistically oriented handbook by Rao and Sinharay (2007). There is also a growing body of books focusing on measurement with practical examples in R. These include Y. Li and Baron (2011), Revelle (2015), and Baker and Kim (2017), or the more recent books of J. D. Brown (2018), Desjardins and Bulut (2018), Mair (2018), T. Albano (2020), and Debelak, Strobl, and Zeigenfuse (2022). There are several things which make this book different from the existing ones. First, this book outlines the measurement process in behavioral practice and links basic psychometric concepts, presented in individual chapters, to this process. Second, we focus on connecting measurement and psychometrics to statistical and computational aspects including the derivation of underlying models and estimation methods, as well as comparison of implementation within different software packages. The third main advantage of the book is the integration of examples in the interactive application of the ShinyItemAnalysis R package which is freely available and widely used by practitioners, students, and educators across xiii

xiv

Preface

the world. This also makes the concepts presented in this book accessible to readers who are new to R. The ShinyItemAnalysis application can then be used as a springboard for analysis in R as it is an easy-to-use interactive platform offering selected R code ready to be copied and pasted directly into R. No specific knowledge is expected from the audience of this book except for an introductory statistical background. Some experience in R is helpful, but not required. I would like to acknowledge the many colleagues who made the publication of this book possible. First, I would like to thank Adéla Hladká for collaboration and assistance. Adéla wrote Chapter 9 on Differential Item Functioning, and she co-authored exercises for each chapter, as well as hints and solutions provided in the supplementary material. She also provided corrections and suggestions for other chapters of the book, and she assisted with the code and figures. I would also like to thank Karel Zvára and Marie Wiberg for reading previous version of the book and for providing detailed feedback and valuable comments. I am thankful to Yves Rosseel, Carolin Strobl, David Kaplan, Stanislav Ježek, Eva Potužníková, Rebecca Price, and anonymous reviewers for reading various parts of the book and for providing helpful suggestions. New datasets were kindly provided by Hynek Cígler, and Martina Hřebíčková, other datasets come from enriching collaborations including those with David Greger, Jenny McFarland, and Elena Erosheva. I am thankful to my colleagues at the Department of Statistical Modelling, Institute of Computer Science of the Czech Academy of Sciences, and at the Institute for Research and Development in Education, Charles University, for creating an inspiring and supportive research environment. I would like to acknowledge the COMPS and CEMP group members for working on related psychometric topics, and particularly Jan Netík and Benjamín Šimsa for assistance with figure processing. Students of the courses “Selected Topics of Psychometrics” and “Statistical Methods in Psychometrics” and participants of “Seminar in Psychometrics” at Charles University and at the Institute of Computer Science of the Czech Academy of Sciences during the years 2019–2022 shared helpful comments on earlier versions of the book. The work on the book was partly supported by the Charles University grant PRIMUS/17/HUM/11 and by the Czech Science Foundation grant 21-03658S. I am very grateful to John Kimmel and Lara Spieker, editors at Chapman Hall & CRC Press, for their patience and helpful guidance over the years, and to entire editorial team for their help. Finally, I would like to express deep thanks to Jon Kern for proofreading the book and all my family for their continuing support. Thank you for reading the book and please report any suggestions for future editions to [email protected]. Patrícia Martinková, Prague, 2023

Preface

xv

Layout of the book The book is organized as follows: Chapter 1 provides an introduction to psychometrics, including a brief history of psychometrics, an introduction to R, and measurement data. Chapter 2 focuses on the concept of validity and statistical methods used in gathering validity evidence using external variables, while Chapter 3 further dives into gathering validity evidence based on the internal structure of the test and using various multivariate statistics methods. Chapter 4 focuses on variance decomposition and deals with reliability of measurement. Chapter 5 covers the methods for item analysis, including various traditional item characteristics and an empirical item characteristic curve. Chapter 6 demonstrates regression models for the description of item functioning, which provides modeling counterparts to empirical item characteristic curves and prerequisites for item response theory models. We then introduce item response theory models in Chapter 7 and 8. Chapter 9 covers the topic of differential item functioning. Finally, Chapter 10 provides an outlook on applications including computerized adaptive testing and test equating, and on more advanced topics including generalizing frameworks and methods for complex data structures. A separate section in each chapter presents examples of implementation in the interactive ShinyItemAnalysis application. Appendices include an introduction to psychometrics with R with basic examples including data handling and graphics, exploration of measurement data with descriptive statistics, modeling measurement data with probabilistic models, description of data upload in the ShinyItemAnalysis, and a set of theoretical and practical problems and exercises for each chapter. Materials related to the book are available on the book web page https://github.com/patriciamar/PsychometricsBook The repository includes a complete R code for each chapter and other supplementary code, all of the presented datasets, hints and solutions for some of the exercises, as well as other material. All datasets are also present in the ShinyItemAnalysis package.

Notation

a b c d β ξ D i m p n k Ki f q Q λi Λ Ψ Φ Xp Ypi Y¯p•

item discrimination parameter item difficulty/location parameter in IRT parametrization item guessing parameter in IRT parametrization item inattention parameter regression parameter in intercept-slope parametrization vector of item parameters scaling parameter of about 1.7 transforming logit to probit item or rater index number of items or raters person or respondent index number of persons category index in polytomous items last category in item i latent factor dimension index number of dimensions vector factor loadings for item i matrix of factor loading vectors covariance matrix of factor loading vectors covariance matrix of factors score of person p score of person p on item i mean item score of person p

Y¯•i E(·) P(·) µ π 2 σX σXY

mean score in item i expected mean value probability mean probability (of correct answer) population variance of X population covariance between X and Y ρXY population correlation between X and Y V population variance/covariance matrix Σ population correlation matrix θ latent ability/trait s2X sample variance of X sXY sample covariance between X and Y S sample variance/covariance matrix rXY sample Pearson correlation between X and Y R sample correlation matrix π ˆ ratio, estimate of probability P sum Q product ∇h(·) gradient of function h(·) Bi(n, π) binomial distribution with n trials and probability π N (µ, σ 2 ) normal distribution with mean µ and variance σ 2

xvii

Acronyms

1PL One-Parameter Logistic. 2PL Two-Parameter Logistic. 3PL Three-Parameter Logistic. 4PL Four-Parameter Logistic. AERA American Educational Research Association. AIBS American Institute of Biological Sciences. AIC Akaike Information Criterion. ANOVA Analysis of Variance. APA American Psychological Association. BFI Big Five Inventory. BIC Bayesian Information Criterion. CAT Computerized Adaptive Test. CBT Computer-Based Tests. CDF Cumulative Distribution Function. CFA Confirmatory Factor Analysis. CFI Comparative Fit Index. CLoSE Czech Longitudinal Study in Education. CML Conditional Maximum Likelihood. CRAN Comprehensive R Archive Network. CTT Classical Test Theory. CVR Content Validity Ratio. DDF Differential Distractor Functioning. DIF Differential Item Functioning. D-study Dependability Study. EAP Expected a Posteriori. xix

xx

EB Empirical Bayes. EFA Exploratory Factor Analysis. EM Expectation-Maximization. ETS Educational Testing Service. FA Factor Analysis. GLM Generalized Linear Model. GLMM Generalized Linear Mixed Effect Model. GPCM Generalized Partial Credit Model. GRM Graded Response Model. GRSM Graded Rating Scale Model. G-study Generalizability Study. G-theory Generalizability Theory. GAMSAT Graduate Medical School Admission Test. GMAT Graduate Management Admission Test. GPA Grade Point Average. HCI Homeostasis Concept Inventory. HI Height Inventory. ICC Item Characteristic Curve. ICF Item Characteristic Function. IDE Integrated Development Environment. IIC Item Information Curve. IIF Item Information Function. IRC Item Response Curve. IRF Item Response Function. IRLS Iterative Re-weighted Least Squares. IRR Inter-Rater Reliability. IRT Item Response Theory. JML Joint Maximum Likelihood. LRT Likelihood-Ratio Test.

Acronyms

Acronyms MAP Maximum a Posteriori. MCAT Medical College Admission Test. MCMC Markov Chain Monte Carlo. MEI Maximum Expected Information. MEPV Minimum Expected Posterior Variance. MH Mantel-Haenszel. MI Maximum Information. ML Maximum Likelihood. MLE Maximum Likelihood Estimate. MLWI Maximum Likelihood Weighted Information. MML Marginal Maximum Likelihood. MPWC Maximum Posterior Weighted Criterion. MS Mean Sum of Squares. MST Multistage Test. NCME National Council on Measurement in Education. NRM Nominal Response Model. OLS Ordinary Least Squares. PCA Principal Component Analysis. PCM Partial Credit Model. PDF Probability Density Function. PIRLS Progress in International Reading Literacy Study. PISA Programme for International Student Assessment. PROMIS Patient-Reported Outcomes Measurement Information System. REML Restricted Maximum Likelihood. RIR correlation between an item score and the sum of the rest of the test items. RIT correlation between an item score and the total test score. RMSEA Root Mean Square Error of Approximation. RSM Rating Scale Model. RSS Residual Sum of Squares. SAT Scholastic Aptitude Test.

xxi

xxii

SEM Structural Equation Models. SIBTEST Simultaneous Item Bias Test. SS Sum of Squares. TIC Test Information Curve. TIF Test Information Function. TIMSS Trends in International Mathematics and Science Study. TLI Tucker-Lewis Index. TOEFL Test of English as a Foreign Language. TSC Test Score Curve. TSF Test Score Function. ULI Upper-Lower Index. WHO World Health Organization.

Acronyms

Author bios

Patrícia Martinková is a Senior Researcher at the Institute of Computer Science of the Czech Academy of Sciences and an Academic Researcher at Charles University. Her research spans psychometrics and statistics with a focus on mathematical, statistical, and computational aspects of measurement. She has been teaching courses on psychometric methods since 2014. Adéla Hladká is a Postdoctoral Fellow at the Institute of Computer Science of the Czech Academy of Sciences. Her research interests include psychometrics and statistics, with a focus on differential item functioning, and software development with R.

xxiii

1 Introduction

Measurement is present in many areas of our everyday life. In an educational context, these measurements include assessing academic achievement, certifying student qualifications and proficiency, high-stakes testing for school admission, as well as national and international large-scale assessment used to support policy decisions. In the area of psychology, measurements are used for assessing intelligence, personality traits, or attitudes. Health-related measurements include those of fatigue, depression, pain, quality of life, or well-being. Measurement is also present in other contexts, such as in hiring or promotion, peer review of articles, selection of grant proposals, and in numerous other areas. What makes measurement in social sciences different from measuring height, weight, or blood pressure? The answer to this question is of key importance to understanding the specifics of the analytic and computational aspects involved. The main feature of measurement in social sciences is that it is typically not possible to directly measure the constructs of interest: The traits, such as math ability, leadership, level of depression, anxiety, or pain, are latent, which means unobservable, hidden. We have to use manifest variables, such as item responses on a questionnaire, to make inferences about these latent variables. Rather than a single manifest variable, we typically use a larger number of variables to measure the same latent construct, using multi-item measurement instruments. The measurement error is omnipresent and has to be enumerated and accounted for. Psychometrics, as a field concerning psychological, educational, health-related, and other measurement in social sciences, is the disciplinary home for a number of statistical models and methods solving the challenges mentioned above. Psychometrics has contributed to many areas of statistics for many years, bringing not only motivation and inspiration but also providing computational improvements and solutions. A number of psychometric methods have been developed to draw inferences from empirical data and to gain quantitative proofs of measurement reliability, validity, or the functioning of single items. While closely related to statistical research, psychometrics has developed field-specific methods and terminology for models, tests, and indices.

1.1

Brief history of psychometrics

The year 1935, when the Psychometric Society was founded, may be considered an important milestone for psychometrics. However, the origins of psychometrics date much earlier and involve scientists from various fields. For an overview, see Jones and Thissen (2006), or the web page of the Psychometric Society.1 We mention at least a few of the important moments and figures below. Carl Friedrich Gauss (1777–1855), a German mathematician and physicist, studied measurement errors in the context of astronomy and orbital prediction. He employed the method 1 www.psychometricsociety.org

1

2

Introduction

of least squares and a statistical distribution which is now known as the Gaussian or normal distribution. In the context of recording the transition time of stars, another German astronomer, Friedrich Bessel (1784–1846), who corresponded regularly with Gauss, noticed that when several observers are employed simultaneously, they determine slightly different values. He further elaborated this concept into a "personal equation" enabling an estimation and correction for rater biases employing the method of least squares. As a result, the measurement error and reaction times became a major topic in experimental psychology of the time. Sir Francis Galton (1822–1911), a British scientist known for achievements in many scientific areas, is also often referred to as "the father of psychometrics". Influenced by Charles Darwin, he employed statistical methods to study human differences and inheritance on data measured in his Anthropometric Laboratory in London. He noticed that above-average height fathers tended to have sons shorter than themselves, while shorter fathers tended to have sons taller than themselves, a concept he termed "regression towards mediocrity". This phenomenon, now termed as "regression towards the mean" is present in many fields including psychological, educational, and health-related measurement. While Galton mainly studied linear relationships, and correlation and prediction, his initial input later gave a start to a wide range of regression models. His numerous papers and books include the Brain journal paper, "Psychometric Experiments", in which he describes psychometrics as the "art of imposing measurement and number upon operations of the mind" (Galton, 1879). The British psychologist Charles Edward Spearman (1863–1945), influenced by Galton, worked on a general intelligence factor which he termed g factor. It should be noted that Spearman was a University College colleague of Karl Pearson (1857–1936), who is credited with establishing the discipline of statistics. While Spearman’s main focus lay in experimental psychology, he was also well known for his work in statistics: Spearman’s rank correlation coefficient — a nonparametric version of Pearson’s correlation coefficient, for articles on correction for attenuation, and for initial development of factor analysis. In France, Alfred Binet and Théodore Simon developed the Binet-Simon IQ test, with the aim of measuring higher cognitive abilities, and to identify students who did not learn effectively from regular classroom instruction. In 1905, they introduced the concept of mental age. English translations of the test followed, notably by Lewis M. Terman in 1916, who defined the IQ score as a quotient of mental age divided by chronological age multiplied by 100. During World War I, psychological examination of U.S. Army recruits was of interest. A statistical unit working on development of Army Alpha and Beta test batteries involved Edward Lee Thorndike, Arthur S. Otis, and Louis Leon Thurstone, the future first president of the Psychometric Society. Initial meetings of the founding fathers of the society concerned creation of a new journal for publication of psychometric research. The first issue of the journal Psychometrika followed shortly after, in 1936. During World War II, the Army General Classification test was developed, and psychometric research was concentrated on this test to assure its quality. Influenced by the work on army test batteries, the Stanford Achievement Test was delivered to measure student learning at school. Development of further educational tests followed. The Educational Testing Service (ETS) was founded in 1947, which produced the Scholastic Aptitude Test (SAT). Numerous psychological, educational, health-related, and other measurement instruments have been developed and established since those times.

Measurement in social sciences

1.2

3

Measurement in social sciences

The process of measurement in social sciences includes a number of steps. As a general principle, before considering any psychometric analysis, the data and measurement quality needs to be assured by the process of measurement development. The general guidelines for test development are provided in the Standards for Educational and Psychological Testing by the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME) (2014). As test development itself is not the focus of this book, we will only mention briefly the main important steps. For more details, the reader is referred to the Handbook of Test Development by Lane et al. (2015). Defining the construct of interest is perhaps the most important part of measurement development. Without a very clear and precise idea about what is intended to be measured, the measurement scores will seldom be valid. Naming and defining the domain to be measured are followed by developing content specifications to guide item development. In the context of educational measurement, this step includes creating conceptual frameworks, or what is sometimes called blueprinting, i.e., the precise and detailed description of the foundation stones from which the measurement instrument should be composed. Item development is the key component in multi-item measurement. Suitable item formats need to be selected and the items need to be written in such a way that they align with the construct and its components. High quality items are those which are reviewed by a number of reviewers for their quality and pretested on a sample of respondents from the specific interest group. Rater training is important when a number of different raters are used to perform the measurement. This includes providing clear and detailed direction about what is to be rated and how it is to be rated, a detailed rubric. Rater training may be complemented by an assessment of the alignment of their ratings compared to a pre-specified gold standard, and formalized by a rater certification. Once the measurement data are collected, the psychometric analyses, which are the main focus of this book, may take place. These analyses may provide empirical proofs for complex validation of the instrument, or assign scores to the respondents or ratees, but can also provide other useful evidence. We describe the validation process and the process of respondent scaling briefly here, while the later chapters of this book will provide more detailed information on psychometric tools for each single step, especially focusing on the computational aspects. Collecting proofs of test validity. While an important part of validity evidence stems from the instrument development process, a number of different psychometric analyses may be performed to provide empirical evidence of instrument validity, i.e., to provide proofs that the instrument measures what it is intended to measure. These analyses include evidence based on external criteria, such as correlations with scores from other instruments. Other psychometric analyses can focus on between-item correlations and the internal structure of the test. See Chapter 2 and Chapter 3 for details on collecting proofs of test validity.

4

Introduction

Collecting proofs of test reliability. Assessing measurement reliability is especially important, as this reliability is what limits the measurement validity. Is the measurement consistent when repeated in time, by different raters or on different occasions? How much error is due to external sources or due to random noise? Chapter 4 provides further details on analyses relevant for collecting proofs of test reliability. Analyzing item functioning. The measurement instrument is typically composed of several items, and to address the quality of the entire instrument, it is imperative to take a look at these individual items. Do all items work properly and discriminate well between respondents? Is guessing an issue? How much information does an individual item carry for respondents of a certain ability level? We provide traditional item analysis in Chapter 5, we start building regression models for item description in Chapter 6, and we describe the Item Response Theory (IRT) models in Chapters 7 and 8. Analyzing impact of group membership and other covariates. In order to ensure instrument fairness but also to understand strengths as well as weaknesses and misconceptions of different groups of respondents, the test and item functioning needs to be analyzed with respect to groups and other respondent or rater characteristics. This is especially important when considering that, oftentimes, only a relatively small selection of items is presented to the respondent, or a selection of raters is used to rate a subject or object of measurement. We focus on group effects in Chapter 9, while we also discuss covariates such as grouping variables in preceding chapters. Selecting optimal items. In the case of larger item banks, only a portion of the items is offered at a time to the respondent. Detailed information on item functioning can then be used to design optimal selection of items in an adaptive setting. In other situations, when a number of test versions is used, information about item properties is being used to equate resulting ability estimates or to link the test versions. We discuss adaptive tests in Chapter 10. Scaling. Eventually, the measurement instrument is used to assign scores to respondents or other rated subjects/objects. We discuss traditional scaling methods in Chapter 1, factor analytic approach in Chapter 3, and methods based on IRT models in Chapters 7 and 8. While the general concepts are common among all areas of multi-item measurement or ratings from multiple raters, different fields have their specifics in the psychometric methods needed for measurement. We will now turn to a discussion of these specifics.

1.2.1

Educational measurement

An educational measurement typically aims at assessing knowledge, ability, or academic achievement. The status of this field, including many aspects of psychometrics, is summarized in the latest edition of the Educational Measurement (4th Edition) edited by Brennan (2006). In most cases the assessment instrument is developed to measure a one-dimensional construct, such as knowledge of a certain subject or concept.

Measurement in social sciences

5

Educational assessment may have a number of purposes, which may in turn affect the psychometric analyses used on data from these assessments. For example, formative tests are used throughout the course to aid learning. In contrast, summative assessment is typically used at the end of the course to assign students a grade. Various testing situations further differ in the stakes that are involved: So called highstakes tests have major consequences, such as awarding a high school diploma based upon the exit/matura examination, or entering a university based upon an admission test. Contrarily, the low-stakes tests include the classroom, national, or international testing used for monitoring student progress. The criterion-referenced tests compare the score with a predefined cut score or other criterion. In this type of test, the performance of other respondents does not affect the respondent, as is the case on a driving test. The norm-referenced tests, on the contrary, compare the respondent score with the score of other respondents. The final score in this type of tests provides the information about the percentage of the population of given characteristics (age group, etc.) which has a higher, or a lower score. Testing situations also differ by other aspects, such as the sample size, i.e., the number of students taking the test. This may greatly influence the psychometric analyses available for the given sample size. For example, some of the IRT models presented in Chapter 7 cannot be fitted for small sample sizes, and their use on data from classroom testing may not be possible. Conceptual tests are knowledge assessments focusing on a certain knowledge area or a concept. As an example, McFarland et al. (2017) described the development and validation of the Homeostasis Concept Inventory (HCI), a conceptual test assessing an undergraduate level of understanding of the concept of homeostasis, the way the body regulates physiological systems. The authors performed a thorough validation study including a number of psychometric analyses of the responses of a high number of undergraduate students on the test. We use the HCI dataset throughout the book for demonstration purposes. Language tests such as the Test of English as a Foreign Language (TOEFL)2 are specific in that they analyze respondent speaking and writing abilities. The Duolingo English Test3 may be taken any time. Principles of adaptive testing and text analysis with machine learning are used to automatically generate test items and to score test-taker item responses. State matura examinations are high-stake tests organized in many countries as high school exit examinations. In some countries, these tests are further used for university admissions; the typical example is Great Britain’s A-levels. In this book, we use data from the Czech Republic matura examination to demonstrate some of the psychometric analyses. Admission tests to universities and high schools are yet another area of educational assessment which are of great importance. The entrance exams may be a more general form of aptitude tests such as the SAT4 , or a more focused admission test, such as the Medical College Admission Test (MCAT)5 , Graduate Medical School Admission Test (GAMSAT)6 , Graduate Management Admission Test (GMAT)7 , and others. With a high number of students and specific times when the exam may be taken, one of the key aspects is ensuring that the scores from different test dates are comparable. National/state testing is implemented in many countries to provide information on school quality or for diagnostic purposes, such as early detection of failing students. In the United States, the "No Child Left Behind" act led to implementation of yearly state testing across 2 https://www.ets.org/toefl/ 3 https://englishtest.duolingo.com/ 4 https://collegeboard.org/sat/ 5 https://www.aamc.org/mcat/ 6 https://gamsat.acer.org/ 7 https://www.mba.com/exams/gmat/

6

Introduction

the country. State testing exists in most countries, although many test only those students from the final grades of given educational stages. In Europe, the state testing is typically low-stakes both for the students and for the schools. In some countries it is possible to link individual data from different years and perform a longitudinal analysis. Some practical examples will be demonstrated on the data from the Czech Longitudinal Study in Education (CLoSE), which administered tests in mathematics, reading, Czech language, and learning competence to the same students in Grades 4, 6, and 9 (Greger, Straková, & Martinková, 2022). International large-scale assessment, abbreviated as ILSA8 , is used for international educational policy development. The studies include the Programme for International Student Assessment (PISA), Trends in International Mathematics and Science Study (TIMSS), or Progress in International Reading Literacy Study (PIRLS). Large international teams of experts are present during test development, the tests and analytic approaches are typically very well documented, and a great deal of data is publicly available.

1.2.2

Psychological assessment

While educational measurements are typically unidimensional, the assessment instruments in psychology are very often multidimensional. For example, the big five model, also known as the five-factor model, is probably the most widely accepted personality theory model. It states that personality can be boiled down to five core factors, known by the acronym CANOE or OCEAN: Openness, Conscientiousness, Extroversion, Agreeableness, and Neuroticism. Inventories used to measure these five factors of personality include the Big Five Inventory (BFI) and its revised version BFI-2 (Soto & John, 2017). As an example, we use the BFI2 dataset from the Czech adaptation of the revised version (Hřebíčková et al., 2020). Psychological tests are typically norm-referenced, and development of population norms is usually an important part of test development. Contrary to educational testing, the psychological testing typically uses ordinal scales such as the Likert scales, "strongly agree" – "agree" – "neutral" – "disagree" – "strongly disagree". Unlike in educational testing, there are usually no correct or wrong answers to an item. Psychological measurement also assesses cognitive functions (e.g., memory or intelligence), emotional states, attitudes, and often the diagnosis of psychological disorders is of interest. More information on psychological measurement can be found in Goldstein, Allen, and DeLuca (2019) or in R. M. Thorndike and Thorndike-Christ (2010).

1.2.3

Health-related outcome measures

While directly-measured data are available in medical sciences, such as body temperature, blood pressure, or functional magnetic resonance imaging, the health-related measurements also include ratings from care-givers (such as physical therapists), or self-reported ratings by the patient. Haigh et al. (2001) surveyed over 400 rehabilitation units, which, in total, reported usage of over 100 clinical tests. Řasová et al. (2020) provided classification of the number of outcome measures according to the International Classification of Functioning, Disability and Health9 . As an example, a study by Řasová, Martinková, Vyskotová, and Šedová (2012) analyzed psychometric properties of an assessment set for the evaluation of clinical outcomes in multiple sclerosis patients. The patient-reported outcomes include questionnaires on pain, fatigue, emotional distress, physical functioning, social-role participation, and others. This is connected to the 8 https://ilsa-gateway.org/ 9 https://www.who.int/standards/classifications/

Measurement data in this book

7

interest of the World Health Organization (WHO) in the subjective well-being of the patients. The Patient-Reported Outcomes Measurement Information System (PROMIS) project (Cella et al., 2010) has been producing a number of item banks which can be used both for linear as well as for adaptive assessments. A number of articles have been devoted to different PROMIS health-related outcome measures10 . As an example, we use the Anxiety dataset containing responses to the PROMIS Anxiety scale. Health-related measurement instruments typically consist of binary or ordinal items, and as is the case in psychological measurement, there are no true or false answers. The algorithms involved are otherwise similar to those of other areas of behavioral measurement. For illustration, the PROMIS psychometric evaluation and calibration plan (Reeve et al., 2007) includes many topics covered in this book: Traditional descriptive statistics for items and scales, evaluation of the assumptions of the IRT model, fitting the IRT model to data and subsequently examining model fit, evaluation of between-group differences among key demographic and clinical groups, item calibration for item banking, and designing adaptive tests.

1.2.4

Other areas of measurement

Measurement of latent traits with multi-item instruments or multiple raters is also present in other areas. In organizational research, raters are typically involved in hiring of new employees (Martinková, Goldhaber, & Erosheva, 2018). The validation of such hiring instruments and procedures involves gaining empirical evidence of their predictive power: To what extent is the rater able to predict the future value of the teacher to their students? In addition, inter-rater reliability is of interest as a way to assess the consistency of the raters. Functioning of the individual criteria being rated may provide information on possible further improvements of the instrument. As an example of another area, in grant proposal selection, peer review is an important part of the process. Raters and committees are involved in selection of the top grant proposals to be funded. In this book, we make use of peer review data from the American Institute of Biological Sciences (AIBS) (Erosheva, Martinková, & Lee, 2021) to demonstrate aspects of the inter-rater reliability analysis. Other areas of measurement include journal article peer review, and customer satisfaction surveys.

1.3

Measurement data in this book

The datasets used in this book cover all above-mentioned areas of measurement in social sciences. The datasets are available in the ShinyItemAnalysis package and also on the book web page. AIBS Grant Peer Review Scoring. The AIBS Grant Peer Review Scoring dataset (Erosheva et al., 2021) comes from a scientific peer review of biomedical applications from a collaborative biomedical research funding program of the AIBS (2014–2018). Three assigned individual reviewers were asked to provide scores and commentary for the following application criteria: Innovation, Approach/Feasibility, Investigator, and Significance (Impact was added as a scored criteria in 2014). Each 10 https://www.healthmeasures.net/

8

Introduction

of these criteria is scored on a scale from 1.0 (best) to 5.0 (worst) with a 0.1 gradation, as well as an overall score. The dataset further contains information on the review year, proposal anonymized ID and type, investigator’s anonymized ID, organization type, gender, rank, and degree, as well as each reviewer’s anonymized ID, institution type, gender, rank, and degree. Anxiety. Anxiety is a real dataset originally from the lordif package. It contains responses from 766 respondents sampled from a general population to the PROMIS anxiety scale composed of 29 Likert-type questions with a common rating scale. Besides item responses, this dataset also contains the age, gender, and education category as further characteristics of respondents. The Next Big Five Inventory (BFI2). The BFI2 dataset (Hřebíčková et al., 2020) consists of responses by 1,733 Czech respondents (730 males, 1,003 females) to the Next Big Five Inventory BFI-2 (Soto & John, 2017) measuring personality in five domains: Extroversion, agreeableness, openness, conscientiousness, and neuroticism. It contains responses to 60 ordinal items from each respondent. Besides item responses, the dataset contains the age, education, and gender membership for each respondent. Czech matura. The CZmaturaS dataset comes from the Czech matura exam in mathematics. The matura exam was assigned in spring 2019 to students from Grade 13, at the end of their secondary education. The dataset consists of responses from a random sample of 2,000 out of 15,702 students. It includes 75 variables including item answers, scored items, the school type, and a binary variable on whether this was the first attempt at the matura exam, or if it was a repeated exam. The ShinyItemAnalysis package also includes the CZmatura dataset, which contains data from the complete set of 15,702 students. Eysenck Personality Inventory – Impulsivity Subscale (EPI-A). The EPIA (Ferrando, 2002) is a real dataset originally available in the EstCRM package (Zopluoglu, 2015). The data capture the responses of 1,033 undergraduates to five items from the Spanish version of the EPI-A impulsivity subscale. Their responses were checks on a 112 mm line segment with two end points: almost never and almost always. The item score was defined as the distance in mm of the check mark from the left end point (Ferrando, 2002). Graduate Management Admission Test (GMAT). The GMAT is a simulated dataset originally available in the difNLR package. The dataset represents the responses of 2,000 subjects (1,000 males, 1,000 females) for a multiple-choice test of 20 items. The answers were generated using item parameters of the real GMAT exam (Kingston, Leary, & Wightman, 1985). The distribution of total scores was the same for both groups; however, the first two items were manipulated to function differently for the two groups (Martinková et al., 2017). The GMAT dataset also contains a simulated continuous criterion variable.

Measurement data in this book

9

Homeostasis Concept Inventory (HCI). The HCI (McFarland et al., 2017) is a 20-item multiple-choice instrument that assesses how well undergraduates understand the critical physiological concept of homeostasis. The HCI was validated on a main sample of 669 undergraduate students and smaller samples of 45, 16, and 10 students. Several datasets from the original study are available in the ShinyItemAnalysis package and presented in this book: • The HCIdata contains the answers from the main sample of 669 undergraduates to multiple-choice items of the HCI test (variables A1–A20), the scored items (variables QR1–QR20), the total score (total), and number of contextual variables: gender, major, year of study, minority membership, and other. • The HCItest contains answers and the HCI dataset contains scored items of a subsample of 651 undergraduate students (405 males, 246 females) who disclosed their gender. • The HCIkey is a nominal vector with 20 values representing correct answers to items of the HCIdata and HCItest dataset. • The HCIlong dataset contains the HCI data in the long format (see Section A.4.2). • The HCIgrads dataset consists of the responses of 10 graduate students. • The HCItestretest dataset consists of the responses of 45 students to HCI from two occasions (test and retest), who took no courses on homeostasis between the two occasions. • The HCIprepost dataset contains the pretest and posttest score of 16 students who received instruction on homeostasis within a physiology course between the two test occasions. Height Inventory. The HeightInventory dataset consists of the responses of 4,885 Czech respondents (1,479 males, 3,406 females) to a Height Inventory (Rečka & Cígler, 2020). The Height Inventory contains 26 ordinal items of self-perceived height, such as "A lot of trousers are too short for me" or "In a crowd of people, I still have a comfortable view", rated on a scale 1 – strongly disagree, 2 – disagree, 3 – agree, 4 – strongly agree. The dataset further contains self-reported height (in centimeters) and gender membership of each respondent. Learning to Learn (LtL). The LearningToLearn (Martinková, Hladká, & Potužníková, 2020) is a real longitudinal dataset from CLoSE study (Greger et al., 2022). Among other variables, it primarily contains the binary-scored responses of 782 secondary-school students to a (mostly) multiple-choice test consisting of 41 items within 7 subscales of the test of learning competencies. Item responses from Grade 6 and Grade 9 are available for each respondent. The school track (variable track_01 or track) is available, with 391 students attending a basic school and 391 pursuing selective academic school. The LearningToLearn dataset was created from a larger sample of respondents using a matching algorithm to achieve similar characteristics for both tracks in Grade 6. Medical School Admission Test. Several real datasets of an admission test to medical school (Martinková et al., 2017) are presented in this book:

10

Introduction

• The dataMedical contains binary-scored items of an admission test to medical school for 2,392 applicants. • The dataMedicalgraded is an ordinal version of the dataMedical dataset. Each item was graded from 0 to 4 points. A maximum of 4 points was given if all of the correct answers and none of the incorrect answers were selected. Similarly to the dataMedical dataset, this ordinal dataset also includes the gender variable and study status as the criterion variable. • The MSATB dataset is a subset of a real medical school admission test for Biology from the difNLR package (Drabinová & Martinková, 2017; Hladká & Martinková, 2020). This dataset represents responses of 1,407 subjects (484 males, 923 females) for a selection of 20 multiple-choice items. The first item was previously detected as functioning differently for the two genders. Test Anxiety. The TestAnxietyCor dataset contains between-item correlations for 20 items of the Test Anxiety inventory. The items cover both the psychological aspects of test anxiety, e.g., thinking about grades, thinking about getting through school, as well as physiological reactions of the nervous system, e.g., uneasy, upset feeling, freeze up, etc. See Bartholomew, Steele, Galbraith, and Moustaki (2008) for more details.

1.4

Psychometrics with R

A number of software packages are available for psychometric analysis, many of which contain commercial software for specific purposes. The IRT models are offered with IRTPRO (Paek & Han, 2013), flexMIRT (Chung & Houts, 2020), Winsteps (Linacre, 2021), or ConQuest (Adams, Wu, Cloney, & Wilson, 2020). The factor analytic approach, path analysis, and more generally the structural equations models are available with LISREL (Jöreskog & Sörbom, 1996), EQS (Bentler, 2006), Amos (Arbuckle, 2011), Mplus (Muthén & Muthén, 2017), or gllamm which is implemented in Stata (Rabe-Hesketh, Skrondal, & Pickles, 2004). In this book, we use R (R Core Team, 2022) which is a freely available statistical environment. There are a number of things that make R a great choice for psychometric analysis: Besides being free, it is also open source software, meaning that the code is available for the user, and it is possible to check what a given function does or how the methods are implemented. It offers up-to-date methods as a number of statisticians and researchers from different fields implement the newly published algorithms and methods in R. There is a huge community around R; thus finding an answer to a question is not a hard task. R also offers excellent graphical instruments. The basic functionality of the language R is enhanced by numerous packages authored by R contributors. Many R packages have been developed to cover general psychometric concepts as well as specific psychometric topics. For an overview, see the subject specific Comprehensive R Archive Network (CRAN) Task View11 . In addition, some of the psychometric models can be implemented by using general packages such as lme4. A brief description of the packages used in this book including full citation, the title of the package, and chapters in which the package is used is provided in Table 1.1. 11 https://cran.r-project.org/web/views/Psychometrics.html

Psychometrics with R

11 TABLE 1.1: R packages used in this book.

R package

Version (Year)

Citation

aod

1.3.2 (2012)

brms catR

2.19.0 (2023) 3.17 (2022)

deltaPlotR

1.6 (2018)

DFIT difNLR

1.1 (2021) 1.4.2 (2023)

difR

5.1 (2020)

equateIRT eRm

2.3.0 (2022) 1.0.2 (2021)

ggdendro

0.1.23 (2022)

ggplot2

3.4.1 (2023)

GPArotation

2022.10.2 (2022) 0.1.2 (2016) 0.1.0 (2023)

Lesnoff and Lancelot Analysis of Overdispersed Data (2012) Bürkner (2017) Bayesian Regression Models using ’Stan’ Magis and Raîche (2012) Generation of IRT Response Patterns under Computerized Adaptive Testing Magis and Facon (2014) Identification of Dichotomous Differential Item Functioning (DIF) using Angoff’s Delta Plot Method Cervantes (2017) Differential Functioning of Items and Tests Hladká and Martinková DIF and DDF Detection by Non-Linear Regression (2020) Models Magis, Beland, Tuer- Collection of Methods to Detect Dichotomous Diflinckx, and De Boeck ferential Item Functioning (DIF) (2010) Battauz (2015) IRT Equating Methods Mair, Hatzinger, and Extended Rasch Modeling Maier (2020) de Vries and Ripley (2020) Create Dendrograms and Tree Diagrams Using ’ggplot2’ Wickham (2016) Create Elegant Data Visualisations Using the Grammar of Graphics Bernaards and Jennrich GPA Factor Rotation (2005) Moore (2016) Apply Generalizability Theory with R Bulut (2020) Handbook of Educational Measurement and Psychometrics Using R Companion Package Sarkar (2008) Trellis Graphics for R

gtheory hemp lattice

0.20.45 (2021) 0.6.15 (2023) 1.1.32 (2023)

ltm mirt mirtCAT

Rosseel (2012) Bates, Mächler, Bolker, and Walker (2015) 0.9-40 (2022) Zeileis and Hothorn (2002) 0.3.3 (2016) Choi, Gibbons, and Crane (2011) 1.2.0 (2022) Rizopoulos (2006) 1.38.1 (2023) Chalmers (2012) 1.12.2 (2022) Chalmers (2016)

moments

0.14.1 (2015)

msm

1.7 (2022)

nlme

3.1.162 (2023)

nnet

7.3.18 (2022)

psych

2.3.3 (2023)

psychometric remotes

2.3 (2022) 2.4.2 (2021)

semPlot

1.1.6 (2022)

lavaan lme4 lmtest lordif

shiny 1.7.3 (2022) ShinyItemAnalysis 1.5.0 (2023) SIAmodules

0.1.0 (2023)

TAM

4.1.4 (2022)

tidyverse VGAM

2.0.0 (2023) 1.1.8 (2023)

Title

Chapters 9 4, 7 10 9 9 9 9 9 7, 8 3 1–10 3 4 4 A

Latent Variable Analysis Linear Mixed-Effects Models using ’Eigen’ and S4

3 4, 7

Testing Linear Regression Models

2

Logistic Ordinal Regression Differential Item Functioning using IRT Latent Trait Models under IRT Multidimensional Item Response Theory Computerized Adaptive Testing with Multidimensional Item Response Theory Komsta and Novomestky Moments, Cumulants, Skewness, Kurtosis and Re(2015) lated Tests C. H. Jackson (2011) Multi-State Markov and Hidden Markov Models in Continuous Time Pinheiro, Bates, DebRoy, Linear and Nonlinear Mixed Effects Models Sarkar, and R Core Team (2020) Venables and Ripley Feed-Forward Neural Networks and Multinomial (2002) Log-Linear Models Revelle (2020) Procedures for Psychological, Psychometric, and Personality Research Fletcher (2010) Applied Psychometric Theory Hester et al. (2021) remotes: R Package Installation from Remote Repositories, Including ’GitHub’ Epskamp (2019) Path Diagrams and Visual Analysis of Various SEM Packages’ Output Chang et al. (2021) Web Application Framework for R Martinková and Drabi- Test and Item Analysis via Shiny nová (2018) Martinková and Netík Modules for ’ShinyItemAnalysis’ (2023) Robitzsch, Kiefer, and Wu Test Analysis Modules (2021) Wickham et al. (2019) Easily Install and Load the ’Tidyverse’ Yee (2015) Vector Generalized Linear and Additive Models

9 7, 8 1, 7–10 10 1 6, 9 6 1–5, 7 4 4 A 3 A 1–10 4, 10 7 A 6

In Appendix A (see also Supplementary R code), we provide a quick introduction to R, including handling the data or creating graphs and even interactive applications. For more

12

Introduction

details and a more thorough general introduction to R, the reader is referred to Paradis (2002), Verzani (2004), Dalgaard (2008), or Wickham and Grolemund (2016).

1.5 1.5.1

Exploring measurement data Item scores

As was mentioned in previous sections, the measurement in social sciences is oftentimes composed of multiple items. We will now focus on exploring and describing the item scores. We will first need to elaborate on available item types in the context of the data typology (see Appendix B for more details). Nominal items. Nominal items are typical in educational measurement, for example, in a form using multiple-choice items, where the task is to select the best answer to a question from a number of choices. Given the answer key (list of correct answers), the nominal data can be scored and yield binary true-false items. However, such dichotomization may lead to discarding valuable information on the functioning of alternative answer options, the distractors, which may be important for the diagnosis and further treatment of misconceptions. Binary items. Binary items are categorical variables typical for educational, psychological, and healthrelated measurements. They include the true/false items for checking a respondent’s knowledge or agree/disagree items for assessing a respondent’s attitudes or health status. In binary items, responses are typically rated as 1 for a true/correct response or endorsement of the key response and 0 otherwise. Ordinal items. Most of the scales used to measure psychological and health-related constructs are ordinal scales. As an example, the BFI-2 inventory (Soto & John, 2017; Hřebíčková et al., 2020) asked the respondents to write a number next to each statement to indicate the extent to which they agree or disagree with a statement about their personality. The same Likert scale (1. Disagree strongly – 2. Disagree a little – 3. Neutral; no opinion – 4. Agree a little – 5. Agree strongly) was offered for all items. Ordinal items are also oftentimes present in educational measurements, where they allow for partial credit in partially correct answers. Note that it is also possible to dichotomize the ordinal data, for example, in the BFI-2 to assign 0 to categories 1 and 2, and to assign 1 to categories 3, 4, and 5 (neutral, agree a little, or agree strongly). However, this will discard information about the respondent’s required trait to move from the category "Disagree strongly" to "Disagree a little", which may be an important piece of information in the context of psychological research. It is also possible to select different cut-points, while always discarding the distinction between some of the subsequent categories. Continuous items. Continuous items are seldom used in social sciences. One example of an instrument which employed continuous items is the Eysenck personality inventory (Ferrando, 2002). The

Exploring measurement data

13

responses were checks on a 112 mm segment line, with 0 corresponding to "Almost never" and 112 corresponding to "Almost always". Other examples of continuous item data include those provided by instruments measuring item response time. Mixed item formats Mixed item formats may be present in a single instrument. This is typical for educational tests, where a number of item formats are usually used, resulting in different item data types. Most common are already mentioned true/false items yielding binary data, multiple-choice items resulting in nominal item data, and partial-credit items producing ordinal item data. Other item formats include open ended questions, and others. For analysis of data with mixed formats, the nominal data are typically scored as true/false and treated as binary items. Binary and ordinal items can be jointly analyzed with ordinal models. However, some IRT models allow for item-level specification of the variable types, as we discuss in more detail in Chapter 8. Item scores in R. To demonstrate the scoring of nominal items in R, let’s present the original unscored dataset of HCI, available under the name HCItest. As this is the first time in this book when we provide sample R code, we remind the reader that Supplementary R code files are provided on the book webpage. As described in Appendix A, prior to any analysis, it is advised to run the code provided in file A-InstallPackages.R, in order to ensure all needed packages are installed. Code for the appendix and for each individual chapter may then be run from respective files, e.g., A-IntroductionToR.R or Ch1-Introduction.R. We first load the data using the data() function. We then display the first two rows using the head() function. data(HCItest, package = head(HCItest, n = 2) ## Item 1 Item 2 Item ## 1 D B ## 2 D B

"ShinyItemAnalysis") 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 ... A D B B B C D ... A D B C B C D ...

The counts and ratios of response categories can be explored using the functions table() and proportions(), also called prop.table() in older versions of R. The counts and ratios are displayed for item 1 of the HCI, which is how negative feedback mechanisms can achieve physiological equilibrium. table(HCItest$"Item 1") ## A B C D ## 27 59 110 455 proportions(table(HCItest$"Item 1")) prop.table(table(HCItest$"Item 1")) ## A B C D ## 0.0415 0.0906 0.1690 0.6989

We see that answer D was selected most often (about 70% of the cases), which corresponds to the proportion of correct answers (see below). Using the answer key HCIkey, the test can be scored to obtain binary items, that is, answers that are either correct or incorrect. We must first load the HCIkey data containing the vector of the answer key. The unlist() function is used to convert the data frame into a vector.

14

Introduction

data("HCIkey", package = "ShinyItemAnalysis") unlist(HCIkey) ## key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 ## D B A D B C C C D A A ## key12 key13 key14 key15 key16 key17 key18 key19 key20 ## D A A C A C C C D ## Levels: A B C D

The nominal data can be scored using the key2binary() function of the mirt package on item data, i.e., the first 20 variables of the HCItest dataset, and using the key of correct answers. HCIscored