The Sage Handbook of Survey Development and Application
 1529758491, 9781529758498

Table of contents :
Half Title Page
International Advisory Board
Title Page
Copyright Page
About the Editors and Contributors
1: Introduction: Sage Handbook of Survey Development and Application
PART 1: Conceptual Issues and Operational Definition
2: A Framework for Evaluating and Creating Formal Conceptual Definitions: A Concept Explication Approach for Scale Developers
3: Group Concept Mapping for Measure Development and Validation
PART 2: Research Design Considerations
4: A Checklist of Design Considerations for Survey Projects
5: Principlism in Practice: Ethics in Survey Research
6: Sampling Considerations for Survey Research
7: Inductive Survey Research
8: Reduction of Long to Short Form Likert Measures: Problems and Recommendations
9: Response Option Design in Surveys
PART 3: Item Development
10: A Typology of Threats to Construct Validity in Item Generation
11: Measurement Models: Reflective and Formative Measures, and Evidence for Construct Validity
12: Understanding the Complexities of Translating Measures: A Guide to Improve Scale Translation Quality
13: Measurement Equivalence/Invariance Across Groups, Time, and Test Formats
PART 4: Scale Improvement Methods
14: Reliability
15: Validity
16: Item-level Meta-analysis for Re-examining (and Initial) Scale Validation: What Do the Items Tell Us?
17: Continuum Specification and Validity in Scale Development
18: Exploratory/Confirmatory Factor Analysis and Scale Development
19: The Use of Item Response Theory to Detect Response Styles and Rater Biases in Likert Scales
PART 5: Data Collection Methods
20: Utilizing Online Labor Pools for Survey Development
21: Digital Technology for Data Collection
22: Designing the Survey: Motivational and Cognitive Approaches
23: Preventing and Mitigating the Influence of Bots in Survey Research
PART 6: Data Management and Analysis
24: Multi-source Data Management
25: Data Wrangling for Survey Responses
26: Examining Survey Data for Potentially Problematic Data Patterns
27: Computerized Textual Analysis of Open-Ended Survey Responses: A Review and Future Directions
28: Qualitative Comparative Analysis in Survey Research
29: Open-ended Questions in Survey Research: The Why, What, Who, and How
PART 7: Research Production and Dissemination
30: Girding Your (Paper’s) Loins for the Review Process: Essential, Best, and Emerging Practices for Describing Your Survey
31: Communicating Survey Research to Practitioners
32: Dissemination via Data Visualization
PART 8: Applications
33: Scale Development Tutorial
34: Defining and Measuring Developmental Partnerships: A Multidimensional Conceptualization of Mutually Developmental Relationships for the Twenty-First Century
35: Development of the Generic Situational Strength (GSS) Scale: Measuring Situational Strength across Contexts
36: A Decision Process for Theoretically and Empirically Driven Scale Shortening using OASIS: Introducing the Gendered Communication Instrument – Short Form
37: The High-Maintenance Employee: Example of a Measure Development and Validation

Citation preview

The Sage Handbook of

Survey Development and Application

International Advisory Board Kyle J. Bradley, Kansas State University Gavin T. L. Brown, University of Auckland Jeremy F. Dawson, University of Sheffield Janaki Gooty, University of NC, Charlotte Ben Hardy, London Business School Tine Köhler, University of Melbourne Lisa Schurer Lambert, Oklahoma State University Karen Locke, College of William and Mary Serena Miller, Michigan State University Carter Morgan, University of South Florida Rhonda Reger, University of Missouri Hettie A. Richardson, Texas Christian University Chet Schriesheim, University of Miami Debra Shapiro, University of Maryland

The Sage Handbook of

Survey Development and Application

Edited by

Lucy R. Ford and Terri A. Scandura

1 Oliver’s Yard 55 City Road London EC1Y 1SP 2455 Teller Road Thousand Oaks, California 91320 Unit No 323-333, Third Floor, F-Block International Trade Tower Nehru Place New Delhi 110 019 8 Marina View Suite 43-053 Asia Square Tower 1 Singapore 018960

Editor: Umeeka Raichura Assistant Editor: Colette Wilson Editorial Assistant: Benedict Hegarty Production Editor: Neelu Sahu Copyeditor: Martin Noble Proofreader: Thea Watson Indexer: KnowledgeWorks Global Ltd Marketing Manager: Ben Sherwood Cover Design: Ginkhan Siam Typeset by KnowledgeWorks Global Ltd Printed in the UK

At Sage we take sustainability seriously. Most of our products are printed in the UK using responsibly sourced papers and boards. When we print overseas we ensure sustainable papers are used as measured by the Paper Chain Project grading system. We undertake an annual audit to monitor our sustainability.

Editorial Arrangement & Introduction © Lucy R. Ford, Terri A. Scandura, 2023 Chapter 2 © Serena Miller, 2023 Chapter 3 © Scott R. Rosas, 2023 Chapter 4 © Jeffrey Stanton, 2023 Chapter 5 © Minna Paunova, 2023 Chapter 6 © Anna M. Zabinski, Lisa Schurer Lambert, Truit W. Gray, 2023 Chapter 7 © Kate Albrecht, Estelle Archibold, 2023 Chapter 8 © Jeremy D. Meuser, Peter D. Harms, 2023 Chapter 9 © Gavin T. L. Brown, Boaz Shulruf, 2023 Chapter 10 © North American Business Press, 2018 Chapter 11 © Lisa Schurer Lambert, Truit W. Gray, Anna M. Zabinski, 2023 Chapter 12 © Sheila K. Keener, Kathleen R. Keeler, Zitong Sheng, Tine Köhler, 2023 Chapter 13 © Changya Hu, Ekin K. Pellegrini, Gordon W. Cheung, 2023 Chapter 14 © Justin A. DeSimone, Jeremy L. Schoen, Tine Köhler, 2023 Chapter 15 © Chester A. Schriesheim, Linda L. Neider, 2023 Chapter 16 © Nichelle C. Carpenter, Bulin Zhang, 2023 Chapter 17 © Louis Tay, Andrew Jebb, 2023 Chapter 18 © Larry J. Williams, Andrew A. Hanna, 2023 Chapter 19 © Hui-Fang Chen, 2023 Chapter 20 © Peter D. Harms, Alexander R. Marbut, 2023 Chapter 21 © Bella Struminskaya, 2023

Chapter 22 © Truit W. Gray, Lisa Schurer Lambert, Anna M. Zabinski, 2023 Chapter 23 © Kristin A. Horan, Mindy K. Shoss, Melissa Simone, Michael J. DiStaso, 2023 Chapter 24 © John W. Fleenor, 2023 Chapter 25 © Michael T. Braun, Goran Kuljanin, Richard P. DeShon, Christopher R. Dishop, 2023 Chapter 26 © Michael T. Braun, Goran Kuljanin, Richard P. DeShon, Christopher R. Dishop, 2023 Chapter 27 © Sheela Pandey, Lars Arnesen, Sanjay K. Pandey, 2023 Chapter 28 © Thomas Greckhamer, 2023 Chapter 29 © Eric Patton, 2023 Chapter 30 © Michael C. Sturman, José M. Cortina, 2023 Chapter 31 © Marcia J. Simmering-Dickerson, Michael C. Sturman, Richard J. Corcoran, 2023 Chapter 32 © Zhao Peng, 2023 Chapter 33 © Terri A. Scandura, Lucy R. Ford, 2023 Chapter 34 © Ethlyn A. Williams, Stephanie L. Castro, Bryan J. Deptula, 2023 Chapter 35 © Ranran Li, Isabel Thielmann, Daniel Balliet, Reinout E. de Vries, 2023 Chapter 36 © Mary M. Hausfeld, Frankie J. Weinberg, 2023 Chapter 37 © David Joseph Keating, Jeremy D. Meuser, 2023

Apart from any fair dealing for the purposes of research, private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act, 1988, this publication may not be reproduced, stored or transmitted in any form, or by any means, without the prior permission in writing of the publisher, or in the case of reprographic reproduction, in accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publisher. Library of Congress Control Number: 2023938601 British Library Cataloguing in Publication data A catalogue record for this book is available from the British Library ISBN 978-1-5297-5849-8

Contents About the Editors and Contributors Acknowledgements 1

Introduction: Sage Handbook of Survey Development and Application Lucy R. Ford and Terri A. Scandura

viii xvii 1



A Framework for Evaluating and Creating Formal Conceptual Definitions: A Concept Explication Approach for Scale Developers Serena Miller Group Concept Mapping for Measure Development and Validation Scott R. Rosas




A Checklist of Design Considerations for Survey Projects Jeffrey Stanton



Principlism in Practice: Ethics in Survey Research Minna Paunova



Sampling Considerations for Survey Research Anna M. Zabinski, Lisa Schurer Lambert and Truit W. Gray



Inductive Survey Research Kate Albrecht and Estelle Archibold



Reduction of Long to Short Form Likert Measures: Problems and Recommendations Jeremy D. Meuser and Peter D. Harms



Response Option Design in Surveys Gavin T. L. Brown and Boaz Shulruf



A Typology of Threats to Construct Validity in Item Generation Lucy R. Ford and Terri A. Scandura


Measurement Models: Reflective and Formative Measures, and Evidence for Construct Validity Lisa Schurer Lambert, Truit W. Gray and Anna M. Zabinski







Understanding the Complexities of Translating Measures: A Guide to Improve Scale Translation Quality Sheila K. Keener, Kathleen R. Keeler, Zitong Sheng and Tine Köhler Measurement Equivalence/Invariance Across Groups, Time, and Test Formats Changya Hu, Ekin K. Pellegrini and Gordon W. Cheung



PART 4  SCALE IMPROVEMENT METHODS 14 Reliability Justin A. DeSimone, Jeremy L. Schoen and Tine Köhler



Validity Chester A. Schriesheim and Linda L. Neider



Item-level Meta-analysis for Re-examining (and Initial) Scale Validation: What Do the Items Tell Us? Nichelle C. Carpenter and Bulin Zhang



Continuum Specification and Validity in Scale Development Louis Tay and Andrew Jebb



Exploratory/Confirmatory Factor Analysis and Scale Development Larry J. Williams and Andrew A. Hanna



The Use of Item Response Theory to Detect Response Styles and Rater Biases in Likert Scales Hui-Fang Chen



Utilizing Online Labor Pools for Survey Development Peter D. Harms and Alexander R. Marbut



Digital Technology for Data Collection Bella Struminskaya



Designing the Survey: Motivational and Cognitive Approaches Truit W. Gray, Lisa Schurer Lambert and Anna M. Zabinski



Preventing and Mitigating the Influence of Bots in Survey Research Kristin A. Horan, Mindy K. Shoss, Melissa Simone and Michael J. DiStaso



Multi-source Data Management John W. Fleenor



Data Wrangling for Survey Responses Michael T. Braun, Goran Kuljanin, Richard P. DeShon and Christopher R. Dishop




Examining Survey Data for Potentially Problematic Data Patterns Michael T. Braun, Goran Kuljanin, Richard P. DeShon and Christopher R. Dishop


Computerized Textual Analysis of Open-Ended Survey Responses: A Review and Future Directions Sheela Pandey, Lars Arnesen and Sanjay K. Pandey





Qualitative Comparative Analysis in Survey Research Thomas Greckhamer



Open-ended Questions in Survey Research: The Why, What, Who, and How Eric Patton



Girding Your (Paper’s) Loins for the Review Process: Essential, Best, and Emerging Practices for Describing Your Survey Michael C. Sturman and José M. Cortina



Communicating Survey Research to Practitioners Marcia J. Simmering-Dickerson, Michael C. Sturman and Richard J. Corcoran



Dissemination via Data Visualization Zhao Peng



Scale Development Tutorial Terri A. Scandura and Lucy R. Ford


Defining and Measuring Developmental Partnerships: A Multidimensional Conceptualization of Mutually Developmental Relationships for the Twenty-First Century Ethlyn A. Williams, Stephanie L. Castro and Bryan J. Deptula






Development of the Generic Situational Strength (GSS) Scale: Measuring Situational Strength across Contexts Ranran Li, Isabel Thielmann, Daniel Balliet and Reinout E. de Vries


A Decision Process for Theoretically and Empirically Driven Scale Shortening using OASIS: Introducing the Gendered Communication Instrument – Short Form Mary M. Hausfeld and Frankie J. Weinberg


The High-Maintenance Employee: Example of a Measure Development and Validation David Joseph Keating and Jeremy D. Meuser



About the Editors and Contributors

THE EDITORS Lucy R. Ford is Professor Emerita, Management, at Saint Joseph’s University, after retiring as an Associate Professor in 2023. From 2016 to 2021 she served as the Director of the Human Resources and People Management program. She also previously served on the faculty at Rutgers University. Her research interests include leadership, teams, and applied research methods. She has authored or co-authored over 60 presentations, articles, and book chapters. Her research has been published in Organizational Research Methods, Journal of Organizational Behavior, Human Performance, Journal of Occupational and Organizational Psychology, and Journal of Business and Psychology, amongst others. She has served on the board of the Research Methods Division of the Academy of Management as both a student representative and as an elected member at large. She has also served as a member of the Board of Governors of the Southern Management Association. In 2012 she and her co-author received the Sage Publishing/Research Methods Division Best Paper Award for her work on item miscomprehension in surveys, and also in 2012 she and her co-authors received the Emerald Citation Award for their 2008 scale development paper in the Journal of Occupational and Organizational Psychology. She has also served on the editorial board of Group and Organization Management. She has presented Executive Education programs on Leadership, Leading Change, Teams, Conflict, Systems Thinking, Legal Issues in HR, Performance Management, and other topics to numerous organizations. Terri A. Scandura is currently the Warren C. Johnson Chaired Professor of Management in the Miami Herbert Business School at the University of Miami. From 2007 to 2012, she served as Dean of the Graduate School of the University. Her fields of interest include leadership, mentorship, and applied research methods. She has been a visiting scholar in Japan, Australia, Hong Kong, China and the United Arab Emirates. Scandura has authored or co-authored over two hundred presentations, articles and book chapters. Her research has been published in the Academy of Management Journal, the Journal of Applied Psychology, the Journal of International Business Studies, the Journal of Vocational Behavior, the Journal of Organizational Behavior, Educational and Psychological Measurement, Industrial Relations, Research in Organizational Behavior and Research in Personnel and Human Resource Management and others. Her book, Essentials of Organizational Behavior: An Evidence-based Approach, 2nd edition is published by Sage. She has presented Executive Education programs on leadership, mentoring, leading change, and high performance teams to numerous organizations such as VISA International, Royal Caribbean Cruise Lines, Bacardi, Hewlett-Packard, and Baptist Health Systems. Scandura is a Fellow of the American Psychological Association, the Society for Industrial & Organizational Psychology, and the Southern Management Association. She is a member of the invitation-only Society of Organizational Behavior (SOB) and the Academy of Management. She received the Distinguished Career Award from the Research Methods Division of the Academy of Management, and the Jerry (J.G.) Hunt Sustained Service Award from the Southern Management Association. She is a past-Associate Editor for Group & Organization Management, the Journal of International Business Studies, Journal of Management and Organizational Research Methods. She currently serves on editorial boards for major journals.



THE CONTRIBUTORS Kate Albrecht is an Assistant Professor at University of Illinois-Chicago in the Department of Public Policy, Management, and Analytics, College of Urban Planning and Public Affairs. Her primary focus as a public management scholar is on networks and collaborative governance. Kate utilizes mixed-methods approaches to examine aspects of boundary management and partnerships between nonprofit and public organizations. Kate is a network methodologist, utilizing both inductive and deductive approaches to advance a broader understanding of how collaborative governance arrangement may evolve over time. Estelle Archibold is a Postdoctoral Scholar at The Pennsylvania State University in the Department of Management and Organizations, Smeal College of Business. Estelle’s research agenda includes research in conflict and cooperation. Estelle applies embodiment, practice, and process theories to her study of these phenomena and uses qualitative as well as mixed methods in her research. Having had a thriving professional career in conflict management and reconciliation practices in the U.S. and abroad, Estelle draws insight from her experiences to study the role of race and intergenerational dynamics in conflict and cooperation. Lars Arnesen is a Policy Analyst at the Federal Reserve Board of Governors of the United States. He is also a Ph.D. student in Public Administration and Public Policy at The George Washington University, where he received his M.A. in Public Administration. Lars’s research interests include the philosophy of social science research methodology and bureaucratic red tape. The views in the chapter do not represent the views of the Federal Reserve Board of Governors. Daniel Balliet is Professor at the VU University Amsterdam, Netherlands. His research focuses on understanding Human Cooperation. He applies experiments, field studies, and meta-analysis to test evolutionary and psychological theories of cooperation. His work addresses issues related to (a) how people think about their interdependence in social interactions, (b) how people condition their cooperation to acquire direct and indirect benefits, and (c) understanding cross-societal variation in cooperation. He is the Principal Investigator of the Amsterdam Cooperation Lab, and the recipient of an ERC Starting Grant (2015–2020) and ERC Consolidator Grant (2020–2025). Michael T. Braun is an Assistant Professor of Organizational Behavior and Human Resource Management in the Driehaus College of Business at DePaul University. He researches team knowledge emergence and decision making, emergent leadership, team cohesion, and modeling multilevel dynamics. He is currently a co-investigator on a grant funded through the Army Research Institute (ARI), as well as serving as a senior consortium fellow for ARI. He is a winner of the 2016 Emerald Group Publishing Citations of Excellence and is the recipient of the 2015 Owens Scholarly Achievement Award as well as the 2013 Organizational Research Method Best Paper Award for work integrating multilevel theory and computational modeling. Braun received his B.A. in Psychology from Purdue University (2006, full honors) and his M.A. (2009) and Ph.D. with a concentration in Quantitative Methodology and Evaluation Science (2012) from Michigan State University. Gavin T. L. Brown is Professor in the Faculty of Education and Social Work at the University of Auckland, New Zealand where he is the Director of the Quantitative Data and Research Unit. He holds joint appointments at Umeå University, Sweden and the Education University of Hong Kong. His research focuses on the design of assessment and the psychology of assessment among teachers and students across cultures. He is the Chief Section Editor for Assessment, Testing, and Applied Measurement for Frontiers in Education. Nichelle C. Carpenter is an Associate Professor in the Human Resources Management department in the School of Management and Labor Relations at Rutgers University–New Brunswick. Nichelle has published articles in outlets such as Journal of Applied Psychology, Journal of Management, Personnel Psychology, Organizational Research Methods, and Leadership Quarterly. Her primary research areas include counterproductive work behavior, meta-analysis, and measurement issues. Stephanie L. Castro is Associate Professor at Florida Atlantic University. Her research interests include leadership, mentoring, and research methods. Former Academy of Management Research Methods



Division Chair. Publications include Human Resource Management Journal, Human Resource Management, Journal of Leadership and Organizational Studies, Leadership Quarterly, and Journal of Applied Psychology (among others). Hui-Fang Chen is an Associate Professor in the Department of Social and Behavioural Sciences in the City University of Hong Kong. She completed her Ph.D. degree in the program of Quantitative Research Methods from the University of Denver in the United States and obtained a Master’s and bachelor’s degree from the Department of Psychology in the National Chung Cheng University in Taiwan. Her research interests include response behaviors in survey questionnaires, latent model analysis, and applied measurement in social sciences and rehabilitation. She has been serving as a reviewer for several prestigious journals and an active member of the American Educational Research Association and the Psychometric Society. Gordon W. Cheung is currently a Professor of Organizational Behavior at the University of Auckland Business School. He is internationally recognized as an expert in structural equation modeling, especially in measurement equivalence/invariance, analysis of dyadic data, and estimation of moderating and mediating effects in complex latent variable models. He is ranked among the world’s top 100,000 scientists in a study by scholars at Stanford University (Bass, Boyack, & Ioannidis, 2022), according to their careerlong impact based on a composite indicator using Scopus data. Gordon has twice received the Sage Best Paper Award from the Research Methods Division of the Academy of Management (2000 and 2009) and, in 2008, the Best Published Paper Award in Organisational Research Methods. He served as the Division Chair of the Research Methods Division of the Academy of Management in 2006/2007. Richard J. Corcoran, CEO of RJ Corcoran & Associates LLC, is a human resource professional with international and domestic experience in consumer products, retail, and not-for-profit associations in Washington, DC. José M. Cortina is a Professor of Management and Entrepreneurship in the School of Business at Virginia Commonwealth University. He is a past Editor of Organizational Research Methods and Associate Editor of the Journal of Applied Psychology. Dr. Cortina was honored by SIOP with the 2001 Distinguished Early Career Contributions Award and the 2011 Distinguished Teaching Award, GMU with a 2010 Teaching Excellence Award, and by AOM’s Research Methods Division with the Distinguished Career Award in 2020. Dr. Cortina recently served as President of SIOP. Among his current research interests are the improvement of research methods in the organizational sciences and the use of restricted variance to hypothesize and test interactions. Bryan J. Deptula is a former Associate Professor of Leadership. Bryan is CEO of Canalside Inn, an award-winning hospitality and professional development retreat. Bryan’s TEDx talk, “Leaders Are Born To Be Made,” gives wisdom about how to make leaders of all people, and create value through collaboration with others. Richard P. DeShon is an Organizational Psychologist specializing in individual, team, and organizational effectiveness. He is a fellow of both the Society for Industrial and Organizational Psychology and the Association for Psychological Science and a member of the Academy of Management. His effectiveness and assessment research have been funded by the National Science Foundation, NASA, and the Department of Defense. He recently (March 2022) retired from Michigan State University and is now focused on working directly with organizations in his leadership role at Coetic. Justin A. DeSimone is currently an Associate Professor of Management at the University of Alabama. He received his Ph.D. from the Georgia Institute of Technology. Justin’s research interests include psychometrics, statistics, and research methodology. Justin has served as an Associate Editor for Organizational Research Methods and Journal of Organizational Behavior, is on the editorial review boards of Journal of Management and Journal of Business and Psychology, and is on the executive committees for the Research Methods Division and CARMA. Christopher R. Dishop received his Ph.D. from Michigan State University and is currently a PostDoctoral Fellow of Organizational Behavior and Theory within Tepper School of Business at Carnegie Mellon University. His work has been published in the Journal of Business Research, Psychological



Methods, and Organizational Psychology Review. He studies how employees collaborate in distributed settings, the situational and psychological antecedents of cooperation, and the underappreciated hazards of various people-analytic techniques. Michael J. DiStaso is a doctoral candidate at the Industrial Organizational Psychology Program at the University of Central Florida. He is interested in studying events and experiences that workers perceive to be stressful, with a special focus on workers’ expectations and worries about future stressful situations. Specifically, his research focuses on employees’ experiences related to interpersonal mistreatment, stressful aspects of helping, and employees’ thoughts about future work events. John W. Fleenor is a Senior Fellow at the Center for Creative Leadership (CCL), where he conducts research on new and innovative products, including multi-rater feedback surveys. He has published extensively in peer-reviewed journals, including Journal of Applied Psychology, Personnel Psychology, and Leadership Quarterly. He was co-editor of the Handbook of Strategic 360 Feedback (Oxford, 2019), and the former book review editor of Personnel Psychology. He serves on the editorial boards of Leadership Quarterly, Human Resource Management, and Journal of Business and Psychology. Dr. Fleenor is a Fellow of the Society for Industrial and Organizational Psychology (SIOP). He is a recipient of numerous awards including the Hogan Award for Personality and Work Performance (2022), Personnel Psychology Best Paper Award (2017), and SIOP Best International Paper Award (2017). His Ph.D. is in Industrial/ Organizational Psychology from North Carolina State University, where he served as an Adjunct Associate Professor of Psychology for 10 years. Truit W. Gray is an Assistant Professor of Management in the Allen W. and Carol M. Schmidthorst College of Business at Bowling Green State University. He completed is Ph.D. in business administration with a concentration in management from Oklahoma State University. His primary research interests include social status, emotions, teams and leadership, and research methods. Regarding research methods, he is particularly interested in study design and measurement. Thomas Greckhamer is the William W. Rucks IV Endowed Chair and Professor of Management at Louisiana State University. He earned his Ph.D. from the University of Florida. His research focuses on configurational and discourse-oriented approaches to strategic management, qualitative research methodology, and qualitative comparative analysis. His theoretical and empirical research has been published in leading journals such as Strategic Management Journal, Academy of Management Review, Journal of Management, Organization Science, Organization Studies, and Long Range Planning. His methods-­ oriented work has been published in leading research methods journals such as Organizational Research Methods, Qualitative Inquiry, Qualitative Research, and Field Methods. Andrew A. Hanna is an Assistant Professor of Management and Entrepreneurship at the University of Nebraska–Lincoln. He is father to two young women and enjoys his time with them above all else. His interest in research methods began as a doctoral student assistant with CARMA, helping with the monthly broadcasts and annual short courses. Professionally, his research interests include the formation and impact of leadership perceptions, multi-team membership, and latent variable methodology. To date, his work has been published in outlets such as Journal of Management and Personnel Psychology. Peter D. Harms is the Frank Schultz Endowed Professor of Management at the University of Alabama. His research focuses on the assessment and development of personality, leadership, and psychological well-being. He has published over 130 peer-reviewed articles. Dr. Harms was selected as one of the St. Gallen Symposium’s “100 Knowledge Leaders of Tomorrow” and is a fellow of both the Society for Industrial and Organizational Psychology (SIOP) and the Association for Psychological Science. He received the 2021 Midcareer Standout Scholar Award from the Network of Leadership Scholars, the Academy of Management’s Sage/Robert McDonald Advancement of Organizational Research Methodology Award, and is a two-time winner of SIOP’s Joyce and Robert Hogan Award for Personality and Work Performance Paper of the Year. He currently serves as an editor at Research in Occupational Stress and Well-Being and The Journal of Managerial Psychology. Mary M. Hausfeld is a post-doctoral research associate in organizational behavior at the University of Zurich. She received her Ph.D. from the University of North Carolina at Charlotte. Her scholarly interests include women in leadership, measurement, research methodology, and the future of work.



Kristin A. Horan was an Assistant Professor in the Industrial Organizational Psychology Program at the University of Central Florida at the time of writing and revising this chapter. She is currently an Assistant Professor in the Department of Psychological Science at Kennesaw State University. She is broadly interested in occupational health psychology topics, as her research interests include establishing best practices in the design, implementation, and evaluation of interventions to improve employee health, occupational safety and health in vulnerable occupations and populations, and the relationship between work and health behaviors. She uses surveys frequently in her research to assess work and health-related attitudes and experiences among employees. Changya Hu is a Distinguished Professor and the Director of the M.B.A. program at the National Chengchi University, Taiwan. She received her Ph.D. in Industrial and Organizational Psychology from the University of Georgia. Her research interests include mentoring, leadership, employee well-being, and measurement equivalence/invariance. Andrew Jebb is a psychologist whose research has focused on statistics (Bayesian methods, time-based analyses), psychological measurement (scale development and validation), and well-being (predictors of happiness). He has a Ph.D. in Industrial/Organizational psychology from Purdue University. David Joseph Keating is currently a Doctoral Candidate at the University of Mississippi. David received his bachelor’s degree in Entrepreneurship with a minor in Marketing at the University of Illinois–Chicago. He earned an M.B.A. with a double concentration in Management & Marketing at Liautaud Graduate School of Business at UIC. In a professional capacity David worked in wine and liquor management and consulting, at Walgreens Corporate in search engine marketing, and was the founding partner of the restaurant Bridges & Bourbon in Pittsburgh, PA. David researches the “dark” and “light” sides of organizational behavior. His research on negative work behavior, employee emotions, and prosocial behavior has been published in the Academy of Management Journal and The Journal of Management. Kathleen R. Keeler is an Assistant Professor of Management and Human Resources at the Max M. Fisher College of Business at The Ohio State University. She received both her B.S. degree in Psychology and her M.A. degree in Industrial/Organizational Psychology from George Mason University. She received her Ph.D. in Organizational Behavior/Human Resources Management from Virginia Commonwealth University. Her main research interests are quantitative research methods, music in the workplace, and employee wellbeing. Her work has been published in the Academy of Management Review, Journal of Applied Psychology, Journal of Management, Organizational Research Methods, and Psychological Methods. Sheila K. Keener is an Assistant Professor of Management in the Strome College of Business at Old Dominion University. She received her B.A. in Psychology at Temple University, her M.S. in Industrial/ Organizational Psychology at Radford University, and her Ph.D. in Organizational Behavior/Human Resource Management at Virginia Commonwealth University. She previously worked as test developer for several licensure and certification exams. Her main research interests include research methods, research integrity, and testing and employee selection. Her work has been published in outlets such as the Journal of Applied Psychology, Organizational Research Methods, and the Journal of Organizational Behavior. Tine Köhler is Professor for International Management at the University of Melbourne, Australia. She received her undergraduate degree from the Philipps-University Marburg (Germany) and her M.A. and Ph.D. degrees from George Mason University (USA). Her main research interests include cross-cultural management, cross-cultural communication and coordination, group processes, qualitative research methods, research design, meta-analysis, and regression. Her work has been published in Organizational Research Methods (ORM), Journal of International Business Studies, Journal of Management, Psychological Methods, and Human Resource Management (among others). She is Co-Editor-in-Chief at ORM and previously held Associate Editor roles at ORM and Academy of Management Learning and Education (AMLE). She serves on the editorial boards of Journal of Management Studies, Journal of Management Scientific Reports, AMLE, Research Methods in Strategy and Management, Small Group Research, and Journal of Management Education. Goran Kuljanin serves as an Assistant Professor in the Department of Management and Entrepreneurship at DePaul University. His research focuses on developing computational process theories and models to



investigate human resources management (HRM) and organizational behavior (OB) processes and the emergence of HRM and OB phenomena, describing the utility of data science to advance HRM and OB research and practice, and studying teams and networks of teams with respect to composition, processes, and emergent states and outcomes. He has published his research in The Journal of Applied Psychology, Psychological Methods, and Organizational Research Methods. His research awards include Best Article in Organizational Research Methods in 2013, the 2015 William A. Owens Scholarly Achievement Award from SIOP, and a Monograph Distinction from The Journal of Applied Psychology. As a co-investigator, he has won multiple grants from the U.S. Army Research Institute for the Behavioral and Social Sciences. Lisa Schurer Lambert is a Professor and a William S. Spears Chair of Business. Her scholarship has focused on the employment relationship, leadership, psychological contracts, person-environment fit theory, and research methods and has been published in The Academy of Management Journal, Journal of Applied Psychology, Personnel Psychology, Organizational Behavior and Human Decision Processes, Organizational Research Methods, and Psychological Methods. Dr. Lambert has served as Chair of the Research Methods Division of Academy of Management and is the current President of the Southern Management Association. She is Co-Editor for Organizational Research Methods and has served on editorial boards for the Academy of Management Journal, Journal of Applied Psychology, Journal of Management, Organizational Behavior and Human Decision Processes, and The Journal of Business and Psychology. She is a frequent instructor for the Consortium for the Advancement of Research Methods and Analysis (CARMA) and SMA Fellow. Ranran Li is a Ph.D. student in Personality Psychology at the Vrije Universiteit Amsterdam, Netherlands. She is broadly interested in personality structure and assessment, psychology of situations, and specifically in person-situation interactions (in prosocial and unethical behaviors in particular). By examining situational strength and trait activation in economic games, daily lives, and organizational settings, she aspires to contribute to research on the effects of personality in and on situations. She is also a fan of computational social science and looks forward to applying advanced quantitative methods to this important line of research. Alexander R. Marbut is a Ph.D. student at the University of Alabama. He received his M.A. in Experimental Psychology from the University of Alabama in Huntsville. His research involves occupational power dynamics and interpersonal perceptions, invoking theories of personality, leadership, and statistical conclusion validity to address practical issues in personnel selection and development. Jeremy D. Meuser was selected as a 2021 Western Academy of Management Ascendant Scholar, a prestigious award given to the very best early career scholars in Management, recognizing excellence in research, service, and teaching. He brings an eclectic background to the study of leadership. He has published in The Academy of Management Journal, Annual Review of Organizational Psychology and Organizational Behavior, The Journal of Management, The Leadership Quarterly, Journal of Management Studies, Journal of Managerial Psychology, Business Horizons, and The Oxford Handbook of Leadership in Organizations. He is an Associate Editor for leadership articles at Group and Organization Management and The Journal of Managerial Psychology. He also serves on the editorial review board for The Leadership Quarterly. Dr. Meuser’s teaching experience includes leadership development. He currently serves as Co-president of the Network of Leadership Scholars. Serena Miller is an Associate Professor at Michigan State University, a Methodology Editor for Review of Communication Research, and a former Associate Editor for Journalism Studies. Informed by mixed methods research, her general research interests focus on the sociology of knowledge, social science theory building, content analysis, metascience, public engagement, and alternative forms of journalism. She enjoys targeting concepts in need of conceptual and empirical specification. Miller seeks to make social science more accessible. In MSU’s College of Communication Arts & Sciences, she teaches scale development, survey research, content analysis, and social science theory building. Linda L. Neider is a tenured Professor of Management in the Miami Herbert School of Business at the University of Miami with specializations in organizational behavior and human resource management. She currently serves as Chair, Department of Management having served with distinction in a number of senior administrative roles including Vice Dean for Internal University Relations, Global/Cross-Disciplinary Initiatives and Undergraduate Programs; Vice Dean for Undergraduate Business Programs; Vice Dean for Faculty Affairs in the School of Business Administration; Co-Director of the Masters in Technology



Management; and Director of the Ph.D. Program in Industrial/Organizational Psychology. Professor Neider has won over two dozen outstanding teaching awards, including the University of Miami Excellence in Teaching Award and the Herbert Miami School of Business Teaching Award for Tenured Faculty. Her scholarly work is primarily in the leadership area where her contributions may be viewed in The Academy of Management Journal, Organizational Behavior and Human Decision Processes, and Leadership Quarterly, to name a few. She is the co-author of 11 research in management books and one textbook. Professor Neider earned her Ph.D., M.B.A, and M.A. degrees from the State University of New York at Buffalo. Sanjay K. Pandey is Shapiro Professor of Public Policy and Public Administration at the Trachtenberg School, The George Washington University. His scholarship focuses on public administration and public policy, dealing with questions central to leading and managing public organizations. Sheela Pandey is Associate Professor of Management in the School of Business Administration at Pennsylvania State University Harrisburg. Her scholarship focuses on social entrepreneurship, leadership, and strategic management of for-profit, nonprofit, and public organizations. Eric Patton is Professor of Management at Saint Joseph’s University. A native of Montreal, Canada, Dr. Patton’s research focuses on absence from work, gender issues in management, and workplace disabilities. His research has been published in several journals including The Journal of Occupational & Organizational Psychology, Human Relations, Personnel Review, Equality Diversity & Inclusion, The Journal of Management History, and The Journal of Workplace Behavioral Health. Dr. Patton is a member of the Academy of Management and the Society for Human Resource Management, and was the founding director of SJU’s Human Resources & People Management undergraduate program. Minna Paunova is Associate Professor of Cross-Cultural Communication and Management at Copenhagen Business School. Her research explores multiculturalism and global inequality at work. She studies national, cultural, and linguistic diversity, particularly related to interpersonal and small-group communication and leadership, employing mostly survey methods, but also archival, experimental, and observational tools. Minna teaches methods to business students at all levels. Ekin K. Pellegrini is an Associate Professor in Global Leadership and Management at the University of Missouri–St. Louis. She currently serves as Associate Dean for Graduate Business Programs, Founding Director of the DBA program, and Board Member and Treasurer for the Executive DBA Council (EDBAC). Her scholarly work is primarily in the leadership area where her contributions may be viewed in the Journal of International Business Studies, Journal of Management, and Journal of Vocational Behavior. Zhao Peng is an Assistant Professor in the Department of Journalism at Emerson College. She conducts research in the domains of data visualization, online privacy literacy, computational journalism, and topic modeling. She uses both quantitative and qualitative approaches to address how communication scientists can present their findings more engagingly and interactively, how communication practitioners can apply DEI in data visualization, how news audiences will interact with, perceive, and consume algorithmic media, and how to design campaign messages to improve parents’ and children’s privacy literacy. Dr. Peng received her Ph.D. from Michigan State University’s School of Journalism. She also obtained a Master’s degree in Statistics in addition to her doctoral degree. Scott R. Rosas is currently the Director of Research and Evaluation at Concept Systems, Inc (CSI) where he specializes in the design and use of the group concept mapping methodology. There he leads the applied investigation of complex problems in the behavioral and social sciences. Much of his work has focused on the conceptualization and measurement in evaluation using group concept mapping, with attention to the validity and reliability of the approach. He obtained a Ph.D. in Human Development and Family Studies from the University of Delaware, with an emphasis on program evaluation. He currently is an Adjunct Assistant professor at Upstate Medical University in Syracuse, NY where he teaches both qualitative and quantitative research methods. Jeremy L. Schoen received his Ph.D. from the Georgia Institute of Technology. He is currently an Assistant Professor in the College of Business at Tennessee Tech University. His interests are research methods, implicit personality, and creativity. Jeremy’s research has been published in top journals in the organizational sciences such as Organizational Research Methods and Journal of Management.



Additionally, he has served on the editorial review board of Organizational Research Methods, Journal of Management, and Journal of Vocational Behavior. Chester A. Schriesheim is the University of Miami Distinguished Professor of Management, Emeritus. Professor Schriesheim is the author or co-author of over 200 books, book chapters, and articles. He has been most active in the areas of leadership, power and influence, and applied psychometrics and statistics, and his articles have appeared in such outlets as The Journal of Applied Psychology, The Academy of Management Journal, The Academy of Management Review, The Journal of Management, Organizational Research Methods, The Leadership Quarterly, and Psychological Bulletin. He is a recipient of the Academy of Management Research Methods Division Distinguished Career Award, a Fellow of the American Psychological Association, and named in Who’s Who in the World. Zitong Sheng is a Senior Lecturer at the Department of Management, Deakin University, Australia. She received her B.S. in Psychology and B.A. in Economics at Peking University (China), M.A. in Industrial and Organizational Psychology at George Mason University, and Ph.D. in Management from Virginia Commonwealth University. Her main research interests include quantitative research methods, proactive work behavior, and counterproductive work behavior. Her work has been published in the Journal of Applied Psychology, Journal of Business and Psychology, and Journal of Vocational Behavior, among others. Mindy K. Shoss is a Professor of Psychology at the University of Central Florida and an Honorary Professor at the Peter Faber Business School at Australian Catholic University. Dr. Shoss conducts research in the areas of work stress, counterproductive work behavior, job insecurity, adaptability, and interpersonal interactions at work. She is particularly interested in the impact of economic conditions and the changing nature of work on employee well-being and behavior. Overall, her research takes a contextual perspective toward understanding employee well-being and behavior. Boaz Shulruf is Professor in Medical Education at the University of New South Wales, Sydney, Australia. Prof. Shulruf is the Head of Assessment at the Faculty of Medicine and Health and the Chair of the Faculty’s Selection and Re-admission Committee. His main research interest is in psycho-educational assessment and measurement, particularly within the context of medical and health professions education. Marcia J. Simmering-Dickerson is the Frances R. Mangham Professor of Management in the College of Business at Louisiana Tech University. She holds a B.B.A. from the University of Iowa and a Ph.D. from Michigan State University. Her research is primarily focused on research methods and human resources topics and has been published in Organizational Research Methods, The Journal of Applied Psychology, and The Academy of Management Journal. Marcia teaches management, HR, and research methods courses at undergraduate and graduate levels. Marcia is also an independent consultant who conducts employee engagement surveys, delivers leader development courses, and provides executive coaching. She holds the SPHR and SHRM-SCP certifications and is a certified Executive Business Coach. Melissa Simone is a Postdoctoral Research Fellow in the Department of Psychiatry and Behavioral Sciences and Division of Epidemiology and Community Health at the University of Minnesota. Their quantitative research focuses on data integrity in online research and longitudinal mediation methods. Their substantive research focuses on socioecological processes linked with health equity among queer and trans young people, with a particular focus on population-specific eating disorder risk and resilience factors in these groups. Jeffrey Stanton is a Professor and the Director of the Center for Computational and Data Science at Syracuse University’s School of Information Studies. Stanton was trained in personnel and organizational psychology and is a fellow of the Society for Industrial and Organizational Psychology and the Association for Psychological Science. He teaches in the areas of data science, statistics, and research methods. His research applies machine learning and natural language processing techniques to social science research problems. He has published five books – including the Sage textbook, Data Science for Business with R – and more than 50 peer-reviewed journal articles. Bella Struminskaya is an Assistant Professor of Methods and Statistics at Utrecht University and an affiliated researcher at Statistics Netherlands. Her research focuses on the design and implementation of online, mixed-mode and smartphone surveys, and passive data collection. She has published on



augmenting surveys with mobile apps and sensors, data quality, nonresponse and measurement error, including panel conditioning, and device effects. Michael C. Sturman is a Professor of Human Resource Management in Rutgers’ School of Management and Labor Relations. His work focuses on the prediction of individual job performance over time, the influence of compensation systems, research methods, and the use of HR Analytics and Metrics to improve HR decision making. Louis Tay is the William C. Byham Professor of Industrial-Organizational Psychology at Purdue University. His substantive research interests include well-being, character strengths, and vocational interests. His methodological research interests include measurement, item response theory, latent class modeling, multilevel analysis, and data science. He is a co-editor of the books Big Data in Psychological Research, Handbook of Well-Being, and The Oxford Handbook of the Positive Humanities. He is also the founder of the tech-startup ExpiWell (, which advances the science and capture of daily life experiences through experience sampling and ecological momentary assessment. Isabel Thielmann is a Research Group Leader at the Max Planck Institute for the Study of Crime, Security and Law in Freiburg, Germany. Her research mostly focuses on understanding individual differences in prosocial and (un)ethical behaviors, combining concepts and theories from personality psychology with methods from social psychology, criminology, and behavioral economics. Moreover, she is interested in personality assessment. She is the recipient of an ERC Starting Grant (2022-2027) investigating the influence of increasing self-knowledge on prosocial and ethical behavior. Reinout E. de Vries is Professor at the VU University Amsterdam, Netherlands. His main research interests are in the areas of personality, communication styles, and leadership. Recent work has focused on the construction of a six-dimensional Communication Styles Inventory (CSI), a Brief HEXACO personality Inventory (BHI), the relation between Impression Management and Overclaiming and HEXACO personality, and on the relation between self- and other-rated HEXACO personality on the one hand and leadership, proactivity, impression management, and overclaiming on the other. Frankie J. Weinberg is an Associate Professor of Management and holds the Dean Henry J. Engler, Jr. Distinguished Professorship in the College of Business at Loyola University New Orleans. He received his Ph.D. from the University of Georgia. His scholarly interests include dyadic and team-level interactions at work including mentoring, leadership, and social networks, as well as team and organizational diversity and scale development. Ethlyn A. Williams is Associate Professor of Management at Florida Atlantic University. Her research interests include mentoring and leadership. She serves on the editorial boards of Group and Organization Management and The Leadership Quarterly. Her research appears in the Academy of Management Journal, Personality and Individual Differences, the Leadership Quarterly, and Journal of Vocational Behavior (among others). Larry J. Williams is the James C. and Marguerite J. Niver Chair in Business and Professor of Management in the Rawls College of Business at Texas Tech University. He is also director of the Consortium for the Advancement of Research Methods and Analysis (CARMA). He received his Ph.D. in organizational behavior from Indiana University. His research interests include the application of structural equation methods to various substantive and methodological concerns. Anna M. Zabinski is an Assistant Professor of Management in the College of Business at Illinois State University. She completed her Ph.D. in Business Administration with a concentration in organizational behavior from Oklahoma State University. Her primary research areas include person-environment fit, social exchange, boredom, and research methods. Within research methods, she is particularly interested in scale development, study design, and polynomial regression and response surface analysis. Bulin Zhang is an Assistant Professor of Management at the College of Business, University of Northern Iowa. She earned her Ph.D. degree in Industrial Relations and Human Resources from the School of Management and Labor Relations, Rutgers University. Her primary research interests include platform work, strategic human resource management, and unethical behaviors.

Acknowledgements We are eternally grateful to the authors that contributed to this Handbook, and to our stellar global advisory board, who freely gave significant time, often with little notice, to reviewing and advising the editors. Without all of you, this volume would not exist. We would also like to thank the editors and publishers at Sage for their ever-patient and supportive guidance as we put this Handbook together. We are grateful to Alysha Owen for commissioning the volume, and to Umeeka Raichura who is now our editor. We ran into a lot of challenges along the way, and Colette Wilson was always there with help and advice. We are also very grateful to Saint Joseph’s University for providing Lucy Ford with sabbatical leave, expressly for the purpose of working on this volume. We would also like to acknowledge the generous efforts of a group of special readers whose efforts were very useful for our authors: Anson Seers, Changmeng Xu, Tina Taksida, Lindsey Greco, and Johnna Capitano. Closer to home, the editors would like to thank their families for putting up with their incessant need to work! Lucy R. Ford and Terri A. Scandura Editors, The Sage Handbook of Survey Development and Application

This page intentionally left blank

1 Introduction: Sage Handbook of Survey Development and Application L u c y R . F o r d a n d Te r r i A . S c a n d u r a

Welcome to The Sage Handbook of Survey Development and Application. The Handbook aims to provide a practical resource that researchers can go to for cutting-edge tools to ensure that they are employing the best survey research techniques. Our intention is to deliver a relatively comprehensive set of chapters that cover the entire survey development process from conceptualization to dissemination. Further, we aim to provide the reader with practical examples of survey development and redevelopment studies that use best practices. Finally, it is our intent to speak to a wide range of academic disciplines – to that end our authorship is drawn from numerous areas including organizational studies, psychology, marketing, communications, education, and political science, to name but a few. As we compiled this Handbook and scoured the globe for qualified authors, we sought to include voices from all continents, across many relevant disciplines. While the editors are from similar academic backgrounds, we recognize that survey development is not a settled subject in any field, and that these disciplines can learn from one another. We also expanded our focus to include the dissemination of results not just to the academic community, but to the practitioner community as well. Many of our chapters will be ideal reading assignments for doctoral students, while also providing a refresher into best practices for

more experienced researchers. In addition, we have included some methods that some may not be familiar with as they are relatively new. We suggest that you select material to read in the Handbook that will augment and supplement your existing knowledge, as you design and administer surveys. We hope that this will become a major reference for survey development in the years to come. This introductory chapter positions what follows in the 36 subsequent chapters by highlighting some of the challenges faced by scholars. We will also introduce each part of the Handbook to offer the reader a high level overview of the contents. The chapters have been arranged into eight logical parts that address various different aspects of the survey development process, plus a final section that provides a tutorial highlighting best practices, and four studies that employ those best practices. These four studies represent exemplars of the development and refinement of new survey measures.

THE BROADER CHALLENGE Sound survey development is the backbone for sound empirical research in the social sciences. Nearly 40 years ago, Schwab (1980) lamented the



lack of attention to the fundamentals of establishing construct validity in Organizational Behavior research due to the haphazard manner in which surveys were developed. We have learned that this is not only the case for Organizational Behavior but is a challenge for other fields of study as well. We believe that some progress has been made with respect to improvements in survey development, and yet, many students and researchers may not be aware of these advances. However, tools for best practices in measurement and survey administration are found in various publications such as textbooks, handbooks, and research methods journals, and in most cases are incomplete. The editors reviewed numerous Handbooks and other texts on survey development and application, and found that there were notable elements missing. For example, DeVellis (2017) is heavily relied upon by organizational scholars, despite the fact that it is not an end-toend guide to survey development. Other more comprehensive volumes, such as The Sage Handbook of Measurement (Walford, Tucker, & Viswanathan, 2010) focus on only one aspect of survey research (in this case, measurement). Despite the important content found in these volumes, the references are dated, given the advances in survey development knowledge in recent years. Further, some journals have moved away from publishing survey development practices in favor of stronger contributions to theory, making it less likely that scholars have access to examples of best practices. The result is that the reader is left with very little insight into how the survey was developed for any given published study (often, readers are asked to contact the authors of the study for details on the development of the measures). This handbook aims to fill the void in the social sciences – and will be a valuable resource for a variety of fields such as Organizational Behavior, Industrial and Organizational Psychology, Management, Psychology, Educational Research, Marketing, Public Policy, and others. The Handbook covers most aspects of the survey research process from start to finish, including both the development of new scales, and the adaptation or alteration of existing scales. Survey development encompasses a wide range of essential topics beginning with conceptualization of constructs and the development of operational definitions, and concluding with dissemination of the results. The psychology of survey response is an important consideration in the development of survey instruments, if the researcher wants to improve construct validity (Tourangeau, Rips, & Rasinski, 2000). This Handbook will squarely address concerns around respondent interpretation of survey items.

In some areas of research, construct proliferation has become an issue (e.g., Banks et al., 2018) and there is a need for construct mapping tools. Once constructs are appropriately defined and operational definitions created, there is a need to understand the best methods for item generation and survey construction. Today, there are different channels of communication (such as paper-andpencil, oral, online, etc.) through which data are collected, and so this Handbook discusses more recent developments, including the use of digital technology and the use of paid panels for research, such as mTurk. These are relatively new methods for data collection, and these chapters offer guidance on best practices for employing them, as well as their strengths and weaknesses. New statistical techniques for the assessment of construct validity have emerged, and the Handbook includes updates on these tools. For example, we include a chapter on reflective and formative measures that we hope will clear up some of the debate on the use of these measurement models. There has been an explosion in research being conducted in other cultures over the last 20 years, yet survey development practices have not kept pace with these developments. This Handbook covers challenges in data collection, and the use of survey methods in cross-cultural research. Translation and back-translation are the minimum requirements for the administration of a survey in another language. However, there is a need for more advanced statistical methods to assess the equivalence of surveys across cultures, and the Handbook addresses this. For example, we include a chapter on suggestions for improving the quality of scale translation. Further, the Handbook addresses the challenges associated with publishing or presenting survey development work in academic and practitioner outlets. The final section of the handbook contains applications of the development of new measures as well as modifications of existing methods to demonstrate best practices. There is a great need for an outlet for this measurement work in the social sciences. This Handbook closes the gaps in survey development and will serve as a one-stop resource for all social scientists interested in conducting survey research. Of particular interest is that many of the chapters provide the reader with a checklist or other tutorial guide to implementation. This Handbook, therefore, responds to repeated calls in the literature to improve survey development practices by providing a comprehensive source of best practices and examples that can be used by scholars in many disciplines, and will serve as an important reference work and a foundational text for training future scholars.


INTRODUCING THE HANDBOOK As both editors are organizational scholars, it was important to us to receive feedback on the content of the book from those outside our own field, in order to ensure that the volume meets the needs of scholars from across disciplines. The original proposal was reviewed by six anonymous reviewers, at Sage’s request. We used this feedback to further develop the table of contents, to ensure that the Handbook would be as useful to as many disciplines as possible, within the bounds of a single volume. Subsequently, a global advisory board was assembled from across the globe and from several disciplines. We sent the final proposed table of contents to the board seeking additional input on the content of the volume. Finally, we tried to make sure that each draft chapter was reviewed by at least one scholar from outside the authors’ field. This was intended to ensure that chapters would be readable across disciplines, as we suspected it highly likely that “best practices” might differ from one discipline to another. Hence the reader will find themselves presented with chapters that have been written to appeal across disciplines, but will reflect some diversity of thought on some content areas. We learned that disagreements on best practices remain, but authors responded to the difference of opinions by revising their chapters to fairly represent state-ofthe-art thinking on the topics. The volume opens by laying out the conceptual and theoretical issues at stake in the development of surveys, prior to making any decisions about the research design, or the generation of the survey. A framework for more effectively explicating constructs is presented (Miller), and Group Concept Mapping is discussed as a means for exploring the connection between the theoretical concepts and the observed data (Rosas). In the next part of the book, consideration is given to research design. The first chapter in this section focuses on those issues that should be considered when making design decisions, and presents readers with a useful checklist (Stanton). The rest of this section covers a set of issues that should be considered when designing a research project using surveys. Ethics in survey research are addressed in a chapter on principlism in practice (Paunova), that describes the moral principles that researchers might use to guide ethical decision making in survey research. Another chapter is devoted to sampling issues (Zabinski et  al.), a topic that has typically received too little attention in the literature. The next chapter in this part focuses on inductive survey research design considerations (Albrecht & Archibold). The two


remaining chapters in this part are concerned with long and short form Likert measures (Meuser & Harms), and response option considerations in the design process (Brown). In the third part, item development is addressed. The first chapter is a reprint of an article originally published by the editors (Ford & Scandura). In this paper a typology of threats to construct validity is presented, with recommendations for minimizing those threats in the scale development process. The next chapter addresses the need to distinguish between reflective and formative measures, as the two types of measures require different approaches to construct validation (Lambert et  al.). In today’s world, trans-national and cross-cultural surveys are increasingly common, and no discussion of survey construction would be complete without addressing the issues that emerge from culture and language differences if the survey is to be used in multiple countries. The next chapter in this part provides guidelines for improving translation quality, including a selection of statistical techniques (Keener et al.). It is important to establish measurement equivalence/ invariance when using a scale under different measurement conditions such as time, context, or geography. The final chapter in this part addresses several approaches to establishment of measurement equivalence, targeted both at the novice researcher, and at those who need a refresher on this subject (Hu et al.). In the fourth part of the volume, we are concerned with the process of scale improvement. The chapters cover topics ranging from fundamental and well established methods, to newly emerging methods. We are first introduced to the concept of reliability. This chapter covers different reliability indices and provides a useful flowchart to allow researchers to make a well-informed choice of index (DeSimone et al.). Once reliability of a measure is established, it is important to establish validity. The next chapter discusses traditional types of validity evidence, and additionally discusses more recent developments, providing information on using multi-trait, multi-method matrices and models. Best practice recommendations are given (Schriesheim & Neider). Addressing some of the existing challenges in establishing construct validity, Carpenter and Zhang focus on the use of item level meta-analysis to establish validity, and provide examples of this technique in use. The next chapter might equally well have been placed in the earlier part of the volume, Item Development. The chapter describes construct continuum specification, and extends previous work by linking the concepts to validation and refinement of the survey measure (Tay & Jebb). There are a number



of additional statistical techniques that scholars may use to refine and improve their surveys. In the next chapter, exploratory and confirmatory factor analyses are explored, using a common data set to demonstrate these techniques. The authors provide recommendations for improved use of these scale improvement methods (Williams & Hanna). Another concern in interpreting survey data is the response style of the respondents completing it. Item Response Theory approaches can be used to detect and control for these tendencies (Chen). In the fifth part of the volume, various concerns around data collection are addressed. Traditionally, surveys were collected using pencil-and-paper, telephone, in person, and internet-based forms. In recent years, more innovative methods are in use both in academia and in the corporate sector, and those methods raise interesting challenges for the researcher. The first chapter in this part discusses the use of online labor pools in survey development (Harms & Marbut). The next chapter introduces the use of digital technologies, including smartphone sensors, apps, wearables, and digital trace data (Struminskaya). The next chapter challenges the reader to think through the cognitive and motivational factors that impact survey responses, and to design their survey accordingly (Gray et  al.). Finally, the last section in this part provides guidance and recommendations for mitigating the possibility of bot responses to surveys (Horan et al.). The sixth part of the volume is designed to help the researcher get their data ready for statistical analysis and hypothesis testing. Historically scholars have learned how to do much of this through trial and error, as it is often given little to no coverage. These topics, though, are critical to the production of quality research, as errors in this stage of the research process have serious implications for the validity of the results. In the first chapter in this part, the reader will learn about managing multi-source data. Topics in this chapter include the development of multi-source databases, maintaining nested relationships within the data, data privacy concerns, data aggregation, and the use of technology with multi-source data (Fleenor). The next two chapters cover a range of topics related to data wrangling (cleaning, organizing, nesting data, etc.) and examining the data for problems such as missing data and outliers. The authors recommend the use of scripts for these processes in order to produce consistent and well documented final data sets (Braun et al.). The fourth chapter in this part of the book highlights approaches that can be used to automate the analysis of text based open-ended responses in surveys, by reviewing studies using

such approaches across multiple disciplines (Pandey et  al.). In the fifth chapter, Qualitative Comparative Analysis is presented as an option for survey researchers interested in a set-theoretic, rather than correlational approach to data analysis (Greckhamer). The final chapter in this part of the volume addresses the use of openended questions for multiple purposes including scale validation, discovery of contextual factors, and inspiration of direction for future research. Strategies for analysis are presented (Patton). The penultimate part of the volume includes three chapters on dissemination of survey results in academic outlets, practitioner outlets, and through data visualization. In the first chapter, current best practices are presented for writing up survey results. Further, the authors speculate on the future demands that journals and reviewers might present to authors engaged in survey research, with recommendations for improving survey reporting (Sturman & Cortina). Most doctoral programs fail to provide any guidance at all on disseminating research results to practitioners, either directly, or through various outlets such as blogs and social media. In the next chapter, the authors provide guidance on better bridging the scientistpractitioner divide (Simmering-Dickerson et  al.). Data visualization is another common omission in doctoral education, beyond the provision of figures and charts suitable for publication in academic journals. The final chapter in this part presents details of many types of visualization that might prove useful for the reader (Peng). In the final section of the book, readers are first presented with a tutorial chapter that outlines, at a summary level, scale development practices. This is followed by four chapters that include detailed write-ups of the scale development processes followed in four separate studies, both in development of an original scale, and adaptation of an existing one. Authors were provided with the tutorial at the time their proposal was accepted, and each agreed to follow the guidelines provided. In the first application chapter, three studies were conducted to develop the Developmental Partnerships Inventory, a measure of the quality of developmental relationships in organizations (Williams et al.). In the second, the authors developed a General Situational Strength measure that will help advance research into how the situation and personality affect behavior (Li et  al.). The third applications chapter uses OASIS to create a short form version of the Gendered Communication Instrument (Hausfeld & Weinberg). And finally, the last applications chapter develops an instrument for measuring the High Maintenance Employee (Keating & Meuser).


CONCLUSION In putting together this Handbook we have sought to provide the reader with a practical end-to-end guide to survey development, from defining concepts, all the way through to dissemination of survey results. The eight parts of the volume progress logically through the process from beginning to end, including both fundamental and cutting-edge guidance. Many of the authors have provided a checklist, flowchart, or other means to help the researcher apply their chapter, and the final section of the book provides several examples of well-executed studies. The chapters include extensive references, and readers may consult those for further reading on the subject. We are grateful for the hard work of the authors in this volume, who carefully navigated their way through differing recommendations for revision from reviewers in different disciplines. We sincerely hope that this selection of chapters will serve as a valuable resource for scholars across disciplines, and across career stages. We encourage you to read on, and dip


into the chapters that will improve your survey development practice.

REFERENCES Banks, G. C., Gooty, J., Ross, R. L., Williams, C. E., & Harrington, N. T. (2018). Construct redundancy in leader behaviors: A review and agenda for the future. The Leadership Quarterly, 29(1), 236–251. DeVellis, R. F. (2017). Scale Development: Theory and Applications. Sage. Schwab, D. (1980) Construct validity in organizational behavior. Research in Organizational Behavior, 2, 3–43. Tourangeau, R., Rips, L., & Rasinski, K. (2000). Respondents’ Understanding of Survey Questions. In The Psychology of Survey Response (pp. 23–61). Cambridge: Cambridge University Press. doi:10.1017/ CBO9780511819322.003 Walford, G., Tucker, E., & Viswanathan, M. (Eds.) (2010). The Sage Handbook of Measurement. Sage.

This page intentionally left blank


Conceptual Issues and Operational Definition

This page intentionally left blank

2 A Framework for Evaluating and Creating Formal Conceptual Definitions: A Concept Explication Approach for Scale Developers Serena Miller

One of the first steps of scale development should be to identify a parsimonious, useful, adequate, and precise conceptual definition or propose a conceptual definition that represents the key characteristics of a concept that will serve in the measurement process (Sartori, 1984; Mowen & Voss, 2008). A scale is not considered “good” if it does align with a formal conceptual definition (Wacker, 2008, p. 8). Sartori (1984) stated that precision is critical to facilitate collectives working together to build and test theory that involves that concept, “Clear thinking requires clear language. In turn, a clear language requires that its terms be explicitly defined” (p. 22). A formal conceptual definition “is a clear, concise verbalization of an abstract concept used for empirical testing” (Wacker, 2004, p. 645). Definition creators use a collection of more familiar terms that are less abstract to communicate the meaning of a scientific concept for other researchers (McLeod & Pan, 2004). A conceptual definition, the theoretical lens employed by scale developers, houses scholars’ interpretation of a concept, which then should inform the items selected or eliminated for a proposed measure (Wacker, 2004; Carpenter, 2018). For the purposes of scale and index development, definitional words function to minimize

the distance between the concept and the variables that represent the concept (Hempel, 1952). A conceptual definition is especially useful during initial stages of item creation and the later stages of item reduction in the scale development process based on my observations. The rarely reported use of conceptual definitions in scale development is a topic of great concern for some theoreticians. Despite their expressed importance in scale development, scale developers typically forego the step of linking scale items to a formal conceptual definition based on the general observations of researchers (Wacker, 2004; Suddaby, 2010; Swedberg, 2020). The reason behind this neglect may be because the functions of conceptual definitions are not taught during graduate training (Podsakoff, MacKenzie, & Podsakoff, 2016). In fact, most theory building textbooks dedicate a very small section to the topic of conceptual definitions based on my reviewing of these sources. If present, most books emphasize how conceptual definitions are used to describe the essential qualities or characteristics of intellectually organized phenomena represented by a concept (e.g., Jaccard & Jacoby, 2010), but most texts do not generally communicate how to create one. Conceptual definition advocates argue that weak



(i.e., ambiguous, rarely used, non-existent) definitions lead to problems such as misspecification of concepts; scattered measurement practices across fields; conceptual confusion and proliferation; and questionable cumulative understandings of phenomena. If a conceptual definition is not used, scholars’ measurement model item decisions may be instead guided by their own individual interpretation of the concept, or they may only use the concept’s label or other scales as a heuristic when crafting and modifying items for their scale. Concept explication is a process that involves logical, creative, and empirical procedures in which “a concept is created, defined, and used in scientific research” (McLeod & Pan, 2004, p. 16). Concept explication is distinguished from concept formation or analysis in that the goal is to theoretically link measures to a particular scientific concept (Chaffee, 1991; McLeod & Pan, 2004). Concept analysis typically references a broader umbrella approach toward creating or evaluating concepts including building frameworks that may serve as conceptual lenses, classifications, or typologies in critical and qualitative research; case studies; or clinical settings (Walker & Avant, 2019). Particularly, the intent of concept explication is to reduce ambiguities and inconsistencies of a concept’s meanings with the future intent of using the definition to observe the natural world (Hempel, 1952; Sartori, 1984). Thus, I argue concept explication is the most appropriate approach to show readers how to create conceptual definitions because the intent of the process is to ultimately develop measures that represent the concept. My hope with this chapter is to help scholars, practitioners, and students learn how to engage the literature and craft formal conceptual definitions. I also hope they will become aware of the role conceptual definitions play in scale development. The choices and steps involved with formulating a conceptual definition require a substantial amount of disciplined thinking about a concepts’ essence, use, and boundaries. Methodologically, the practice of creating clear definitions may involve identifying essential and precise terms of scientific utility; locating patterns of agreement and disagreement; using dictionaries and thesauruses; mapping empirical referents; locating boundaries; detailing functions and uses; surveying experts; and analyzing data; for example. The chapter is intended to be written in an accessible way, while also encouraging a more systematic approach related to teaching students and researchers how to create parsimonious and useful conceptual definitions. In this

chapter, I will first provide guidance on how to evaluate the goodness of a conceptual definition to help determine whether a new one is needed. Following the evaluative guidance, I present the Concept Explication Framework for Evaluating and Creating Conceptual Definitions that outlines organizational and methodological strategies that researchers may employ to identify an appropriate definition or craft a formal conceptual definition (see Table 2.1).

EVALUATIVE GUIDANCE FOR REVIEWING CONCEPTUAL DEFINITIONS Conceptual definitions represent researchers’ shared understanding of an abstract entity. Researchers give phenomena names for the purposes of categorization and efficient communication about the phenomena. Definitions serve a referential role in scale development or refinement research because concepts represent recurrent patterns that take place in the real world (Gerring, 1999; Carpenter, 2018). Riggs (1975), for example, stated a concept is a “mental image of a thing formed by generalization from particulars” (p. 46), while Podsakoff, MacKenzie, and Podsakoff (2016) defined concepts as “cognitive symbols (or abstract terms) that specify the features, attributes, or characteristics of the phenomenon in the real or phenomenological worlds that they are meant to represent and that distinguish them from other related phenomena” (p. 161). Ultimately, the better the conceptual definition, the better the scale because definitions are the connective tissue that ties theory with reality. It is a rare event, however, to locate a conceptual definition for a concept or find one that is useful for the purposes of measurement in research studies. Wacker (2004) argued, “many (or most) concepts are not conceptually well-defined,” which suggests measures may be questionable if one employed such definitions to develop variables for a concept (p. 634). Summers (2001) referred to definitional missteps as pseudodefinitions in which a concept’s meaning is clouded by researchers’ definitional writing practices. Some ill-advised definition writing practices include using terms that do not serve in observation; creating circular or tautological definitions; defining concepts based on examples; or using too many conjunctions (Cohen & Nagel, 1934; Suddaby, 2010).



Table 2.1  Concept explication framework for evaluating and creating conceptual definitions 1. Collect and Record Relevant Conceptual Interpretations of Construct a. Formal Conceptual Definitions b. Concept Descriptions c. Measures d. Popular Literature e. Non-scholarly Definitions

2. Thematically Map Conceptual Definitions a. Order Definitions Chronologically b. Record Key Terms c. Record Field if Interdisciplinary d. Note Observations and Patterns

3. Map Neighboring Concepts to Determine Boundaries a. Record Similar and Polar Concepts b. Record Concepts in Nomological Framework

4. Evaluate State of Conceptual Definitions 5. Present Existing, Modified, or New Formal Conceptual Definition 6. Content Validate Conceptual Definition 7. Refine and Present Formal Conceptual Definition

VAGUE AND AMBIGUOUS DEFINITIONAL TERMS Researchers should select definitional terms familiar to researchers that serve in future observations. The assessment of vagueness and ambiguity are two evaluative components one may consider when creating and reviewing the quality of a definition (Collier & Gerring, 2009). Conceptual vagueness arises when a researcher selects terms within a definition that do not serve in model specification because the definitional terms make it challenging to reference it, while ambiguity emerges when readers of the definitions are confused about the entity being defined because the definition or certain definitional terms

possess multiple meanings (Voss et al., 2020). In the realm of conceptual definitions, researchers correct these problems by choosing more visual and concrete terms to reduce vagueness (i.e., cloudiness of the terms within the definition used to identify referents or “a lack of determinacy”) (Hempel, 1952, p. 10). Terms within a definition should offer clear direction on how to write items to represent a concept (Sartori, 1984). For example, a scale developer would be challenged to identify a coherent and unifying body of items that collectively represent the positive mental health construct if researchers defined the concept as “general emotional, psychological, and social well-being” (Velten, Brailovskaia, & Margraf, 2022, p. 332). A definition loaded with vague



keywords leads to scattered measurement practices because it does not offer information related to the boundaries of the construct. Ambiguity may be diminished if the researcher defines a few key terms within the definition to clearly communicate what is meant by words such as community, society, etc. following a formally presented conceptual definition (Riggs, 1975; Gerring, 1999). The definitional writer may want to italicize words within the definition that are being articulated by the author or italicize words critical to the definition that serve scale developers in observation. The addition of clarifying phrases within a definition may cause confusion around the core meaning of the concept (Wacker, 2004). Thus, one should strive for a parsimonious definition and define only a few key terms within the definition out of necessity.

CIRCULAR OR TAUTOLOGICAL DEFINITIONS Researchers create circular definitions by using terms in the label to define the concept and a tautological definition is when one restates the concept label by using different, but similar words in a definition (Suddaby, 2010; Podsakoff, MacKenzie, & Podsakoff, 2016). Definition creators should strive to not include any concept label term in the definition because it does not offer additional clarity as to what the concept means. An example of such a definition would be a charismatic leader is a leader who is charismatic. In the case of tautological definitions, Podsakoff, MacKenzie, and Podsakoff (2016) provided the example of a frequent purchaser is someone who often buys a product.

CONCEPTUAL BOUNDARIES Another mistake that researchers make is defining a concept based on what the concept is supposed to predict or what predicts it, which creates confusion surrounding the boundaries of the concept (MacKenzie, 2003; Suddaby, 2010). The usefulness of a concept is often derived from its contribution in theoretical predictions. Definition writers should consider how the concept is embedded in a nomological framework (Mowen & Voss, 2008; Jaccard & Jacoby, 2010) because its placement provides information on how it will be used by researchers to ensure that researchers do not

write definitions that reflect what it is supposed to predict. The definition is a representation of a concept rather than the function it serves in theoretical predictions. Thus, during the literature review, one may want to record relational concepts and their definitions in a separate table to ensure they do not create such definitions.

OSTENSIVE DEFINITIONS Conceptual definitions are also more than lists of exemplars or examples. Conceptual methodologists recommend researchers avoid defining a concept by listing examples. Ostensive definitions make abstractions clearer to readers by pointing to real-world examples of the concept, but such definitions do little in helping one figure out the body of items representing that concept (Swedberg, 2020). Chaffee (1991) illustrated this practice by stating that media scholars conceptually define mass media by listing examples of mass media such as newspapers, television, books, etc. As one can see, lists are time-bound, violating the tenet of abstractness (McLeod & Pan, 2004). This practice is problematic because a list of examples or exemplars is not exhaustive making it challenging to identify the boundaries of a concept, items that represent it, and the definition is ultimately less stable (Sartori, 1984; MacKenzie, 2003). The conceptual definitional practices associated with creative scholarship is another example of this practice. Most scholars and tenure documents do not include formal conceptual definitions of creative scholarship, but instead the concept is often defined by listing example discipline-relevant mediums or platforms (e.g., film, graphic design, music, photography, theatre, software, paintings) that represent creative scholarship. One, of course, can see such a list violates principles of abstractness and exhaustiveness. A suggestion is to present a definition with complete and clear sentences when communicating a concept’s structure. Examples or model cases, however, may be shared following the presentation of a formal definition (Chaffee, 1991).

CONJUNCTIONS IN DEFINITIONS Voss, Zablah, Huang, and Chakraborty (2020) found the use of and and or in conceptual definitions may cause concept explication problems for scale developers because definitions loaded with


conjunctions are confusing and tend to steer scholars toward creating a list of dimensions to represent the breadth of the construct rather than encouraging scale developers to create a body of items that collectively represent the construct based on an overarching definition. Interactivity, for example, was formally defined as “the possibility for users to manipulate the content and form of communication and/or the possibility of information exchange processes between users or between users and a medium” (Weber, Behr, & DeMartino, 2014, p. 82). The presence of and/or may be confusing to the scale developer because this definition signals that it is optional to include items to represent the latter portion of the definition, which means some items will only represent certain parts of the definition. Thus, one should be thoughtful when employing conjunctions when presenting formal definitions.

EXPLICATING CONCEPTUAL DEFINITIONS Cronbach (1971) stated content validation involves both the assessment of a definition and investigating whether items match that definition. Cronbach proposed that expert judgment should be the primary content validation approach, yet conceptual definitions, which Cronbach referred to as blueprints, are often not employed to build content valid scales. I argue that concept explication is useful during the early stages of content validation because the process provides some informational guidance on how to formulate conceptual definitions. Carnap (1945) proposed the term concept explication, which he originally referred to as rational reconstruction. Researchers cannot work together to advance a social science theory if they do not possess a similar mutual understanding of a concept contained within a formal theory. Concept explication is the process of making a concept clear so that the concept can then be used in social science theories. Concept explication is a systematic practice that involves crystalizing both a concept’s meaning and measurement – and one approach toward making concepts precise and useful (i.e., lessening vagueness and ambiguity) is through conceptual definitions (Nunnally, 1967). Rodgers (1993) stated “words are manifestations of concepts, not the concepts themselves” (p. 74). Podsakoff, MacKenzie, and Podsakoff (2016) provided several valuable suggestions to help identify the words to include in conceptual definitions such as using dictionaries, reviewing literature, seeking expert feedback, conducting research (focus groups, case studies,


observations), comparing the concept to dissimilar concepts, and reviewing measures. The overall intent of the concept explication process is to identify the shared meaning of a concept and discrepancies across definitions. Scale developers create or modify definitions in cases of conceptual redundancy, delineation, clarification, or identification (Chaffee, 1991; Morse, 1995). Once problems associated with the conceptual definitions are revealed, researchers should decide whether they should create or refine the conceptual definition based on the state of definitions (Klein, Molloy, & Cooper, 2009). Conceptual definition modification involves rewriting an existing conceptual definition to make that definition more precise or useful by following previously mentioned conceptual definitional writing best practices, while the creation of a new definition is conducted by mapping the literature as presented in the previously mentioned steps and writing a new one informed by the literature. Following the identification of definitional attributes, a formal definition is presented. In Table 2.1, I present a framework regarding how one can evaluate and arrive at a conceptual definition to use for measurement purposes.

COLLECT AND RECORD RELEVANT CONCEPTUAL INTERPRETATIONS OF CONSTRUCT The first suggested step to dismantle a concept begins with a thematic analysis of the scholarly literature because scale developers are building measures to be used by other researchers and mapping the literature provides the most relevant and useful information regarding how a concept is interpreted by researchers. Researchers should be able to trace the steps the conceptual definer took to aggregate conceptual definitions or descriptions. Databases such as Google Scholar, PsycINFO, Sociological Abstracts, and Communication & Mass Media Complete are used to identify relevant journal articles. Volk and Zerfass (2018) explicated alignment by searching descriptions and definitions of it. The authors communicated they searched for definitions in particular handbooks, journals, and a database and reported using the search terms of “align” AND “strategic communication,” which is important information to share because the keywords influence the content being examined. In the previously mentioned example, this information communicates the authors restricted their examinations to



how strategic communication scholars defined alignment. There are methodological and theoretical implications associated with using these services to generate a sampling frame for a content or thematic analysis because some publications are absent from these databases. One should communicate and provide the logic behind their database selection or other sources including the number of relevant articles reviewed. In another example, Mou, Shi, Shen, and Xu (2020) searched four databases in which they employed the string “(‘personality’) AND (‘robot’ OR ‘machine’ OR ‘agent’)” to identify English-language peerreviewed articles from 2006–2018 on the personality of robots. A scale developer should also test different search string keywords and Boolean operators and provide the logic behind the selection of the string employed to identify the relevant articles for the concept explication (Stryker, Wray, Hornik, & Yanovitzky, 2006). The search string communicates to the readers how the researchers built their sampling frame for analysis.

Formal Conceptual Definitions Following the building of the sampling frame, scale developers should scour those units of analysis to locate (a) explicitly defined concepts from primary scholarly sources, however, they may also need to review (b) concept descriptions, (c) concept measures, and (d) non-scholarly concept definitions. In the conceptual definitional literature, conceptual definitions are sometimes referred to as nominal definitions because they often do not represent a definition of an observable phenomenon (Hempel, 1952). The components of a nominal definition include a definiendum (i.e., concept label), the term in need of definition, and the definiens, which are the group of terms used to define the definiendum (Cohen & Nagel, 1934; Teas & Palan, 1997). Definitions reflect collective agreement surrounding the meaning of an entity. Nominal definitions are defined “as an agreement or resolution concerning the use of verbal symbols” (Cohen & Nagel, 1934, p. 228). The reader may pinpoint definitions by looking for formal presentations typically involving a verb such as “is defined as” or “means” connecting the concept with a phrase or clause that provides an explanation of the concept.

Concept Descriptions As previously stated, researchers often do not provide definitions for their concepts. One method

to handle such an obstacle when creating a definition is to review and record the researchers’ descriptions when they reference the concept. In a study of mentor functions, Carpenter, Makhadmeh, and Thornton (2015) initially struggled finding conceptual definitions of the concept. They collected descriptions of the mentor functions concept to pinpoint how researchers interpreted it: “Kram (1985, p. 22) described mentor functions as ‘essential characteristics that differentiate developmental relationships from other relationships,’ in which the mentor provides a variety of functions ‘that support, guide, and counsel the young adult’” (Kram, 1983, p. 806, p. 3).

Concept’s Measures An additional way to address the conceptual definitional void in the literature is by collecting and reviewing patterns across scale items. Since measures are often not anchored by a conceptual definition, hints at the meaning of the concept may emerge when one group scales items together and reviews all items for patterns. This abstraction technique may help one formulate a definition. This process also may reveal operational representational problems demonstrating that scholars do not link their measures to conceptual definitions. Balzer and Sulsky (1992) found researchers rarely linked their measures to existing conceptual definitions of halo leading them to argue this issue created measurement proliferation and boundary problems associated with the concept. Measures are often the most information one can access regarding how researchers interpret a concept. Nah and Armstrong (2011) reviewed conceptual definitions and measures of the structural pluralism concept that represents power distributions among diverse interest groups, and they found consistency in meaning across presented conceptual definitions. They, however, found and presented a clear table highlighting the scattered and inconsistent measurement practices suggesting that scholars were not presenting logic linking their measures to a conceptual definition.

Popular Literature Rodgers (1993) stated researchers should also attend to popular literature. Buzzwords associated with discourse surrounding an emerging phenomenon are common, but words need to be concrete in meaning to model reality. Popular literature gives shape to the concept, but it is up to the researchers to give a concept precision and structure.


Non-scholarly Definitions Once definitions are organized into a table, one may want to review dictionary or legal definitions especially if they are facing mixed interpretations of the concept following a preliminary review of the definitions. Podsakoff, MacKenzie, and Podsakoff (2016) stated reviewing dictionary definitions should be the first step in developing conceptual definitions because they may provide general guidance related to the meaning of the concept. It is suggested to review definitions following the mapping of literature because a dictionary may be helpful when determining whether the present descriptions and definitions align with the roots of the concept. Natural language and dictionary definitions, however, should not be presented as formal conceptual definition because they lack the precision needed to serve scale developers in their theoretical and measurement endeavors (Wacker, 2004). Dictionaries typically highlight a range across which the definition applies rather than a single definition with complete sentences (Gerring, 1999).

THEMATICALLY MAP CONCEPTUAL DEFINITIONS Little literature exists explaining how to systematically evaluate the literature for the purposes of creating one. A suggested organizational strategy is to create a table to help one dissect the definitions and isolate its essence. Sartori (1984) created a “rule” in which definition creators should collect all definitions, extract their characteristics, and create matrices to organize the characteristics to detect patterns across definitions and descriptions of a concept (Sartori, 2009). In a table(s), the first columns could include the full citation and page number in one column and the definition or description(s) verbatim in another table column (see Table 2.2). A thematic analysis of table content should assist one in justifying the definition and identifying the boundaries of the construct. A thematic analysis is “a method for identifying, analyzing, and reporting patterns (themes) within data” to obtain an intellectual grasp of the meaning of the concept (Braun & Clarke, 2006, p. 79). Koekkoek, Van Ham, and Kleinhans (2021) critiqued the narrowness of scholarly interpretations of the university-community engagement construct. Their thematic analysis of definitions led them to identify the following definitional themes of universitycommunity engagement: spatial connections, reciprocity, mutual benefit, and knowledge-sharing


beyond academic audiences. Although a formal definition in this research was not proposed based on these themes, this information could assist one in developing a formal definition that represents the intended construct rather than neighboring constructs such as public scholarship, university outreach, or community engagement.

Order Definitions Chronologically to Identify Definitional History One table organization suggestion is to organize conceptual definitions by arranging citations chronologically based on their published date to assess whether and how interpretations of them have evolved over time. A historical arrangement of definitions can lead to useful insights. Miller (2019), for example, discovered that conceptual definitions of the concept citizen journalist prior to 2011 were more idealist in nature based on this organization of definitions. Based on a review of definitions, she stated citizen journalists who were engaged supporters of democracy and who filled news media coverage gaps and conceptual definitions in the literature post-2011 devolved into less idealized interpretations of citizen journalists. In another example, Klein, Molloy, and Cooper (2009) traced the problems associated with overlapping and confounding conceptualizations of commitment due to some influential scholars’ work beginning in the 1960s.

Record Definitional Key Terms (Definiens) In another column adjacent to the verbatim definitions, one should identify the key concrete terms within each definition to determine whether those keywords should be included in their formal definition, and then later use those keywords to identify themes. A researcher should strive for agreement in meaning among scholars who may use the definition, while also addressing conflicting views concerning the meaning of the concept. Thus, the listing of keywords is helpful when identifying the keywords that should be considered for a formal definition.

Record Field or Discipline if Interdisciplinary Other separate table column(s) to consider is the field in which the definition stems to determine



The act of a citizen, or group or citizens, Citizen; collecting, reporting, analyzing and Circular; news gathering and dissemination; playing an active role in the process dissemination; news and information vague terms; outcome. What is democratic of collecting, reporting, analyzing and information? What is active? dissemination of news and information. The intent of this participation is to provide independent, reliable, accurate, and wideranging and relevant information that a democracy requires. Mortensen, T., Keshelashvili, A., & A participant that had ever produced a photo, Participant; a photo, video, or writing piece; Piece = content? What about graphics? Weir, T. (2016). video, or writing piece that has been mainstream or citizen journalism outlet Published not a necessary requirement. Is submitted for inclusion by a mainstream anyone who posts to a social media platform or citizen journalism outlet, or that has a journalist based on this definition? intentionally or unintentionally been published by a mainstream or citizen journalism outlet. Paul (2017, p. 2) Someone who creates, moderates, or Someone; creates; news content Publicly accessible information. A person who comments on news content on public web posts any type of comment is a journalist? sites. Review notes and highlighted keywords to identify patterns. Beneath the table, write down observations and propose a tentative conceptual definition. Formal Conceptual Definition Following Literature Review: Citizen journalists are people with no news organizational affiliation who create news and informational content intended for public dissemination (Miller, 2019, p. 8).

Conceptual definition or description

Example shortened table mapping conceptual definitions

Bowman & Willis (2003, p. 9)


Table 2.2 



whether patterns exist within fields. In a separate column in the table, the concept evaluator may want to add the discipline or field to review the discriminant validity and boundaries of a concept. Conceptual definitions are particularly necessary in fields less paradigmatically developed in which there is not consensus surrounding the meaning of concepts or a consistent utility of the concept in theory (Pfeffer, 1993). Disciplines often possess their own histories and uses associated with a concept, and one should be fully aware whether a shared understanding of a concept permeates fields or if a concept is interpreted differently by outside disciplines, cultures, or regions. The scale developer interrogates definitions to identify areas of overlap to increase the permeability of a concept across disciplines to encourage it to be of use beyond a discipline’s borders. The other contexts in which the concept serves such as country, theory, time, and levels are also important to consider when thinking about a definition (Wacker, 2004; Suddaby, 2010). Suddaby (2010) stated a concept’s borders are often constrained by time, space, and values in which the concept is applicable. Units of analysis, for example, may influence the reach of a concept constraining it to organizations, employees, technological platform, parents, patients, etc. Thus, the scale developer needs to communicate the scope in which the concept should be applied to ensure that measures are contained within the borders of the concept. Scale item writers should be able to understand from the conceptual definition whether it emphasizes feelings, units, behaviors, behavioral intentions, beliefs, attitudes, traits, states, etc. (Jaccard & Jacoby, 2010; Podsakoff, MacKenzie, & Podsakoff, 2016).

Note Observations and Patterns One of the next steps is to write observational notes and thematically analyze the definitions. The notes column may also include observations regarding patterns of agreement and disagreement. Mekawi and Todd (2021), for example, identified and critiqued whether definitional properties such as intention to harm, need for consensus, people of color, and context were appropriate representations of racial microaggressions based on their literature review.

MAP NEIGHBORING CONCEPTS TO DETERMINE BOUNDARIES Once the matrix is complete, the scale developer may create another table or add more columns to


the existing one to map other relevant concepts to ensure the discriminant validity of their construct. This step is also useful when the scale developer wants to identify concepts to establish the construct, criterion, or nomological validity. But in the context of explicating conceptual definitions, the step of documenting conceptual definitions of neighboring concepts helps to ensure the definition is representative of the intended construct.

Record Similar and Polar Concepts and Their Conceptual Definitions Another way to evaluate the distinctiveness of a concept is to also review neighboring or polar concepts (Jaccard & Jacoby, 2010). In the early stages of meaning analysis, conceptual definitions serve as guides that help researchers identify not only the essence, but also the scope of the concept (Cohen & Nagel, 1934; Hempel, 1952). Boundaries help the scholar identify what represents a concept and what does not represent a concept (Cohen & Nagel, 1934). The reviewing of concepts with similar or polar meanings but different labels ensures that concept creators avoid adding to the proliferation of concepts and they present a definition that is truly representative of the concept. Bergkvist and Langer (2019) found in a study of 1086 advertising journal articles that numerous different scales and labels existed for ad credibility. They hypothesized that scattered measurement and labeling practices were due to nonexistent or vague conceptual definitions, which led to discriminant validity problems. Researchers should be disciplined when referring to a concept throughout the literature review narrative by using the label verbatim to discourage conceptual misunderstandings and redundancy. In another example, Lee (2004) argued technology-specific demarcations of the presence construct (e.g., telepresence, virtual presence, mediated presence) created confusion because it led scholars to interpret the construct based on the characteristics of technologies rather than a psychological state. He found the origin of the word concentrated on the experience of being physically transported or the feeling of being there when not there. The literature review led Lee to propose this definition of presence: “a psychological state in which virtual objects are experienced as actual objects in either sensory or nonsensory ways” (p. 27).

Record Concepts in a Nomological Framework As previously stated, one should identify the use of the concept by identifying the antecedents and



consequences of a concept. Bagozzi and Fornell (1982) stated that meaning is articulated by defining a concept, and then identifying antecedents and causes of the concept. The usefulness of a concept is assessed by whether it serves in predicting and explaining other concepts. One could place the new concept within a nomological network to evaluate the concepts’ antecedents and outcomes to understand how it may be used by future researchers (Cronbach & Meehl, 1955), which may influence how one defines the concept and whether there is a theoretical need for it. Klein, Molloy, and Cooper (2009) critically reviewed numerous multiple conceptualizations of commitment finding that definitions were confounded because researchers conceptualized it by using both antecedents and outcomes in their definitions. For example, they found scholars included identification with an organization in definitions, but they argued that identification instead predicted commitment. They also found people included motivation to define commitment, but they argued that motivation was an outcome rather than a representation of commitment based on their mapping of the literature. Another concern is a lack of agreement may lead to a measurement surplus surrounding the theoretical entity (Podsakoff, MacKenzie, & Podsakoff, 2016). After identifying eight different conceptualizations, Klein, Molloy, and Cooper (2009) argued bond was the most appropriate property of commitment.

EVALUATE STATE OF CONCEPTUAL DEFINITIONS One should present logic regarding why a new or reconceptualized definition is being proposed if another one is being frequently referenced by a scholarly community. Podsakoff, MacKenzie, and Podsakoff (2016) suggested keeping track of how often a definition is cited to determine a scholarly community’s perceptions of its usefulness. Redundantly cited definitions may suggest a reconsideration regarding whether one should develop a definition. Based on my experience, however, it is rare to find a formal conceptual definition, even in concept explication papers, and it is even rarer to find one that guides the researcher in measurement development. Thus, scale developers will likely have to modify an existing one or create a formal one to serve in their measurement endeavors. The tables are simply useful organizational tools to assist one in identifying the core meaning

of construct. A literature review following the presentation of the definition may be presented concerning the logic behind the selection of keywords argued to make up the structure of the formal definition. Kiousis (2002), for example, broke down existing definitions of interactivity to identify the essential ingredients (e.g., at least two participants, mediated information exchange, and modification of mediated environments) of the concept. He then formally defined interactivity as “the degree to which a communication technology can create a mediated environment in which participants can communicate, both synchronously and asynchronously, and participate in reciprocal message exchanges” (p. 372). Following the presentation of the proposed definition, he explained the logic behind his selection of the keywords that I underlined in his definition. The creator should strip the definition to the point that only the necessary terms that make it up are present (Riggs, 1975). Necessary means selecting terms of scientific utility that provide a compass for scale developers on how to write items that represent it. The listing of key terms within the concepts helps researchers arrive at a defendable definition. A quantitative content or algorithmic text analysis may be also useful to record the prevalence of key terms found across definitions in the units of analyses (Riffe, Lacy, Watson, & Fico, 2019).

PRESENT EXISTING, MODIFIED, OR NEW FORMAL CONCEPTUAL DEFINITION One key goal of concept explication is to present a parsimonious and useful formal definition. Otherwise, it will not encourage unity around the interpretation of a concept or utility of it. The net result of this process is the presentation of a formal definition often denoted by the cues such as is, means, defines, etc. The reason for a formal presentation is to encourage use and adoption of that definition.

Conceptual Definitions of Dimensions Unidimensionality is often encouraged (Edwards, 2001). If researchers theorize the concept as multi-dimensional instead of unidimensional, each dimension should be labeled and conceptually defined following the selection and presentation of a working and tentative overarching conceptual definition. Dimensions are lower-order


representations of the overarching concept (McLeod & Pan, 2004). The conceptual definition of dimension should align with the overarching definition of the concept, and both definitions should be used when determining item retention for each factor.

CONTENT VALIDATE CONCEPTUAL DEFINITIONS Most content validity data collection efforts associated with scale development concentrate on gathering expert feedback evaluating whether scale items represent a construct, but scholars may also use expert feedback to evaluate the quality and usefulness of a formal conceptual definition. I present methodological approaches associated with assessing expert feedback that may be employed to examine a proposed conceptual definition. Interpretations of concepts and their definitions are subject to challenge by other scholars because intersubjective agreement is a critical element of scientific theory (Bagozzi & Fornell, 1982). As a content validity check, some consensus should exist concerning the meaning of a concept. Voss et al. (2020), for example, recruited academics to evaluate conceptual definitions for their ambiguousness and vagueness on an 11-point slider scale ranging from –5 to +5 evaluating the vagueness and ambiguity of a definition in which positive numbers were denoted as ambiguous or vague and negative numbers were less ambiguous or vague. They then calculated interrater reliability to assess the quality of conceptual definitions. The Delphi survey method (Murry & Hammons, 1995; Hsu & Sandford, 2007) may be used to query and interview a panel of experts by seeking their feedback to help identify properties and rate a proposed conceptual definition. In a Delphi study, a concept definer could ask experts to list terms or properties to represent a concept and then examine the data for trends of agreement. They could also rate whether terms should be part of a definition by using both close- and open-ended questions. Another questionnaire could later be redistributed for the experts to evaluate and refine the scholar’s proposed definition based on the previous rounds of feedback with the intent of reaching some collective consensus. For example, textile and apparel design scholars Adams and Meyer (2011) proposed a formal conceptual definition of creative scholarship based on their results of a Delphi study. They found that being peer-reviewed, disseminated, and technical skills were rated highly,


leading the authors to define creative scholarship as “applied research that involves the design process in a way that demonstrates proficiency of combining creative and technical skills that provide a clear understanding of the inspiration or theoretical foundation, which was peer-reviewed and may or may not be retrievable” (p. 228).

REFINE AND PRESENT FORMAL CONCEPTUAL DEFINITION Following the literature review and systematic evaluation efforts to establish a valid and mutually understood concept, one should present a conceptual definition.

CONCLUSION Conceptual definitions breathe life into concepts. A concept is a unit of thought and bad definitions lead to questionable and scattered measures. Satori (1984) stated “bad language generates bad thinking” (p. 15). Researchers may not be aware of the functions of conceptual definitions especially their theoretical purpose in scale development. As a definitional writer, it takes practice to learn how to create useful and precise definitions. I offer organized, accessible concept explication guidance on how to identify and develop conceptual definitions. The framework recognizes the importance of conceptual definitions by outlining procedures intended to get researchers to critically think about their concept. Concepts are often developed for scientific purposes and conceptual definitions are the flashlight needed to guide researchers when identifying referents (i.e., the objects) to represent them. Definitional terms within a formal definition should be clear to scholars and act as a guiding hand in helping scale developers accurately capture the abstraction in the real world. The content validity of measures depends upon whether researchers rely on the terms in the conceptual definition because these key terms within a definition should correspond with manifestations that collectively represent the concept. Scale developers should place their conceptual definition above the items as they write the items to ensure that items reflect the conceptual definition, and it should be brought back into the mind as they contemplate which items to retain and which indicators to eliminate following statistical analyses. Content



validation is the process of defining the construct and seeking expert feedback to determine whether items adequately represent the construct based on that theoretical definition. Experts should be provided with the definition prior to the selection of items representing it (Brown, 1970). The conceptual definition not only addresses misalignment, but it is also helpful in identifying missing gaps regarding what constitutes the concept. A conceptual definition does little to serve social scientists if it does not guide them in observing the physical realm. If constructs are not rigorously defined and referenced, then the items generated to represent the construct may not adequately tap into the intended meaning of the construct. Conceptual definitions generate social coordination toward tackling problems and improving knowledge. Researchers, for example, may need to revisit definitions due to organization change; technological innovations; the passage of time; or racist, classist, or sexist definitions. If researchers are bound by science rather than the borders of their discipline, the defining of concepts ensures they are talking to one another across fields, and it prevents them from developing insular or incorrect views and assumptions about a concept. Conceptualization is the process of linking concepts to the empirical realm in which researchers move back and forth from the conceptual definition to indicators as illustrated in Table 2.1 (McLeod & Pan, 2004). Evidence that a conceptual definition is accepted by a scholarly community is when researchers cite the conceptual definitions’ verbatim and when scale developers employ them to inform their scales. Ultimately, the presentation of conceptual definitions demonstrates the corrective character of science and recognizes that science is a shared process.

REFERENCES Adams, M., & Meyer, S. (2011). Defining creative scholarship in textiles and apparel design in the United States. Design Principles and Practices: An International Journal, 5(2), 219–30. Bagozzi, R. P., & Fornell, C. (1982). Theoretical concept, measurements, and meaning. In C. Fornell (Ed.), A second generation of multivariate analysis (pp. 24–38). Praeger. Balzer, W. K., & Sulsky, L. M. (1992). Halo and performance appraisal research: A critical examination. Journal of Applied Psychology, 77(6), 975–85.

Bergkvist, L., & Langer, T. (2019). Construct heterogeneity and proliferation in advertising research. International Journal of Advertising, 38(8), 1286–1302. Bowman, S., & Willis, C. (2003). We media: How audiences are shaping the future of news and information. A seminal report. Reston: The Media Centre at the American Press Institute. Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77–101. Brown, F. G. (1970). Principles of educational and psychological testing. Dryden Press. Carnap, R. (1945). The two concepts of probability. Philosophy and Phenomenological Research, 5(4), 513–32. Carpenter, S. (2018). Ten decision steps in scale development: A guide for researchers. Communication Methods and Measures, 12(1), 25–44. Carpenter, S., Makhadmeh, N., & Thornton, L.J. (2015). Mentorship on the doctoral level: An examination of communication mentors’ traits and functions. Communication Education, 64(3), 366–84. Chaffee, S. H. (1991). Explication. Sage. Cohen, M. R., & Nagel, E. (1934). An introduction to logic and the scientific method. Harcourt, Brace, and World. Collier, D., & Gerring, J. (2009). Concepts and method in social science. The tradition of Giovanni Satori. Routledge. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (ed.). Educational measurement (pp. 443–507), 2nd ed. American Council on Education. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Edwards, J. R. (2001). Multidimensional constructs in organizational behavior research: An integrative analytical framework. Organizational Research Methods, 4(2), 144–92. Gerring, J. (1999). What makes a concept good? A critical framework for understanding concept formation in the social sciences. Polity, 31(3), 357–93. Hempel, C. G. (1952). Fundamentals of concept formation in empirical science. International Encyclopedia of Unified Science, 2, 10–12. Hsu, C. C., & Sandford, B. A. (2007). The Delphi Technique. Making sense of consensus. Practical Assessment, Research and Evaluation, 12(10), 1–8. Kiousis, S. (2002). Interactivity: A concept explication. New Media & Society, 4(3), 355–83. Jaccard, J., & Jacoby, J. (2010). Theory construction and model-building skills. A practical guide for social scientists. The Guilford Press.


Klein, H. J., Molloy, J. C., & Cooper, J. T. (2009). Conceptual foundations: Construct definitions and theoretical representations of workplace commitments. H. J. Klein, T. E. Becker, & J. P. Meyer (Eds.), Commitment in organizations: Accumulated wisdom and new directions (pp. 3–36). Routledge. Koekkoek, A., Van Ham, M., & Kleinhans, R. (2021). Unraveling university-community engagement: A literature review. Journal of Higher Education Outreach and Engagement, 25(1), 3–24. Kram, K. E. (1983). Phases of the mentor relationship. Academy of Management Journal, 26, 608 & 625. Kram, K. E. (1985). Mentoring at work: Developmental relationships in organizational life. Scott, Foresman. Lee, K. M. (2004). Presence, explicated. Communication Theory, 14(1), 27–50. MacKenzie, S. B. (2003). The dangers of poor construct conceptualization. Journal of the Academy of Marketing Science, 31(3), 323–6. McLeod, J. M., & Pan, Z. (2004). Concept explication and theory construction. In S. Dunwoody, L. B. Becker, D. M. McLeod, & G. M. Kosicki, The evolution of key mass communication concepts. Honoring Jack M. McLeod. Hampton Press. Mekawi, Y., & Todd, N. R. (2021). Focusing the lens to see more clearly: Overcoming definitional challenges and identifying news directions in racial microaggressions research. Perspectives in Psychological Science, 16(5), 972–90. Miller, S. (2019). Citizen journalism. In H. Örnebring (Ed.), Oxford Research Encyclopedia of Journalism Studies (pp. 1–25). Oxford University Press. Morse, J. M. (1995). Exploring the theoretical basis of nursing using advanced techniques of concept analysis. Advances in Nursing Science, 17(3), 31–46. Mou, Y., Shi, C., Shen, T., & Xu, K. (2020). A systematic review of the personality of robot: Mapping its conceptualization, operationalization, contextualization, and effects. Journal of Human-Computer Interaction, 36(6), 591–605. Mortensen, T., Keshelashvili, A., & Weir, T. (2016). Who we are. Digital Journalism, 4(3), 359–378. Mowen, J. C., & Voss, K. E. (2008). On building better construct measures: Implications of a general hierarchal model. Psychology & Marketing, 25(6), 485–505. Murry, J. W., & Hammons, J. O. (1995). Delphi. A versatile methodology for conducting qualitative research. Review of Higher Education, 18(4), 423–36. Nah, S., & Armstrong, C. L. (2011). Structural pluralism in journalism and media studies: A concept explication and theory construction. Mass Communication and Society, 14(6), 857–78.


Nunnally, J. C. (1967). Psychometric theory. McGraw-Hill. Paul, S. (2018). Between participation and autonomy. Understanding Indian citizen journalists. Journalism Practice, 12(5), 526–542. Pfeffer, J. (1993). Barriers to the advance of organizational science: Paradigm development as a dependent variable. Academy of Management Review, 18(4), 599–620. Podsakoff, P. M., MacKenzie, S. B., & Podsakoff, N. P. (2016). Recommendations for creating better concept definitions in the organizational, behavioral, and social sciences. Organizational Research Methods, 19(2), 159–203. Riffe, D., Lacy, S., Watson, B., & Fico, F. G. (2019). Analyzing media messages. Using quantitative analysis in research. Lawrence Erlbaum. Riggs, (1975). The definition of concepts. In G. Satori, F. W. Riggs, & H. Teune (eds.), Tower of Babel (pp. 39–105). International Studies, Occasional Paper No. 6. Rodgers, B. L. (1993). Concept analysis: An evolutionary view. In B. L. Rodgers & K. A. Knafl (eds.), Concept development in nursing (pp. 73–92). W. B. Saunders Company. Sartori, G. (1984). Guidelines for concept analysis. In G. Sartori (ed.), Social science concepts: A systematic analysis (pp. 15–48). Sage. Sartori, G. (2009). Sartori on concepts and method. In D. Collier & J. Gerring (eds.), Concepts and method in social science. The tradition of Giovanni Sartori (pp. 11–178). Routledge. Stryker, J. E., Wray, R. J., Hornik, R. C., & Yanovitzky, I. (2006). Validation of database search terms for content analysis: The case of cancer news coverage. Journalism & Mass Communication Quarterly, 83(2), 413–30. Suddaby, R. (2010). Editor’s comments: Construct clarity in theories of management and organization. Academic of Management Review, 35(3), 346–57. Summers, J. O. (2001). Guidelines for conducting research and publishing in marketing: From conceptualization through the review process. Journal of the Academy of Marketing Science, 29(4), 405–15. Swedberg, R. (2020). On the use of definitions in sociology. European Journal of Social Theory, 23(3), 431–45. Teas, R. K., & Palan, K. M. (1997). The realms of scientific meaning framework for constructing theoretically meaningful nominal definitions of marketing concepts. Journal of Marketing, 61, 52–67. Velten, J., Brailovskaia, J., & Margraf, J. (2022). Positive mental health scale: Validation and measurement invariance across eight countries, genders,



and age groups. Psychological Assessment, 34(4), 332–40. Volk, S. C., & Zerfass, A. (2018). Alignment: Explicating a key concept in strategic communication. International Journal of Strategic Communication, 12(4), 433–51. Voss, K. E., Zablah, A. R., Huang, Y., & Chakraborty, G. (2020). Conjuntionitis: A call for clarity in construct definitions. European Journal of Marketing, 15(5), 1147–59. Wacker, J. G. (2004). A theory of formal conceptual definitions: Developing theory-building measure-

ment instruments. Journal of Operations Management, 22, 629–50. Wacker, J. G. (2008). A conceptual understanding of requirements for theory-building research: Guidelines for scientific theory building. Journal of Supply Chain Management, 44(3), 5–15. Walker, L. O., & Avant, K. C. (2019). Strategies for theory construction in nursing, 6th ed. Pearson. Weber, R., Behr, K., & DeMartino, C. (2014). Measuring interactivity in video games. Communication Methods and Measures, 8(2), 79–115.

3 Group Concept Mapping for Measure Development and Validation Scott R. Rosas

OVERVIEW OF GROUP CONCEPT MAPPING (GCM) Content domain conceptualization is key to the measurement development and validation process, yet often receives little attention. Typically, elicitation of content and conceptualization of measurement models are limited to expert perspectives, often using panels or workgroups. Rarer still, when targets of the measures are involved in the measurement development process, traditional qualitative methods such as interviews or focus groups are frequently employed. Few tools are available that enable researchers to comprehensively identify, organize, and visualize the content domain prior to measure construction, psychometric testing, and utilization. One approach that has seen more widespread use in this context is group concept mapping (GCM). GCM is a participatory, mixed methods approach that uses familiar qualitative group processes and multivariate statistical analyses to visually represent a synthetic model of a group’s thinking for subsequent utilization (Kane & Trochim, 2007). GCM is recognized as a valuable methodological tool for applied social, behavioral, and health sciences researchers looking to explicate social and behavioral phenomena across a range

of disciplines (Trochim & Kane, 2005; Kane & Trochim, 2009). GCM takes a large amount of information generated by participants, and through a participatory process, consolidates the input into a concise set of graphic representations and output. The multi-step GCM process normally engages identified participants to first brainstorm a set of items relevant to a topic of interest. Next, each participant structures the content by sorting the items into groups based on perceived similarity and rates each item on one or more scales. Then, the group input is analyzed using multivariate analyses that include two-dimensional multidimensional scaling (MDS) of the unstructured sort data, a hierarchical cluster analysis of the MDS coordinates, and the computation of average ratings for each item and set of items. The “maps” that are produced show the individual statements in two-dimensional (x, y) space with more similar statements proximally located. Groups of statements are further distinguished as the map area is partitioned into non-overlapping clusters that reflect a set of related constructs. Finally, the output is interpreted using structured group interpretation processes to guide participant labeling and interpretive contributions to the substantive meaning of the visual representations. As a conceptualization technique, GCM is based on the premise that for any given topic a



collective mental model exists among a group of individuals that can be accessed through a structured methodological approach (Rosas, 2017). Fundamental to this assumption is accepting that individuals know only part of what the group knows, and knowledge on the topic is distributed unequally among the group. As such, a group can be viewed as a socially shared, distributed cognitive system (Yoo and Kanawattanachai, 2001; Akkerman et  al., 2007). Accessing what a group knows about a subject is complicated by the presence of concepts that reside at a group level, with varying degrees of abstraction (Morgeson & Hofmann, 1999). Because conceptualization with groups is challenged by the need for common frames of reference some strategy, heuristic, procedure, or integration technique is typically required (Hinsz et  al., 1997), and processes that help resolve discrepancies and arrive at a joint understanding are critically important (Barron, 2000). It is within this context that GCM as a methodological tool thrives. As a mechanism for explicating the distributed knowledge of a group, Trochim and Linton (1986) first outlined a general model of group conceptualization that emphasized specific, defined processes to organize and represent the thinking of a collective. Their general model integrated several components that work in combination, including process steps for conducting the conceptualization, incorporating diverse perspectives from multiple participants, and producing various representational forms of the final conceptualization. Contemporary GCM is based upon this general approach to group conceptualization (Trochim, 1989a). Trochim (1989b) emphasized such an approach is particularly useful for content domain specification because of its detailed, visual, pattern-based representation of concepts. GCM is useful for articulating and matching patterns because it helps researchers to scale theoretical expectations. Trochim and McLinden (2017) note that from its early development in the mid-to-late 1980s, GCM reflected many of the research and evaluation issues of the period. The use of GCM has expanded over the past 30 years, and its versatility is evidenced by the range of applications, content areas, and disciplines that utilize the methodology. Trochim (2017) reviewed the nearly 30-year history of the methodology and the technology that has supported its use. His cursory analysis of over 475 articles citing the seminal article on the method (c.f., Trochim, 1989a) showed the rate of citation continues to increase, signaling sustained uptake in the peer-reviewed literature. He argues the method has infiltrated a broad array of fields, and there are communities of practitioners in a

diverse range of universities and countries. Similar observations have been reported in other reviews of GCM and its application (Kane & Trochim, 2007; Rosas & Kane, 2012; Vaughn & McLinden, 2016; Kane & Rosas, 2017). Indeed, in their systematic review of community-engaged and participatory studies where GCM was used, Vaughn et al. (2017) identified 11 topic areas focused on health and well-being including mental health, cancer, substance abuse/treatment, elder welfare, sexuality/sexual health, to name a few. The expansion of GCM is not limited to the peer-reviewed literature. Donnelly’s (2017) systematic review of over 100 concept mapping dissertations completed between 1989 and 2014 detailed the extensive use of the method by graduate students. He emphasized the value of the methodology for dissertation work, in part due to its flexibility in topic areas and utility for the articulation of a conceptual framework. Not surprisingly, most of these dissertations focused on a topic that was previously unstudied or understudied.

An Integrated Mixed-Method Approach to Conceptualization Social phenomena are complex and require different methodological forms to understand and account for the intrinsic complexities (Greene & Caracelli, 1997). Those utilizing GCM recognize the advantages of mixing quantitative and qualitative techniques to strengthen understanding of the concepts being examined in diverse contexts. Historically, Kane and Trochim (2007) describe group concept mapping as an integrated, mixedmethod approach. In describing the value of GCM’s integrated mixed-method foundation, several researchers emphasized the relevance and utility of information produced by the qualitative and quantitative components of GCM. They confirmed that integration enabled them to better understand the phenomenon of interest (see Kane & Rosas, 2017). As noted by methodologists, the mixing of methods for research can be viewed from both pragmatic and paradigmatic perspectives (Creswell et  al., 2003; Teddlie & Tashakkori, 2009). This distinction is noteworthy and fundamental to understanding GCM as a mixed-method approach. Consistent with these views Kane and Rosas (2017) argue that there is more to “mixing” in GCM than simply combining quantitative and qualitative methods. GCM integrates the two methods in complementary and additive ways. Data are integrated at multiple points in the GCM process. It has been further suggested that GCM components are connected in


a way that blurs the distinction between the qualitative and quantitative paradigms (Kane & Trochim, 2007). This degree of integration is distinguished from the mixed-method applications where several techniques are used to collect and analyze data but lack true integration. Integration occurs when quantitative and qualitative components are explicitly related to each other within a single study to answer the same question (Woolley, 2009). Through merging, combining, or embedding, one dataset builds upon another dataset. Integration presumes interdependence between the quantitative and qualitative components that is mutually illuminating (MoranEllis et al., 2006). Methodological integration has been key to unfolding the complex relationships in the topic of study (Bazeley, 2009). Moreover, a focus on integration encourages serendipity, stimulates theoretical creativity, and initiates new ideas (Brewer & Hunter, 2006; Greene, 2007). The sequential integration of qualitative and quantitative information is key to understanding GCM as a unique methodology for capturing the complexity of social phenomena (Rosas, 2017). Several recent reviews identify GCM as an innovative mixed method research tool for addressing conceptual challenges relevant for measurement (Regnault et al., 2018; Palinkas et al., 2019). Many researchers claim the design, process, and results of their inquiry were stronger because of the mixed methods nature of GCM. Arguably, the integrated mixed-method approach of GCM accentuates the strengths of each method while mitigating their respective weaknesses. Moreover, the sequenced steps of qualitative and quantitative data collection invoke a sense of methodological rigor that ultimately yields value, credibility, and legitimacy of the GCM process (Kane & Rosas, 2017). The ongoing advancement of computer technologies added substantially to the analytical techniques that support integration of qualitative and quantitative methods. GCM has contributed to and benefited from the technological developments, as evidenced by several web-based tools designed to capture, organize, and analyze GCM data. More recent developments include the marrying of other methodological approaches with GCM, such as social network data analysis (Goldman & Kane, 2014) and photovoice (Haque & Rosas, 2010). The number and range of applied GCM studies suggest a solid orientation with mixed methods thinking and practice. Yet only a limited number of GCM studies explicitly discuss GCM relative to existing mixed method design typologies. Two examples illustrate different positions that are useful in contemplating GCM as a mixed method approach. Windsor (2013) describes GCM consistent with the fully mixed sequential


equal status method design proposed by Leech and Onwuegbuzie (2009). In this design, sequentially phased qualitative and quantitative data are mixed within phases, or across the stages of the research process, with both elements having equal weight. Another perspective (Hanson et al., 2005) describes two GCM studies conducted by Aarons et al. (2009) and Bachman et al. (2009) as sequential exploratory designs, where data collection is done sequentially, quantitative data is prioritized, and data analysis is conducted. These researchers detail how qualitative and quantitative data were transformed through multivariate analysis to quantitate the qualitative data elicited from focus groups, resulting in a sequential quantification of qualitative data.

VALUE TO CONSTRUCT VALIDATION The practical, methodological, and epistemological consequences of poor measurement cannot be overlooked. Deficient or contaminated measures, measurement model misspecification, and weak theoretical rationale for hypotheses have serious ramifications to the appropriateness and accuracy of instrumentation (MacKenzie, 2003). Developing sound measures is a difficult, time-consuming, and costly enterprise, especially considering the complexity and challenge of establishing construct validity of new instrumentation. The value of methods to support the measurement development and testing process should be based on rigor, efficiency, and ability to yield information that helps avoid measurement misalignment. Contemporary validity testing theory emphasizes that validity requires evidence derived from both quantitative and qualitative research sources to justify a proposed interpretation and use of test scores (Hawkins et  al., 2020). The construction of a conceptual framework, identification and design of questions, and implementation of a formal psychometric validation study is a complex process involving key decisions that impact the validity of the measure. A critical step of any attempt to design a new instrument is to clarify the conceptual domain under measurement. The conceptualization stage of the process receives far less interest than the attention afforded psychometric testing and validation of new measures yet is arguably no less important. Explication of the component parts is vital to fully understanding the theoretical pattern of relationships within the content domain. GCM provides a systematic but flexible approach to achieve these ends.



An early foundational study on the use of GCM in the measurement development and testing process comes from Rosas and Camphausen (2007). They used GCM to engage staff and managers in the development of a framework of intended benefits of program participation and selected the scale’s content from the result. GCM provided a systematic mechanism for generating and identifying the structure of the content through the articulation of relationships between multiple concepts within the domain. The presence of a clear conceptual framework offered a roadmap for decision-making. Items for the measure were selected by computing the item-total correlation and identifying those above a set criterion. Psychometric testing revealed excellent internal consistency characteristics, clearly discernable dimensions of individual strengths and capabilities, differences from measures of family strengths, and variation between program usage patterns in an expected manner. The authors argue since development and testing decisions were made based on the conceptual framework related to perceived benefits of the program, claims of the scale’s validity were more easily supported. Following the approach, Chen et al. (2015) further demonstrated the effectiveness of GCM in ensuring high reliability and validity. They used GCM to operationalize the definition of individualism and collectivism, and the conceptual framework guided their selection of both new and validated items from published scales. Finally, confirmatory factor analysis (CFA) confirmed the proposed conceptual framework. Similarly, Borsky et al. (2016) linked the conceptualization and validation processes together employing exploratory factor analysis (EFA) techniques following GCM to examine whether bystander behaviors directed at friends to prevent dating violence was a uni- or multi-dimensional construct. Strong psychometric results were found from applying the conceptual framework as a guide in subsequent analyses of the scale’s measurement properties. Rosas and Ridings’ (2017) systematic review examined several elements of GCM in the scale development and validation process. In their assessment of 23 scale development and evaluation studies where GCM was used, they noted the value of the method for establishing content validity, facilitating researcher decision-making (including a variety of item selection methods), generating insight into target population perspectives integrated a priori, and establishing a foundation for analytical and interpretative choices. Others have acknowledged some of these benefits in more recent scale development and validation processes using GCM. For example, in the development and psychometric validation of an uncertainty scale related to emergency department use Rising, Doyle

et al. (2019) emphasized the application of GCM ensured the range and scope of the content domain was more adequately represented than what would be obtained from literature review and research team discussion alone. In their use of GCM, Adams et al. (2021) noted the constructs identified in the measurement framework for Black men’s depressive symptoms in community and clinical settings are not presently captured in existing studies. The emergent conceptual framework formed the basis for future scale development studies with this specific population because the new theoretical constructs were more sensitive and culturally appropriate. Similarly, other researchers report using GCM to identify constructs not previously found through other methods, such as expert opinion (Stanick et al., 2021) or largely missing from existing research and taxonomies (Thomas et  al., 2016; Barnert et al., 2018). Methodologically, Wilberforce et  al. (2018) found GCM to possess certain advantages over other methods traditionally used for content generation. They argue the methodology overcomes several limitations found in purely qualitative methods used for content elicitation. In the development of the Recovery Promotion Fidelity Scale, Armstrong and Steffen (2009) indicated the use of GCM helped establish acceptable levels of face and content validity, as well as provide a mechanism for engaging multiple stakeholders in establishing the content. Indeed, comparisons of GCM to other qualitative tools have produced several noteworthy results. Humphrey et al. (2017) found despite providing the greatest conceptualization of the patient experience, interviews were the most time intensive and methodologically demanding on research staff and participants. Further, GCM provided greater value in minimizing researcher bias within the research topic. They contend that when researchers need to establish the relative importance of concepts in a way that provides greater objectivity and confidence in measuring the most important concepts, GCM adds value above other qualitative concept elicitation methods. LaNoue et al. (2019) point out that as a comprehensive conceptualization tool, GCM optimizes content organization, producing a framework to understand ideas generated during the initial brainstorming phases. Rising, LaNoue et al. (2019) claim participant-generated maps of the brainstormed ideas suggest the ways content is related, and participant ratings help in selecting priority items for use in psychometric testing and research. Collectively, these researchers emphasize the higher-level conceptualization results produced in GCM provide greater value to the measurement development and testing process than simple elicitation methods like brainstorming and interviews.


A small but growing number of researchers utilized GCM as a conceptualization tool in the measurement development and validation process. Several scholars argue GCM occupies a key methodological role in evaluating the characteristics and quality of newly developed measures by exploring the linkage between the theoretical and observed measurement patterns (Trochim, 1985; Rosas, 2017; Rosas & Ridings, 2017). Indeed, the comprehensiveness and clarity of the conceptual origin is essential to assessing the correspondence between the conceptual and operational domains of measurement. As a group conceptualization technique, GCM enables conceptual elaboration and explication of patterns that define the conceptual domain of interest (Rosas, 2017). The value of GCM to conventional measurement development and psychometric testing processes can be judged in the context of three considerations for content and construct validation (Simms, 2008; Pedhazur & Schmelkin, 2013). First, GCM yields both the content from which a scale or measure might be constructed, and an interrelated structure among elements illuminating what is and is not core to the measurement pattern. Content comprehensiveness and clarity is essential to defining components of any new measurement tool and is often underexplored (Hinkin, 1995; Simms, 2008). Second, the use of GCM in the explication of a theoretical measurement pattern enables researchers to employ a strategy to examine its structural fidelity. As models are specified and re-specified through psychometric analyses, GCM can yield insights about the structural relationships among items parallel to other objects in concept mapping outputs (i.e., items, clusters, dimensions). Thus, GCM enables the specification of an intricate architecture of hypothesized relationships within a domain of interest. Researchers applying GCM for construct specification can account for configural properties produced through group conceptualization, helping avoid under- or overspecification of hypothesized constructs and potentially degrading construct validity. Finally, GCM can suggest a priori hypotheses about relationships between the target constructs and other constructs and provide a firm foundation for building the evidence of validity across a range of interpretations and applications. Precision in identifying the configural properties of the content domain is vital for decisions regarding confirmatory analytical processes. For example, Southern et  al. (2002) first used GCM to conceptualize content area followed by Batterham et al.’s (2002) confirmatory analyses in the form of Structural Equation Modeling (SEM) and Confirmatory Factor Analysis (CFA), together operationalizing the linkage between the


theoretical and observed measurement pattern in the pursuit of construct validity.

DESIGN AND IMPLEMENTATION OF GROUP CONCEPT MAPPING The design and conduct of group concept mapping (GCM) are described extensively in other resources (see Kane & Trochim, 2007; Kane & Rosas, 2017) and readers are encouraged to review those publications for a step-by-step approach. In general, the concept mapping process is divided into six major stages (Trochim, 1989a) and are summarized below. Table 3.1 highlights these stages and major activities within each.

Preparation During the Preparation stage, the researcher organizes the process to guide the overall GCM inquiry and makes key decisions related to the project’s focus, participants, schedule, and utilization of results. The researcher, often in collaboration with a planning or advisory group, considers the desired outcomes and potential uses of the results as well as advantages and limitations of the context. Two essential activities are completed at this stage: focus prompt development and participant identification. The importance of aligning the focus of the inquiry with those who can provide information in response to that focus cannot be understated. Establishing a focus prompt is often an iterative process. The focus prompt is drafted early in the planning process and pilot tested with a small group to ensure it yields appropriate responses. Used to facilitate the generating of ideas, the focus prompt is sometimes presented as a direction, such as “What are the factors that influence the acceptance and use of evidence-based practices in publicly funded mental health programs for families and children?” (Green & Aarons, 2011). More often, however, a sentence completion focus prompt is employed, such as “A reason or way that older adults misuse or abuse prescription and nonprescription medications would be…” (Berman et al., 2019). The sentence completion prompt is designed to encourage that generated statements are concise, address the topic of inquiry, and are grammatically and syntactically similar. A second critical activity during the Preparation stage is the identification and engagement of the participants. Participants in GCM are typically



Table 3.1  Group concept mapping stages and activities GCM stages



Recruit participants Define schedule Determine focus Brainstorm Synthesize items Sort Rate (one or more ratings) Collect demographics Analyze sort aggregation Conduct and review multidimensional scaling Analyze by clusters Assess maps; Determine cluster configuration Match by patterns (ladder graphs) Assess bivariate scatter plots (Go-Zones) Prioritize maps, patterns, plots Define content areas and items for use in measure development

Idea Generation Structuring




non-randomly and purposefully selected. The goal in the idea generation stage is often to capture a broad set of ideas. Purposefully inviting a broad number of participants across a range of disciplines and experiences related to the topic will likely yield a more diverse set of ideas. The inclusion of diverse perspectives ensures the sorting and rating activities produce information that enables the emergence of details in the conceptualized model that are attributable to the group beyond any individual (Rosas, 2017).

Idea Generation In GCM, surfacing of ideas during the Idea Generation stage can take many forms. Most common for content elicitation in GCM is brainstorming, however text extraction from interviews, open ended surveys, focus groups, or content analysis of documents are options (see Jackson & Trochim, 2002). Typically, participants are invited to generate or “brainstorm” any number of ideas in response to the inquiry’s focus prompt. In virtual, asynchronous GCM applications, participants access a dedicated website and respond anonymously, generating as many responses to the focus prompt as they wish while observing the accumulating set of ideas from others. For in-person approaches, a facilitator guides a group brainstorming session where participants interactively generate ideas in response to the prompt. During virtual brainstorming, as the number of ideas on the brainstorming website increases,

it is expected some participants will likely make more specific contributions or “tweaks” to ideas already generated by other participants. Careful consideration of the diversity and size of the participant pool during the Preparation stage has a direct effect on the brainstorming process, with large, diverse participant groups yielding a more nuanced set of responses with specific contextual distinctions. The brainstorming activity ends at the point of idea saturation. Conceptually, saturation of the topic is based on two factors considered simultaneously: a) the judgment that the content provided in response to the focus prompt adequately captures the breadth (range of ideas) and depth (range of specificity) of the topic and, b) no new information is being provided. An intermediate step during the Idea Generation stage is Idea Synthesis. GCM typically works best with a final set of statements of about 100 or fewer (Kane & Trochim, 2007). During virtual brainstorming large statement sets can be produced, and it is not unusual to generate several hundred (sometimes thousands depending on the number of participants) responses to the prompt. Although this intermediate step might seem less of a concern for in-person brainstorming, the overall number of in-person sessions can increase the volume of raw statements as well. Thus, a systematic process is often needed to account for many responses to the prompt, whether generated virtually or through multiple in-person sessions. To reduce the set to a manageable number of statements (e.g., 80–100) for sorting and rating, a statement reduction procedure is commonly


employed. As outlined by Kane and Trochim (2007) and further detailed by Kane and Rosas (2017), the Idea Synthesis process involves discarding any unreadable or unintelligible statements, editing each statement so that it grammatically completes the focus prompt, and splitting any double-barreled statements. A variation of a Key Words In Context (KWIC) approach (Krippendorf, 2004) is then applied to the edited statement set to assign each statement up to three classificatory terms or phrases so that decisions can be made as to which statements to retain relative to the content structure. In contrast to a thematic analysis of qualitative content, the process is intended to yield a final set of clear and relevant statements representative of the entire body of raw content from the Idea Generation step. Moreover, a complete audit trail of the process is kept to ensure each final statement can be traced to its ancestor statements in the original brainstormed set (Brown, 2005).

Structuring Three data collection tasks are completed by participants during the Structuring stage: sorting, rating, and participant characteristics. For the sorting task, each participant sorts the final list of ideas into groups based on similarity. This task usually requires a well-informed understanding of the inquiry’s focus and can take approximately 30 to 60 minutes to complete. To complete the task, each participant is instructed to arrange the set of statements into groups “in a way that makes sense to you”. The only restrictions for sorting are that there cannot be: (a) N groups (every group has one item each); (b) one group consisting of all items; or (c) a “miscellaneous” group (any item thought to be unique is to be put in its own separate pile). Sorting in GCM is an unstructured sort task. That is, there are no pre-determined number of groups that participants are expected to produce (Rosenberg & Kim, 1975). Sorting can be done via online sorting platforms (i.e., virtual tabletop card sorting) or in person where participants sort a stack of “cards” with each statement printed on a separate card. Participants are also requested to provide a label for each group they create to summarize the contents. For the rating task, each participant rates the list of statements on one or more pre-determined scales. Typically, a five-point Likert-type scale is used, although any range meeting the goals of the inquiry is appropriate. Participants are directed to rate statements relative to the others and utilize the full range of the scale(s). Participants also answer a set of background/demographic questions at the


time of sorting and rating. This information enables the researcher to examine sample variation and compare how specific ideas and clusters of ideas were rated by different groups of stakeholders. Prior to data analysis, the researcher reviews the submitted data and determines the acceptability for computing the multiple visualizations and representations.

Representation In the Representation stage, GCM employs multivariate analytical techniques that include multidimensional scaling (MDS) of the unstructured sort data, and hierarchal cluster analysis (HCA) of the MDS coordinates to produce two-dimensional maps showing clustered statements (Trochim, 1989a). GCM data analysis begins with the compilation of the sorting responses and construction of an N × N binary, symmetric matrix of similarities. For any two items, a 1 is entered in the cell for the pair if the two items were placed in the same group by the participant, otherwise a 0 is entered (Coxon, 1999). The total N × N similarity matrix is obtained by summing the individual matrices and is the aggregate of the sorting binary matrices of all participants. Thus, any cell in the total matrix will have values between 0 and the number of people who sorted the statements. The value indicates the number of people who placed pairs of statements in the same group. Higher values indicate more participants paired statements together in a group. Lower values indicate fewer participants sorted the items together. The total similarity matrix is then analyzed using nonmetric MDS analysis with a two-dimensional solution. The analysis yields a two-dimensional (x,y) configuration of the set of statements based on the criterion that statements grouped together most often are closely positioned while those grouped together less frequently are further apart. The MDS analysis yields a stress statistic that reflects the fit of the final representation with the original similarity matrix used as input (Kruskal & Wish, 1978). For any given configuration, the stress value indicates how well that configuration matches the data. The better the “fit” of a map to the original similarity matrix, the lower the stress value. Typical GCM projects average around .28 and range between .17 and .34 (Rosas & Kane, 2012). The MDS configuration of the statement points is graphed along two dimensions (see Figure 3.1). This “point map” displays the location of the final set of statements marked by an identifying number. Statements closer to each other are generally







54 34


63 76 44 25











23 40 12



2 55






48 60



13 56

19 38









75 62 6





59 36

46 18 66

43 15



53 5



45 29

1 78 27

74 28



Figure 3.1  Example point map based on multidimensional scaling analysis. Points are marked by the statement number and similarity based on proximity to other points expected to be more similar in meaning, reflecting the judgements regarding statement similarity made by participants during the sorting task. The second major analysis conducted in GCM, is hierarchical cluster analysis (HCA). Using the x,y configuration as input, HCA applies Ward’s algorithm (Everitt, 1980) as the basis for defining a cluster. Because the MDS configuration is used as input, the clustering algorithm forces the analysis to partition the point map into non-overlapping groups or clusters. This results in any number of hierarchical clusters, grouping statements on the map into sets of statements that presumably reflect similar concepts. The computational process for determining the optimal cluster configuration begins by considering each statement to reside in its own cluster. At each stage of HCA, the algorithm combines two clusters based on centroid computations until eventually all the statements are in a single cluster. The final cluster configuration is determined by the researcher as no simple mathematical criterion exists by which a final cluster arrangement is selected. The procedure outlined by Kane and Trochim (2007) is to examine a range of cluster configurations to identify the most parsimonious description of the point map. The final number of

clusters is selected to preserve the most detail and still yield substantively interpretable clusters of statements. Based on the researcher’s selection of the optimal cluster configuration, a “cluster map” is produced that displays the original statement points enclosed by polygon-shaped boundaries for the clusters. An example cluster map is displayed in Figure 3.2. Once a final cluster configuration is selected, the rating data are averaged at the item and cluster levels. Graphically, the averages are displayed in a “point rating map” showing the original point map with the average rating per statement displayed as vertical columns and, in a “cluster rating map” that shows the cluster average rating as vertical layers. These two graphical representations are illustrated in Figure 3.3a and 3.3b. Two additional graphical and statistical analyses are computed based on the map results. A “pattern match” is the pairwise comparison of average cluster ratings across criteria such as different stakeholder groups or ratings, using a ladder graph representation. The ladder graph or pair-link diagram is displayed with two vertical axes representing two variables (i.e., groups or ratings) and horizontal lines connecting them to represent the averages for each cluster. An example pattern



11 72 54 34

Cluster 1



63 10

44 76 25

Cluster 7

14 52






Cluster 2





23 40

64 51

Cluster 8

39 68

















Cluster 6

56 38


75 62 6


59 36

46 18 66

43 15






Cluster 5




53 5




70 28 71 27


Cluster 3


Cluster 4

Figure 3.2  Example eight cluster point-cluster map based on hierarchical cluster analysis of the MDS plotted points match graphic is represented in Figure 3.4. A correlation coefficient (r) is also computed to indicate the magnitude of the overall pattern match. In addition, standard descriptive statistics are produced (M, SD, N) that enable significance tests of differences between average cluster ratings among subgroups. Finally, within each cluster an analysis comparing average statement values for two rating variables is conducted. The result of this comparison is represented as a within-cluster bivariate plot of average statement ratings for two groups or two ratings. An example of a within cluster bivariate graph is shown in Figure 3.5. Like a pattern match, the bivariate plot also displays a correlation coefficient (r) to indicate the magnitude of relationship between the two variables. The plot is a restricted form of a standard bivariate plot and: (a) sets the minimum and maximum values for all plots to the same range (based on minimum and maximum statement average for that variable); and (b) divides the bivariate space into quadrants based on the cluster average of the x and y variables. Thus, every plot has a quadrant that shows which statements in the cluster

were rated above average on both variables, one that shows which statements were below average on both, and two that show statements that were above average on one and below on the other. Diagnostically, the plots allow for precise examination of the variation of item-level ratings and aids in distinguishing higher versus lower rated items within each cluster.

Interpretation In the Interpretation stage, the researcher usually engages a subset of stakeholders in a facilitated session that follows a prescribed sequence of steps. An interpretive review generally follows the structured process described in detail in Trochim (1989a) and Kane and Trochim (2007). Generally, the researcher convenes a group for preliminary interpretation and provides an overview of the brainstorming, sorting, and rating tasks performed earlier. Along with a list of final statements, the point map is shown, along with an explanation of how statement placement on the map is related to





Point Legend 11

Layer Value 1 2.43 to 2.85 2 2.85 to 3.28 3 3.28 to 3.70 4 3.70 to 4.13 5 4.13 to 4.55






40 12







69 64

41 51

2 55





39 68

63 76 44








14 52



13 56

38 19






46 43

18 66



Cluster Legend



53 5

60 21 70 7


42 75






1 45


78 27



Cluster 1

Layer Value 1 3.30 to 3.46 2 3.46 to 3.61 3 3.61 to 3.77 4 3.77 to 3.92 5 3.92 to 4.07

Cluster 7

Cluster 2

Cluster 8 Cluster 6 Cluster 3

Cluster 5

Cluster 4

Figure 3.3  (a) Example point-rating map with layers (range by quintile) indicating the average level of rating per item and (b) example cluster rating map with layers (range by quintile) indicating the average level of rating per cluster



Ra�ng 1

Ra�ng 2



Cluster 1

Cluster 1

Cluster 2 Cluster 3

Cluster 3

Cluster 5 Cluster 8 Cluster 4

Cluster 2

Cluster 7

Cluster 5 Cluster 4 Cluster 6 Cluster 7

Cluster 6

Cluster 8 3.21


r = 0.83

Figure 3.4  Example ladder graph depicting the pattern match between average cluster ratings for two ratings Cluster 1 r = 0.32 4.78



Group 2



27 56 70

28 60 29

21 1



2.4 2.38

3.8 Group 1

Figure 3.5  Example bivariate scatter plot (go-zone) for a single cluster of items contrasting the average item ratings for two different subgroups. Points are marked by statement number




the judgments made by participants during sorting. To reinforce the notion that the analysis placed the statements sensibly, the group identifies statements in various places on the map and examines the contents of those statements. After examining the numbered point map, the group reviews the results of the analysis that organized the points (i.e., statements) into clusters of statements. The group discusses the cluster map, working cluster-by-cluster to reach agreement on an acceptable label for each cluster. In rare instances where the group has difficulty achieving a consensus, hybrid names are often used that combine key terms or phrases from labels submitted by sorting participants. During the Interpretation session, the agreedupon cluster labels are shown on the final map, and participants are directed to examine the labeled cluster map to determine whether it makes sense to them. Participants are reminded that in general, clusters closer together on the map should be conceptually more similar than clusters farther apart and are asked to confirm the sensibility of the visual structure. Furthermore, participants are asked to discuss and label any interpretable groups of clusters or “regions.” As with the clusters, the group arrives at a consensus label for each of the identified regions. Furthering the interpretation of the concept mapping output is the use of bridging values (Kane & Rosas, 2017). Bridging values and their computation are unique to Trochim’s approach to GCM. Bridging values range from 0 to 1 and are computed for each statement and cluster during the GCM analysis. Conducted after MDS and HCA, the bridging values describe how each item on the map is related to the statements around it by considering the strength of the aggregated sorting data. The values indicate the degree of concurrence among those who sorted. Lower values (anchors) better indicate the conceptual topics associated with a particular area of the map and higher values (bridges) suggest a broader relationship of that statement across the map (Kane & Trochim, 2007). Observing both types of underlying relationships is valuable in interpreting and finalizing the GCM results.

Utilization Finally, the Utilization stage involves the application of GCM results through a carefully planned strategy, usually done in advance of implementing the GCM process. Because GCM produces a multidimensional, multi-cluster framework of constructs, the utility of the results is frequently manifold. Kane and Rosas (2017) argue the need for a clearly articulated linkage between the purpose of a GCM

inquiry and what might be done with the resulting information. GCM is more than simply the collection of visual outputs, and consideration of the process and outcomes are important. As Kane and Rosas (2017) emphasize, transaction between explicit and tacit forms of knowledge occurs when each is clearly specified, combined, and incorporated into practice.

RELIABILITY AND VALIDITY OF GROUP CONCEPT MAPPING The reliability and validity of the GCM methodology was established through multiple independent analyses. Rosas and Kane (2012) conducted the most comprehensive examination. In a systematic analysis of 69 GCM studies, individual study characteristics and estimates were pooled and quantitatively summarized, describing the distribution, variation, and parameters for each. Variation in the GCM data collection in relation to those characteristics and estimates was also examined. Overall, results suggest GCM yields strong internal representational validity and very strong sorting and rating reliability estimates. These estimates were consistently high despite variation in participation and task completion percentages across data collection modes. The results aligned with an earlier examination conducted by Trochim (1993) who initially established the reliability procedures used by Rosas and Kane. Trochim’s original examination established that the GCM process is reliable according to generally recognized standards for acceptable reliability levels. More recently, a third independent evaluation of quality indicators was conducted by Donnelly (2017) in his study of doctoral dissertations where GCM was used. He noted substantial similarity of the stress values found by Rosas and Kane (2012) and Trochim (1993) in the sample of dissertations. He further revealed that stress values were not dependent on the data collection modality, suggesting the internal representation validity of GCM is not influenced by in-person, online, or mixed-mode data collection.

CONCLUSION As Buchbinder et  al. (2011) reinforce, dialogue with targets, robust psychometric testing, and assessment of the final content with the original


conceptual framework is a continuous process that generates evidence of validity and utility of the instrument. Arguably, GCM is a valuable tool for researchers as they contemplate how group input informs the measure development process, the methods that are used and integrated to yield constructs, and how the conceptualized information is utilized. Accurate conceptualization and measurement development processes depend on the clarity of latent and emergent constructs. As a source of these constructs, a firm grasp of the role of participants in GCM and how they inform the measure development process is imperative. Klein and Kozlowski (2000) argue that expert informants are the most appropriate sources for higher-level constructs that they can directly observe or if they have unique knowledge of configural properties. Conversely, they are limited in their ability to identify shared and unobservable configural properties and require some systematic procedure to explicate useful constructs. Thus, GCM can be an invaluable tool for measure development and validation process helping researchers manage issues related to the aggregation, combination, and representation of latent and emergent constructs.

REFERENCES Aarons, G. A., Wells, R. S., Zagursky, K., Fettes, D. L., & Palinkas, L. A. (2009). Implementing evidencebased practice in community mental health agencies: A multiple stakeholder analysis. American Journal of Public Health, 99(11), 2087–95. Adams, L. B., Baxter, S. L., Lightfoot, A. F., Gottfredson, N., Golin, C., Jackson, L. C., … & Powell, W. (2021). Refining Black men’s depression measurement using participatory approaches: a concept mapping study. BMC Public Health, 21(1), 1–10. Akkerman, S., Van den Bossche, P., Admiraal, W., Gijselaers, W., Segers, M., Simons, R. J., & Kirschner, P. (2007). Reconsidering group cognition: From conceptual confusion to a boundary area between cognitive and socio-cultural perspectives? Educational Research Review, 2(1), 39–63. Armstrong, N. P., & Steffen, J. J. (2009). The recovery promotion fidelity scale: assessing the organizational promotion of recovery. Community Mental Health Journal, 45(3), 163–70. Bachman, M. O., O’Brien, M., Husbands, C., Shreeve, A., Jones, N., Watson, J., … & Mugford, M. (2009). Integrating children’s services in England: National evaluation of children’s trusts. Child: Care, Health and Development, 35, 257–65. Barnert, E. S., Coller, R. J., Nelson, B. B., Thompson, L. R., Klitzner, T. S., Szilagyi, M., … & Chung, P. J.


(2018). A healthy life for a child with medical complexity: 10 domains for conceptualizing health. Pediatrics, 142(2). Barron, B. (2000). Achieving coordination in collaborative problem-solving groups. The Journal of the Learning Sciences, 9(4), 403–36. Batterham, R., Southern, D., Appleby, N., Elsworth, G., Fabris, S., Dunt, D., & Young, D. (2002). Construction of a GP integration model. Social Science and Medicine 54, 1225–41. Bazeley, P. (2009). Editorial: Integrating data analyses in mixed methods research. Journal of Mixed Methods Research, 3(3), 203–7. Berman, R. L., Iris, M., Conrad, K. J., & Robinson, C. (2019). Validation of the MedUseQ: A self-administered screener for older adults to assess medication use problems. Journal of Pharmacy Practice, 32(5), 509–23. Borsky, A. E., McDonnell, K., Rimal, R. N., & Turner, M. (2016). Assessing bystander behavior intentions toward friends to prevent dating violence: Development of the bystander behavior intentions-friends scale through concept mapping and exploratory factor analysis. Violence and Victims, 31(2), 215–34. Brewer, J., & Hunter, A. (2006). Foundations of multimethod research: Synthesizing styles. Thousand Oaks, CA: Sage. Brown, J.S. (2005). So many ideas, so little time: Statement synthesis in a youth development context. Paper presented at the annual meeting of the American Evaluation Association, Toronto, Canada. Buchbinder, R., Batterham, R., Elsworth, G., Dionne, C. E., Irvin, E., & Osborne, R. H. (2011). A validitydriven approach to the understanding of the personal and societal burden of low back pain: Development of a conceptual and measurement model. Arthritis Research & Therapy, 13(5), R152. Chen, X., Gong, J., Yu, B., Li, S., Striley, C., Yang, N., & Li, F. (2015). Constructs, concept mapping, and psychometric assessment of the Concise Scale of Individualism–Collectivism. Social Behavior and Personality: An International Journal, 43(4), 667–83. Coxon, A. P. M. (1999). Sorting data: Collection and analysis. Thousand Oaks, CA: Sage. Creswell, J. W., Plano Clark, V. L., Gutmann, M. L., & Hanson, W. E. (2003). Advanced mixed methods research designs. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research (pp. 209–40). Thousand Oaks, CA: Sage. Donnelly, J. P. (2017). A systematic review of concept mapping dissertations. Evaluation and Program Planning, 60, 186–93. Everitt, B. (1980). Cluster analysis (2nd ed.). New York, NY: Halsted Press. Goldman, A. W., & Kane, M. (2014). Concept mapping and network analysis: An analytic approach to measure ties among constructs. Evaluation and Program Planning, 47, 9–17.



Green, A. E., & Aarons, G. A. (2011). A comparison of policy and direct practice stakeholder perceptions of factors affecting evidence-based practice implementation using concept mapping. Implementation Science, 6(1), 104. Greene, J. C. (2007). Mixed methods in social inquiry (vol. 9). New York, NY: John Wiley & Sons. Greene, J. C., & Caracelli, V. J. (Eds.). (1997). Advances in mixed-method evaluation: The challenges and benefits of integrating diverse paradigms. New Directions for Evaluation, vol. 74. San Francisco, CA: Jossey-Bass. Hanson, W. E., Creswell, J. W., Clark, V. L. P., Petska, K. S., & Creswell, J. D. (2005). Mixed methods research designs in counseling psychology. Journal of Counseling Psychology, 52(2), 224. Haque, N., & Rosas, S. (2010). Concept mapping of photovoices: Sequencing and integrating methods to understand immigrants’ perceptions of neighborhood influences on health. Family & Community Health, 33(3), 193–206. Hawkins, M., Elsworth, G. R., Hoban, E., & Osborne, R. H. (2020). Questionnaire validation practice within a theoretical framework: a systematic descriptive literature review of health literacy assessments. BMJ Open, 10(6), e035974. Hinkin, T. R. (1995). A review of scale development practices in the study of organizations. Journal of Management, 21(5), 967–88. Hinsz, V. B., Tindale, R. S., & Vollrath, D. A. (1997). The emerging conceptualization of groups as information processors. Psychological Bulletin, 121(1), 43. Humphrey, L., Willgoss, T., Trigg, A., Meysner, S., Kane, M., Dickinson, S., & Kitchen, H. (2017). A comparison of three methods to generate a conceptual understanding of a disease based on the patients’ perspective. Journal of Patient-reported Outcomes, 1(1), 1–12. Jackson, K. M., & Trochim, W. M. (2002). Concept mapping as an alternative approach for the analysis of open-ended survey responses. Organizational Research Methods, 5(4), 307–36. Kane, M., & Rosas, S. (2017). Conversations about group concept mapping: Applications, examples, and enhancements. Sage. Kane, M., & Trochim, W. M. (2007). Concept mapping for planning and evaluation. Thousand Oaks, CA: Sage. Kane, M., & Trochim, W. M. (2009). Concept mapping for applied social research. The Sage Handbook of Applied Social Research Methods, (pp. 435–74). Thousand Oaks, CA: Sage. Klein, K. J., & Kozlowski, S. W. (2000). From micro to meso: Critical steps in conceptualizing and conducting multilevel research. Organizational Research Methods, 3(3), 211–36.

Krippendorf, K. (2004). Content analysis: An introduction to its methodology (2nd ed.). Newbury Park, CA: Sage. Kruskal, J. B., & Wish, M. (1978). Multidimensional scaling. Beverly Hills, CA: Sage. LaNoue, M., Gentsch, A., Cunningham, A., Mills, G., Doty, A. M., Hollander, J. E., … & Rising, K. L. (2019). Eliciting patient-important outcomes through group brainstorming: when is saturation reached? Journal of Patient-reported Outcomes, 3(1), 1–5. Leech, N. L., & Onwuegbuzie, A. J. (2009). A typology of mixed methods research designs. Quality & Quantity, 43(2), 265–75. MacKenzie, S. B. (2003). The dangers of poor construct conceptualization. Journal of the Academy of Marketing Science, 31(3), 323–6. Morgeson, F. P., & Hofmann, D. A. (1999). The structure and function of collective constructs: Implications for multilevel research and theory development. Academy of Management Review, 24(2), 249–65. Moran-Ellis, J., Alexander, V. D., Cronin, A., Dickinson, M., Fielding, J., Sleney, J., & Thomas, H. (2006). Triangulation and integration: Processes, claims and implications. Qualitative Research, 6(1), 45–59. Palinkas, L. A., Mendon, S. J., & Hamilton, A. B. (2019). Innovations in mixed methods evaluations. Annual Review of Public Health, 40, 423–42. Pedhazur, E. J., & Schmelkin, L. P. (2013). Measurement, design, and analysis: An integrated approach. New York: NY. Psychology Press. Regnault, A., Willgoss, T., & Barbic, S. (2018). Towards the use of mixed methods inquiry as best practice in health outcomes research. Journal of Patient-reported Outcomes, 2(1), 1–4. Rising, K. L., Doyle, S. K., Powell, R. E., Doty, A. M., LaNoue, M., & Gerolamo, A. M. (2019). Use of group concept mapping to identify patient domains of uncertainty that contribute to emergency department use. Journal of Emergency Nursing, 45(1), 46–53. Rising, K. L., LaNoue, M., Gentsch, A. T., Doty, A. M., Cunningham, A., Carr, B. G., … & Mills, G. (2019). The power of the group: comparison of interviews and group concept mapping for identifying patientimportant outcomes of care. BMC Medical Research Methodology, 19(1), 1–9. Rosas, S. R. (2017). Group concept mapping methodology: Toward an epistemology of group conceptualization, complexity, and emergence. Quality & Quantity, 51, 1403–16. Rosas, S. R., & Camphausen, L. C. (2007). The use of concept mapping for scale development and validation in evaluation. Evaluation and Program Planning, 30(2), 125–35. Rosas, S. R., & Kane, M. (2012). Quality and rigor of the concept mapping methodology: A pooled study analysis. Evaluation and Program Planning, 35(2), 236–45.


Rosas, S. R., & Ridings, J. (2017). The use of concept mapping in measurement development and evaluation: Application and future directions. Evaluation and Program Planning, 60, 265–76. Rosenberg, S., & Kim, M. P. (1975). The method of sorting as a data gathering procedure in multivariate research. Multivariate Behavioral Research, 10, 489–502. Simms, L. J. (2008). Classical and modern methods of psychological scale construction. Social and Personality Psychology Compass, 2(1), 414–33. Southern, D. M., Young, D., Dunt, D., Appleby, N. J., & Batterham, R. W. (2002). Integration of primary health care services: Perceptions of Australian general practitioners, non-general practitioner health service providers and consumers at the general practice–primary care interface. Evaluation and Program Planning, 25(1), 47–59. Stanick, C. F., Halko, H. M., Nolen, E. A., Powell, B. J., Dorsey, C. N., Mettert, K. D., … & Lewis, C. C. (2021). Pragmatic measures for implementation research: Development of the Psychometric and Pragmatic Evidence Rating Scale. Translational Behavioral Medicine, 11(1), 11. Teddlie, C., & Tashakkori, A. (2009). Foundations of mixed methods research. Thousand Oaks, CA: Sage. Thomas, S. J., Wallace, C., Jarvis, P., & Davis, R. E. (2016). Mixed-methods study to develop a patient complexity assessment instrument for district nurses. Nurse Researcher, 23(4). Trochim, W. M. (1985). Pattern matching, validity, and conceptualization in program evaluation. Evaluation Review, 9(5), 575–604. Trochim, W. M. (1989a). An introduction to concept mapping for planning and evaluation. Evaluation and Program Planning, 12(1), 1–16. Trochim, W. M. (1989b). Concept mapping: Soft science or hard art?. Evaluation and program planning, 12(1), 87–110. Trochim, W. M. (2017). Hindsight is 20/20: Reflections on the evolution of concept mapping. Evaluation and program planning, 60, 176–85.


Trochim, W. M., & Linton, R. (1986). Conceptualization for planning and evaluation. Evaluation and Program Planning, 9(4), 289–308. Trochim, W. M. K. (1993). The reliability of concept mapping. Paper presented at the Annual Conference of the American Evaluation Association. Trochim, W. M., & Kane, M. (2005). Concept mapping: an introduction to structured conceptualization in health care. International Journal for Quality in Health Care, 17(3), 187–91. Trochim, W. M., & McLinden, D. (2017). Introduction to a special issue on concept mapping. Evaluation and Program Planning, 60, 166–175. Vaughn, L. M., Jones, J. R., Booth, E., & Burke, J. G. (2017). Concept mapping methodology and community engaged research: A perfect pairing. Evaluation and Program Planning, 60, 229–37. Vaughn, L. M., & McLinden, D. (2016). Concept mapping. Handbook of methodological approaches to community-based research: Qualitative, quantitative, and mixed methods, pp. 305–14. New York: Oxford University Press. Wilberforce, M., Batten, E., Challis, D., Davies, L., Kelly, M. P., & Roberts, C. (2018). The patient experience in community mental health services for older people: A concept mapping approach to support the development of a new quality measure. BMC Health Services Research, 18(1), 1–11. Windsor, L. C. (2013). Using concept mapping in community-based participatory research: A mixed methods approach. Journal of Mixed Methods Research, 7(3), 274–93. Woolley, C. M. (2009). Meeting the mixed methods challenge of integration in a sociological study of structure and agency. Journal of Mixed Methods Research, 3(1), 7–25. Yoo, Y., & Kanawattanachai, P. (2001). Developments of transactive memory systems and collective mind in virtual teams. The International Journal of Organizational Analysis, 9(2), 187–208.

This page intentionally left blank


Research Design Considerations

This page intentionally left blank

4 A Checklist of Design Considerations for Survey Projects Jeffrey Stanton

In the social sciences, the term “survey” encompasses a wide variety of self-report approaches to data collection, often as instruments completed by respondents without supervision. Although surveys of various kinds have probably existed since the earliest days of written communication, the development of self-report methods for application to social research questions dates back to the period between the 1890s and the 1930s (Kruskal & Mosteller, 1980; Barton, 2001). Not coincidentally, this was also a period of rapid progress in statistics, in mass communications, and in the emergence of new societal research applications such as opinion polls and labor surveys. In the present day, surveys often include queries with simple structured response formats, such as the ubiquitous Likert scale, although the latest survey methods using internet technology also offer an array of interactive features that expand the capabilities of surveys beyond the limits of paper, pencils, and postal mail (Stanton, 1998; Stanton & Rogelberg, 2001, 2002; Rogelberg et  al., 2002). Despite pitfalls, such as the problem of declining response rates dues to over-surveying, the great flexibility and relatively low costs of survey research have ensured the continuing popularity of the method. Thus, surveys are often, though certainly not always, a good choice for a research project. This

chapter develops a simple map of design motivations for surveys in research and practice settings. This map supports a discussion of two overarching considerations that researchers should address prior to launching a survey project. The discussion concludes with a checklist of questions to support decision making about the features included in a survey research project. The chapter is organized into four sections. To provide perspective on the how and why of surveys, the first two sections focus on understanding how researchers have used surveys in the past. In the first section, the chapter reviews a few of the survey handbooks published over recent years for general insights into the evolution and current applications of surveys. In the second section, 28 scientific and applied research projects are closely examined for clues about the major design considerations for survey studies. In the third section, insights from the handbooks and projects are synthesized into a simple descriptive diagram that considers two of the major drivers of survey design: (1) the researchers’ intention to draw causal conclusions from their survey data and (2) researchers’ efforts to control sources of error in survey data. This diagram organizes the reviewed projects into four types, depending upon their respective approaches to causality and error. Finally, the four quadrants of the diagram guide



the development of a checklist comprising roughly 20 questions a researcher should ask when developing a survey study. Readers already familiar with the underlying design issues may wish to skip directly to this checklist, which appears as the final table in the chapter.

WHAT WE KNOW ABOUT SURVEYS, PART 1: SURVEY HANDBOOKS The popularity of surveys has been driven by the need to obtain information from large groups of people to address social issues. In this sense, use of surveys mirrors the development of modern society (Aronovici, 1916, p. 5) insofar as accessing large populations with surveys requires high literacy and mass distribution mechanisms like mail and telephones. Opinion polls have served as a ubiquitous public interest tool for more than a century (Tankard, 1972). With the emergence of large organizations, the impracticality of having leaders speak directly with all members has necessitated the use of climate surveys, health/safety surveys, and job satisfaction surveys (Munson, 1921, p. 374; Stouffer et al., 1950). In education, medicine, public health, social work, transportation, hospitality, retailing, and dozens of other areas, the widespread use of surveys reflects a continuing need to uncover trends and relationships across myriad situations where fine-grained methods, such as direct behavioral observation, focus groups, or in-depth personal interviews would be impractical at scale. One way to uncover the breadth of application of surveys across areas of research and practice is to examine topics covered in previous survey handbooks. Creating and administering a survey seems appealingly simple on the surface, but the variety of considerations involved in producing high-quality survey data has spawned numerous volumes replete with helpful guidance. Appendix A contains a selected title list of several survey research handbooks published since 2003 along with a designation indicating the area or field of study for which the book was written. Chapter lists for these volumes reveal a typical progression: research question formulation, questionnaire construction, sampling, survey administration, and survey data analysis. Within this general topical flow, some handbooks contain treatments of more advanced topics such as weighting, response propensity, cross-cultural issues (including survey translation techniques), mixed mode surveys, and panel designs. The unique needs of some fields, such as public health and epidemiology, are

sufficiently specialized to necessitate customized treatments. Prefaces of these books promote their value for novices, but challenging topics such as measurement equivalence in mixed mode surveys also warrant regular use of handbooks by experts (Dillman & Edwards, 2016). Several of the listed handbooks have recently been revised to account for innovations arising from the use of internet-based surveys. Aside from the telephone survey and the later development of the computer aided telephone interview, the internet has arguably had the most profound effect on survey methods. At a high level, the internet has had three primary impacts: instrument interactivity, respondent availability, and cost reduction. First, modern tools have simplified the construction of web-based surveys and expanded the range of interactions that respondents may have. Specialized response interfaces (e.g., graphical controls, click and drag), flow control, piping, and randomization tools have expanded the capabilities of surveys into application areas previously only available via custom programming or in a lab setting. At the same time, internet surveys also support the provision of images, video, and sound recordings as elements of stimulus material. Second, as worldwide connectivity has increased, about half of the world’s population can now be contacted through internet-enabled mechanisms (Graydon & Parks, 2020). Firms have created on-demand panels where, for a fee, researchers can reach hundreds or thousands of respondents. Given a large enough budget, one can craft a sample with preferred demographic characteristics, varying countries-of-origin, different native languages, a range of professions, and so forth. Researchers using these panels have remarked that a study launched at noon can reach its sampling targets before there is time to finish a sandwich. Finally, costs for survey research have declined with the growth of the internet. While a carefully pilot-tested instrument may take a professional team weeks or months to develop and a true probability sample is still expensive to obtain, mailing and data entry costs have declined steeply. Likewise for telephone administered surveys with computer-assisted interviewing, the marginal cost of a phone call to almost any part of the world has dropped to zero. None of this is to say that a professional, large scale survey project is cheap, but rather that the locus of costs has shifted over recent decades and the barriers to entry for smallscale projects are lower than before. One unfortunate side effect of the low barrier to entry has emerged in the form of “oversurveying” and the related idea of survey fatigue (Rogelberg et  al., 2002; Porter et  al., 2004; Sinickas, 2007;


Pecoraro, 2012). Survey fatigue, particularly in organizational contexts, has driven down response rates over recent decades (Baruch & Holtom, 2008). A low response rate can contribute to survey error, as the discussion below describes, but also risks a negative reaction to the organization that sponsors the survey (Adams & Umbach, 2012). To conclude this section, note that handbooks also indicate that the telephone and postal mail still serve as viable distribution and response methods. Certain respondent groups cannot reliably be reached over the internet and extraordinary efforts may be needed to sample them (Marpsat & Razafindratsima, 2010). Even when a population of interest has internet access, a browser-based survey is not necessarily the appropriate tool for every project, despite the seeming simplicity and low cost (Stapleton, 2013; Torous et  al., 2017). The next section reviews survey projects with the goal of developing an organizing structure to show how applied and scientific researchers use surveys to fulfill project design goals.

WHAT WE KNOW ABOUT SURVEYS, PART 2: SURVEY PROJECTS Handbooks tell us how to construct surveys but provide only limited examples of survey projects in action. This section fills that gap by reviewing a small selection of recently published survey studies and analyzing the various research designs and the contexts in which they were conducted. Given that peer reviewed literature trends towards scientific studies, this review also includes surveys from areas of professional practice such as those conducted in government, non-profits, and political advocacy organizations. Method descriptions from 28 research projects – half peer reviewed and half not – depict a wide range of survey purposes, designs, and administration characteristics. For peer reviewed publications with surveys, this review contains articles from journals in psychology, sociology, political science, administrative science, public health, and epidemiology that appeared with top ranks in Scimago Journal Rankings (, a widely used resource for journal impact factor, journal H-index, and other scientometric information. Most of the journals consulted had recent impact factors in excess of five. Articles from those journals using the survey method and published in 2020 or 2021 appear in this review. Readers seeking exemplary uses of the survey method for scientific research can examine any of these articles and the respective journals for


design ideas that have satisfied some of the most demanding peer reviewers. For practice-oriented/applied survey projects, a general web search using the keywords “2021,” “national,” and “survey” yielded a list of candidate survey projects for inclusion. These varied widely in scope, focus, and sponsoring organization. To be included in the list of reviewed projects, a publicly available report needed to provide sufficient research design information to enable an analysis of the characteristics of the study such as the modes of survey administration and sampling methods. Government-produced reports tended to have the most detailed and formal reports, while other organizations often published slide decks containing visualizations that summarized their results. The most detailed among these had sections explaining sampling and administration methods. For both peer reviewed and non-peer reviewed projects, the relevant citations appear in Appendix B with an index number that also appears in Table 4.1 and Figure 4.1. Table 4.1 summarizes key characteristics of these studies. For each characteristic a brief statement describes the breakdown of that characteristic across the 28 reviewed projects. Table 4.1 shows a divide between peer-reviewed scientific research and applied projects on several interrelated issues. One the one hand a majority of applied survey projects used probability sampling and weighting methods. Researchers configured these methods to ensure that the analysis revealed results as closely representative of the populations of interest as possible. On the other hand, scientific researchers invested effort both in measurement and augmenting the capability of making causal inferences from their data. Two recent trends in survey methodology research provide a lens for understanding these inclinations. The first trend harkens back to W. Edward Deming’s (1900–1993) ideas about total quality management (TQM; Anderson et  al., 1994). Deming pioneered the application of sampling and statistical methods to the improvement of industrial processes (Hackman & Wageman, 1995). One of the offshoots of the TQM movement – focusing primarily on the design and administration of survey projects – evolved into books and papers under the rubric of “total survey error” (e.g., Weisberg, 2009; Groves & Lyberg, 2010). Although efforts have been made to develop a computation that represents the total survey error of a project as a single value, the idea is perhaps better understood not as a number but as an orientation toward minimizing five interlinked sources of error in survey data collection: sampling, measurement, nonresponse, coverage, and specification (McNabb, 2013, p. 47).



Table 4.1  Published survey project characteristics Characteristic

Summary of method review

Project Purpose

All peer-reviewed studies had a common purpose to reveal insights by means of inferential analysis of relationships among variables (3, 4, 5, 10, 13, 14, 15, 17, 19, 20, 21, 24, 25, 26). All applied projects had description as a core purpose, with secondary purposes such as advocacy or policy-making appropriate to the mission of the sponsoring organization (1, 2, 6, 7, 8, 9, 11, 12, 16, 18, 22, 23, 27, 28). While web surveys were used as an essential data collection method in half the projects (1, 2, 3, 7, 8, 9, 11, 12, 13, 16, 17, 19, 21, 22, 23, 25, 26, 27, 28), the other half used telephone (6, 18, 22, 23), paper and pencil (4, 5, 23, 27, 28), or in-person interviews (10, 13, 14, 20) for collecting survey responses. Projects with multiple modes of administration tended to be government-sponsored (22, 23, 28). Peer reviewed studies almost universally used previously validated multi-item response scales to assess standing on constructs of interest (3, 4, 5, 10, 13, 17, 19, 20, 21, 24, 25, 26). Applied projects universally used single-item measures or questions (1, 2, 6, 7, 8, 9, 11, 12, 16, 18, 22, 23, 27, 28). Self-reported demographic variables in both types of studies were also captured with single-item measures. Thirteen projects used only non-probability convenience samples (1, 4, 5, 8, 9, 13, 17, 19, 21, 24, 25, 26, 27). Among these, eleven were peer-reviewed studies. In contrast, only two applied projects used non-probability sampling (8, 9). Four peer reviewed studies took advantage of probability sampling by virtue of harnessing their data collection to society-scale, sponsored research projects (3, 10, 14, 15). The majority of projects that used probability sampling also used weighting to match census distributions on age, race, ethnicity, religious affiliation, and/or educational attainment (2, 3, 7, 10, 12, 14, 15, 22, 23). Many of the peer-reviewed projects indicated awareness of endogeneity and addressed the concern either by linking to secondary data sources or using random assignment (3, 4, 5, 10, 13, 17, 21, 24, 26). Eleven of the applied studies used entirely endogenous data – i.e., all reported data came from a single, cross-sectional survey administration (1, 2, 6, 7, 9, 11, 18, 22, 23, 27, 28). Six peer reviewed projects included at least one data collection with an experimental manipulation that used random assignment to conditions (3, 13, 17, 21, 24, 26). Two other peer reviewed studies included multiple time periods of data collection (19, 25). None of the applied projects used experimental methods (1, 2, 6, 7, 8, 9, 11, 12, 16, 18, 22, 23, 27, 28).

Mode of Administration




Exogenous Data


Most researchers are familiar with sampling error: it is the source of error that we try to control with inferential tests and confidence intervals. Likewise, researchers try to control and understand measurement error by using previously validated scales and. by subjecting items to internal consistency, test-retest, or other forms of reliability assessment. Nonresponse is a perennial problem that survey researchers address with approaches such as prenotification and incentives (Rogelberg & Stanton, 2007). Non-response error occurs when a systematic link exists between status on one or more focal variables and likelihood of nonresponse or attrition. Coverage error becomes a problem in probability samples when the sampling frame is misspecified – this leads to the inclusion of respondents who do not represent the larger population of interest and/or the exclusion of

those who do. Finally, specification error reflects mistakes in the translation of the research question and its pertinent concepts into the measurement operations implemented in the survey – i.e., measuring the wrong constructs. Note that some sources also mention a sixth source of error – processing error – that can occur during data entry or data analysis processes (Biemer et  al., 2017). Optical scanners and internet-based surveys have reduced processing error but even the most highly automated survey processing pipeline can still result in data glitches, for example in the coding of open-ended responses (Conrad et al., 2016). Using the total survey error orientation, survey researchers invest resources into minimizing each source of error. Such efforts require substantial planning prior to the launch of the data collection process and also influence the design of the



Figure 4.1  28 studies organized into four quadrants Figure note: The X-axis, Control of Total Survey Error, refers to the extent to which a study tries to minimize sampling, measurement, nonresponse, coverage, and specification errors. The Y-Axis, Strength of Causal Inference, refers to the extent to which a study includes features such as experimental manipulation, instrumental variables, and data collection at multiple time points.

survey instrument itself, for example, through the inclusion of certain demographic variables for later comparison or by linking to census data. Groves and Lyberg (2010) noted that the design process in service of total survey quality naturally represents a set of trade-offs, as no project has all the time or resources needed to minimize every source of error. Additionally, the minimization of error simply represents the foundation of a quality research project: Researchers must also consider other quality criteria such as the timeliness and relevance of a project. A second important trend in survey methodology research pertains to researchers’ ability to draw valid causal inferences from their data. As the entries in Table 4.1 attest, experimental manipulations with random assignment to conditions are not uncommon: This feature would be considered the gold standard in strengthening causal inference. But starting in 1970, as structural equation modeling became increasingly accessible with software tools and expert guidance, researchers

accelerated their use of non-experimental data for causal inference (Byrne, 2001). Structural equation modeling and survey data have a natural affinity because surveys often contain multiple variables to be modeled with complex mediating relationships, surveys often have multiple indicators for each construct, measurement error is omnipresent in survey responses, and the survey and its corresponding hypothesis tests are often embedded within theories that include causal explanations. Criticisms of causal inference from covariance models (e.g., Freedman, 1987) quickly emerged to counter the optimism around structural equation modeling, but researchers then responded with innovations such as instrumental variables (e.g., White, 1982) designed to address these and other concerns. An instrumental variable is an extra variable introduced into a covariance analysis to compensate for measurement error and/or omitted variables (Angrist & Krueger, 2001). This debate about drawing causal inferences from non-experimental data continues to the



present day, with a recent emphasis on the problem of endogeneity (Antonakis et al., 2014). From a statistical standpoint, endogeneity occurs when an explanatory variable in a statistical model is correlated with the error term of the model. This technical definition hides a more intuitive set of concerns, however. Most notably these concerns include the problems of omitted variables and third causes, both of which are endemic in survey studies. No matter how long and comprehensive a survey is, it is unlikely to take into account all of the possible causes of an outcome variable. Thus, a cross-sectional (one-shot) survey that measures a few predictors and one or more outcomes has omitted one or more important explanatory variables with near certainty. In the same vein, each predictor-outcome variable pair in a survey may in reality reflect the influence of a third variable that causes them both. As Antonakis et al. (2010) highlighted, making causal claims without knowing the status of this third variable is an exercise in futility. Literature reviews suggest that published work from many fields routinely contains mistakes related to omitted variables and third causes (e.g., Stone & Rose, 2011; Reeb et al., 2012; Abdallah et al., 2015; Papies et al., 2017).

A DESCRIPTION OF SURVEY PROJECT GOALS IN FOUR QUADRANTS Considered in context, these two perspectives – the total survey quality perspective and researcher interest in drawing causal inferences – can frame two distinctive goals that drive the designs of survey research projects. On the one hand, a goal of many research projects, particularly in the scientific research community, lies in gathering evidence about research questions. Many research questions arise via the formulation and testing of theoretical frameworks and many theories share a goal of explaining dynamics that ultimately distill to predictions of causal relationships among variables. Pursuit of this goal explains the inclusion of experimental manipulations, the linking of surveybased results to external data sources, and/or the use of time series or panel designs among the studies that appear in Table 4.1. Even in crosssectional survey projects that lack experimental features or exogenous variables, researchers may nonetheless try to engage in weaker forms of causal inference by creating multivariate predictive models from the variables on the survey. Across the board, with either strong or weak assertions about causality, a core goal of these studies is explanation.

Notably, many of the peer reviewed studies in Appendix B used non-probability sampling techniques to obtain their data. In this approach, the generalizability (external validity) of the results must necessarily arise from subsequent replication or appeals to similar results in the literature. In contrast, the applied projects listed in Appendix B typically crafted sampling frames and used probability sampling and weighting to ensure that the estimates obtained from the measured variables matched the underlying populations of interest. In these projects, the total survey error perspective governed the major design considerations of the projects because their descriptive goals depended on claims of precision and absence of bias. For these projects, minimizing estimation error reflected a core goal of the projects to document a particular phenomenon in the population with the greatest possible fidelity. Figure 4.1 summarizes these ideas: The X-axis represents the extent to which a project invested in error controls, with considerations focused on the use of sample frame development, probability sampling, response promotion, and pre-analysis weighting. The Y-axis represents the extent of focus on the capability of making causal inferences. Considerations here pertain to the use of experimental methods, inclusion of variables from external sources, and/or making survey measurements of a phenomenon at two or more points in time. Each of the 28 studies from Appendix B appears in one of the quadrants, reflecting a high or low focus on reduction of total survey error and a high or low focus on enhancing causal explanation. Quadrant A includes most of the peer reviewed scientific studies. This signifies an emerging scientific consensus that cross-sectional studies without instrumental variables do not produce useful evidence (Antonakis et al., 2010, 2014). Editorial pressure at the highly ranked journals represented in Quadrant A likely promotes a high level of methodological rigor with respect to causal explanations. Thus, as a suggested label, we might consider that Quadrant A studies focus on the goal of causal explanation. Down the main diagonal, Quadrant D includes most of the applied survey studies. These lacked experimental controls and made no linkages to external variables, but by minimizing coverage errors, sampling error, non-response bias, and errors, these surveys aspired to provide precise depictions of the phenomena they measured with attentiveness to metrics of precision, such as the margin of error. Thus, we might categorize these studies as seeking precise description of a population or populations of interest as their primary functional goal.


This leaves Quadrant B and Quadrant C. Quadrant B lists those studies that incorporated features designed to enhance causal explanations and features designed to control error. These studies combined proven strategies for confirming causal relations with established methods for ensuring generality to a defined population and so might be termed framework generalization surveys. Finally, Quadrant C includes those studies that used convenience samples without experimental or other causal-enhancing features. As these surveys may have functional goals such as outreach or advocacy, we might term them as communicative studies. The pattern in Figure 4.1 raises a question about the small number of studies appearing in Quadrants B and C. In the case of Quadrant B, where studies included probability sampling, weighting, quality measurement techniques, as well as one or more efforts to mitigate the weaknesses of a purely cross-sectional design, the explanation for the paucity of examples might lie in the expense of harnessing a survey in service of both causal explanation and precise description. For example, Bush and Zetterberg’s (2021; Appendix B, #3) study including experimental features included samples from two countries, quality measurement techniques, and other features to support theoretical arguments and generalizability of their results to the two populations. Achieving all these design goals in a single project may exceed the resources of many researchers. Nonetheless, Quadrant B is arguably a valuable ideal to which all survey researchers can aspire: a project that combines the precision descriptive capabilities of a probability sample with the explanatory power of a design that supports causal inference. In the case of Quadrant C, these projects employed convenience samples and lacked experimental controls. For example, the European Monitoring Centre for Drugs and Drug Addiction (2020, Appendix B, #8) conducted a multi-country study of illegal drug use. Their population of interest was particularly difficult to reach: Options for developing representative sampling frames in various countries with widely differing governmental and public health infrastructures were extremely limited. In situations like this, the value of obtaining any data, even with unknown relations to defined populations, outweighed the value of precision in generalizing from sample to population. Quadrant C studies also lacked experimental manipulations, linking to exogeneous variables, data collection at multiple time points, and other design features that would enhance causal explanation. Yet the data from these studies may still have offered value. For instance, in the case of Tiller (2020), the ethics review of the study of


young people in mental health crisis led to a strategy whereby school principals received notifications about students at severe risk of self-harm. It is likely that the data collection paired with this notification process saved lives. Thus, while Quadrant C studies seem suboptimal from the standpoint of methodological rigor, the sponsors of these studies considered them sufficiently important to invest in them, despite any concerns about data quality. The use of surveys in situations like these could be framed as the best choice when no good choices exist. Collecting data that can neither be used for causal modeling nor for generalization to a defined population may still be useful and/or worthwhile if the practical benefits of conducting the study outweigh the cost of collecting data that lacks explanatory power and/or precision. At least two important caveats accompany Figure 4.1. First, all the Quadrant A studies used previously validated scales from the literature. Thus, these studies made an effort to control measurement error – one of the goals of the total survey error approach. Similarly, several of these peerreviewed studies mentioned incentives and other efforts to promote responding, thereby trying to minimize non-response bias. The position of these studies on the diagram is not meant to suggest that these researchers ignored survey error. Instead, their position reflects these studies’ primary goal of explaining, confirming, or disconfirming a theoretical perspective and a concomitant emphasis on the strength of causal inference. Second, the two goals inscribed by the X- and Y-axes obviously do not provide a comprehensive view of how and why various projects came into being. Particularly in the applied realm, researchers may devise surveys that address political, legislative, or propaganda purposes that Figure 4.1 makes no attempt to represent. Nonetheless, in the next section of the chapter, the quadrants depicted in Figure 4.1 support the development of a checklist of questions to address the main goal of the chapter.

THE SURVEY PROJECT QUESTION CHECKLIST Using the organizing structure described above as a jumping off point, the chapter now concludes with a checklist – a tool to guide design choices about survey projects. Several published checklists related to the survey method already exist. For example, Gehlbach and Artino (2018) offered a 17-point list of best practices for item writing,



response options, and survey formatting. Sharma et  al. (2021) reported on a “consensus-based checklist for reporting of survey studies,” a list of 40 information components for inclusion in report of biomedical surveys. Molléri et  al. (2020) derived 38 decision points involved in the development and deployment of surveys related to software engineering. Protogerou and Hagger (2020) worked with a team of experts to develop a 20-point checklist for evaluating the quality of a completed survey study. Each of these checklists provides added value for guiding the development and reporting of a survey study, but none of them asked questions aimed at guiding the design of a survey project from its earliest stages. Before considering the detailed question list, researchers should begin by asking the general question of whether a survey is appropriate as the main or only mode of data collection for a project. Advocates of the mixed methods approach (e.g., Creswell & Creswell, 2017) recommend that researchers conduct sequences of studies that include both qualitative methods such as semistructured interviews, and quantitative methods such as surveys. Avoid undertaking a survey project until clarification of the goals and questions of the research clearly suggest that a survey will efficiently provide the most useful and useable data.

After passing that basic litmus test, review the list of questions presented in Table 4.2. The first column in Table 4.2 divides the questions according to the main purpose or goal of the research corresponding to the four quadrants in Figure 4.1. For each quadrant of design goals from Figure 4.1, the checklist asks a series of questions about important project features. Each question in Table 4.2 is phrased in search of an affirmative answer. These questions are not litmus tests, however, in the sense that a negative answer must thereby scuttle a project. Rather, given a particular study goal, if the answer to any of the questions in the block is “no,” that should trigger a reflective process which considers the possibility that the feature in question may be worth adding to the project or that another research method may be more appropriate. Regardless of whether that particular answer changes to “yes,” the reflection may still lead to the conclusion that surveying is still the best choice. Assuming that a research team commits to conducting a survey, the questions in Table 4.2 can also serve as guideposts for sharpening the justifications for the project. The twin goals of causal explanation and generalization to defined populations represent the ideals for what a welldesigned survey project can achieve. When project

Table 4.2  The survey method checklist Primary goal

Question checklist

Causal Explanation (Quadrant A): Multiple measured constructs to be examined using inferential statistics

a. Will the survey have different versions/conditions and will random assignment be used to assign respondents to those versions/ conditions? b. Will responses on the survey be joined at the case level with external data sources to provide exogenous variables? c. Will respondents provide responses at more than one point in time? d. Do you have previously validated, multi-item measures for all key constructs? e. Do you have methods of testing the data for the presence of nonresponse bias? f. Will your survey distribution method provide access to a more suitable respondent pool than other available research methods? g. In addition to the previous questions, do you have a systematic sampling frame appropriate to the research problem? h. Do you have a model framework that specifies proposed connections among variables? i. Can all explanatory variables for the model be collected on the survey or become available from external data sources? j. Do you have sufficient resources to test all proposed model relations with sufficient statistical power? k. Given the proposed sample size and measurement strategy, will the resulting confidence intervals be small enough to support confirmation or disconfirmation of model propositions?

Framework Generalization (Quadrant B): A causal model or other framework to be confirmed in a well-defined population or compared across populations




Table 4.2  The survey method checklist (Continued) Primary goal

Question checklist

Communicative (Quadrant C): Outreach, advocacy, political, or other goals that do not require quality data

l. Will the design and subject matter of the survey reflect well on the sponsoring organization? m. Are the proposed respondents capable of providing the range of answers needed to fulfill the communicative function of the survey? n. Is there a plan to communicate the survey results to the audience(s) of interest? o. Will subsequent actions based on survey results benefit the audience(s) of interest? p. Are members of the population of interest reachable using an available distribution mechanism? q. Does evidence from previous projects suggest that the chosen survey distribution method and incentive strategy will produce response rates consistent with your study goals? r. Are there resources, such as census data, that can support appropriate weighting of the data during analysis? s. Will a second mode of administration (e.g., internet plus mail) enhance your reach to the population(s) of interest? t. Are there sufficient resources to get the margin of error to within an acceptable range? u. Can you adequately justify why the cost of administering the survey (including the time cost for respondents) outweighs the data quality concerns (due to lack of causal and error controls)?

Precise Description (Quadrant D): Probability sampling in service of providing unbiased, accurate estimates of responses


resources permit, the inclusion of even one or two additional features – such as an instrumental variable or a diagnostic for detecting non-response bias – has the potential to improve the quality of data produced by the survey and the credibility of the results. Careful considerations of the questions will make it possible, once the data collection, analysis, and reporting are completed, for researchers to answer the question of why the survey method was used, why the particular design choices were made, and how those design choices fit the overall goals of the project.

CONCLUSION Surveys offer many advantages because of their potential to reach a large number and a wide variety of respondents at relatively low cost. Surveys can also serve a communicative function – the sponsors, topics, and content of surveys may influence respondents’ attitudes and behaviors separately from how the data analysis results are used. At the same time, surveys have key limitations: mistakes during the development phase are rarely correctable once administration begins; brevity requirements limit the number of constructs that

can be measured; many respondents complete surveys in uncontrolled environments; survey fatigue, diminishing response rates, and the potential for non-response bias create major pitfalls; multiple time point designs often suffer from attrition; and one-shot, cross-sectional surveys make causal interpretations problematic. Importantly, to the extent that a survey serves an outreach or political goal, the impact of the survey project may be adverse if respondents fail to recognize the outcomes that they expected would follow from the survey process. Each of these strengths and weaknesses of the survey method may matter more or less to researchers depending upon the goals and purpose of the various survey projects. The checklist of questions presented in this chapter can help researchers ensure that they get the most out of their survey project in those cases where a survey is the best choice for data collection.

REFERENCES Abdallah, W., Goergen, M., & O’Sullivan, N. (2015). Endogeneity: How failure to correct for it can cause wrong inferences and some remedies. British Journal of Management, 26(4), 791–804.



Adams, M. J., & Umbach, P. D. (2012). Nonresponse and online student evaluations of teaching: Understanding the influence of salience, fatigue, and academic environments. Research in Higher Education, 53(5), 576–91. Angrist, J. D., & Krueger, A. B. (2001). Instrumental variables and the search for identification: From supply and demand to natural experiments. Journal of Economic Perspectives, 15 (4), 69–85. Anderson, J. C., Rungtusanatham, M., & Schroeder, R. G. (1994). A theory of quality management underlying the Deming management method. Academy of Management Review, 19(3), 472–509. Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A review and recommendations. The Leadership Quarterly, 21(6), 1086–1120. Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2014). Causality and endogeneity: Problems and solutions. In D.V. Day (Ed.), The Oxford handbook of leadership and organizations (pp. 93–117). New York: Oxford University Press. Aronovici, C. (1916). The social survey. Philadelphia: Harper Press. Barton, A. H. (2001). Paul Lazarsfeld as institutional inventor. International Journal of Public Opinion Research, 13(3), 245–69. Baruch, Y., & Holtom, B. C. (2008). Survey response rate levels and trends in organizational research. Human Relations, 61(8), 1139–60. Bethlehem, J., & Biffignandi, S. (2011). Handbook of web surveys (Vol. 567). Hoboken: John Wiley & Sons. Biemer, P. P., de Leeuw, E. D., Eckman, S., Edwards, B., Kreuter, F., Lyberg, L. E., Tucker, C., & West, B. T. (Eds.). (2017). Total survey error in practice. Hoboken: John Wiley & Sons. Brace, I. (2004). Questionnaire design: How to plan, structure and write survey material for effective market research. London: Kogan Page Publishers. Byrne, B. M. (2001). Structural equation modeling: Perspectives on the present and the future. International Journal of Testing, 1(3–4), 327–34. Creswell, J. D., & Creswell, J. W. (2017). Research design: Qualitative, quantitative, and mixed methods approaches. Thousand Oaks: Sage. Conrad, F. G., Couper, M. P., & Sakshaug, J. W. (2016). Classifying open-ended reports: Factors affecting the reliability of occupation codes. Journal of Official Statistics, 32(1), 75. de Leeuw, E. D., Hox, J. J., & Dillman, D. A. (2008). International handbook of survey methodology. New York: Taylor & Francis. De Vaus, D., & de Vaus, D. (2013). Surveys in social research. London: Routledge. Dillman, D. A., & Edwards, M. L. (2016). Designing a mixed mode survey. In Wolf, C., Joye, D., Smith, T. E., Smith, T. W., & Fu, Y. C. (Eds.) The Sage hand-

book of survey methodology, pp. 255–68, Thousand Oaks: Sage. Dillman, D. A., Smyth, J. D., & Christian, L. M. (2014). Internet, phone, mail, and mixed-mode surveys: the tailored design method. John Wiley & Sons. In Fink, A. (2003). The survey handbook. Thousand Oaks: Sage. Fink, A. (2003). The survey handbook. Thousand Oaks: Sage. Freedman, D. A. (1987). As others see us: A case study in path analysis. Journal of educational statistics, 12(2), 101–28. Gehlbach, H., & Artino, A. R. (2018). The survey checklist (manifesto). Academic Medicine, 93(3), 360–6. Gideon, L. (Ed.). (2012). Handbook of survey methodology for the social sciences. New York: Springer. Graydon, M., & Parks, L. (2020). ‘Connecting the unconnected’: a critical assessment of US satellite Internet services. Media, Culture & Society, 42(2), 260–76. Groves, R. M., & Lyberg, L. (2010). Total survey error: Past, present, and future. Public Opinion Quarterly, 74(5), 849–79. Hackman, J. R., & Wageman, R. (1995). Total quality management: Empirical, conceptual, and practical issues. Administrative Science Quarterly, 40 (2), 309–42. Hill, N., & Alexander, J. (2016). Handbook of customer satisfaction and loyalty measurement. New York: Routledge. Irwing, P., Booth, T., & Hughes, D. J. (Eds.). (2018). The Wiley handbook of psychometric testing: A multidisciplinary reference on survey, scale and test development. Hoboken, NJ: John Wiley & Sons. Johnson, T. P. (Ed.) (2015). Handbook of health survey methods (Vol. 565). Hoboken, NJ: John Wiley & Sons. Kruskal, W., & Mosteller, F. (1980). Representative sampling, IV: The history of the concept in statistics, 1895–1939. International Statistical Review/Revue Internationale de Statistique, 48(2), 169–95. Marpsat, M., & Razafindratsima, N. (2010). Survey methods for hard-to-reach populations: introduction to the special issue. Methodological Innovations Online, 5(2), 3–16. McNabb, D. E. (2013). Nonsampling error in social surveys. Thousand Oaks: Sage. Molléri, J. S., Petersen, K., & Mendes, E. (2020). An empirically evaluated checklist for surveys in software engineering. Information and Software Technology, 119, 106240. j.infsof.2019.106240 Morris, M. (Ed.). (2004). Network epidemiology: A handbook for survey design and data collection. New York: Oxford University Press.


Munson, E. L. (1921). The management of men: A handbook on the systematic development of morale and the control of human behavior. New York: Henry Holt. Papies, D., Ebbes, P., & Van Heerde, H. J. (2017). Addressing endogeneity in marketing models. In C. Homburg et  al. (eds), Advanced methods for modeling markets (pp. 581–627). Springer, Cham. Pecoraro, J. (2012). Survey fatigue. Quality Progress, 45(10), 87. Porter, S. R., Whitcomb, M. E., & Weitzer, W. H. (2004). Multiple surveys of students and survey fatigue. New directions for institutional research, 121, 63–73. Protogerou, C., & Hagger, M. S. (2020). A checklist to assess the quality of survey studies in psychology. Methods in Psychology, 3, 100031. https:// Punch, K. F. (2003). Survey research: The basics. Thousand Oaks: Sage. Reeb, D., Sakakibara, M., & Mahmood, I. P. (2012). From the editors: Endogeneity in international business research. Journal of International Business Studies, 43, 211–18. Rogelberg, S. G., Church, A. H., Waclawski, J., & Stanton, J. M. (2002). Organizational Survey Research: Overview, the Internet/intranet and present practices of concern. In S. G. Rogelberg (Ed.), Handbook of research methods in industrial and organizational psychology. Oxford: Blackwell. Rogelberg, S. G, & Stanton, J. M. (2007). Understanding and dealing with organizational survey nonresponse. Organizational Research Methods, 10, 195–209. Rossi, P. H., Wright, J. D., & Anderson, A. B. (Eds.). (2013). Handbook of survey research. New York: Academic Press. Sharma, A., Duc, N. T. M., Thang, T. L. L., Nam, N. H., Ng, S. J., Abbas, K. S., … & Karamouzian, M. (2021). A consensus-based Checklist for Reporting of Survey Studies (CROSS). Journal of general internal medicine, 1–9. DOI: 10.1007/s11606021-06737-1


Sinickas, A. (2007). Finding a cure for survey fatigue. Strategic Communication Management, 11(2), 11. Stanton, J. M. (1998). An empirical assessment of data collection using the Internet. Personnel psychology, 51(3), 709–25. Stanton, J. M., & Rogelberg, S. G. (2001). Using internet/intranet web pages to collect organizational research data. Organizational Research Methods, 4(3), 200–17. Stanton, J. M., & Rogelberg, S. G. (2002). Beyond online surveys: Internet research opportunities for industrial-organizational psychology. In S. G. Rogelberg (Ed.), Handbook of research methods in industrial and organizational psychology. Oxford: Blackwell. Stapleton, C. E. (2013). The smartphone way to collect survey data. Survey Practice, 6(2), 1–7. Stone, S. I., & Rose, R. A. (2011). Social work research and endogeneity bias. Journal of the Society for Social Work and Research, 2(2), 54–75. Stouffer, S. A., Guttman, L., Suchman, E. A., Lazarsfeld, P. F., Star, S. A., & Clausen, J. A. (1950). Measurement and prediction [Studies in social psychology in World War II. Vol. 4]. Princeton, NJ: Princeton University Press. Tankard, J. W. (1972). Public opinion polling by newspapers in the presidential election campaign of 1824. Journalism Quarterly, 49(2), 361–5. Torous, J., Firth, J., Mueller, N., Onnela, J. P., & Baker, J. T. (2017). Methodology and reporting of mobile heath and smartphone application studies for schizophrenia. Harvard review of psychiatry, 25(3), 146. Vannette, D. L., & Krosnick, J. A. (Eds.). (2017). The Palgrave handbook of survey research. Cham: Springer. Weisberg, H. F. (2009). The total survey error approach. Chicago: University of Chicago Press. White, H. (1982). Instrumental variables regression with independent observations. Econometrica: Journal of the Econometric Society, 50(2), 483–99. Wolf, C., Joye, D., Smith, T. E., Smith, T. W., & Fu, Y. C. (Eds.) (2016). The sage handbook of survey methodology. Thousand Oaks: Sage.



APPENDIX A: WIDELY USED SURVEY HANDBOOKS Table 4. A1  Widely used survey handbooks Year




2003 2003 2004

Fink Punch Morris

General Social Science Epidemiology



2008 2011 2012

de Leeuw Bethlehem Gideon

2013 2013 2014

De Vaus Rossi Dillman

2015 2016

Johnson Hill

2016 2017 2018

Wolf Vannette Irwing

Survey Handbook (2nd edition) Survey Research: The Basics Network Epidemiology: A Handbook for Survey Design and Data Collection Questionnaire design: How to plan, structure and write survey material for effective market research International Handbook of Survey Methodology Handbook of Web Surveys Handbook of Survey Methodology for the Social Sciences Surveys in Social Research Handbook of Survey Research Internet, phone, mail, and mixed-mode surveys: the tailored design method Handbook of Health Survey Methods Handbook of Customer Satisfaction and Loyalty Measurement Sage Handbook of Survey Methodology Palgrave Handbook of Survey Research The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale, and Test Development

APPENDIX B: PUBLISHED SURVEY PROJECTS FOR TABLE 4.1 AND FIGURE 4.1 1 Australian Chamber of Commerce and Industry (2021). National Trade Survey 2021. Downloaded on 7/9/2021 from publications/national-trade-survey-2021/ 2 Beacon Research (2021, April). A National Survey of Parents for the Walton Family Foundation, Downloaded on 7/8/21 from 3 Bush, S. S., & Zetterberg, P. (2021). Gender quotas and international reputation. American Journal of Political Science, 65(2), 326–41. 4 Chen, J. (2020). A juggling act: CEO polychronicity and firm innovation. The Leadership Quarterly, 101380. 101380


General Social Science General Sociology General General Public Health Commerce Sociology Institutional & Governmental Psychology

5 Connelly, B. S., McAbee, S. T., Oh, I.-S., Jung, Y., & Jung, C.-W. (2021, June 24). A multirater perspective on personality and performance: An empirical examination of the Trait–Reputation–Identity Model. Journal of Applied Psychology. Advance online publication. apl0000732 6 Daniels, K., & Abma, J. C. (2020) Current contraceptive status among women aged 15–49: United States, 2017–2019. NCHS Data Brief, no 388. Hyattsville, MD: National Center for Health Statistics. 7 Della Volpe, J. (2021). Harvard Youth Survey. Downloaded on 7/8/21 from https://iop.harvard. edu/youth-poll/spring-2021-harvard-youth-poll 8 European Monitoring Centre for Drugs and Drug Addiction (2020, June). Impact of COVID-19 on Patterns of drug use and drug-related harms in Europe. Downloaded on 7/9/2021 from www.


9 Fabrizio, Lee & Associates (2021, March). Political “Tribes” within Today’s GOP. Downloaded on 7/8/2021 from uploads/2021/03/Political-Tribes-within-TodaysGOP.pdf 10 Foss, N., Lee, P. M., Murtinu, S., & Scalera, V. G. (2021). The XX factor: Female managers and innovation in a cross-country setting. The Leadership Quarterly, 101537. https://doi. org/10.1016/j.leaqua.2021.101537 11 Foster, L., Park, S., McCague, H., Fletcher, M.-A., & Sikdar, J. (2021). Black Canadian National Survey Interim Report 2021. York University. Downloaded on 7/8/2021 from https://blacknessincanada. ca/wp-content/uploads/2021/05/0_BlackCanadian-National-Survey-Interim-Report2021.2.pdf 12 Gregorian, N. (2021). Fluent Family Well-Being Study. Downloaded on 7/8/21 from Family-Wellbeing_JED-report_12-28-20.pdf 13 Hickman, L., Bosch, N., Ng, V., Saef, R., Tay, L., & Woo, S. E. (2021, June 10). Automated Video Interview Personality Assessments: Reliability, Validity, and Generalizability Investigations. Journal of Applied Psychology. Advance online publication. 14 Hierro, M. J., & Queralt, D. (2021). The divide over independence: explaining preferences for secession in an advanced open economy. American Journal of Political Science, 65(2), 422–42. 15 Homan, P., & Burdette, A. (2021). When religion hurts: Structural sexism and health in religious congregations. American Sociological Review, 86(2), 234–55. 16 Igielnik, R., Keeter, S., Weisel, R., & Jordan, C. (2021). Behind Biden’s 2020 Victory. Pew Research Center Report. Downloaded on 7/8/2021 from uploads/sites/4/2021/06/PP_2021.06.30_validated-voters_REPORT.pdf 17 Kappes, H. B., Gladstone, J. J., & Hershfield, H. E. (2021). Beliefs about whether spending implies wealth. Journal of Consumer Research, 48(1), 1–21. 18 Kukutschka, R. M. B., Rougier, J., & Granjo, A. F. (2021). Citizen’s views and experiences of corruption. Downloaded on 7/9/2021 from https:// EU_2021_web_2021-06-14-151758.pdf


19 Ma, J., Liu, C., Peng, Y., & Xu, X. (2021). How do employees appraise challenge and hindrance stressors? Uncovering the double-edged effect of conscientiousness. Journal of Occupational Health Psychology, 26(3), 243–57. 20 Mooldijk, S. S., Licher, S., Vinke, E. J., Vernooij, M. W., Ikram, M. K., & Ikram, M. A. (2021). Season of birth and the risk of dementia in the population‐based Rotterdam Study. European Journal of Epidemiology, 36(5), 497–506. 21 Muir (Zapata), C. P., Sherf, E. N., & Liu, J. T. (2021, July 1). It’s not only what you do, but why you do it: How managerial motives influence employees’ fairness judgments. Journal of Applied Psychology. Advance online publication. 22 National Foundation for Infectious Diseases (2021). Black adult perspectives on COVID-19 and flu vaccines. Downloaded on 7/8/21 from 23 National Science Board, National Science Foundation (2020). Science and Engineering Indicators 2020: The State of U.S. Science and Engineering. NSB-2020-1. Alexandria, VA. Available at nsb20201/ 24 Rodas, M.A., John, D. R., & Torelli, C. J. (2021). Building brands for the emerging bicultural market: The appeal of paradox brands. Journal of Consumer Research. DOI: jcr/ucab037 25 Scharp, Y. S., Breevaart, K., & Bakker, A. B. (2021). Using playful work design to deal with hindrance job demands: A quantitative diary study. Journal of Occupational Health Psychology, 26(3), 175–88. 26 Tilcsik, A. (2021). Statistical discrimination and the rationalization of stereotypes. American Sociological Review, 86(1), 93–122. 27 Tiller, E., Fildes, J., Hall, S., Hicking, V., Greenland, N., Liyanarachchi, D., and Di Nicola, K. (2020). Youth Survey Report 2020. Sydney, NSW: Mission Australia. 28 U.S. Census Bureau (2020, September). 2019 National Survey of Children’s Health: Methodology Report. Downloaded on 7/8/2021 from technical-documentation/methodology/2019NSCH-Methodology-Report.pdf

5 Principlism in Practice: Ethics in Survey Research Minna Paunova

“Ethical codes, or any ethical claim for that matter, should not in our view be taken without critical thought. They should not be taken implicitly to say, “You should not think….”” (O’Donohue & Ferguson, 2003, p. 1) “RULE III: The researcher, like the voter, often must choose the lesser among evil.” (McGrath, 1981, p. 186)

INTRODUCTION Ethics in survey research draws heavily from biomedical ethics, which developed rapidly during the mid-20th century, following two pivotal historical cases: the Nazi and Tuskegee experiments. The Nuremberg Code and the Belmont Report, developed during the aftermath of these events, inspired the Declaration of Helsinki and the Georgetown Principles (i.e., respect for autonomy, beneficence/non-maleficence, and justice; Beauchamp & Childress, 1979), which lie at the core of contemporary human and social research ethics. Most features of the ethics codes of survey researchers are common to ethics codes of other

research disciplines and other professions. For example, most codes contain prescriptions concerning relationships between professionals and clients, between professionals and the public, and among professionals. Even sections dealing with relationships between researchers and subjects (e.g., respondents during survey research) are common to the ethics codes of other professionals engaged in human-subjects research, including physicians, sociologists, psychologists, and anthropologists (Singer, 2012). Survey research commonly entails modest ethical risks, in comparison to many other types of human and social research (e.g., experimental research), but surveys are rarely exempt from review by institutional review boards (IRB).1 Literature increasingly addresses explicitly the (ethical) implications of IRB practices on survey research (Howell et al., 2003; Grayson & Myles, 2004; Hickey, 2021). Research protocols of online surveys are commonly reviewed and are, by far, the most common online or Web-based method of social and behavioral research (Buchanan & Hvizdak, 2009). The increasingly digital nature of (survey) data collection includes traditional challenges, such as consent, risk, privacy, anonymity, confidentiality, and autonomy, but adds new methodological complexities regarding data storage,


security, sampling, and survey design (Buchanan & Hvizdak, 2009). Regardless of institutional requirements, survey researchers espouse principles that generalize across most social research methods. Ethical guidelines for survey and social science research are inspired heavily by principlism, which has developed into a moral theory in its own right, but the ethical recommendations that most survey researchers are socialized with lack strong theoretical and empirical focus, and they are a mere list of prescriptions. Researchers are expected to internalize lists of rules with little to no understanding of why such rules are meaningful or important. Without greater substantive treatment of such rules, researchers cannot be expected to solve moral dilemmas when principles conflict. Lavin (2003) argues that training in the social sciences is a superb foundation for moral and ethical thinking because they share many facets. Instead of being given yet another checklist, survey researchers and other social scientists might benefit from training and practice in moral reasoning as a form of quasi-scientific reasoning, and from deeper understanding of normative, non-normative, and meta-ethics. This chapter is an overview of the moral foundations that guide recommendations in survey research ethics. The next section begins with a historical summary of foundational documents in research ethics involving human subjects, and by extension, ethics in survey research.

A BRIEF HISTORY OF ETHICS IN SOCIAL RESEARCH Most sources cite the Nuremberg Code2 as one of the first documents to guide research using human subjects (Bloom, 2008; Oldendick, 2012; Singer, 2012; Ali & Kelly, 2016). Developed in the aftermath of the post-World War II Nuremberg trials of war criminals, the Code contains principles for conducting experiments using human subjects, the foremost of which concerns voluntary consent (Shuster, 1997; Ali & Kelly, 2016). Although judges at Nuremberg recognized the importance of Hippocratic ethics and the maxim first, do no harm, they found that more is necessary to protect human research subjects. Accordingly, the judges articulated a sophisticated set of ten research principles centered not on the physician but on the research subject (Shuster, 1997). Informed consent, the core of the Nuremberg Code, has represented a fundamental human right ever since. The contribution of Nuremberg was to merge medical ethics and the protection of human rights into a


single code (Shuster, 1997). Subsequently, the basic requirement of informed consent has been accepted universally and articulated in international law in Article 7 of the United Nations International Covenant on Civil and Political Rights, adopted in 1966 (Shuster, 1997). Today, informed consent represents a process by which potential participants acquire understanding of the procedures, risks, and benefits associated with participation in a study (Singer, 2003; Losch, 2008). Following the Nuremberg Code in 1949, the Declaration of Helsinki3 on Ethical Principles for Medical Research Involving Human Subjects was developed. Published in 1964 and amended multiple times since, it includes extensive treatment of consent, emphasizing physicians’ duty to protect the health, wellbeing, and rights of human subjects (WMA, 2018). The document addresses physicians and medical researchers, but it has become one of the best-known foundational sources of research ethics among contemporary social scientists. Not fully aligned with the Nuremberg Code, since it suggests that peer review supplement and even supplant informed consent4 as a core principle, the Declaration acknowledges Nuremberg’s authority (Shuster, 1997). The Nuremberg Code and Declaration of Helsinki combined served as models for U.S. federal research regulations, which require not only informed consent, with proxy consent sometimes acceptable, as for young children, but prior peer review of research protocols by an ethics committee (i.e., an IRB or HREC) (Shuster, 1997; Bloom, 2008). The Nuremberg Code and Declaration of Helsinki are largely European in spirit, developed to safeguard against Nazi-like atrocities, but similar documents were developed simultaneously in the United States. The Tuskegee Study of Untreated Syphilis in the Negro Male, which was conducted between 1932 and 1972, spurred widespread criticism against the U.S. Public Health Service, which, for nearly 30 years, denied penicillin treatment to nearly 400 African American men infected with syphilis. When the media revealed news of this experiment, there was significant impetus for the formation of the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research in 1974. Formation of the Commission was urged further by the questionable ethics of ongoing research in the social sciences, such as Milgram’s obedience to authority study and Zimbardo’s prison study (Bloom, 2008; Oldendick, 2012). The Commission’s Belmont Report,5 published in 1979, established guidelines when conducting research using human subjects.



The report explicitly credited the Nuremberg Code as inspiration but proceeds to (1) set boundaries between practice and research, (2) formulate three basic ethical principles (i.e., respect for persons, beneficence, and justice), which it (3) illustrates with three applications to the context of research using human subjects: informed consent, assessment of risk/harm and benefits, and selection of subjects. The first principle, respect for persons, incorporates at least two ethical convictions – that individuals be treated as autonomous agents, and that persons with diminished autonomy are entitled to protection. The term beneficence, which commonly covers acts of kindness or charity beyond obligation, takes on a stricter meaning – one of obligation. It constitutes two general rules: do not harm, and maximize possible benefits and minimize possible harms. The question of justice relates to who should receive the benefits of research and who bears its burdens. Applications of general ethical principles to the conduct of research leads to requirements of informed consent, risk/benefit assessment, and selection of subjects. For example, concerning informed consent, an essential component of ethical survey design, the report explains:6 Respect for persons requires that subjects, to the degree that they are capable, be given the opportunity to choose what shall or shall not happen to them. This opportunity is provided when adequate standards for informed consent are satisfied. While the importance of informed consent is unquestioned, controversy prevails over the nature and possibility of an informed consent. Nonetheless, there is widespread agreement that the consent process can be analyzed as containing three elements: information, comprehension and voluntariness. (Section C1)

By the mid-70s, bioethics was maturing as a field in Europe and North America, and one of the standard theoretical frameworks from which to analyze ethical situations in medicine was synthesized by Beauchamp and Childress (1979). Published in the same year as the Belmont Report, the text identified and substantiated a shortlist of fundamental moral principles that guide the medical profession. The goal of the National Commission in the Belmont Report (NC, 1979) was principles for research ethics, but Beauchamp and Childress (1979) presented principles suitable for application to ethical problems in medical practice, research, and public health (Beauchamp & Rauprich, 2016). The four principles Beauchamp and Childress identify (i.e., autonomy, non-maleficence, beneficence,

and justice) were similar to those in the Belmont Report. In combination, these Georgetown principles (later, “principlism”) have been extremely influential in both research and medical ethics, and they are fundamental to understanding the current approach to ethics in survey and other social research (Singer & Levine, 2003; Singer, 2012). The next section provides an overview of principlism’s core ideas, including a short definition of the four principles and the methodological and theoretical considerations behind them. It highlights controversies surrounding principlism, most notably a lack of completeness of the four principles and the need for additional ones.

PRINCIPLISM: AN OVERVIEW Principlism designates an approach to biomedical ethics that uses a framework of four universal and basic ethical principles – respect for autonomy, non-maleficence, beneficence, and justice. These principles and the framework that unifies them were first presented and defended in Beauchamp and Childress’s (1979) textbook Principles of Biomedical Ethics, which has been revised multiple times since and is currently in its 8th edition, organized in three parts. In part one, the authors discuss foundational concepts, such as moral norms and character using everyday terms. Putting the notion of moral character and virtue ethics aside,7 part two presents the groundwork of the book, an elaboration of moral norms for biomedical practice in the form of four principles and several rules. There is a chapter for each principle. Autonomy is about an individual’s right to make their own choices, beneficence is about acting with the best interests of another in mind, non-maleficence is the principle “above all, do no harm,” and justice emphasizes fairness and equality among individuals. The fifth chapter of part two clarifies rules of professional–patient relationships, such as veracity, privacy, confidentiality, and fidelity. Explicit treatment is also given to the dual role of physician/investigator. In part three, the authors dig deeper into the philosophical foundations of normative ethics, reviewing the moral theories of utilitarianism, Kantianism, rights-based ethics, communitarianism, and ethics of care, and their convergence. They also provide a more informed discussion of the theory and method behind the four principles. An important theme in the book is that the traditional preoccupation with beneficence within medical ethics can be augmented by the principle of respect for the autonomy of patients and by


broader concerns of social justice (Beauchamp & Rauprich, 2016). Prior to the book’s publication, beneficence and non-maleficence represented the core of medical and human research ethics. However, the expert authority of doctors meant that their choices were largely left unchecked. The paternalistic logic of doing “what is best” for the patient meant that patients, especially less-powerful, less-educated patients, were treated like children, incapable of making informed choices. An example sometimes cited to illustrate the moral issues arising from relying solely on beneficence and non-maleficence, without considering autonomy and justice, concerns the involuntary sterilizations performed on minority women throughout much of the 20th century that might be ongoing today (Stern, 2005). For example, in the case of Madrigal v. Quillian, the Mexican American plaintiffs had signed consent forms that empowered doctors to do what they deemed medically necessary. However, the forms were in a foreign language (i.e., English rather than Spanish) and were signed immediately before or after emergency caesareans. Formal consent was meaningless because individual autonomy was not respected. The four principles tend to take a central position during discussions of this framework, but principlism is not merely a framework of four principles; it is a method for using the principles in practice (Beauchamp, 2015). “Principlism is a complex system that includes rules, decision procedures, role morality, moral ideals, and coherence as a basic method” (DeMarco, 2005, p. 103). The principles state prima facie (i.e., first-impression, non-absolute) moral obligations that are rendered practical by being specified for particular contexts. Moral problems arise when principles or their specifications conflict, resolved by further specification or balancing judgments. Principlism justifies moral reasoning by appealing to the method of reflective equilibrium and common morality (Beauchamp & Childress, 2001). The common morality is the set of norms that “all morally serious persons share” (p. 3). Specifically, morality can also be community-specific, but principlism’s morality is universal. The principles can be justified by a broad range of moral theories because they constitute common morality. Despite their core differences, deontological and teleological theories8 often “lead to similar action-guides and to similar virtues…. This idea underlies the notion of common morality. Convergence as well as consensus about principles among a group of persons is common in assessing cases and framing policies, even when deep theoretical differences divide the group” (p. 376). Morality encompasses both non-normative (e.g., empirical) claims about what is universal,


or particular, and normative claims about what should be universal, or particular, in moral belief, but principlism is concerned with the normative; it approaches morality as universal and normative, but not absolute. Here, it is important to define prima facie obligations and contrast them with actual ones. A prima facie obligation “is always binding unless a competing moral obligation overrides or outweighs it in a particular circumstance. Some acts are at once prima facie wrong and prima facie right…. Agents must then … locate what Ross called ‘the greatest balance’ of right over wrong” (Beauchamp & Childress, 2001, p. 14). The principles and rules (i.e., norms) in principlism are not absolute, and distinctions between rules and principles are loose, since both are general norms that guide actions. Principles are general norms that leave considerable room for judgment in many cases, and they thus do not function as precise action guides that inform us in each circumstance how to act in a way that more detailed rules and judgments do. Rules are more specific in content and more restricted in scope than principles are. Several types of rules specify principles, and thereby provide specific guidance, including substantive, authority, and procedural rules. There is often room for judgment, and it is also possible that moral conflicts arise. When moral conflicts arise, principlism suggests turning to Rawls’ (1971) core ideas. In what they call coherence theory, Beauchamp and Childress (2001, building on Rawls, 1971) suggest that reflective equilibrium guides ethical decision-making. Comparing deductive, top-down models of applying moral theory to situations (i.e., what most people think of when they think of ethical decision-making) and inductive, bottomup models of generalizing moral theory from cases (e.g., as in pragmatism and particularism, and some forms of feminism and virtue theory), Beauchamp and Childress (2001, p. 397) explain that “neither general principles nor paradigm cases have sufficient power to generate conclusions with the needed reliability. Principles need to be made specific for cases, and case analysis needs illumination from general principles.” Thus, instead of top-down or bottom-up models, principlism suggests that methods in ethics begin properly with considered judgments, the moral convictions in which we have the highest confidence and believe to have the lowest bias, or “judgments in which our moral capacities are most likely to be displayed without distortion” (p. 398). Whenever some feature in a moral theory that we hold conflicts with one or more of our considered judgments, we must modify one or the other to achieve equilibrium. Moral justification here is a reflective testing of moral principles, theoretical postulates,



and other relevant moral beliefs to make them as coherent as possible.

APPLYING PRINCIPLISM TO SURVEY RESEARCH To relate principlism to extant guidelines in survey and social research, a large number of codes of conduct of professional and scientific bodies were reviewed and summarized (see Appendix A), including those from the European Federation of Academies of Sciences and Humanities, the International Chamber of Commerce, the European Society for Opinion and Marketing Research, UK Research & Innovation, the Economic and Social Research Council, Academy of International Business, Academy of Management, American Anthropological Association, American Association for Public Opinion Research, American Marketing Association, American Political Science Association, American Psychological Association, American Sociological Association, American Statistical Association, Committee on Publication Ethics, Insights Association, National Communication Association, Society for Human Resource Management, and Strategic Management Society. These codes were selected because, although inexhaustive, they represent a range of organizations and professional bodies with which survey researchers identify, particularly in the United States and Europe.9 Many other English-language codes from professional bodies in, for example, Great Britain, Australia, and Canada contain similar guidelines (see Appendix A). Social and survey research guidelines published by leading professional bodies are organized around a small number of principles (i.e., aspirational ideals) and a larger set of enforceable rules. The principles of autonomy, beneficence/non-maleficence, and justice appear in the codes of conduct of most professional bodies in the social sciences, and those dealing with survey and other applied social research. Just like principlism, general principles often find expression in more specific rules, such as those that relate to informed consent (from notions of autonomy and respect for persons), confidentiality (from do no harm), and representativeness (from ideals related to fairness and justice). A large number of principles are listed across documents (e.g., integrity and responsibility), and it is unclear how they relate to principlism’s autonomy, beneficence, non-maleficence, and justice, despite principlism’s influences on human and social research ethics (Singer & Levine, 2003; Singer, 2012).

This makes it difficult to synthesize sources of morality in survey research in a loosely unified framework that is not a mere checklist of rules. Previous attempts to synthesize revealed that most codes stress the importance of a small number of ethical responsibilities toward participants, especially voluntary participation, informed consent, no harm, anonymity and confidentiality, and privacy (Valerio & Mainieri, 2008; Schirmer, 2009; de Vaus, 2014). These are slightly different from what principlism suggests; in addition to mixing rules and principles, the principle of justice, for example, finds limited space here. However, justice would be very important when weighing out the effects of one’s actions across stakeholders, as discussed shortly. To see how this small number of universally endorsed norms operates in practice, consider an issue that is critical to survey methodology – non-response. Addressing non-response through refusal conversion has been contentious and presents ethical challenges. To reduce non-response (i.e., convert participants’ initial refusals to participate), researchers use incentives, multiple reminders, and other tactics. The question, then, is whether refusal conversion ultimately violates subjects’ autonomy by reducing the voluntariness of participation. Without taking measures to reduce non-response, does the survey researcher violate their commitment to justice and, more specifically, representativeness – that is, ensuring the survey is inclusive of all relevant groups and does not discriminate against minorities or disadvantaged groups (Schirmer, 2009)? What about other broad ethical concerns such as community, where researchers’ duty to science and the profession take precedence? The codes that were reviewed do not help with this issue, but a more thorough commitment to principlism might. The most crucial point for survey researchers concerns the organization of codes in relation to stakeholders. In many extant codes, ethical responsibilities during research extend to various stakeholders, including participants, the profession and colleagues, the public, sponsors, and funders. Unlike principlism, in which the relationship with subjects (e.g., patients) takes precedence, many professional codes emphasize the relationship with the profession (e.g., colleagues) and the public to varying, and sometimes a higher, extent. For example, the American Statistical Association explicitly identifies responsibilities to the following stakeholders: science/public/funders/clients, research subjects, research team colleagues, other statisticians or statistics practitioners, allegations of misconduct, employers, including organizations, individuals, attorneys, or other clients who employ statistical practitioners.


Many codes begin with research integrity, implying that there exists an abstract responsibility to science or a more concrete responsibility to produce good science for the sake of the profession and the public. This makes it particularly likely for general principles to conflict in some situations. Table 5.1 identifies the dilemmas most likely to arise during survey research, such as minimizing response burden and refusal conversion to address non-response. Since principlism focuses heavily on research participants/subjects, whereas the codes overviewed attend to a large number of stakeholders, Table 5.1 is organized around two types of stakeholders – participants and other stakeholders – including science, the public, colleagues, and funding bodies. It is assumed that the interests of the latter align with, but differ greatly from, the interests of individual research subjects. Table 5.1 identifies several dilemmas that survey researchers commonly encounter, organized by stages of survey research, the four principles, and the stakeholders involved or affected. I do not review each cell due to space constraints, but I do provide several examples. Specifically, I consider one example that (largely) spans a single principle but affects many stakeholders, and one that spans several principles but affects a single stakeholder (mostly participants). As these examples illustrate, both are complex and involve multiple adjacent cells. The maxim do no harm (i.e., the principle of non-maleficence) is interesting. Regarding surveys, the maxim finds expression in many issues, and those related to privacy and confidentiality are most pertinent. However, during early stages of survey design and data collection, most relevant are avoiding deception, which also relates to informed consent, and minimizing response burden. Avoiding deception is laudable, but full transparency might produce acquiescence bias, a response bias that limits the utility of survey results (Krosnick, 1999; Holbrook, 2008). Minimizing response burden involves limiting a questionnaire to essential issues that are required to answer a research question, and limiting the time and energy respondents must devote to completing the survey. Taken to extreme, this rule implies not burdening respondents with a survey at all. It is difficult to reason about the limits of such specific rules if a researcher simply takes do no harm at face value. It might be necessary to have a more profound understanding of the principles of beneficence/non-maleficence. These principles derive from the Hippocratic Oath, particularly “instructing physicians to ‘never do harm’ while acting ‘according to [one’s] ability and [one’s] judgment’” (Brill, 2008, p. 56; emphasis added). From these ideas,


several, more fully articulated notions derive from principlism (Brill, 2008). For example, beneficence suggests that exposing participants to some risk is justifiable; considered judgments guide researchers when determining a reasonable degree of risk. Beneficence is more sophisticated than do no harm, which is impossible to adhere to in an absolute sense, except in the most extreme situations. Alternatively, beneficence represents balancing the trade-off between the potential benefits and justifiable risks of potential harms associated with participation. It is manifest in investigators’ efforts to minimize risks while maximizing potential benefits to the individual participant and/or society (Beauchamp & Childress, 2001). The approach of considered judgments is similar to what McGrath (1981, p. 186) calls the Third Rule of Dilemmatics, or choosing the lesser evil. Completing a survey might be an annoyance to participants but is mostly harmless. Minimizing the length of a survey might be desirable not only to minimize response burden, but to minimize the amount of (personal) data that require storage, and maximize response rates, thereby ensuring highquality science. However, if a researcher’s considered judgment suggests that high-quality science requires a lengthy survey, this might be acceptable, insofar as the harm done to participants is not much more than annoyance. A slightly different, and perhaps more contentious, issue is that of non-response and refusal conversion (Schirmer, 2009). Refusal conversion represents tactics researchers use to acquire cooperation from respondents who have refused an initial survey request or, more typically with internet surveys, have ignored it. Refusal conversion includes different versions of a survey’s introduction (e.g., cover letters), or offering respondents incentives to complete the survey (Shuttles et al., 2008), to improve response rates. The first dilemma relates to balancing harms and benefits across stakeholders, and this approach might apply here as well. However, refusal conversion can be viewed as prioritizing different principles for any stakeholder, here and most typically, the participants. Consider incentives during refusal conversion, though similar logic applies to other tactics. Considered judgment suggests that using incentives might be acceptable in several situations (Schirmer, 2009). It affects subjects’ autonomy to a degree, but it also balances harms (i.e., time investment) with benefits (i.e., monetary reward) of participating in the survey. Incentives facilitate justice, since some participants might truly require compensation for their time. McKeown and Weed (2004) argue that survey researchers have a particular ethical obligation to ensure that their research

Primary stakeholder† Survey design and data collection

Other stakeholders††

3. Response burden (i.e., length of questionnaires) should be minimized. Relates to the objective of data minimization. 4. Painful and distressful questions causing harmful reflections or emotional harm should be avoided. Has harm been reduced by informed consent? Relates to the principle of autonomy. Beneficence, minimize possible 5. Surveys should limit risks and maximize benefits to harm and maximize benefits for participants. Are there (intrinsic) benefits they might derive by subjects; decide when research is participating? permissible despite risk of harm Justice, achieve some fair balance 6. When sampling, researchers must ensure that all relevant between those who bear the groups, especially those typically underrepresented, are burdens of research and those represented. who benefit from it 7. Internet surveys have a sampling advantage among populations that are difficult to access, ensuring justice. 8. Payment and compensation: Should people be compensated equally, or based on need/equity? Should managers be compensated more because their time is potentially more valuable, or less because they have less use for the incentive? Can harm be compensated or healed with incentives? Are incentives coercive or do they rectify injustice? Relates to the principles of autonomy and non-maleficence.

Non-maleficence, no needless harm; do no harm

7. Internet surveys present additional challenges to researchers who want to ensure ethical data collection and storage. See Nosek et al. (2002). 8. Some participants must spend more time than others (e.g., due to digital literacy on internet surveys). Should they get extra pay for their time and effort? What are the implications for their own welfare, fairness to other participants, and data quality?

5. Many surveys provide benefits for society-at-large and provide minimal, if any, direct benefits to participants. (Not) conducting the survey might maximize individual respondents’ welfare but not the common good. Relates to the principle of justice. 6. Should everyone be represented, and should samples be representative? Is everyone to give an opinion?

3. Length has implications for data quality because of increasing drop-out rates (no conflict). 4. Painful and distressful questions might be necessary to advance knowledge or promote the common good (e.g., surveying victims of abuse and harassment to prevent it).

Autonomy, the right to self-govern; 1. Informed consent, “the knowing consent of an individual or his 1a. Deception might be necessary to maintain the integrity of the research but respect for persons; the basis for legally authorized representative ... without undue inducement or require that respondents are not informed fully. informed consent any element of force, fraud, deceit, duress, or any other form of 1b. Refusal conversion, tactics to gain cooperation from respondents (e.g., constraint or coercion.” Information provided prior to obtaining with incentives), might improve response rates and the quality/usefulness consent should include a realistic description of potential benefits of survey data, but might violate autonomy. and costs, including time, energy, emotional harm, etc. Relates to the principle of beneficence. Participants should be aware that participation is voluntary; they can withdraw at any time without negative consequences. 2. Sometimes it is necessary to provide standard response categories to match 2. Forced response (e.g., binary gender) might violate subjects’ sense of autonomy. and compare with other samples.


Table 5.1  Principlism and some common ethical issues in survey research


9. Respondents have the right to privacy, or to determine when and under what conditions to reveal information about themselves. 10. Unethical practices such as p-hacking (i.e., selective reporting), fabrication (i.e., making up data or results and recording or reporting them), falsification (i.e., manipulating research materials, equipment, or processes, or changing or omitting results such that the research is represented inaccurately in the research record), and plagiarism (i.e., theft or misappropriation of intellectual property, or substantial unattributed copying of another’s work), are maleficent (harmful) to individuals and society. They do not honor researchers’ obligations to participants. Relates to the principle of justice. 11. Respondents must be assured of confidentiality, the safeguarding, by a recipient (i.e., survey researcher), of information about another. 12. Respondents’ answers are reported as representing a sociodemographic group and not unique individuals. Was consent for this type of analysis obtained? 12. To be useful and adhere to ethical principles, data are processed and reported in a way that minimizes individuality.

11. There are limits to confidentiality, and some issues are subject to mandatory reporting (e.g., communicable disease or child abuse).

10a. These practices also contribute to the replication crisis in psychology, and to lowering public trust in the field. 10b. Any sample that depends on volunteers, or on individuals who selfselect into a study, cannot ethically be reported as representing a larger population.

9. Researchers have an obligation of transparent reporting.

Note: See Oldendick (2012) and Singer (2012) for some of these issues in-depth; Nosek et al. (2002) and Singer (2018) for internet-based surveys in particular. Definitions appear in bold. †Research subjects and participants. ††General public and society, clients, the funding body, science/knowledge/truth, the scientific community, colleagues, users of research, students, and practitioners.





Data processing, storage, and communication




includes all relevant groups and does not exclude groups that are commonly underrepresented in research. This means the sample represents all relevant groups, regardless of incentives, and that responses from all sampled participants are encouraged, particularly from those who might otherwise be underrepresented. Refusal conversion is relevant to achieving the latter in pursuit of justice. Considered judgments might be needed here to account for other principles and stakeholders. Is the subject’s autonomy always to be prioritized, given scientists’ obligation to science (and potentially justice), as reflected in recruiting representative samples? Yet, refusal conversion is expensive and moderately effective. The issue is complex and must be considered in context. Most professional organizations’ codes do not suggest how to address broad issues, such as do no harm and issues such as refusal conversion, because they say little about relationships among the various stakeholders and principles when they conflict. Codes do not always align, or are difficult to combine, with practices currently propagated by IRBs. IRBs purport to prioritize the autonomy and beneficence of the research subject, but they, in fact, protect universities, obstructing social welfare consequently (Schrag, 2010). This state of affairs does little to help survey researchers address ethical conflicts. What might help instead is knowledge of the principles and methods of principlism, accompanied by careful consideration of how they apply to difficult situations, and what stakeholders are involved or are affected.

DISCUSSION Criticisms, Extensions of Principlism Although principlism began during the late 1970s, the term emerged later, not before the 1990s, as vaguely derogatory (Clouser & Gert, 1990; Beauchamp & Childress, 2001). The primary criticism at the time was that the principles lacked systematic relationships with each other and that they often conflicted, and since there was no unified moral theory from which they derived, resulting conflicts were unresolvable (Clouser & Gert, 1990). More broadly, the framework was criticized for being relativist; it endorsed a range of theories instead of a single, unified ethical theory, a kind of ethical relativism that has been endorsed in most areas of applied and professional ethics (Clouser & Gert, 1990). Beauchamp and Childress (2001) defended their theory by pointing to the methods and theoretical

foundations of principlism (Beauchamp, 2015). Within medical ethics literature, the principles have been debated extensively, and even assessed empirically. For example, in the field, healthcare practitioners endorse non-maleficence above the other three principles, but that preference does not translate when practitioners make real-world ethical judgments (Page, 2012). Even when people declare that they value these medical ethical principles, they do not appear to use them directly while making decisions. Other criticisms concern the extent to which the four principles are exhaustive of relevant moral norms, or generalize across national, cultural, and institutional contexts (DeMarco, 2005; Walker, 2009). Although “principlist theory is committed to a global bioethics because the principles are universally applicable, not merely local, customary, or cultural rules” (Beauchamp & Rauprich, 2016, p. 2283), recent advances in cultural psychology and descriptive ethics test principlism’s universal generalizability. Harm, fairness, and justice appear in all cultures, including non-Western ones, but many moral systems do not protect the welfare and autonomy of individuals above all else (Haidt, 2008; Walker, 2009). Ingroup loyalty, authority, respect, and spiritual purity are important parts of the moral domain outside of Western nations, and even within Western nations, political conservatives and conservative religious communities endorse this broader domain. Principlism has clearly been influenced strongly by liberal political theory from Kant, John Stuart Mill, and John Rawls (Haidt, 2008). In liberal political theory, and in principlism, people are conceived as reasoning beings who have equal worth and must be treated as ends in themselves, never solely as means to other goals. Liberal individualism will resonate with most readers of this chapter, but it can obscure the importance of alternative norms within communitarianism and traditionalism. Walker (2009) suggests that principlism is enriched by additional principles, drawing from all three ethics – autonomy, community, and divinity (Shweder et  al., 1997). Principlism thus might be developed by drawing not only from harm, justice, and autonomy, but the ethics of community, in which duty, respect, and interdependence are relevant, and divinity, in which tradition and purity are prominent. A list of four principles might yet be too narrow from a different perspective (DeMarco, 2005). Principlism does not deal satisfactorily when two or more obligations hold but only one can be satisfied. Such common moral dilemmas have been handled in two ways: (1) only one prima facie obligation entails a genuine obligation and


(2) all obligations are genuine and so a moral residue remains, requiring moral regret or perhaps a derived obligation, such as compensation. DeMarco (2005) proposes adding a mutuality principle as a third alternative, which is the best solution in such difficult situations. When principles conflict and one or more are violated, value is lost, and the mutuality principle (i.e., establishing mutual enhancement of all basic moral values) deals with such losses. In the several examples DeMarco (2005) discusses, there are forward-oriented ways to eliminating moral conflicts, suggesting that where feasible and appropriate, concrete actions, beyond simple regret, are required by the principle. Developing alternatives that defuse a dilemma is morally desirable. For survey researchers dealing with non-response, for example, this might relate to developing tactics that encourage voluntary responses, without resorting to refusal conversion.

International and Indigenous Issues in Survey Research Ethics Most of the guidelines discussed in this chapter pertain to the United States and some European contexts, but national and cultural differences exist. For example, the Swedish Research Council (2017) publishes 86 pages of excellent ethical guidelines, while a 15-page Dutch code offers concrete ethical dilemmas as illustrations (Association of Universities in the Netherlands, 2012). This chapter disregards extant ethical guidelines in Asia, Africa, and Latin America, partially because of linguistic difficulties of identifying and reviewing them, and partially because they tend to align closely with European and North American guidelines (Lamas et  al., 2010; Kruger et al., 2014; Pratt et al., 2014). Given criticisms of principlism, it is relevant to consider how principles that derive from ideals of community and divinity are relevant to research, particularly across cultural contexts. Related to international differences are disparities in data protection regulations. Although legislation regarding data processing exists in Europe and the United Kingdom, no comprehensive framework exists, for example, in the United States. Legislation falls outside of the scope of this chapter, but data protection legislation is interesting to note, insofar as it relates to survey-specific research ethics. Survey data might be archived and used in ways that were unobvious when the data were collected, a practice that has recently raised many concerns. In the European Union, legislation (General Data Protection Regulation


or GDPR EU 2016/679) greatly restricts data collection and processing. GDPR guidelines contain principles for data processing and the accompanying rights of data subjects. Related to transparency and avoiding deception during survey research, GDPR (Article 13) requires subjects to be aware if data are to be used for any other purposes, shared with research partners, or transferred. The principle of data minimization suggests that data collection should involve only data that are necessary and proportionate to achieve the purpose for which they were collected (Article 5(1)). Even if it is lawful, as it might be in some parts of the world, collecting data that a researcher does not need immediately for a project might be unethical. Technological capacities for the collection, storage, analysis, and sharing of data are evolving rapidly, and so too are the possibilities of improving people’s daily lives. However, data use can also result in exploitation and harm, evident in indigenous peoples (West et al., 2020). Indigenous data sovereignty emerged as a consideration for this cause, combining indigenous research ethics, cultural and intellectual property rights, and indigenous governance discourse to offer solutions to challenges present in open-data environments.

Concluding Thoughts “Ethical principles in survey research are in place to protect individual participant(s) beginning at the start of study recruitment, through participation and data collection, to dissemination of research findings in a manner that is confidential, private, and respectful” (Valerio & Mainieri, 2008, p. 2). However, professional codes of ethics in the social sciences, including those that guide ethical practices in survey research, commonly contain a list of principles, rules, or best practices, and aspirational ideals, but it is unclear what the theoretical, philosophical, or empirical bases of such proscriptions and prescriptions are. Ethical principles in survey research are not given sufficient consideration in most anthologies; they are simply lists of best practices that guide researchers and professionals. Researchers are thus expected to internalize rules, with little to no understanding of why such rules are meaningful. Without more substantive treatment of such rules, researchers cannot be expected to solve moral dilemmas when principles conflict. More nuanced understanding of principlism might enrich survey researchers’, and many other social scientists’, approaches to ethics, facilitating ethical decisions. Although influenced heavily by principlism,



research and education on the principles of social and behavioral research, including survey research, lag behind those in medical and bioethics. Principlism might also improve dialogues among researchers and their IRB.

Notes 1  Survey research typically undergoes expedited or limited board review, and full review is required only in special circumstances, such as survey research involving children under 18. 2  The full text is available at information/exhibitions/online-exhibitions/special-focus/doctors-trial/nuremberg-code 3  Current and archived versions are available at 4  This might partially explain heavy involvement of IRBs in contemporary social research. 5  The full text is available at regulations-and-policy/belmont-report/index. html 6  Although not very detailed, this explanation of the requirement for consent is considered much more than the checklist commonly provided to survey researchers in training today. 7  Beauchamp and Childress (1979) argue, “In recent years, the favored category to represent this universal core of morality in public discourse has been human rights, but moral obligation and moral virtue are no less vital parts of the common morality” (p. 3). Beyond the most basic principles of the common morality, the framework encompasses several types of moral norms, including rules, rights, virtues, moral ideals, emotions, and other moral considerations. All are important in the framework, though principles provide the most general and comprehensive norms. 8  Teleological ethics determines the morality of an action by examining its consequences, and deontological ethics determines the morality of an action by examining the action itself, whether the action is right or wrong based on rules rather than its consequences. 9  Some codes were reviewed but excluded from the summary because they would be repetitive. For example, the Society of Industrial and Organization Psychology endorses the APA’s Code of Ethics, though highlighting some excerpts that are particularly relevant for I-O psychologists. Some codes were too scant to merit review (e.g., the Institute for Operations Research and the Management Sciences offers only a one-page flyer).

REFERENCES Ali, S., & Kelly, M. (2016). Ethics and social research. In C. Seale (Ed.), Researching society and Culture (pp. 44–60). Sage. Association of Universities in the Netherlands (2012). The Netherlands Code of Conduct for Scientific Practice. VSNU. Beauchamp, T. L. (2015). The theory, method, and practice of principlism. In J. Z. Sadler, K. W. M. Fulford, & W. (C.W.) van Staden (Eds.), The Oxford handbook of psychiatric ethics (pp. 405–22). Oxford University Press. Beauchamp, T. L., & Childress, J. F. (1979). Principles of biomedical ethics. Oxford University Press. Beauchamp, T. L., & Childress, J. F. (2001). Principles of biomedical ethics (5th ed.). Oxford University Press. Beauchamp, T. L., & Rauprich, O. (2016) Principlism. In H. ten Have (Ed.), Encyclopedia of global bioethics. Springer. Bloom, J. D. (2008). Common rule. In P. J. Lavrakas (Ed.), Encyclopedia of survey research methods (vol. 1, pp. 110–11). Sage. Brill, J. E. (2008). Beneficence. In P. J. Lavrakas (Ed.), Encyclopedia of survey research methods (Vol. 1, p. 56). Sage. Buchanan, E. A., & Hvizdak, E. E. (2009). Online survey tools: Ethical and methodological concerns of human research ethics committees. Journal of Empirical Research on Human Research Ethics, 4(2), 37–48. Clouser, K. D., & Gert, B. (1990). A critique of principlism. The Journal of Medicine and Philosophy, 15(2), 219–36. de Vaus, D. (2014). Surveys in social research (6th ed.). Routledge. DeMarco J. P. (2005). Principlism and moral dilemmas: A new principle. Journal of Medical Ethics, 31(2), 101–5. Grayson, J. P., & Myles, R. (2004). How research ethics boards are undermining survey research on Canadian university students. Journal of Academic Ethics, 2, 293–314. Haidt, J. (2008). Morality. Perspectives on Psychological Science, 3(1), 65–72. Hickey, A., Davis, S., Farmer, W. et al. (2021). Beyond criticism of ethics review boards: Strategies for engaging research communities and enhancing ethical review processes. Journal of Academic Ethics. Holbrook, A. (2008). Response order effects. In P. J. Lavrakas (Ed.), Encyclopedia of survey research methods (pp. 755–6). Sage. Howell, S. C., Quine, S., & Talley, N. J. (2003). Ethics review and use of reminder letters in postal


surveys: Are current practices compromising an evidence-based approach? The Medical Journal of Australia, 178, 43. Krosnick, J. A. (1999). Survey research. Annual Review of Psychology, 50, 537–67. Kruger, M., Ndebele, P., & Horn, L. (Eds.). (2014). Research ethics in Africa: A resource for research ethics committees. African Sun Media. Lamas, E., Ferrer, M., Molina, A., Salinas, R., Hevia, A., Bota, A., Feinholz, D., Fuchs, M., Schramm, R., Tealdi, J.-C., & Zorrilla, S. (2010). A comparative analysis of biomedical research ethics regulation systems in Europe and Latin America with regard to the protection of human subjects. Journal of Medical Ethics, 36(12), 750–3. Lavin, M. (2003). Thinking well about ethics: Beyond the code. In W. T. Odonohue, & K. E. Ferguson (Eds.), Handbook of professional ethics for psychologists: Issues, questions, and controversies. Sage. Losch, M. E. (2008). Informed consent. In P. J. Lavrakas (Eds.), Encyclopedia of survey research methods (Vol. 1, pp. 336–7). Sage. McGrath, J. E. (1981). Dilemmatics: The study of research choices and dilemmas. American Behavioral Scientist, 25(2), 179–210. McKeown, R. E., & Weed, D. L. (2004). Ethical choices in survey research. Social and Preventive Medicine, 49, 67–8. National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research [NC] (1979). The Belmont report. United States Government Printing Office. Nosek, B. A., Banaji, M. R., & Greenwald, A. G. (2002). E‐research: Ethics, security, design, and control in psychological research on the Internet. Journal of Social Issues, 58(1), 161–76. O’Donohue, W., & Ferguson, K. E. (Eds.). (2003). Handbook of professional ethics for psychologists: Issues, questions, and controversies. Sage. Oldendick, R. W. (2012). Survey research ethics. In L. Gideon (Ed.), Handbook of survey methodology for the social sciences (pp. 23–35). Springer. Page, K. (2012). The four principles: Can they be measured and do they predict ethical decision making? BMC Medical Ethics, 13(10). https://doi. org/10.1186/1472-6939-13-10 Pratt, B., Van, C., Cong, Y., Rashid, H., Kumar, N., Ahmad, A., Upshur, R., & Loff, B. (2014). Perspectives from South and East Asia on clinical and research ethics: A literature review. Journal of Empirical Research on Human Research Ethics, 9(2), 52–67. Rawls, J. (1971). A theory of justice. Harvard University Press.


Schirmer, J. (2009). Ethical issues in the use of multiple survey reminders. Journal of Academic Ethics, 7(1/2), 125–39. Schrag, Z. M. (2010). Ethical imperialism: Institutional review boards and the social sciences, 1965–2009. JHU Press. Shuster, E. (1997). Fifty years later: The significance of the Nuremberg Code. New England Journal of Medicine, 337(20), 1436–40. Shuttles, C., Lavrakas, P., & Lai, J. (2008). Refusal conversion. In P. J. Lavrakas (Ed.), Encyclopedia of survey research methods (pp. 705–7). Sage. Shweder, R., Much, N., Mahapatra, M., & Park, L. (1997). The “big three” of morality (autonomy, community, divinity) and the “big three” explanations of suffering. In A. Brandt, & P. Rozin (Eds.), Morality and health. Routledge. Singer E. (2003). Exploring the meaning of consent: Participation in research and beliefs about risks and benefits. Journal of Official Statistics, 19, 273–85. Singer, E. (2012). Ethical issues in surveys. In E. D. de Leeuw, J. J. Hox, & D. A. Dillman (Eds.), International handbook of survey methodology (pp. 78–96). Routledge. Singer, E. (2018). Ethical considerations in internet surveys. In M. Das, P. Ester, & L. Kaczmirek (Eds.) Social and Behavioral Research and the Internet (pp. 133–62). Taylor & Francis. Singer, E., & Levine, F. J. (2003). Protection of human subjects of research: Recent developments and future prospects for the social sciences. The Public Opinion Quarterly, 67(1), 148–64. Stern, A. M. (2005). Sterilized in the name of public health: Race, immigration, and reproductive control in modern California. American Journal of Public Health, 95(7), 1128–38. Swedish Research Council (2017). Good research practice. Vetenskapsrådet. Valerio, M. A., & Mainieri, T. (2008). Ethical principles. In P. J. Lavrakas (Ed.), Encyclopedia of survey research methods (vol. 1, pp. 244–6). Sage. Walker, T. (2009). What principlism misses. Journal of Medical Ethics, 35(4), 229–31. West, K., Hudson, M., & Kukutai, T. (2020). Data ethics and data governance from a M¯aori world view. In L. George, J. Tauri, & L. T. A.o.T. MacDonald (Eds.), Indigenous research ethics: Claiming research sovereignty beyond deficit and the colonial legacy (vol. 6, pp. 67–81). Emerald. WMA (2018). Declaration of Helsinki. World Medical Association.



APPENDIX A: SUMMARY OF SELECT DOCUMENTS GUIDING RESEARCH IN SURVEY AND SOCIAL RESEARCH IN THE UNITED STATES AND EUROPE Many of the documents available to guide survey researchers are organized around a few general, aspirational ideals (i.e., principles) and a large number of specific, enforceable standards (i.e., rules). However, the boundary between principles and rules is fuzzy, and what are principles for some are rules for others. For example, informed consent is one among many rules for APA, but a fundamental principle for the Insights Association. Table 5.A1 illustrates that broad values and specific norms often find themselves on equal footing. For example, “data minimization” is as significant as the “duty of care” in the document put forward by ICC/ESOMAR, while “do no harm” and “foster trust in the marketing system” are equally important for members of the AMA. Whereas confidentiality is arguably but one manifestation of “respect for civil and human rights”, it sits next to this more general principle within the code of AIB. Another issue to look out for concerns the use of descriptive versus prescriptive language. The APA code uses somewhat unusual and puzzling language to express ethical standards: it uses the indicative mode to define both its

General Principles (aspirational in nature, guide psychologists toward the very highest ethical ideals of the profession) and Ethical Standards (obligations that form the basis for imposing sanctions). It is unclear why such a style has been chosen, but APA is not alone in this, and this is the style adopted by the AOM, and the ASA, among others. Potentially even more confusing, some codes (e.g., SMS) mix normative and descriptive language. Finally, an overarching concern is the lack of systematic treatment of the theory (either normative or descriptive) behind the documents. Some professional bodies offer resources beyond the specific codes, which may be helpful to practicing survey researchers. For example, in the case of AIB and APSA, there is an attempt to provide context by overviewing the history of the documents and the process of their creation; in the case of AAA, there is a section on cases and ethical dilemmas, which may be used to train scholars on how to apply the broad guidelines to particular ethical situations. As a rule, however, the guidelines stand in a vacuum. To take the example of the ICC/ESOMAR guidelines, which are perhaps the most comprehensive within survey and public opinion research, there is little or no systematic treatment of the sources of inspiration, the stakeholders involved in crafting the documents, and/or the history of the documents.

Country Year of publication/ revision

Europe/Germany 2017

Body/organization Document name

ALLEA – All European Academies (The European Federation of Academies of Sciences and Humanities) The European Code of Conduct for Research Integrityi Research in all scientific and scholarly fields


Fundamental principles and good research practices (descriptive) “Researchers across the entire career path, from junior to the most senior level, undertake training in ethics and research integrity.”

Type of guidelines (descriptive or normative) Example

Notes, comments and additional resources


Good research practices are based on Annex 1 provides key fundamental principles of research resources for research integrity. These principles are: ethics (mostly, across •  Reliability in ensuring the quality of all the sciences and research, reflected in the design, the humanities), whereas methodology, the analysis and the Annex 2 overviews the use of resources. revision process and list •  Honesty in developing, undertaking, of stakeholders. reviewing, reporting and communicating research in a transparent, fair, full and unbiased way. •  Respect for colleagues, research participants, society, ecosystems, cultural heritage and the environment. •  Accountability for the research from idea to publication, for its management and organisation, for training, supervision and mentoring, and for its wider impacts. In addition to the main principles, the document describes good research practices in the following contexts: •  Research Environment •  Training, Supervision and Mentoring •  Research Procedures •  Safeguards •  Data Practices and Management •  Collaborative Working •  Publication and Dissemination •  Reviewing, Evaluating and Editing.

Summary of major guidelines

Table 5.A1  Summary of select documents guiding research in survey and social research in the United States and Europe


Country Year of publication/ revision


The International Chamber International/The Professional practice: of Commerce (ICC) Netherlands 2016 market and social European Society for research Opinion and Marketing Research (ESOMAR) ICC/ESOMAR International Code on Market, Opinion and Social Research and Data Analyticsii

Body/organization Document name Fundamental principles and substantive articles (normative) “Throughout this document the word “must” is used to identify mandatory requirements, that is, a principle or practice that researchers are obliged to follow. The word “should” is used when describing implementation and denotes a recommended practice.”

Type of guidelines (descriptive or normative) Example Fundamental principles 1. Responsibilities to data subjects Article 1 – Duty of care Article 2 – Children, young people and other vulnerable individuals Article 3 – Data minimization Article 4 – Primary data collection Article 5 – Use of secondary data Article 6 – Data protection and privacy 2. Responsibilities to clients Article 7 – Transparency 3. Responsibilities to the general public Article 8 – Publishing findings 4. Responsibilities to the research profession Article 9 – Professional responsibility Article 10 – Legal responsibility Article 11 – Compliance Article 12 – Implementation.

Summary of major guidelines

The original version of this document is among the oldest documents guiding survey research, with the first ESOMAR code published in 1948. ICC (the International Chamber of Commerce) is the world’s largest business organization with a network of over 6.5 million members in more than 130 countries. ESOMAR is the global voice of the data, research and insights community, speaking on behalf of over 4900 individual professionals and 500 companies who provide or commission data analytics and research in more than 130 countries, all of whom agree to uphold the ICC/ ESOMAR International Code. All ICC codes and guidelines are available at All ESOMAR Codes and guidelines are available at

Notes, comments and additional resources

Table 5.A1  Summary of select documents guiding research in survey and social research in the United States and Europe (Continued)


UK Research & Innovation, UK 2021 Economic and Social Research Council (ESRC) Framework for Research Ethicsiii Research in the social sciences

Fundamental principles (normative) “This ESRC framework for research ethics sets out good practice for social science research, detailing our principles and expectations from researchers, research organizations (ROs) and research ethics committees (RECs). Researchers, ROs and RECs should consider ethics issues throughout the lifecycle of a research project and promote a culture of ethical reflection, debate and mutual learning.”


The six principles of ethical research are: Ethics in practice: Ethics 1. research should aim to maximize issues are best benefit for individuals and society understood within and minimize risk and harm; the context of specific 2. the rights and dignity of individuals research projects, and and groups should be respected; ESRC encourages the 3. wherever possible, participation research community should be voluntary and to share guidance, appropriately informed; experience and solutions 4. research should be conducted with to ethics dilemmas to integrity and transparency; facilitate innovative 5. lines of responsibility and research. Ethics case accountability should be clearly studies highlight defined; some ethics issues 6. independence of research should encountered by ESRC be maintained and where conflicts funded researchers. See of interest cannot be avoided they “The Research Ethics should be made explicit. Guidebook” index.html


Country Year of publication/ revision

USA 2018 (2020)

Body/organization Document name

Academy of International Business (AIB) Code of Ethicsiv

Type of guidelines (descriptive or normative) Example

Research, education, Components and principles and professional (normative) service in the social “In their research, education sciences (primarily in and professional activities, terms of the roles of AIB members should: university professors) a. Encourage the free expression and exchange of scientific ideas; b. Pursue the highest possible professional and scientific standards, including an unwavering commitment to transparency and verifiability in research; and c. Remain current in their scientific and professional knowledge. ”


Notes, comments and additional resources

Principles guiding Professional Activities The code provides Examples 1. Universal Values of Ethical Issues AIB members are expected to respect involving each of these and protect civil and human rights components. as outlined in the universal values of The code includes a brief UN Human Rights Conventions. history of itself. It 2. Equal Treatment also includes some 3. Non-Harassment considerations regarding 4. Non-Abuse of Power the influence of national 5. Fair Employment Practices culture on ethics and 6. Confidentiality ethical behavior in light of 7. Fraud and Misrepresentation the international diversity 8. Anti-Trust/Competition of AIB members. 9. Anti-Corruption/Anti-Bribery. AIB Code of Ethics is Principles are organized loosely under supplemented by the “Core Components”, namely: Leadership Code of Ethics, a. Competence and Expertise Journals Code of Ethics, b. Professional Activities and AIB Insights Code of c. Conflicts of Interest Ethics. d. Public Communication e. Stewardship f. Research and Publication g. Teaching and Education. For example, principles guiding Research and Publication include elaborate sections on “Representation” and “Plagiarism/ Redundancy” but also relatively loose normative statements such as “conduct and report research with the highest standards of objectivity, accuracy and quality” under the heading “Research Planning, Implementation and Dissemination”.

Summary of major guidelines

Table 5.A1  Summary of select documents guiding research in survey and social research in the United States and Europe (Continued)


USA n.d.

American Anthropological USA 2012 Association AAA Statement On Ethicsvi

Academy of Management (AOM) Code of Ethicsv

Social science research

Professional practice (academicians, researchers, and managers)


Principles and standards The Preamble, General Principles The Academy of Management (descriptive) (Responsibility, Integrity, Respect Code of Ethics is adapted “AOM members establish for People’s Rights and Dignity), and from the codes of other relationships of trust with Professional Principles (to various professional associations those with whom they stakeholders including students, whose scientific principles work (students, colleagues, knowledge, all people in the world) are similar to the administrators, clients). … set forth aspirational goals to guide Academy. These include Relationships with students AOM members toward the highest the APA, AIA, and ACA require respect, fairness, ideals of research, teaching, practice, and others referenced and caring, along with and service. The Ethical Standards in Codes of Professional commitment to our subject (e.g., avoiding harm and informed Responsibility (Gorlin, R. matter and to teaching consent, plagiarism) set forth (Ed.) (1999). Codes of excellence.” enforceable rules for conduct by Professional Responsibility AOM members. (4th ed.) Washington, DC: Bureau of National Affairs.) Principles Principles of Professional Responsibility The document contains (normative) 1.  Do No Harm supporting resources, “Anthropologists must 2.  Be Open and Honest Regarding Your including further readings weigh competing Work and case studies. ethical obligations to 3.  Obtain Informed Consent and The website offers guidelines research participants, Necessary Permissions on how to teach ethics students, professional 4.  Weigh Competing Ethical Obligations and methods, including colleagues, employers and Due Collaborators and Affected a Handbook on Ethical funders, among others, Parties Issues in Anthropology while recognizing that 5.  Make Your Results Accessible obligations to research 6.  Protect and Preserve Your Records ethics-and-methods. participants are usually 7.  Maintain Respectful and Ethical primary.”**reference Professional Relationships. provided to Joan Cassell and Sue-Ellen Jacobs (Eds.), (1987). Handbook on Ethical Issues in Anthropology.


Country Year of publication/ revision

USA 2021

USA 2019

Body/organization Document name

American Association for Public Opinion Research (AAPOR) The Code of Professional Ethics and Practicesvii

American Marketing Association (AMA) Codes of Conduct, AMA Statement of Ethicsviii Professional practice (practitioners, academics and students)

Survey research (all public opinion and survey researchers)


Ethical norms and ethical values (normative) “As Marketers, we must: Do no harm.”

Principles and actions (normative) “We will avoid practices or methods that may harm, endanger, humiliate, or unnecessarily mislead participants and potential participants.”

Type of guidelines (descriptive or normative) Example

1.  Do no harm. 2.  Foster trust in the marketing system. 3.  Embrace ethical values. This means building relationships and enhancing consumer confidence in the integrity of marketing by affirming these core values: honesty, responsibility, fairness, respect, transparency and citizenship. (describes the core values one by one)

As AAPOR members, we pledge to maintain the highest standards of scientific competence, integrity, accountability, and transparency in designing, conducting, analyzing, and reporting our work, and in our interactions with participants (sometimes referred to as respondents or subjects), clients, and the users of our research. We pledge to act in accordance with principles of basic human rights in research.

Summary of major guidelines

AAPOR also provides information on working with Institutional Review Boards, details best practices, various working examples, disclosure FAQs and more. Much of AAPOR’s work is in developing and promoting resources that help researchers meet demanding standards. See the list to the right for other materials that may assist in your work.

Notes, comments and additional resources

Table 5.A1  Summary of select documents guiding research in survey and social research in the United States and Europe (Continued)


American Political Science Association (APSA) Principles and Guidance for Human Subjects Researchix

USA 2020

Research with human subjects

Principles and guidance (normative) “Consent 5. Political science researchers should generally seek informed consent from individuals who are directly engaged by the research process, especially if research involves more than minimal risk of harm or if it is plausible to expect that engaged individuals would withhold consent if consent were sought.” “Guidance (Consent) Elements usually included in consent processes In general, when seeking consent, researchers should usually communicate [a checklist of items].”


General Principles: A Guide to Professional 1.  Political science researchers should Ethics in Political Science respect autonomy, consider the (2012), an alternative wellbeing of participants and other document that goes people affected by their research, beyond research, is also and be open about the ethical published by the APSA. issues they face and the decisions The website contains they make when conducting their other resources, including research. the history of ethics in 2.  Political science researchers have an APSA individual responsibility to consider ethics the ethics of their research related activities and cannot outsource ethical reflection to review boards, other institutional bodies, or regulatory agencies. 3.  These principles describe the standards of conduct and reflexive openness that are expected of political science researchers. In some cases, researchers may have good reasons to deviate from these principles (for example, when the principles conflict with each other). In such cases, researchers should acknowledge and justify deviations in scholarly publications and presentations of their work. Additional principles (and accompanying guidelines) related to power, consent, deception, harm and trauma, confidentiality, impact, laws, regulations, and prospective review, and shared responsibilities.


Country Year of publication/ revision

USA 2003 (2010, 2016)

Body/organization Document name

American Psychological Association’s (APA) Ethical Principles of Psychologists and Code of Conductx (“the Ethics Code”)

Type of guidelines (descriptive or normative) Example

Professional practice General principles and (activities that are specific ethical standards part of psychologists’ (descriptive) scientific, “Psychologists do not educational, or fabricate data.” professional roles)


General Principles (A–E) Principle A: Beneficence and Non-maleficence Principle B: Fidelity and Responsibility Principle C: Integrity Principle D: Justice Principle E: Respect for People’s Rights and Dignity. Specific standards related to Research and Publication (Section 8) 8.01 Institutional Approval 8.02 Informed Consent to Research 8.03 Informed Consent for Recording Voices and Images in Research 8.04 Client/Patient, Student, and Subordinate Research Participants 8.05 Dispensing with Informed Consent for Research 8.06 Offering Inducements for Research Participation 8.07 Deception in Research 8.08 Debriefing 8.09 Humane Care and Use of Animals in Research 8.10 Reporting Research Results 8.11 Plagiarism 8.12 Publication Credit 8.13 Duplicate Publication of Data 8.14 Sharing Research Data for Verification 8.15 Reviewers.

Summary of major guidelines

This Ethics Code applies to these activities across a variety of contexts, such as in person, postal, telephone, internet, and other electronic transmissions.

Notes, comments and additional resources

Table 5.A1  Summary of select documents guiding research in survey and social research in the United States and Europe (Continued)


American Sociological Association (ASA) Code of Ethicsxi

USA 2018

Professional practice but mostly social science research and teaching

Principles and standards (descriptive) “Sociologists take all reasonable precautions to protect the confidentiality rights of research participants.”


As is also the case of APA Principle A: Professional Competence and many other codes Principle B: Integrity following the model of Principle C: Professional and Scientific general principles and Responsibility specific standards, “the Principle D: Respect for People’s Rights, Preamble and General Dignity, and Diversity Principle E: Social Responsibility Principles of the Code Principle F: Human Rights. are aspirational goals Standards to guide sociologists toward the highest 1. Competence ideals of Sociology. 2. Representation and Misuse of Expertise Although the Preamble 3. Delegation and Supervision and General Principles 4. Discrimination are not enforceable 5. Exploitation rules, they should be 6. Harassment considered by sociologists 7. Employment Decisions in arriving at an ethical 8. Conflicts of Interest and Commitment course of action and may 9. Public Communications be considered by ethics 10. Confidentiality 10.1 Confidentiality in Research bodies in interpreting 10.2 Confidentiality in Teaching the Ethical Standards. 11. Informed Consent The Ethical Standards set 11.4 Use of Deception in Research forth enforceable rules of 12. Research Planning, Implementation, scientific and professional and Dissemination conduct for sociologists.” 13. Plagiarism 14. Authorship 15. Publication Process 16. Responsibilities of Reviewers 17. Education, Teaching, and Training 18. Contractual and Consulting Services 19. Adherence to the Code of Ethics professional competence is a principle here but a standard in APA’s version (Section 2: Competence).


Country Year of publication/ revision

USA 2018

Body/organization Document name

American Statistical Association Ethical Guidelines for Statistical Practicexii

Type of guidelines (descriptive or normative) Example

Social (and quantitative) Guideline principles science research: (descriptive) all practitioners “The ethical statistician: of statistics and … Employs selection quantitative sciences or sampling methods and analytic approaches appropriate and valid for the specific question to be addressed, so that results extend beyond the sample to a population relevant to the objectives with minimal error under reasonable assumptions.”


Notes, comments and additional resources

A. Professional Integrity and “Good statistical practice is Accountability fundamentally based on B. Integrity of Data and Methods transparent assumptions, C. Responsibilities to Science/Public/ reproducible results, and Funder/Client valid interpretations. In D. Responsibilities to Research Subjects some situations, guideline E. Responsibilities to Research Team principles may conflict, Colleagues requiring individuals F. Responsibilities to Other Statisticians to prioritize principles or Statistics Practitioners according to context. G. Responsibilities Regarding However, in all cases, Allegations of Misconduct stakeholders have an H. Responsibilities of Employers, obligation to act in good Including Organizations, Individuals, faith, to act in a manner Attorneys, or Other Clients Employing that is consistent with Statistical Practitioners. these guidelines, and to encourage others to do the same. Above all, professionalism in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical ends is inherently unethical.”

Summary of major guidelines

Table 5.A1  Summary of select documents guiding research in survey and social research in the United States and Europe (Continued)


COPE (Committee on Publication Ethics) Core Practicesxiii

USA 2017

Professional: Applicable to all involved in publishing scholarly literature: editors and their journals, publishers, and institutions.

Core practices (normative) “Journals should include policies on data availability and encourage the use of reporting guidelines and registration of clinical trials and other study designs according to standard practice in their discipline.”

1. Allegations of misconduct; 2. Authorship and contributorship; 3. Complaints and appeals; 4. Conflicts of interest / Competing interests; 5. Data and reproducibility; 6. Ethical oversight; 7. Intellectual property; 8. Journal management; 9. Peer review processes; 10. Post-publication discussions and corrections.


The Core Practices were developed in 2017, replacing the Code of Conduct.


Country Year of publication/ revision

USA 2019

Body/organization Document name

Insights Association (IA) IA Code of Standards and Ethics for Marketing Research and Data Analyticsxiv Professional practice: market researcher and data analytics


“Fundamental, overarching principles of ethics and professionalism for the industry, supplemented by guidelines that assist practitioners and companies with its application” (normative) “Researchers must obtain the data subject’s consent for research participation and the collection of PII or ensure that consent was properly obtained by the owner of the data or sample source.”

Type of guidelines (descriptive or normative) Example Fundamental principles of the code The Code is based on the following principles: 1.  Respect the data subjects and their rights as specified by law and/or by this Code. 2.  Be transparent about the collection of PII; only collect PII with consent and ensure the confidentiality and security of PII. 3.  Act with high standards of integrity, professionalism and transparency in all relationships and practices. 4.  Comply with all applicable laws and regulations as well as applicable privacy policies and terms and conditions that cover the use of data subjects’ data. Section 1: Duty of Care RESPONSIBILITIES TO CLIENTS Section 7: Honesty and Transparency.

Summary of major guidelines

The Insights Association was founded in 2017 with the merger of CASRO, a trade association formed in 1975, and MRA, a professional society founded in 1957. The IA Code is similar to the ICC/ ESOMAR Code in many ways. The website provides supplemental guidelines (mostly ESOMAR resources), including guides for online research and online sample quality, for mobile research, for social media research, for research with children, young people and vulnerable populations, and guidelines on the duty of care. It links to EphMRA Code and Guidelines, Intellus Worldwide Code and Guidelines, ISO 20252, Market, opinion and social research, including insights and data analytics, ISO 27001, Information technology and security requirements, among others.

Notes, comments and additional resources

Table 5.A1  Summary of select documents guiding research in survey and social research in the United States and Europe (Continued)


USA 1999

USA 2007 (2014)

National Communication Association (NCA) A Code of Professional Ethics for the Communication Scholar/Teacherxv

Society for Human Resource Management (SHRM) Code of Ethicsxvi Professional practice

Core principles, intent and guidelines (a mix between normative and descriptive) “We accept professional responsibility for our individual decisions and actions.” “As professionals we must strive to meet the highest standards of competence and commit to strengthen our competencies on a continuous basis.”

Social science research Values, principles and teaching (descriptive) (communication scholar/ “Responsibility to others teacher) entails honesty and openness. Thus, the ethical communication researcher obtains informed consent to conduct the research, where appropriate to do so.”

The document contains core principles, intent, and guidelines in the areas of: 1.  Professional responsibility 2.  Professional development 3.  Ethical leadership 4.  Fairness and justice 5.  Conflicts of interest 6.  Use of information.

We believe that ethical behavior is guided by values such as: integrity, fairness, professional and social responsibility, equality of opportunity, confidentiality honesty and openness, respect for self and others, freedom, and safety. The guidelines that follow offer means by which these values can be made manifest in our teaching, research, publications, and professional relationships with colleagues, students, and in the society as a whole.


Communication researchers working in the social science tradition are urged to consult the APA guidelines for specific advice concerning the ethical conduct of social scientific research. Some principles specific to communication researchers are articulated in this document. Also available is the “NCA Credo for Ethical Communication” (November, 1999).


USA 2020

Strategic Management Society (SMS) Guidelines for Professional Conductxvii

Type of guidelines (descriptive or normative) Example

Professional practice Guidelines that describe values but mostly social (a mix between normative and science research and descriptive) teaching “SMS members conduct their professional affairs in ways that inspire trust and confidence.” “Researchers should seek to ensure that their findings are properly understood and applied.”


Notes, comments and additional resources

These guidelines focus on two areas: The document is only 4 basic values espoused and professional pages in length. In values which foster excellence in contrast, the AIB code is practice, research, and teaching. 24 pages, and the ASA Basic Values code is 20 pages. (1) honesty and integrity (2) openness (3) respect for people’s rights, dignity, and diversity (4) professional and scientific accountability. Values Associated with Professional Conduct and Activities.

Summary of major guidelines

iii iv v vi vii viii ix x xi xii xiii xiv (previously Council of American Survey Research Organisations (CASRO) Code of Standards and Ethics for Market, Opinion, and Social Research) xv xvi xvii

ii (the website offers translations in many European languages)


Country Year of publication/ revision

Body/organization Document name

Table 5.A1  Summary of select documents guiding research in survey and social research in the United States and Europe (Continued)


6 Sampling Considerations for Survey Research Anna M. Zabinski, Lisa Schurer Lambert a n d Tr u i t W. G r a y

The purpose of research is to generate knowledge about how our world works (Shadish et al., 2002; Van de Ven, 2007). In the interest of generating knowledge, it is often easier to examine parts of the world by studying research questions through experiments, surveys, archival analysis, and other techniques. For instance, in organizational science, results from numerous investigations accumulate, resulting in knowledge of people in the workplace. The extent results from single investigations add to generalizable knowledge is a central question that requires an answer. Research can serve one of two purposes. First, to explore and identify whether relationships between variables or phenomena exist. Second, to apply findings from smaller investigations more broadly. In the first purpose, research functions to show that something can occur, perhaps under specified conditions, and it is less important that those conditions are widely present in the world (Mook, 1983). In the second purpose, findings from a single, focused investigation are intended to be generalized, meaning that findings from a study inform our understanding of the world more broadly (Mook, 1983). We acknowledge that no single investigation can fully replicate the complexity of the world, so the critical question is, when can the results of a study be reasonably generalized beyond the study specific circumstances?

This chapter focuses on one aspect of generalizability; whether results from survey research involving people, groups, or organizations, can be applied broadly to other people, groups, or organizations. To answer questions related to generalizability, we must simultaneously address three issues: the research question, the target population, and the study sample. All research questions, whether theory or practice-driven (Van de Ven, 2007), are intended to address a problem or examine a relationship. The research question is about a relationship or phenomenon that occurs within or between persons, groups, or organizations, and the population is the entire set of people (or groups or organizations) that the research question addresses. The relevant population is determined by the research question and may differ by project. We refer to a project-specific population as the target population. It is rarely realistic to survey all members of a target population, so researchers identify a set of potential respondents from the target population, known as sampling. Whom researchers ask to participate matters because it dictates how adequately the research question can be answered as well as how broadly the study findings can be generalized to the target population. To reasonably generalize conclusions, researchers must justify the similarity of the sample to the



target population and the appropriateness of the sample for the research question. Yet, researchers simultaneously face political, ethical, monetary, and logistical challenges as they seek an appropriate sample. The purpose of this chapter is to acknowledge the sampling challenges researchers face and to specify a priori sampling considerations to facilitate stronger inferences from research findings. First, we review statistical assumptions related to sampling and highlight threats to validity. Then, we describe an intentional and strategic process derived from the principles of purposive sampling (Shadish et  al., 2002) to select a sampling strategy that enhances generalizability. Our process involves using three purposive sampling principles to simultaneously consider a research question and a target population to examine the potential advantages and disadvantages of sampling strategies. We then provide an overview of both probability and nonprobability sampling techniques and explain how to apply the purposive sampling process. Lastly, we offer closing suggestions, exemplar articles, and a checklist for researchers to consider when sampling for survey research.

STATISTICAL ASSUMPTIONS OF SAMPLING AND THREATS TO VALIDITY Researchers seek to accurately infer findings from the sample and apply them to the target population. The credibility of the results rest, in part, on the underlying statistical assumptions regarding sampling as well as threats to validity.

Violating Statistical Assumptions Regression analysis (e.g., Generalized Linear Modeling, Ordinary Least Squares), Structural Equation Modeling (SEM), and Analysis of Variance (ANOVA) are based on two assumptions related to sampling: first, that the residuals of the data are normally distributed, and second, that the residuals are independent of one another (Cohen, et al., 2003; Kline, 2016; Weisberg, 2005). These assumptions may be violated when the sample is not randomly selected from the population. Random samples are not haphazardly selected but are chosen in such a way that each population member has an equal chance of being selected. For example, a random selection of college students at a specific university is not a random sample of young adults (i.e., the population).

Nonrandom samples, commonly referred to as convenience samples, violate the second assumption and can result in biased sample statistics, statistics that differ from the true population estimates (Cohen et al., 2003). Statistical analyses conducted on nonrandom samples can be biased in two ways. First, the effect sizes of relationships, e.g., coefficient estimates in regression, can be biased upwards or downwards. Consequently, the amount of variance explained in the model (e.g., R2), significance tests, or confidence intervals may be biased and differ from true population estimates (Cohen et al., 2003). Second, the coefficients may be unbiased, but the standard errors may be biased. Therefore, significance tests and confidence interval estimates may be incorrect (Cohen et  al., 2003). No test can reveal whether statistical results are biased, meaning researchers must determine if the sampling assumptions are violated and judge the extent to which these violations and the chosen sampling strategy pose threats to the validity of the findings.

Threats to Validity In addition to violating statistical assumptions, nonrandom samples can exhibit sampling bias. Sampling bias occurs when characteristics of the sample, such as who is included and the context in which respondents are found, do not accurately represent the population of interest (Cohen et al., 2003). Sampling bias may limit the generalizability of research findings as well as the accuracy of examined relationships, raising validity concerns (see Chapters 15 and 14 on validity and reliability, respectively). External validity refers to how well research findings can be applied to the target population as well as similar “people, organizations, contexts, or times” (Bhattacherjee, 2012, p. 36). Researchers routinely experience, “a tension between the localized nature of research findings provided by a particular study and the more generalized inferences researchers wish to draw from their studies” (Van de Ven, 2007, p. 178). This tension becomes particularly salient when researchers take advantage of convenient opportunities to collect data without adequately considering threats to validity. Researchers develop study designs and select samples they believe are suitable to test proposed relationships between variables. Internal validity refers to causality in the model (Bhattacherjee, 2012); the confidence that the observed relationships between variables are true and not influenced by outside factors. However, if the sample is not adequate for the research question and/or if

Sampling Considerations for Survey Research

the survey design is of poor quality, then the findings may be questioned. Experiments are typically viewed as the gold standard for testing causality, and survey designs will always be accompanied by some concern about internal validity. Researchers must be able to address validity concerns to generalize their findings.

IMPROVING SAMPLES USING A PURPOSIVE SAMPLING PROCESS We borrow our terminology of the purposive sampling process from a nonprobability sampling technique where researchers intentionally select samples sharing key attributes with the target population or setting, known as purposive sampling or judgment sampling (Bhattacherjee, 2012; Shadish et al., 2002). However, rather than limiting this technique to a single sampling strategy, we expand on this logic and articulate how the principles of purposive sampling can be used to select an appropriate sample regardless of the sampling technique. Researchers seek to select sampling strategies that are appropriate for their research question. “Inherent in the concept of representativeness is the notion that care must be taken a priori to identify the population that the sample is intended to represent” (Short et al., 2002, p. 366). We suggest that applying the principles of purposive sampling before collecting data will facilitate selecting a sample that poses fewer threats to validity and increases generalizability. Below we briefly articulate three principles of a purposive sampling process and follow with illustrations of the three principles as we discuss various types of sampling. We offer a set of guidelines for implementing


purposive sampling while adhering to these three principles in Table 6.1.

Principles of the Purposive Sampling Process Principle 1: Aligning the research question, target population, and sample

Findings from tests of a research question should apply to a target population, but the target population may vary depending on the research question. For example, findings might be expected to generalize to all full-time adult workers, or only to high level strategic decision makers of very large forprofit organizations (e.g., top management teams), or to pregnant employees without government funded childcare options, or to entrepreneurs in economies with punitive bankruptcy laws. The target population may be defined by characteristics of the individuals or organizations, by the nature of the economic, social or political context, time, or by any other relevant descriptor. The critical point is to have strong reason to believe that the relationship or phenomenon under study occurs within the bounds of the target population. Once a target population is identified, researchers consider how to select a sample that adequately represents the target population by creating a representative sample with “approximately the characteristics of the population relevant to the research in question” (Kerlinger, 1986, p. 111). By strategically selecting a sample that is representative of the target population (Kruskal & Mostellar, 1979), researchers can increase their ability to make accurate inferences and generalize study findings.

Table 6.1  Guidelines for implementing the purposive sampling process 1. Align the research question, target population, and sample a. Consider how questions might be specific to a particular time, context, or type of person b. Determine what would be the ideal target population to answer the research question 2. Determine where variance is needed a. Justify where variance is needed theoretically, empirically, and substantively b. There should be similarity between key characteristics of the target population and study sample c. There should be variance in variables included in the research model d. Include variables not explicitly in the model that might affect the focal relationship (i.e., the omitted or missing variable problem) e. Explain why a control variable might be needed (evidence or a logical reason) 3. Determine what differences between the samples and target population are irrelevant a. Justify irrelevant differences theoretically, empirically, and substantively b. Differences that do not change the generalizability of the findings are irrelevant c. Differences that do not change the direction or strength of the hypothesized relationships are irrelevant



Published papers rarely explicitly identify the target population. However, the match between the target population and the sample is implicitly addressed when authors make claims about the generalizations of their findings or when these claims are debated in the review process with editors and reviewers. Defining who or what is in the target population clarifies the task of aligning the target population with the research question and possible samples.

Principle 2: Account for needed variance

It is not necessary that the sample fully match the target population on all measurable characteristics. Instead, researchers must determine the key characteristics where the sample must correspond to the target population to be able to test the research question. To test relationships among variables, it is necessary for them to vary. Variance is needed in the variables that are specifically named in the research model and on the variables that may influence the strength of those relationships (e.g., moderators, control variables, spurious variables). The sample should contain observations that represent a large, preferably full, range of the continuum on which these variables are measured so that relationships can be accurately assessed. For example, a researcher interested in the effect of technology use (e.g., smartboards, laptops, etc.) on student performance outcomes might sample from multiple school districts to ensure variance in the focal construct, technology. If a single school district were sampled, district-specific resources and education practices may limit the variance found in reported technology use. Additionally, researchers should consider what variables not explicitly included in the model might change the direction or strength of the relationships between model variables (sometimes referred to as the omitted or missing variable problem, or endogeneity; James, 1980; Shaver, 2020). One solution to preemptively account for such variables is to include them in the model, perhaps as moderators, mediators, or control variables (Becker, 2005; Carlson & Wu, 2012). The researcher can then rule out the possibility that the omitted variable accounts for the focal relationship(s) in the model. For example, imagine a researcher is interested in the relationship between student performance, parental involvement, and class size. It is likely that student general intelligence plays a role in the proposed relationships so it may be included as a moderator or control variable in the model. In short, the principle of purposive sampling to identify where

variance is needed accounts for both variables directly in the model as well as variables that may influence the model. Conducting multiple studies with diverse samples can be a useful strategy for offsetting the limitations of a specific sample where not all needed variance is able to be accounted for.1 For example, researchers interested in the effect of intrinsic motivation on student performance may first decide to conduct a study in a specific context or subset of the target population to see if the phenomenon exists, such as a lab or a single classroom, before replicating the study in a larger, broader sample. Alternatively, the researchers could first conduct the study using cross-sectional data before later replicating the study using longitudinal data (Ployhart & Vandenberg, 2010; Spector, 2019).

Principle 3: Identify irrelevancies

In addition to determining where variance is needed, researchers should consider where variance is unnecessary. Characteristics of the sample or setting that are different from that of the target population – but are unlikely to influence or change the direction and strength of the relationships between variables of interest – should be identified with strong theoretical or logical reasoning, or prior empirical evidence (Shadish et al., 2002). These variables are irrelevant to the research question because they do not threaten the generalizability of the findings, and it is not necessary for the sample to match the target population on these characteristics. Just as researchers need to offer theoretical, substantive, or logical reasoning for where variance is needed (Principle 2), researchers need to similarly justify any irrelevant differences. For example, consider a research question on reactions to abusive supervision at work. Theoretical and logical arguments, in combination with prior empirical findings, should rule out the possibility that employees’ age influences how people respond to abusive supervision. If there is no theoretical reason or empirical evidence to suggest why there would be a difference, then a lack of variance in respondent age in the sample should not influence results.

TYPES OF SAMPLING TECHNIQUES We now describe various types of probability and nonprobability sampling and how the purposive sampling procedure can be used to strengthen such samples. While probability sampling is often considered the ideal for generalizing findings,

Sampling Considerations for Survey Research

within social science research such samples are often infeasible. Therefore, we more briefly discuss probability samples and instead focus on nonprobability samples, commonly referred to as convenience samples. We recommend researchers apply the three principles of purposive sampling to increase the generalizability of whichever sampling techniques are feasible.

Probability Sampling Probability samples have two main attributes. First, each unit of observation (e.g., persons or organizations) in the population has a known, non-zero probability of being selected, and second, each unit in the population could be selected into the sample via a random process (Bhattacherjee, 2012). Because the sample is selected by a random draw process, the sample, if large enough, is likely to perfectly represent the population. Probability sampling is the strongest strategy because inferences and generalizations from the sample should apply to the target population. Unique chance variation in the sample might threaten validity, but this is unlikely when sample sizes are adequately large. For these reasons, probability sampling does not violate the assumptions of random sampling inherent to regression, SEM, and ANOVA. Despite these strengths, probability sampling is not without limitations. Probability sampling rests on the assumption that the census of the population is complete and that the population contact information is accurate. Depending on the research question and target population, this assumption may not hold. For instance, researchers seeking to survey nascent entrepreneurs cannot use probability sampling because there is no comprehensive list of people who have begun taking initial steps towards starting a business. Likewise, lists of businesses engaged in landscaping services in a geographic area will be riddled with organizations that have gone out of business and omit many that have recently entered this business because this industry has low barriers to entry and exit. No census exists for many, if not most, possible target populations of people; for instance, there is no census of people who have worked outside of their home country for a multinational organization, or who have received vocational training in machine trades. We now depict common types of probability sampling strategies using the example of scholars interested in predictors of voter turnout. The research question specifically pertains to voters; however, the target population could be specified to registered voters in the U.S. The selected


sample should align with both the research question and the characteristics of the target population to generalize the study findings (Principle 1), regardless of the adopted sampling strategy. Variance is needed along key characteristics of the model (e.g., demographics, socioeconomic status; Principle 2) and the researchers should justify any irrelevant differences between the sample and target population (i.e., differences that do not change the strength or direction of relationships or change the generalizability of findings; Principle 3). The researchers could adopt a simple random sampling strategy where each registered voter in the entire target population has an equal chance of being selected (Bhattacherjee, 2012). The researchers could take an ordered list of the target population (i.e., registered voters in the U.S.) and select every voter in the nth position, known as systematic random sampling (Bhattacherjee, 2012). In systemic random sampling, it is important that the researchers do not simply start at the top of the list but use a table of random numbers or a random number generator to determine the starting point. In another strategy, researchers may only have access to certain segments of the target population, referred to as clusters, and so they randomly draw from those segments, known as cluster sampling (Bhattacherjee, 2012). For example, the researchers may only have access to certain state or county registration lists and thus draw a random sample of voters from those clusters. If the researchers wanted to compare two groups, say male versus female voters, they could randomly select voters from each group, known as stratified random sampling (Bhattacherjee, 2012). This strategy is particularly useful when comparing groups that are naturally occurring and difficult to manipulate, such as sex (Van de Ven, 2007). Alternatively, if the researcher wanted to control for a particular factor or dimension, perhaps urban versus rural voters, they could utilize a matched pairs sampling strategy where within the randomly drawn sample, voters otherwise similar (e.g., similar age, sex, income, education, etc.) are compared (Bhattacherjee, 2012). This type of strategy controls for the factors that are not of interest (e.g., age, sex, income, education) and focuses on the dimension of interest (urban versus rural voters). The previously described probability sampling techniques are known as single-stage sampling techniques, meaning one sampling technique was applied to draw the final sample. However, it may make sense to randomly select respondents by using multiple sampling techniques, known as multi-stage sampling (Bhattacherjee, 2012). For example, researchers may first use a cluster sampling strategy to randomly select states, then a systematic sampling strategy to select districts,



and finally, a simple random strategy to randomly select voters. This is an example of a three-tier multi-stage sampling process using cluster, systematic, and simple random sampling strategies.

Nonprobability Sampling In nonprobability sampling, some members in the population have a zero chance of being selected for participation and respondents are selected based on non-random criteria, such as researcher accessibility (Bhattacherjee, 2012). Nonprobability sampling, or convenience sampling, is likely the most common form of sampling in the social sciences. Since all convenience samples are nonprobability strategies, we use the terms interchangeably.2 A strength of convenience sampling is its accessibility because it implies taking advantage of local resources, such as surveying undergraduates or local businesses with university ties. Yet, the term convenience sampling does not necessarily mean the sampling strategy is easy. For example, researchers might go to a fire station to persuade firefighters to complete surveys, or researchers might approach people waiting in line (e.g., to renew drivers’ licenses or to enter a concert), or survey pedestrians traversing a city business district. Such methods require significant commitments of time and effort by the researchers. The greatest weakness of convenience samples is the question of whether findings from the sample can be generalized to the target population. Using the purposive sampling process can mitigate this concern by ruling out ungeneralizable samples during the planning stages. Because each variation of convenience sampling is obtained by means that violate general principles of random sampling, each convenience sample is different in some ways from the relevant target population. Please note, threats to external and internal validity may become magnified when studies based on convenience samples are combined in metaanalyses (Cohen et al., 2003; Short et  al., 2002). Although meta-analyses can wash away abnormalities or chance variation within a data set, if there are consistent errors in parameter estimates due to nonprobability sampling, then those issues will still be present in the meta-analytic correlations. For example, a pattern of sampling biases in terms of persons (e.g., samples of white, middle-class people), treatment, setting, or outcome will be overstated in meta-analytic correlations (Shadish et al., 2002). By incorporating the three principles of purposive sampling when selecting a non-probability sample researchers can strengthen the generalizability of their findings. We first explain the use

of convenience sampling in three frequently used convenience samples (organizations, students, and panel data). Then, we review common types of convenience sampling strategies (snowball sampling, quota sampling, and expert sampling); see Table 6.2 for exemplar papers using purposive sampling for these techniques.

Organizations, students, and panel samples for survey data

Researchers often are able to gain access to organizations by leveraging personal contacts. In these situations, organizational decision makers grant researchers the opportunity to survey employees, customers, students, etc. This convenience sampling technique has resulted in some well-known, foundational findings to the field of management, including the Hawthorne studies, Greenberg’s (1990) quasi-experimental study on the relationship between justice and employee theft, and Smith’s (1977) study on job attitudes and absenteeism during a Chicago snowstorm, among others (Mayer & Gavin, 2005; Ross & Staw, 1993). When researchers utilize organizational samples, they must ensure that the target population of the research question aligns with that organizational sample (e.g., type of managers, customers, etc., within the organization). Moreover, they must ensure there is adequate variance in key variables and justify irrelevant differences between the sample and target population to offer evidence of external validity. Although still a common source of data, due to increasing political and logistical hurdles with gaining access to organizations, more researchers are turning to two other forms of convenience samples: students and panel data. There has been a great deal of criticism directed towards the use of university undergraduate samples. Student data has been characterized as WEIRD (Western, Educated, from Industrialized nations, Rich, and Democratic; Henrich et  al., 2010), perhaps seriously limiting the generalizability of findings. However, students are also people, and they may share many characteristics with the broader target population. For example, using U.S. university students to test job satisfaction in adults can be appropriate when there is evidence that the selected students are employed. However, students tend to work in low wage jobs which may not align with the focus of the research question. In another example, using university students to test working memory in adults can be appropriate when there is reason to believe that cognitive processes are fundamentally the same for students as for older adults. However, using a student sample to test how adults make financial planning decisions related to retirement may not be appropriate

O’Neill et al. (2017)

Convenience: Organizational

Klein et al. (2014)

Gelbrich (2010)

Gooty et al. (2019)


Proportional quota


Target population*


What is the role of helplessness in explaining coping responses to anger and frustration after service failure (i.e., when customer perceptions fail to meet expectations of services delivered)?* Is positive emotional tone for both parties higher when both parties are in agreement at a high level of LMX than when both parties are in agreement at a low level of LMX?†

How do high intrinsic motives on one task effect performance on other tasks?* How well do the scale items reflect the construct definition?

Supervisor-subordinate dyads

German consumers

Subject Matter Experts (SME)

U.S. adults

Used snowball sampling where students referred subordinatesupervisor dyads

Experience sampling methodology appropriately aligned with the research question Study 2 is a lab experiment to supplement findings from Study 1 (field experiment in South Korea) SMEs used for item generation, item selection and refinement, and content adequacy and distinctiveness Proportional quotas used to reflect the age and gender groups of German consumers published in a report

Study 1 is qualitative Study 2 is quantitative

Other notes

Here we provide a short list of select exemplar articles that appropriately utilized the purposive sampling process in convenience samples, incorporating the research question into intentional sample selection that is generalizable to the target population

US dyads

German hotel guests

3 samples of SMEs

U.S. college students

Does variation in emotional culture play Employees in masculine U.S. firefighters a role in exacerbating or attenuating organizational problems typically associated with cultures masculinity? How do anxiety responses change due to Employees U.S. full-time employees prolonged exposure to stressors?*

Research question

Note: *Not explicitly stated in manuscript, but inferred by the author team. †One of two research questions.

Shin & Grant (2019)

Convenience: Student

Convenience: Panel data Fu et al. (2020)


Sampling strategy

Table 6.2  Exemplars of purposive sampling in nonprobability samples

Sampling Considerations for Survey Research 87



because young adults’ experience with financial decision making is likely limited. We suggest that when the research question is aligned with the target population and applicable to student samples (Principle 1) and needed variance and irrelevant difference between student samples and the target population are considered (Principles 2 and 3), student samples may be appropriate and results generalizable. One increasingly popular outlet for collecting survey data is via online panels (i.e., Qualtrics, CloudResearch, MTurk, Prolific). For relatively low fees, researchers can quickly collect data from people who have registered with a platform to participate in activities in exchange for monetary compensation. On some platforms (such as CloudResearch) researchers specify who can participate (e.g., specifying age range, or type of industry). Decisions regarding what restrictions to specify should be made with the principles of purposive sampling in mind – what is the needed variance and what differences are irrelevant between the target population and sample? For example, when the research question is grounded in the context of communication gestures in North American English speakers, it is reasonable for the survey to open to people only in Mexico, Canada, and the U.S. who speak English. In support of convenience samples derived from panel data, recent evidence in management has shown no difference in the meta-analytic correlations between samples of working adults from field and panel data samples (Walter et al., 2022). Meaning, logistically convenient data (i.e., panel data) can offer valid insights about the target population and answer some research questions when used correctly. Concurrently, we agree with prior work (Aguinis et al., 2021; Landers & Behrend, 2015) that researchers should carefully consider the generalizability of panel data samples (e.g., appropriateness to answer the research question and alignment with the target population).

Snowball Sampling Although convenience sampling refers to collecting data from accessible sources more broadly, there are some commonly available strategies. When researchers identify a few respondents that match the target population and then ask those respondents to refer others who meet the qualifying criteria, it is known as snowball sampling. This technique is particularly useful for identifying difficult to reach populations, such as disabled persons, financial officers in a specific industry, or members of small, unique, or marginalized communities. Another variation that combines access

to students in a university setting with snowball sampling involves asking students to contact people in their personal network (e.g., family members or friends) who meet the study qualifying criteria with a request from researchers. People who respond to the request can then be invited to participate in the survey. Snowball sampling effectively takes advantage of people’s social networks but can simultaneously suffer from limitations imposed by those networks. People omitted from such networks may be substantially different from the final sample in unknown or unmeasured ways. Additionally, snowball sampling can violate the assumption of independence of residuals in regression (Cohen et  al., 2003), meaning a single observation or person does not influence another (Field, 2015; Weisberg, 2005). Treating non-independent data as independent can underestimate standard errors (Bliese & Hanges, 2004; Klein & Kozlowski, 2000). This issue is particularly relevant when respondents are asked to rate each other, known as multi-source research (Marcus et al., 2017). For example, using a snowball sample to examine exercise habits within a community is likely to influence results because respondents may recruit people with whom they share similar exercise habits (e.g., going to the same gym or belonging to the same club sport) – influencing variance in the focal variable, exercise habits. However, the same reason why nonindependence can be an issue with snowball sampling (i.e., people rating each other) might be the reason why snowball sampling is appropriate for the research questions, such as dyadic research questions. For example, newlyweds can be recruited using panel data to take a survey and then recruit their spouse. This sample may be more generalizable than if the dyads were collected from a single religious institution or courthouse because the respondents’ backgrounds and values will vary. Multilevel modeling can account for nonindependence and nesting in this type of snowball sampling design (Bliese & Hanges, 2004; Gooty & Yammarino, 2011; Klein & Kozlowski, 2000; Krasikova & LeBreton, 2012).

Quota or Proportional Quota Sampling When researchers continue to select respondents representative of a target population until a target sample size is reached it is known as quota sampling (Bhattacherjee, 2012). Samples from online panels are often quota samples because the data collection concludes after a preselected number of responses have been collected. In such samples, there is no response rate because data is collected

Sampling Considerations for Survey Research

until a target sample size is reached. In a variation of quota sampling, similar to stratified probability sampling, the researcher can split the target population into groups and specifically replicate the proportion of subgroups in the population with their sample, referred to as proportional quota sampling. This is particularly useful when the research question requires specific demographics. For example, researchers interested in drug users in the U.S. might be concerned that their sample will disproportionately favor a single ethnicity, and thus consider strategies to ensure that the sample is composed of all racial and ethnic subgroups in numbers proportional to the target population. In accordance with the principles of purposive sampling, the appropriate proportional breakdown of the sample should align with the research question and target population while providing necessary variance and justifying irrelevant differences.

Expert Sampling When researchers select respondents based on their expertise, it is known as expert sampling. This is useful when the opinion of content-area experts is necessary to answer a research question. For example, a researcher interested in providing evidence for the content validity of a new measure may ask subject matter experts to rate how similar the proposed items are to the construct’s definition.3 Or a researcher interested in compiling a list of recommended public health practices during a pandemic may survey epidemiologists directly.


sampling for quantitative versus qualitative research. First, qualitative research tends to have smaller sample sizes than quantitative research (Charmaz, 2006; Creswell & Creswell, 2018; Miles et al., 2020). Qualitative case studies examine a single, in-depth case such as a person, specific group, or particular event. Second, qualitative research tends be more intentional with sample selection compared to quantitative research, particularly regarding the context of the sample. This is because when considering the focal phenomenon of interest in a qualitative study, the context drives selection of participants as a source for understanding a particular phenomenon (Tracy, 2010; Tracy & Hinrichs, 2017). With qualitative research, the composition of the sample can evolve during the research process so that, “the initial choice of participants lead you to similar or different ones, observing one class of events invites comparison with another,” known as sequential sampling (Miles et al., 2020, p. 27). For example, if you are interested in the well-being of medical workers, you may first bound your sample to a single occupation (e.g., physicians) or hospital and then include medical workers from another occupation (e.g., nurses, pharmacists, etc.) or hospital to compare your findings. Sequential sampling can be used to intentionally uncover both unit-specific and general knowledge of wellbeing. For more direction on how to build qualitative samples, see Erickson (1986), Patton (2014), and Miles et al. (2020).

HURDLES IN SAMPLING PURPOSIVE SAMPLING IN QUALITATIVE RESEARCH This chapter has focused on sampling for surveys in quantitative research; however, survey designs may be used for qualitative research (see Chapters 27–29). There are two notable differences between

We recognize that there are various hurdles that can arise during the sample selection process (see Table 6.3), each of which may influence the generalizability of results. First, there are ethical challenges associated with accessing data from certain populations, such as minors or undocumented workers. In addition to issues addressed by

Table 6.3  Hurdles in sampling Hurdle type



Challenges related to ethics of accessing certain populations (e.g., minors, undocumented workers) or issues raised by review boards (e.g., deception). Challenges with working with organizations or institutions (e.g., legal or HR barriers, resistance to obtaining data pertaining to sensitive topics, data privacy concerns, ownership of data). Challenges related to securing funding for collecting data (e.g., competitive compensation incentives, fees for panel data, costs of multiple studies). Challenges related to time constraints and difficulties accessing specific populations (e.g., those without internet, the homeless) or accessing enough people (e.g., sample size concerns).

Socio-Political Monetary Logistical



university Internal Review Boards, other ethical issues may arise. For instance, surveying students at a community college that does not require evidence of legal residency for enrollment may unintentionally draw the researchers into an on-going conversation about undocumented residents. Second, there are social and political challenges related to gaining access to certain organizations. Researchers may leverage personal connections to access data, such as relationships with organizational leadership to survey employees. However, as organizations and people are increasingly presented with survey opportunities (Tourangeau et  al., 2013) and employers are wary of setting aside time for surveying, it may take years to arrange for data collection. Moreover, such contacts may resist cooperating with researchers due to privacy concerns or when investigations involve potentially sensitive topics, such as deviant employee behaviors. Third, there are monetary expenses related to collecting data. Panel data services have surcharges and some target populations otherwise suitable for data collection may not be incentivized by small monetary compensation (e.g., high level managers and executives tend not to respond to offers of compensation for taking surveys; see Chapter 22 for a discussion on incentivizing participation). Additionally, as the standard for publishing in toptier journals has increased, e.g., multiple studies or complex designs such as experience sampling, the necessary funding to publish in top-tier journals has increased as well. In short, adequate sampling has costs that should be acknowledged. Finally, there are logistical challenges including time constraints and difficulties accessing specific people or organizations (e.g., surveying those without internet, the homeless, etc.) or accessing enough people for an adequate sample size.4 We realize these obstacles may be deciding factors when selecting samples. Therefore, we encourage fellow survey researchers to improve the sampling methodologies that are feasible for them via the principles of purposive sampling so that samples are more representative of the target population, increasing the generalizability of findings.

CLOSING SUGGESTIONS Most social science survey research is based on convenience samples, yet “the social science literature [has] described many examples where the conclusions of studies were dramatically altered when proper sampling methods were used” (Visser et al., 2014, p. 410). This begs the question, when

can the results of a study be reasonably generalized beyond the study specific circumstances? We do not aim to undermine all convenience samples or shame the researchers who use them. Rather, we suggest researchers use the purposive sampling process to align their research question with the target population and sample (Principle 1) as well as consider where there is necessary variance (Principle 2) and any irrelevant differences between the sample and target population (Principle 3) before selecting a sample to obtain more generalizable data. This chapter recommends applying a purposive sampling process to sample selection to increase generalizability of findings. In doing so, we underscore the importance of aligning the research question, target population, and sample selection while simultaneously identifying necessary variance and irrelevant differences between the target population and sample to result in stronger, more defensible research findings.

Notes 1.  If multiple samples are used in the same study, conduct measurement equivalence and invariance tests (see Chapter 13; Cheung & Lau, 2012; Tay et al., 2015; Vandenberg, 2002; Vandenberg & Lance, 2000). 2.  Although all convenience samples are nonprobability samples, not all nonprobability samples are convenience samples, as there is theoretical sampling (Breckenridge & Jones, 2009; Glaser & Strauss, 2017). Because theoretical sampling is based on grounded theory research which shows that a phenomenon can occur and generalizability is less important, we omit this method from our discussion. 3.  Notably, recent guidelines suggest using both subject matter experts and members of the target population to establish content validity (Colquitt et al., 2019). 4.  We acknowledge that we have not provided any guidance regarding desirable sample sizes. Any rules of thumb for sample sizes are coarse approximations resting on assumptions that are often unstated. Instead, we urge researchers to conduct a priori power analyses before collecting their data that incorporate information about hypothesized effect sizes, the number of estimated parameters, significance level (α) and desired power (Cashen & Geiger, 2016; Cohen, 1992; MacCallum, et al., 1996; Murphy & Myors, 2004). The study design, hypothesized model, and a priori power analysis should ultimately determine the necessary sample size, rather than the type of sampling technique.

Sampling Considerations for Survey Research

REFERENCES Aguinis, H., Villamor, I., & Ramani, R. S. (2021). MTurk research: Review and recommendations. Journal of Management, 47(4), 823–7. Bhattacherjee, A. (2012). Social science research: Principles, methods, and practices (2nd ed.). Creative Commons Attribution. Becker, T. E. (2005). Potential problems in the statistical control of variables in organizational research: A qualitative analysis with recommendations. Organizational Research Methods, 8(3), 274–89. Bliese, P. D., & Hanges, P. J. (2004). Being both too liberal and too conservative: The perils of treating grouped data as though they were independent. Organizational Research Methods, 7(4), 400–17. Breckenridge, J., & Jones, D. (2009). Demystifying theoretical sampling in grounded theory research. Grounded Theory Review, 8(2), 113–26. Carlson, K. D., & Wu, J. (2012). The illusion of statistical control: Control variable practice in management research. Organizational research methods, 15(3), 413–35. Cashen, L. H., & Geiger, S. W. (2004). Statistical power and the testing of null hypotheses: A review of contemporary management research and recommendations for future studies. Organizational Research Methods, 7(2), 151–67. Charmaz, K. (2006). Constructing grounded theory: A practical guide through qualitative analysis. Sage. Cheung, G. W., & Lau, R. S. (2012). A direct comparison approach for testing measurement invariance. Organizational Research Methods, 15(2), 167–98. Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–9. Cohen, J., Cohen, P., West, S., & Aiken, L. (2003). Applied multiple regression/correlation analysis for the social sciences (3rd ed.). Routledge. Colquitt, J. A., Sabey, T. B., Rodell, J. B., & Hill, E. T. (2019). Content validation guidelines: Evaluation criteria for definitional correspondence and definitional distinctiveness. Journal of Applied Psychology, 104(10), 1243. Creswell, J. W. & Creswell, J. D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches (5th ed.). Sage. Erickson, F. (1986). Qualitative methods on research on teaching. In M. Wittrock (Ed.), Handbook of research on teaching (3rd ed., pp. 119–61). Macmillan. Field, A. (2015). Discovering statistics using IBM SPSS statistics (4th ed). Sage. Fu, S. Q., Greco, L. M., Lennard, A. C., & Dimotakis, N. (2020). Anxiety responses to the unfolding COVID-19 crisis: Patterns of change in the experience of prolonged exposure to stressors. Journal of Applied Psychology, 106, 48–61.


Gelbrich, K. (2010). Anger, frustration, and helplessness after service failure: coping strategies and effective informational support. Journal of the Academy of Marketing Science, 38(5), 567–85. Glaser, B. G., & Strauss, A. L. (2017). Theoretical sampling. In Sociological methods (pp. 105–14). Routledge. Gooty, J., & Yammarino, F. J. (2011). Dyads in organizational research: Conceptual issues and multilevel analyses. Organizational Research Methods, 14(3), 456–83. Gooty, J., Thomas, J. S., Yammarino, F. J., Kim, J., & Medaugh, M. (2019). Positive and negative emotional tone convergence: An empirical examination of associations with leader and follower LMX. The Leadership Quarterly, 30(4), 427–39. Greenberg, J. (1990). Employee theft as a reaction to underpayment inequity: The hidden cost of pay cuts. Journal of Applied Psychology, 75(5), 561–8. Henrich, J., Heine, S. J., & Norenzayan, A. (2010). Most people are not WEIRD. Nature, 466(7302), 29. James, L. R. (1980). The unmeasured variables problem in path analysis. Journal of Applied Psychology, 65, 415–21. Kerlinger, F.N. (1986). Foundations of behavioral research (3rd ed.). Harcourt Brace College Publishers. Klein, K. J., & Kozlowski, S. W. (2000). From micro to meso: Critical steps in conceptualizing and conducting multilevel research. Organizational Research Methods, 3(3), 211–36. Kline, R. B. (2016). Principles and practice of structural equation modeling. Guilford Publications. Krasikova, D. V., & LeBreton, J. M. (2012). Just the two of us: Misalignment of theory and methods in examining dyadic phenomena. Journal of Applied Psychology, 97(4), 739–57. Kruskal, W., & Mosteller, F. (1979). Representative sampling, I: Non-scientific literature. International Statistical Review/Revue Internationale de Statistique, 13–24. Landers, R. N., & Behrend, T. S. (2015). An inconvenient truth: Arbitrary distinctions between organizational, Mechanical Turk, and other convenience samples. Industrial and Organizational Psychology, 8(2), 142–64. MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological Methods, 1(2), 130–49. Marcus, B., Weigelt, O., Hergert, J., Gurt, J., & Gelléri, P. (2017). The use of snowball sampling for multi source organizational research: Some cause for concern. Personnel Psychology, 70(3), 635–73. Mayer, R. C., & Gavin, M. B. (2005). Trust in management and performance: who minds the shop while the employees watch the boss? Academy of Management Journal, 48(5), 874–88.



Miles, M., Huberman, A. M., & Saldana, J. (2020). Qualitative data analysis: A methods sourcebook (4th ed.). Sage. Mook, D. G. (1983). In defense of external invalidity. American Psychologist, 38(4), 379–87. Murphy, K. R., & Myors, B. (2004). Statistical power analysis: A simple and general model for traditional and modern hypothesis tests (2nd ed.). Mahwah, NJ, US: Lawrence Erlbaum Associates. O’Neill, O. A., & Rothbard, N. P. (2017). Is love all you need? The effects of emotional culture, suppression, and work–family conflict on firefighter risktaking and health. Academy of Management Journal, 60(1), 78–108. Patton, M. Q. (2014). Qualitative research & evaluation methods: Integrating theory and practice. Sage. Ployhart, R. E., & Vandenberg, R. J. (2010). Longitudinal research: The theory, design, and analysis of change. Journal of Management, 36(1), 94–120. Ross, J., & Staw, B. M. (1993). Organizational escalation and exit: Lessons from the Shoreham nuclear power plant. Academy of Management Journal, 36(4), 701–32. Shadish, W., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Wadsworth Cengage Learning. Shaver, J. M. (2020). Causal identification through a cumulative body of research in the study of strategy and organizations. Journal of Management, 46(7), 1244–56. doi:10.1177/0149206319846272 Short, J. C., Ketchen Jr, D. J., & Palmer, T. B. (2002). The role of sampling in strategic management research on performance: A two-study analysis. Journal of Management, 28(3), 363–85. Smith, F. J. (1977). Work attitudes as predictors of attendance on a specific day. Journal of Applied Psychology, 62(1), 16–19. 0021-9010.62.1.16

Spector, P. E. (2019). Do not cross me: Optimizing the use of cross-sectional designs. Journal of Business and Psychology, 34(2), 125–37. Tay, L., Meade, A. W., & Cao, M. (2015). An overview and practical guide to IRT measurement equivalence analysis. Organizational Research Methods, 18(1), 3–46. Tourangeau, R., Conrad, F. G., & Couper, M. P. (2013). The science of web surveys. Oxford University Press. Tracy, S. J. (2010). Qualitative quality: Eight “bigtent” criteria for excellent qualitative research. Qualitative Inquiry, 16(10), 837–51. Tracy, S. J., & Hinrichs, M. M. (2017). Big tent criteria for qualitative quality. The International Encyclopedia of Communication Research Methods, 1–10. Van de Ven, A. H. (2007). Engaged scholarship: A guide for organizational and social research. Oxford University Press. Vandenberg, R. J. (2002). Toward a further understanding of and improvement in measurement invariance methods and procedures. Organizational Research Methods, 5(2), 139–58. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70. Visser, P., Krosnick, J., Lavrakas, P., & Kim, N. (2014). Survey Research. In H. T. Reis and C. M. Judd (Eds.), The handbook of research methods in social and personality psychology (2nd ed., pp. 404–42). Cambridge University Press. Walter, S. L., Seibert, S. E., Goering, D., & O’Boyle Jr, E. H. (2022). A tale of two sample sources: Do results from online panel data and conventional data converge? In Key Topics in Consumer Behavior (pp. 75–102). Springer Nature Switzerland. Weisberg, S. (2005). Applied linear regression (3rd ed.). Wiley.

7 Inductive Survey Research Kate Albrecht and Estelle Archibold

INTRODUCTION Inductive research deploys a variety of data generation approaches like semi-structured interviews, focus groups, observations, and surveys. This chapter focuses on the use of surveys for inductive research. Inductive, or sometimes referred to as qualitative, surveys are a set of varied, open-ended questions that elicit rich description and deep engagement from respondents (Terry & Braun, 2017). Inductive surveys generate data used for purely inductive analysis that aims to understand frameworks of meanings, diversity of respondent experiences, and critical perspectives of phenomenon. Surveys, in general, can have elements that collect deductive, quantitative data and generate inductive, qualitative data together. This chapter introduces the phases involved in an inductive survey approach as guideposts for the content of the chapter. Next, we position inductive surveys as a tool in relation to other types of qualitative methods. We also discuss examples of inductive surveys used in a variety of social science fields. In the main section of this chapter, we give an overview of the phases of inductive survey planning, design, analysis, and reporting. We offer a discussion of the strengths and weaknesses of inductive surveys as a research approach. Finally,

we provide a research design checklist for implementing inductive surveys. As we discuss in detail below, when researchers apply inductive survey methods, their efforts align with four phases of development: planning, designing, analyzing, and reporting. Choices made within these phases involve specific items (see Figure 7.1). These phases are referenced in the background section below to show unique issues of inductive research approaches to the practical work of inductive survey research.

INDUCTIVE VERSUS DEDUCTIVE APPROACHES TO SURVEY RESEARCH Across many fields, a survey is considered an instrument for gathering data to understand numeric variations in pursuit of quantitative explanations like frequencies, means, modes, and other parameters for statistical testing (Jansen, 2010). Surveys can also approach understanding variation from an inductive, qualitative perspective. Qualitative surveys focus on identifying the variations of a phenomenon of interest in a given sample. Simply put, quantitative approaches



Figure 7.1  Phases of inductive survey creation and usage establish distribution, while qualitative approaches establish diversity. When discussing and demonstrating design parameters for inductive surveys, we first must move away from the text and numbers false dichotomy that often separates qualitative and quantitative approaches (Nowell & Albrecht, 2019). Deductive approaches begin with propositions that are translated to hypotheses and tested in a random sample that is assumed to be representative of the population. Inductive approaches begin with an exploration of cases or events that have theoretical importance so that deeper insights can be established (Eisenhardt & Graebner, 2007). The perennial misunderstanding across fields is that inductive and deductive methods are asking and answering the same research questions just with different data. This assumption, unfortunately, causes confusion in research designs and appropriate methodological tools. Deductive approaches are characterized by starting with a broad theoretical view in pursuit of a particular set of hypotheses derived from theory (Jaana & Urs, 2018). Deductive approaches focus on hypothesis testing, examining the generalizability of findings, and then making suggestions for updating or contextualizing theory. Inductive approaches begin with a particular phenomenon and make empirical observations to establish concepts, propositions, and theories that can be translated to more general settings (Jaana & Urs, 2018). As such, inductive and deductive approaches can often complement each other. We will discuss this in the next section when we showcase how inductive surveys may be especially useful in mixed methods designs. While deductive survey designs are often the norm, inductive methods are best applied in instances when a researcher wants to advance new theories and extend an understanding of the

nuances of existing theories (Nowell & Albrecht, 2019). Inductive approaches also have the ability to establish new frameworks, typologies, and constructs that advance a field’s ability to create new hypotheses to deductively test. Finally, inductive approaches enable researchers to clarify the underlying processes and mechanisms of statistically significant quantitative results.

Contributions of Inductive Methods to Social Science Research An inductive approach to research has many benefits, including advancing new theory and further characterizing nuances of existing theory (Nowell & Albrecht, 2019). This section provides a broad overview of the benefits of inductive inquiry, as well as situating the approach among other types of research designs. Inductive approaches that utilize qualitative data can provide: (1) detailed descriptions of contexts, (2) rich definitions of mechanisms and associations between concepts, and (3) propositions about the connections between context and concept (Tracy & Hinrichs, 2017; Nowell & Albrecht, 2019). Inductive approaches using qualitative data also further develop theoretical constructs and aid in developing typologies of concepts. The underpinnings of all social science theories are derived from inductive, observational work that focused on making sense of the phenomena around us. Inductive research approaches aim to examine questions of “how” and “why,” rather than trying to quantify impacts or effects. Table 7.1 below provides an overview of the differences when considering using an inductive versus deductive survey approach.

Inductive Survey Research


Table 7.1  Considerations for using an inductive versus deductive survey approach (adapted from Fink, 2003; Jansen, 2010; Terry & Braun, 2017) Phase

Inductive survey

Deductive survey

Planning: Research question

Uses verbs like: Generate, Identify, Describe Uses nouns like: Meaning, Outline, Experience Diversity Purposeful Distinct to the topic, context, and respondents Include a variety of prompts, as well as demographic, topic, and additional response Diversity analysis Case-oriented and concept-oriented coding and category creation

Uses verbs like: Relate, Impact, Effect, Cause, Influence, Compare, Contrast

Planning: Sampling Designing: Question phrasing

Analyzing: Sorting and category creation

Analyzing: Memoing Reporting: Trustworthiness

Applied throughout research phases Used as evidence and audit of process Fulfill and report important criteria throughout all phases

Inductive survey research is one type of inductive design and it can be coupled with deductive approaches. Inductive inquiry is most often associated with qualitative data generation and a variety of analysis traditions. As noted in Table 7.2 (adapted from Nowell & Albrecht, 2019), inductive survey research could be included as a method in multiple traditions. Since inductive surveys can generate data similar to those data from semistructured interviews or long, in-depth interviews, researchers can utilize them in a variety of inductive, qualitative traditions.

Examples of Inductive Survey Research Our search of several databases shows that inductive surveys are predominantly used in the management sciences, social sciences, health sciences, and psychological sciences. In the following table, we organize the examples by research discipline and outline the approach the researchers take to address each of the phases of inductive survey planning. There are several methodological approaches represented; although each research team does not necessarily employ or discuss every inductive survey research phase. While inductive data analysis and reporting are common foundations when employing inductive

Frequency and distribution Random Use verified scales whenever possible Response options are often the same for many questions (Likert, etc.)

Distribution analysis Unit-oriented or variable-oriented descriptive statistics and/or estimating parameters [Not often done; not a requirement for reliability and validity of findings] [Not applicable]

survey methods, in this chapter we focus on disciplines with human subject samples. In Table 7.3 below, we demonstrate that inductive surveys are not a “one size fits all” approach, and advance variation in their design, and therefore in the analysis of the data. Further, the design follows from the specific research context and research question posed for the study.

INDUCTIVE SURVEY DESIGN The main focus of this chapter is to describe the important considerations and steps in designing an inductive survey. Before detailing the phases, there is a brief discussion of the concept of data generation, why research “teams” are mentioned, and the trustworthiness criteria that establish rigor in qualitative research across all phases. The main sections below will focus on four phases: (1) planning, (2) designing, (3) analyzing, and (4) reporting. Each of these phases has several steps and key decision points. The planning phase includes establishing overarching research questions, defining approaches to data generation, and addressing issues of sampling. In the design phase, attention turns to constructing questions for data generation and choosing technologies for a successful survey. Researchers will often engage

Developing a theory grounded in data from the field


Interviews with individuals to “saturate” categories and detail a theory

Open coding, axial coding, selective coding, conditional matrix Theory or theoretical model (propositions)

Understanding the essence of experiences about a phenomenon

Philosophy, Sociology, Psychology

Long interviews *Inductive surveys possible with ample space for story-like responses

Statements, meanings, general description of experience

Description of the “essence” of the experience

Phase 1: Planning Focus

Disciplinary Origin

Phase 2: Designing Data Generation

Phase 3: Analyzing

Phase 4: Reporting

Adapted from: Creswell, 2013; Cho & Lee, 2014; Nowell & Albrecht, 2019.

Grounded theory


Description of the cultural behavior of a group or an individual

Description, analysis, researcher interpretations

Primarily observations and interviews, artifacts, extended time in field (6 months+)

Cultural Anthropology, Sociology

Describing and interpreting a cultural and social group


Table 7.2  Inductive methodological traditions and associated methods

In-depth study of a case or cases

Political Science, Sociology, Evaluation, Urban Studies Multiple sources are triangulated: documents, archival records, interviews, observations *Inductive surveys possible Description, themes, assertions

Developing an in-depth analysis of a single case or multiple cases

Case study

Conceptual groupings of narrative elements, narrative elements coupled with phenomena Description of “what” and “how” a story is being told

Field notes, interviews, documents *Inductive surveys possible

Cultural Anthropology, Sociology, Public Policy

Describe how storytelling elements express sequence and consequence

Narrative analysis

Systematic description of meaning/ definitions directly from data

Open coding, develop codebook, higher order themes

Interviews and field notes *Inductive surveys possible

Interpretation of the content of text data through a systematic classification process Many disciplines within social sciences

Inductive content analysis


Social Sciences Parent, et al., 2020 An inductive analysis of young adults’ conceptions of femininity and masculinity and comparison to established gender inventories

Management Sciences Heaton et al., 2016 Learning lessons from software implementation projects: An exploratory study

Research study

Research question/topic: Common conceptions of femininity and masculinity in the U.S. Sampling: Mechanical Turk (Mturk) and the introductory psychology participant pool at a college in northeastern U.S.

Question development: A short demographic questionnaire was given to potential Mturk participants to pre-screen Each prompt inquired, “Among people your age, what are the characteristics/behaviors that are considered [masculine/ feminine]? (Please list at least 5 characteristics/behaviors) Survey administration: Primary questionnaire offered to participants who met specific criteria

Survey administration: A mixed approach to data generation was employed beginning with data collected from a cross-sectional series of nine digitally recorded semistructured interviews, leading to an inductive survey

Question development & survey administration

Research Question/Topic, Data Generation & Sampling

Research question/topic: Examine and understand how software companies perceive project learning Sampling: A mix of homogenous and theory-based sampling techniques



Table 7.3  Examples of inductive survey research in social sciences

Coding process: Identified categories of characteristics/behaviors that emerged in response to the open-ended prompts about femininity and masculinity Sorting: Inductive analysis resembling grounded theory Open and axial manual coding used to identify themes and sub-themes Prevalence of themes were determined via content analysis A codebook was developed and used to re-code the data by research team

Coding process: Conceptdriven coding (Ritchie and Lewis, 2003, pp. 224, 228); identification of emergent themes Sorting: Data reduction was performed iteratively on the individual transcripts prior to, during and after coding

Coding process & sorting



Reliability (not inductive criteria): Interrater agreement between authors using Cohen’s kappa Codes retained that were mentioned by 5% or more of the sample Similarities and differences were compared in the sample in two phases of analysis Equivalencies assessed/tested between proportions of responses between groups Transparency: Frequency tables and code and theme tables in Results

Synthesis: A theoretical model is produced, p. 301 (Figure 3)

Synthesis, transparency & trustworthiness


Inductive Survey Research 97

Health Sciences Soinio, et al., 2020 Lesbian and bisexual women’s experiences of health care: “Do not say, ‘husband’, say, ‘spouse’”

Psychological Sciences Rafaeli, A., 2006 Sense-making of employment: on whether and why people read employment advertising

Research question/topic: Describe the experiences and wishes of lesbian and bisexual women concerning health care Sampling: Convenience sample

Research question/topic: Are employment ads vehicles for employee recruitment? Data generation: A survey of people’s reading of employment ads Sampling: Students and the general public (people riding a commuter train)

Question development: Authors provide set of example questions in article Survey administration: Data were generated using an electronic survey

Question development: Open-ended and multiple choice questions about participants’ reading of employment ads Authors provide full set of questions in article for reference Survey administration: Two phase data collection utilizing a mixed methods approach

Table 7.3  Examples of inductive survey research in social sciences (Continued)

Coding process: Categories and themes developed from participant responses to the survey

Coding process: Responses were coded 1 if only employment ads were cited, 2 if they were cited along with other sources, and 3 if they were not cited at all Sorting: Coding was used to support proposition development

Synthesis: Categories and themes table developed Quotes from sample population reported in narrative form (pp. 97–102)

Transparency: Authors detailed how they developed their analysis”


Inductive Survey Research

with data analysis while the survey is still actively collecting responses, as well as engage in category creation and sorting, and memoing about the data generation process and outcomes. The final stage is reporting the findings that highlight relevant aggregate and individual evidence in a rigorous and trustworthy manner.

What Is “Data Generation“? Throughout this chapter, we use the term “data generation” rather than “data collection.” We use this term deliberately to highlight that the language of research design processes differs in an inductive approach. “Data generation” is the process of a researcher designing a situation or set of questions to prompt respondents to describe their reality (Goldkuhl, 2019). Seixas et al. (2018) offer an interesting metaphor that describes the nature of an inductive researcher as a “composite sketch artist” (p. 779). In this role, the researcher uses a wide variety of generated data to “draw” the reality that has been described by others. As will be discussed below, inductive surveys have the opportunity to include many types of prompts and questions for data generation to engage respondents in many ways to share their views.

Why Is a “Research Team” Mentioned? Throughout the description of the phases, the term “team” is used to describe a variety of group options for additional research peers and/or members of the sample of interest to check a researcher’s coding and categorization work (Nowell & Albrecht, 2019). Engaging others in all of the phases of an inductive survey can help support the dependability of findings (Cascio et al., 2019). Dependability is an element of trustworthiness of research design, discussed below, that involves the researcher engaging other coders in recoding data, discussing categories, and reaching consensus (Cascio et al., 2019).

What Are Trustworthiness Criteria in Qualitative Research? Trustworthiness criteria are discussed more deeply in the Phase 4: Reporting section below. But, understanding the nature of trustworthiness criteria is essential to all phases of inductive research design (Nowell & Albrecht, 2019). The


trustworthiness criteria are used to assess the rigor of inductive research, not whether or not the survey respondents trust a researcher. Establishing trust with a potential set of respondents is an important element of rigorous inductive research that is actually embedded in the broader trustworthiness criteria as part of establishing credibility (see Table 7.5 in the section on Phase 4: Reporting). Trustworthiness criteria are present in all phases of inductive survey research and these criteria are related to the discussion above about researcher reflexivity. In Phase 1: Planning, researchers should consider if the planning makes it possible for multiple researchers to engage in data generation, memoing and describing the research process as it unfolds. Also in Phase 1, careful consideration of the sample will further establish credibility (Nowell & Albrecht, 2019). In Phase 2: Designing, the question wording and the survey format will support respondents in providing rich description. In Phase 3: Analysis, credibility, transferability, dependability, and confirmability are supported by documenting the coding process, iterations of sorting data, and consensus-based coding with multiple research team members. Finally, in Phase 4: Reporting, researchers should consider how to document and report their processes alongside relevant evidence such as direct quotes from respondents.

Phase 1: Planning Your Inductive Survey Research Project The planning phase for inductive survey research sets the stage for success. Using inductive survey research means that the questions for data generation are mostly fixed and all respondents will have a similar data generation experience (Braun et al., 2017). While also important when designing a protocol for deductive surveys and interview guides, in an inductive survey all questions should map back to the overall research questions. In each of our past research examples in Table 7.4, a clearly articulated research question or topic is followed by a sampling and/or data generation plan. This tight coupling is very important. We discuss this further at the end of the section utilizing Rafaeli et al. (2006) as an example. Establishing the research question or topic: Inductive reasoning calls for understanding processes and mechanisms, dynamics, and the emergence of concepts. Thus, inductive research questions are often phrased as “How ...?” or “What are the aspects of ...?” From your research question then stems the topic of study and the



Table 7.4  Examples of inductive survey questions Question type

Question example


“Tell us about your experience with ...” “Think about X event. Tell us all about what happened.” “How do you describe your gender?’ “What do you consider to be your ethnic background?” “How was X situation addressed?” “In your view, what could have been addressed differently during X situation?” “Thank you for your responses! Is there anything else you would like to share?” “We appreciate your time. Is there anything you feel we should know more about?”

Demographic Topic-based Additional opportunity for response

important facets of that topic that data generation will address. Defining approaches to data generation: Inductive surveys are best suited to generating data about the categories of diversity of a topic within a given sample. The key element of this part of Phase 1 is determining what approaches, types of questions, and survey structure will support answering the research question. Use of inductive, qualitative surveys aligns with different research aims such as: (1) sensemaking of respondent experiences and practices, (2) categorizing respondent views and perspectives, and (3) documenting respondent definitions of representation, meaning construction, and aspects of reality (Terry & Braun, 2017). Addressing issues of sampling: To achieve a wide range of viewpoints, sampling should be purposive, or with the aim to engage with respondents with relevant backgrounds (Jansen, 2010). Achieving saturation, or covering all of the relevant descriptions of a phenomenon, in inductive surveys requires: (1) starting with a small sample, (2) immediately analyzing responses to develop relevant categories of respondents, (3) creating and implementing a strategy for sharing the survey with new respondents, and (4) defining decision criteria for when to stop this ongoing process (Jansen, 2010). Important considerations for sample selection and refinement include (adapted from Jansen, 2010; Gobo, 2014; Sharma, 2017): • Understanding and including a diversity of perspectives • Creating purposive or quota subsets from larger sample sets • Identifying and including “emblematic” cases and respondents that are both average and extreme • Utilizing a chain referral (snow ball) process for identifying important respondents based on past respondent suggestions

Research example

In Rafaeli et al. (2006), the authors’ data generation strategy and sampling are tightly coupled with the research question. Specifically, the research question is about recruitment of employees via employment ads. The researchers chose a data generation strategy that elicits likely candidates’ propensities to respond to employment ads. By choosing a sample of persons likely to be exposed to an employment ad in a natural environment, i.e. on a commuter train, the researchers were more likely to obtain realistic data from their survey tool.

Additional resources • Agee (2009) article “Developing qualitative research questions: A reflective process” • Berger (2015) article “Now I see it, now I don’t: Researcher’s position and reflexivity in qualitative research” • Special issue of Human Resource Management Review (2017) Volume 27, Issue 2

Phase 2: Designing Your Inductive Survey Instrument While each example in Table 7.3 discusses and gives evidence of how the researchers designed their survey instrument, some of the examples are more explicit than others. Below we review the main components of instrument design. Constructing survey questions: Inductive surveys will most likely include open-ended questions. Open-ended questions are designed to encourage respondents to provide feedback in their own words. Inductive surveys also include items that are simply written as a prompt or a statement for respondents (Ruel et  al., 2016). Open-ended

Inductive Survey Research

questions can seem easier to write because they do not involve challenges of creating item response options or using a pre-validated scale, but these types of questions and prompts are the sole inspiration for respondents to generate data. See Table 7.5 below for examples of types of questions and prompts discussed in this section. In an inductive survey, there are usually demographic and topic-based questions (Braun et al., 2017). Demographic questions look fundamentally different in an inductive approach, given that the aim of the research is to understand the diverse perspectives of respondents, rather than their willingness to choose a predetermined category (Braun et al., 2017). Careful consideration is needed about the placement of demographic questions. Depending on the topic of the research, demographic questions may be an easy “warm-up” for respondents before going to deeper, more personal issues (Terry & Braun, 2017). Topic-based questions are often the majority of items in an inductive survey. Because respondents cannot ask for clarification, questions should be as clear as possible, with little jargon, and without making any assumptions or using biased language (Braun & Clarke, 2013; Braun et al., 2017). A good practice is also to end the survey with a question that encourages participants to share additional thoughts or reflections from the survey process. This is often phrased as “Is there anything else you would like to share here about ...?” Best practices for writing inductive survey questions and prompts (adapted from Andres, 2012; Ruel et al., 2016; Braun et al., 2020): • Use question wording to build rapport and a feeling of engagement • Keep questions and prompts short • Pose questions that respondents can actually answer • Use simple language and remove as much jargon as possible • Do not pose questions that result in a yes/no or other dichotomous response • Avoid prompting response options in the question • Each question should cover one topic only; no “double barreled” questions Choosing a format and appropriate technological platform: One challenge with inductive surveys that do not involve direct engagement is that there is not an organic opportunity to ask follow-up or probing questions prompted by the respondent’s initial answer (Terry & Braun, 2017). There are several ways to address this limitation. First, researchers should consider how the size


of a free-text response box cues a respondent to engage (Terry & Braun, 2017). Short boxes that look like only a few words may fit can constrain respondent answers. Longer, paragraph or square shaped, response boxes provide more visual space for respondents and allow for reviewing and editing or expansion (Terry & Braun, 2017). Additionally, inductive survey items can be posed as questions or prompts like “Tell us about …” Large response boxes allow respondents to answer with more in-depth details and rich description. Newer online survey platform technologies also offer the opportunity to have respondents record audio and video answers to text, audio, or video prompts and questions. In Figure 7.2 we offer screenshots of a current platform that allows respondents to answer in several mediums.

Research examples

Both Rafaeli et al. (2006) and Parent et al. (2020) demonstrate how researchers use the open-ended inductive survey formats to support the aims of their research. For example, Parent et  al. (2020) used an open-ended prompt to elicit characteristics participants associated with masculine or feminine behaviors, i.e. “Among people your age, what are the characteristics/behaviors that are considered [masculine/feminine]?” By requesting a minimum number of responses, they ensured that participants were able to provide the breadth of information they would need to address their research question. In Rafaeli et al. (2006), the researchers take a mixed-method approach to developing their survey using both open-ended and multiple-choice questions. Because their participants were embedded within a scenario or context the researchers created, the use of material from artifacts, such as the employment ads, provided a natural opportunity for participants to consider authentic responses.

Additional resources • Flick (2008) Designing qualitative research • Stake (2010) Qualitative research: Studying how things work

Phase 3: Analyzing Data Generated from Inductive Surveys Having analysis as a seemingly separate phase in the inductive survey process is slightly misleading. Data generation and analysis are often



Figure 7.2  Example of Phonic survey platform – overlapping, a distinguishing characteristic of inductive research (Merriam & Tisdell, 2015). Often, a researcher using an inductive approach does not know the full extent of the sample nor the important categories that will surface from data generation. As such, researchers should plan for emergent themes to direct analysis that will be dynamic with many feedback loops (Jansen, 2010; Merriam & Tisdell, 2015; Tracy & Hinrichs, 2017). For successful inductive data analysis, researchers must focus on early engagement with data, category creation and sorting, as well as memoing about process and outcomes of the analysis phase. Below, we focus on actionable steps for how to conduct inductive survey data analysis in general, rather than specific analysis techniques that align with approaches like

phenomenology, grounded theory, ethnography, or narrative analysis. Early engagement with data: At the end of a successful inductive survey deployment, researchers likely will have hundreds of pages of text to read or hours of audio or video to absorb. Engaging with data during the data generation process makes analysis less confusing and overwhelming. Even though a research question is established early in the process, researchers must engage with the data to see what ideas emerge that are more prominent, as well as to see which types of respondents are necessary to have a diversity of viewpoints (Merriam & Tisdell, 2015; Elliott, 2018). Overall, early engagement with the data should include discussions of impressions with the research team and early conceptualization of categories and themes. Memoing,

Inductive Survey Research

as discussed below, is a key component of documenting this process as analysis continues. Category creation and code sorting: Since we are offering a general overview of inductive data analysis techniques, it is important to keep central that the primary goal is to develop a set of categories or themes. The process of this categorization is called coding. Inductive researchers engage in coding “to get to grips with our data; to understand it, to spend time with it, and ultimately to render it into something we can report” (Elliott, 2018, p. 2851). Coding distinguishes and describes important categories in the data, and the process of coding is iterative in that codes are created, renamed, moved to categories, and sometimes become higher level categories themselves (Creswell, 2013). A general overview of coding is depicted in Figure 7.3. An example of a coding structure and the branching/ nesting of categories in Dedoose qualitative analysis software is shown in Figure 7.4. Coding involves the following steps (adapted from Jansen, 2010; Merriam & Tisdell, 2015; Elliott, 2018), often repeated and in overlapping processes: 1 Close reading of a few respondents’ answers to engage with material as written. 2 Choose one respondent’s data to re-read and code. 3 Open code first. Mark/select words or phrases that offer insight into aspects of your research questions. Take a “trees” or detailed approach in open coding. 4 Repeat open coding with a few more respondents’ data. 5 Transition to code reduction and category creation. Sort existing codes into categories that you now name. Take a “forest” or big picture approach to sorting and clarifying codes.

Figure 7.3  Coding process in inductive analysis


6 Discuss codes and categories with the research team. 7 Closely read new data. 8 Repeat open coding, utilizing existing codes as needed. 9 Repeat code reduction and category creation. New codes may be sorted into existing categories or they might warrant the creation of a new category. 10 Repeat discussion of codes and categories with the research team. 11 Continue until all data has been coded. 12 Discuss and clarify definitions and example data for all codes to create a codebook. A codebook includes definitions of codes, example excerpts, and decision rules for choosing between codes and what makes them unique. “Good” or useful codes in the analysis process should meet a few criteria (Merriam & Tisdell, 2015). They should: • Logically connect with the research question (empirical) • Cover as many aspects of the data as possible (diversity-driven) • Be conceptually distinct (mutually exclusive) • Associate with only one level of analysis (level of abstraction) Memoing about process and outcomes of analysis: The essence of inductive research, which focuses on uncovering patterns of meaning, requires researchers to engage in reflexivity (Birks, Chapman, & Francis, 2008). Reflexivity is the process of reflecting on our roles as researchers in the knowledge that we produce (Braun & Clarke, 2013). Reflexivity includes engaging with personal bias and worldview as the researcher, and functional issues of how the research itself



Parent et  al. (2020) take a more elaborate grounded theory approach (Glaser and Strauss, 1967) to their data analysis. The researchers use open and axial coding to create categories and utilize the iterative coding process to reach the bar of thematic saturation. These authors also develop and utilize a codebook to facilitate the organization of their data for reporting.

Additional resources • Hands-on coding resources in Flick’s (2015) book chapter, Sage’s Analyzing and interpreting qualitative data: After the interview (2021), and Merriam and Tisdell’s (2015) text. • Cornish et  al.’s (2014) chapter “Collaborative Analysis of Qualitative Data” in Flick’s The Sage Handbook of Qualitative Data Analysis • Guidance and tutorials on types of technology like Atlas.ti, NVivo, and Dedoose for analyzing inductive survey data

Phase 4: Reporting Inductive Survey Findings

Figure 7.4  Code categories with branching/ nesting in Dedoose software is designed. Keeping track of these reflections and conversations with other researchers is accomplished through note taking known as memoing. Although memoing is discussed in the analysis phase of the research process, it is helpful to begin memoing at the very beginning of any project (Birks, Chapman, & Francis, 2008). Memoing is also an important record-keeping tool for the process of reporting findings as discussed in the section below.

Research examples

Heaton et  al. (2016) and Parent et  al. (2020) are exemplars of strategic approaches to data analysis of inductive surveys. Heaton et  al. (2016) take a thematic approach to coding and identifying emergent themes, while iteratively sorting throughout the coding process. Both articles document their processes and discuss how multiple coders were involved in the category creation and sorting.

Many journals offer researchers a set of specific guidelines for the format of inductive research findings for peer-review publication. The section below discusses the general techniques for sharing findings from inductive survey research through explanation of research processes, best practices for sharing inductive research findings and evidence, and meeting the broadly accepted trustworthiness criteria (Nowell & Albrecht, 2019). Explanation of research processes: While some explanation of the research and analysis approach may appear in the “methods” section of a research write-up, inductive research is best described as an interactive process in which the many steps or loops of analysis must be described (Blignault & Ritchie, 2009). Thus, discussing the research process and presenting the results of an inductive survey will be more than just a table of codes or some excerpts from responses, it will be an integral extension of the analysis phase. The explanation of steps of coding, sensemaking, and categorization of concepts is often embedded throughout the “findings” section since important decision points in the analysis became paths for further inquiry and insights to emerge from the data (Thody, 2006). See the following section for suggested articles that report inductive survey findings in a variety of formats. Sharing inductive findings and evidence: Often, inductive research is reported in an article that is called “Findings/Results and Discussion.” This

Inductive Survey Research

convention differs from deductive articles because a researcher’s personal interpretation of the data is embedded in the way the data has been organized and presented (Blignault & Ritchie, 2009). Inductive research tends to be iterative, with many feedback loops among the researchers and the process of data generation. Given that inductive survey research seeks to understand the diversity and depth of responses, the researchers must (1) organize findings in a logical flow, (2) discuss context and concepts, (3) choose exemplary data, and (4) creatively format evidence beyond simple tables (Ritchie et al., 2013). Suggestions for each of these actions is adapted from Blignault and Ritchie (2009) and Silverman (2013): • Organize findings in a logical flow � Immediately follow the “Methods” section � Include a narrative of the phases and cycles of the analysis process; always link back to research question or topic � Use subheadings as “signposts” to guide your reader and highlight the main ideas � Juxtapose narrative of findings with evidence • Discuss context and concepts � Describe any important aspects of the respondents or the survey design that were not included in the “Methods” � Discuss if there were any modifications to the respondent pool or survey design during data generation and why � Define any jargon or new concepts


• Choose exemplary data � Choose three or four excerpts that are evidence of each code or larger category � Use excerpts in-text that are the most direct example of a code or larger category • Creatively format evidence � Tables of supporting quotes should be as succinct as possible, if being used � Provide readers with a code-concept map to demonstrate the logic of the analysis � As needed, create diagrams or flowcharts of concepts, especially if there are meaningful differences between subgroups of your survey respondents (for examples, see Ligita et al., 2020) Meeting trustworthiness criteria: As discussed throughout this chapter, inductive research is fundamentally different from deductive research. As such, the criteria in Table 7.5 for what constitutes “good” inductive survey research is aligned with trustworthiness criteria rather than concepts of validity and reliability (Nowell & Albrecht, 2019). Inductive research does not aim to be generalizable, but rather transferable (Andres, 2012), and inductive research is not driven by prior theory that generates hypotheses to be tested (Nowell & Albrecht, 2019; Pratt, Kaplan, & Whittington, 2020). Table 7.5, adapted from Nowell and Albrecht (2019) and Pratt, Kaplan, and Whittington (2020), provides an overview of important

Table 7.5  Overview of trustworthiness critera for inductive survey research Characteristic



Prolonged engagement with the setting Member check initial interpretations (more challenging in a survey design than other inductive research) Provide “thick description” of research context and respondents Description of major findings use evidence and/or illustrative quotations/text excerpts





Has the researcher acknowledged the different constructions of reality found in the data? Credibility is assessed by respondents, not the researcher Is there meaningful similarity to other contexts? Can another researcher draw connections to what has been found in another context? Transferability is judged by subsequent researchers, not the current researcher Multiple researchers participating in data generation What steps were taken to ensure that multiple Peer debriefing and documentation during data researchers engaged with the data and generation interpretations? Triangulate findings across different data sources Dependability is characteristic of the process of Analysis mapping and audit of research processes data generation, analysis, and reporting Generate data in phases, iterating between analysis What was the process for verifying the data? and data generation Confirmability is a characteristic of the data Consensus based coding with multiple coders Report audit trail and other design features aimed at improving rigor of design and analysis



Parent et al. (2020) and Soinio et al. (2020) offer exemplars in reporting of inductive survey data. Specifically, Parent et al. (2020) raise the critical importance of establishing that inductive survey results are communicated transparently and reliably. Their use of interrater coding, as well as reporting how the researchers chose to keep or discard codes increases the trustworthiness of their findings. Further, they provide an in-depth description of their phased approach to coding and analysis in their reporting. Soinio et al. (2020) provide category and theme tables to demonstrate how they came to their conclusions, which helps increase transparency for the readers and reviewers. Also, their use of quotes in their results section authenticates their analysis and creates more transparency and trustworthiness in their study.

Additional resources • Bekker and Clark’s (2018) article “Improving Qualitative Research Findings Presentations: Insights from Genre Theory” • Thorne (2016) text chapters on disseminating findings and advancing evidence

Research question: Strive for understanding processes, mechanisms, dynamics, and/or emergence of concepts. Data generation: Aim for diversity of topics to help "make sense" of experiences, categorize concepts and perspectives, and document definitions, meanings, and realities. Sampling: Structure for most diverse information, consider purposely including viewpoints, and/or utilizing chain referral.


User experience



Question phrasing: Consider placement of demographic questions, and create clear, unambigous topic-based questions (Table 4). User format: Design free-text entry spaces large enough for anticipated responses. Explore new technologies that allow for audio/video questions and for respondents to record audio/video answers.

While the application of inductive surveys may currently be limited, the benefits to researchers and respondents can be substantial when the approach is well designed. For the researcher, inductive surveys offer the opportunity to generate in-depth, information-rich data, as well as support proposition, framework, and theory development (Braun et  al., 2017). For respondents, inductive surveys, especially when administered online, offer the opportunity to feel more comfortable disclosing sensitive information without having to do so in-person or on a phone (Braun et al., 2017). Inductive surveys also have the potential to engage more diverse respondents with broader views because the process is less time consuming for both the researchers and the participants.

Best Practices for Inductive Survey Application Throughout this chapter, detail has been provided for the four phases of inductive survey design. Figure 7.5 summarizes the best practices for each phase. This figure serves as a checklist of workflow diagram for planning and designing an inductive survey, as well as analyzing and reporting the findings. Along with the examples from past research, and the additional resources, these


Early engagement: Begin reviewing data in generation process, adjust sample as needed, and discuss early impressions and categories with research team. Category creation and code sorting: Aim for an iterative process that creates clear definitions and supporting data. Codes should be empirical, diversitydriven, mutually exclusive, and associate with only one level of analysis.


Logic of evidence

Research examples

Future Directions

Iteration and reflection

trustworthiness criteria for conducting and reporting inductive survey findings.

Memoing: Take notes on research process, concepts and connections in data, and researcher bias or reflexivity.

Figure 7.5  Best practices in the phases of inductive survey creation and usage

Process: Embedded in the "methods" and "findings" section. Describe iterative processes and document research team conceptual decisions. Evidence: Strive for a logical flow of ideas, clearly define/discuss context and concepts, carefully choose exemplary data, and format evidence creatively. Trustworthiness: Discuss ways in which the research team met criteria for credibility, transferability, dependability, and confirmability (Table 7.5).

Inductive Survey Research

best practices can enable new and experienced researchers to engage with inductive surveys. In the planning phase, alignment is key across the development of research questions, data generation processes, and sampling strategy. In the design phase, researchers should focus on user experience in the form of questions and formats of the survey itself. During the analysis phase, early engagement, category creation and sorting, and memoing are coupled through iteration and reflection. Finally, in the reporting phase, researchers should focus on the logic of their evidence and clear support for their adherence to trustworthiness criteria.

SUMMARY • Inductive surveys are best for generating data that provides rich descriptions directly from respondents’ own words • Inductive surveys can offer benefits to researchers and respondents through broader reach and engagement, especially around sensitive topics • Inductive surveys aim for diversity of viewpoints through purposive sampling and carefully crafted prompts and open-ended questions • Inductive survey creation has four phases that often overlap, including: planning, designing, analyzing, and reporting • Analyzing inductive survey responses requires iterative steps of coding and categorizing • Inductive survey design, analysis, and reporting should align with the principles of trustworthiness

REFERENCES Agee, J. (2009). Developing qualitative research questions: A reflective process. International Journal of Qualitative Studies in Education, 22(4), 431–47. Andres, L. (2012). Validity, reliability, and trustworthiness. In Designing & doing survey research (pp. 115–128). Sage, edu/10.4135/9781526402202 Bekker, S., & Clark, A. (2018). Improving qualitative research findings presentations: Insights from genre theory. International Journal of Qualitative Methods, 17(1), 160940691878633. https://doi. org/10.1177/1609406918786335


Berger, R. (2015). Now I see it, now I don’t: Researcher’s position and reflexivity in qualitative research. Qualitative Research, 15(2), 219–234. https://doi. org/10.1177/1468794112468475 Bingham, A. J., & Witkowsky, P. (2021). Deductive and inductive approaches to qualitative data analysis. In Vanover et  al. (Eds.), Analyzing and interpreting qualitative data: After the interview, Sage: 133–46. Birks, M., Chapman, Y., & Francis, K. (2008). Memoing in qualitative research: Probing data and processes. Journal of Research in Nursing, 13(1), 68–75. Blignault, I., & Ritchie, J. (2009). Revealing the wood and the trees: reporting qualitative research. Health Promotion Journal of Australia, 20(2), 140–5. Braun, V., & Clarke, V. (2013). Successful qualitative research: A practical guide for beginners. Sage. Braun, V., Clarke, V., & Gray, D. (2017). Collecting textual, media and virtual data in qualitative research. In V. Braun, V. Clarke, & D. Gray (Eds.), Collecting qualitative data: A practical guide to textual and virtual techniques (1–12). Cambridge: Cambridge University Press (CUP). Carter, S. M., & Little, M. (2007). Justifying knowledge, justifying method, taking action: Epistemologies, methodologies and methods in qualitative research. Qualitative Health Research, 17(10), 1316–28. DOI: 10.1177/1049732307306927 Cascio, M. A., Lee, E., Vaudrin, N., & Freedman, D. (2019). A team-based approach to open coding: Considerations for creating intercoder consensus. Field Methods, 31(2), 116–30. Cho, J. Y., & Lee, E. (2014). Reducing confusion about grounded theory and qualitative content analysis: Similarities and differences. The Qualitative Report, 19(32), 1. Creswell, J. W. (2013). Qualitative inquiry: Choosing among five approaches. Sage. Eisenhardt, K. M., & Graebner, M. (2007). Theory building from cases: Opportunities and challenges. Academy of Management Journal. 50(1): 25–32. Elliott, V. (2018). Thinking about the coding process in qualitative data analysis. The Qualitative Report, 23(11), 2850–61. Flick, U. (2008). Designing qualitative research. Sage. Flick, U. (2015). Qualitative Data Analysis 2.0. Qualitative Inquiry and the Politics of Research, 10, 119. Glaser, B. G. (2005). The impact of symbolic interaction on grounded theory. The Grounded Theory Review, 4(2), 1–22. Glaser, B., & Strauss, A. (1967). The discovery of grounded theory: Strategies for qualitative research. Chicago, IL: Aldine Atherton.



Gobo, G. (2004). Sampling, representativeness and generalizability. In Seale, C., Gobo, G., Gubrium, J. F., & Silverman, D. (Eds.), Qualitative research practice (pp. 405–26). Sage, www-doi-org.proxy. Goldkuhl, G. (2019). The generation of qualitative data in information systems research: The diversity of empirical research methods. Communications of the Association for Information Systems, 44, pp-pp. Heaton, K. M., Skok, W., & Kovela, S. (2016). Learning lessons from software implementation projects: An exploratory study. Knowledge and Process Management, 23(4), 293–306. Jaana, W., & Urs, D. (2018). Evaluating inductive vs deductive research in management studies. Qualitative Research in Organizations and Management, 13(2), 183–95. doi: QROM-06-2017-1538 Jansen, H. (2010). The logic of qualitative survey research and its position in the field of social research methods. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research, 11(2), Art. 11, http:// Ligita, T., Nurjannah, I., Wicking, K., Harvey, N., & Francis, K. (2020). From textual to visual: The use of concept mapping as an analytical tool in a grounded theory study. Qualitative Research, 1468794120965362. Merriam, S. B., & Grenier, R. (Eds.). (2019). Qualitative research in practice: Examples for discussion and analysis. John Wiley & Sons, Incorporated. Merriam, S. B., & Tisdell, E. (2015). Qualitative research: A guide to design and implementation (4th edition). John Wiley & Sons. Miles, M. B., Huberman, A. M., & Saldaña, J. (2020). Qualitative data analysis: A methods sourcebook (4th edition.). Sage. Nowell, B., & Albrecht, K. (2019). A reviewer’s guide to qualitative rigor. Journal of Public Administration Research and Theory, 29(2), 348–63. Parent, M. C., Davis-Delano, L. R., Morgan, E. M., Woznicki, N. W., & Denson, A. (2020). An inductive analysis of young adults’ conceptions of femininity and masculinity and comparison to established gender inventories. Gender Issues, 37(1), 1–24. Pratt, M. G., Kaplan, S., & Whittington, R. (2020). Editorial essay: The tumult over transparency: Decoupling transparency from replication in establishing trustworthy qualitative research. Administrative Science Quarterly, 65(1), 1–19.

Rafaeli, A. (2006). Sense-making of employment: on whether and why people read employment advertising. Journal of Organizational Behavior: The International Journal of Industrial, Occupational and Organizational Psychology and Behavior, 27(6), 747–70. Ritchie, J. & Lewis, J. (2003) Qualitative research practice. Sage. Ritchie, J., Lewis, J., Nicholls, C. M., & Ormston, R. (Eds.). (2013). Qualitative research practice: A guide for social science students and researchers. Sage. Ruel, E., Wagner III, W., & Gillespie, B. (2016). Survey question construction. In The practice of survey research (pp. 44–77). Sage. Seixas, B. V., Smith, N., & Mitton, C. (2018). The qualitative descriptive approach in international comparative studies: using online qualitative surveys. International Journal of Health Policy Management, 7(9): 778–81. doi:10.15171/ijhpm.2017.142 Sharma, G. (2017). Pros and cons of different sampling techniques. International Journal of Applied Research, 3(7), 749–52. Silverman, D. (2013). A very short, fairly interesting and reasonably cheap book about qualitative research (2nd edition). Sage. Soinio, J. I. I., Paavilainen, E., & Kylmä, J. P. O. (2020). Lesbian and bisexual women’s experiences of health care: “Do not say, ‘husband’, say, ‘spouse’“. Journal of Clinical Nursing, 29(1–2), 94–106. Stake, R. E. (2010). Qualitative research: Studying how things work. Guilford Press. Terry, G., & Braun, V. (2017). Short but often sweet: The surprising potential of qualitative survey methods. In V. Braun, V. Clarke, & D. Gray (Eds.), Collecting qualitative data: A practical guide to textual, media and virtual techniques (pp. 15–44). Cambridge University Press. Thody, A. (2006). Qualitative data. In writing and presenting research (pp. 129–44). Sage. Thorne, S. (2016). Interpretive description: Qualitative research for applied practice. Routledge. Tracy, S. J. (2012). The toxic and mythical combination of a deductive writing logic for inductive qualitative research. Qualitative Communication Research, 1(1), 109–41. Tracy, S. J., & Hinrichs, M. (2017). Big tent criteria for qualitative quality. The international encyclopedia of communication research methods, 1–10. Wolf, Joye, D., Smith, T. W., & Fu, Y. (2016). The Sage handbook of survey methodology. Sage Reference.

8 Reduction of Long to Short Form Likert Measures: Problems and Recommendations Jeremy D. Meuser and Peter D. Harms

“Some vices miss what is right because they are deficient, others because they are excessive, in feelings or in actions, while virtue finds and chooses the mean.” (Aristotle, Nicomachean Ethics) “Counting is the religion of this generation. It is its hope and its salvation.” (Gertrude Stein, 1937, p. 120) “It is impossible to escape the impression that people commonly use false standards of measurement.” (Sigmund Freud, 1930) “The point is not that adequate measure is ‘nice.’ It is necessary, crucial, etc. Without it, we have nothing.” (A. K. Korman, 1974, p. 194)

Surveys, for better or worse, are one of the most common research methods in (organizational) psychology (Rogelberg et  al., 2004), in spite of decades’ old advice to utilize other sources of research data (e.g., Campbell & Fiske, 1959; Thomas & Kilmann, 1975). Considering the centrality of survey measures (that is, a set of questions or items designed to measure a construct) in

(organizational) psychology research and practice, it is not surprising that there are many recommendations for best practices for new measure development (e.g., Burisch, 1984; Jackson, 1970; Clark & Watson, 1995; Hinkin, 1995, 1998; Malhotra, 2006; John & Soto, 2007; Simms, 2008; DeVellis, 2016; Krosnick, 2018; this Handbook). However, scholars often do not adhere to best practices (see Heggestad et  al., 2019 and Cortina et  al., 2020 for reviews). One often-overlooked issue in such guides is consideration of the possible consequences to the validity of assessments when developing short scales or shortening existing measures (e.g., Smith et al., 2000; Credé et al., 2012; Franke et al., 2013). Such warnings are certainly warranted as many reviewers do not (adequately) scrutinize the short form measures they encounter (Heggestad et  al., 2019; only 23% of adapted scales in their review were accompanied by evidence of validity). This can be problematic because researchers often shorten measures in an ad hoc fashion using various criteria from factor loadings to applicability in a (often single) sample or acceptability of items by the subject organization or others without full attention to the consequences of their shortening. The purpose of this chapter is to discuss the pros and cons of Likert long versus short form



measures (we do not expressly address IRT, formative, forced choice, or any other form of measurement, though some of the present discussion applies to those domains) and provide some recommendations when shortening a scale or selecting an existing short form measure. To accomplish this goal, we conducted a broad multi-disciplinary review of the measurement literature over the last 100 years encompassing 60 journals (list available from the first author) from the organizational, marketing, psychology, counseling/psychiatry, health care, sociology, political science, education, social services, and MIS/computer science literatures.

WHY SHORTEN A SCALE? IS IT OK? The conventional wisdom taught in many test theory or assessment classes is that longer measures tend to be superior to shorter measures because they allow greater content coverage of the construct being measured and they display greater internal reliability as a result of the increased number of items (assuming that item intercorrelations are equivalent). Standard advice for those believing that there was not sufficient time for a full assessment has been to “find the time” (Wechsler, 1967, p. 37). Longer measures typically have higher Cronbach’s (1951) alpha estimations of internal reliability because the formula is a function of the number of items and the intercorrations between them (Bruner & Hansel, 1993; Cortina, 1993). Moreover, comparisons of short measures to longer versions frequently showed reductions in validity (Credé et al., 2012). Why shorten a scale? There is reason to question this conventional wisdom regarding scale length and debates as to best practices in this regard have persisted for decades. Although, some scholars hold that survey length is a critical feature of the design (Jenkins & Taber, 1977; Schmitt & Stuits, 1985; Yammarino, Skinner, & Childers, 1991), impacting access to participants and organizations (Smith, Colligan & Tasto, 1979), participant compliance, and data quantity and quality (Rogelberg et  al., 2001; Thompson et  al., 2003; Sakshaug & Eckman, 2017; DeSimone et  al., 2020), others suggest that survey length is not a critical feature of the research design or the results they produce (e.g., Sheth & Roscoe, 1975; Craig & McCann, 1978; Harrison & McLaughlin, 1996; Bowling et al., 2022). At a practical level, organizational samples are difficult to negotiate and time consuming to execute, and researchers want to maximize the number of constructs in a data

collection to maximize their investment and publication potential. Organizational researchers are frequently caught between the “rock” of data quality and the “hard place” of data quantity1 (Bruner & Hensel, 1993). While in principle, resolving this dilemma is a matter of theory and empirics (Yousefi-Nooraie et al., 2019), in practice, executing this balance is non-trivial and, we argue, an underappreciated aspect of the research process.

POTENTIAL ISSUES WITH LONG FORM MEASURES Longer measures are often criticized for having too many very similarly worded items (e.g., “Achieving power is one of my top priorities in life”; “Attaining power is very important to me”), which can increase respondent boredom or irritation (Schmitt & Stuits, 1985), decrease motivation to complete the survey (Podsakoff et  al., 2012), and impact their response rates and responses to the items (Rogelberg et  al., 2001; Bundi et  al., 2018; DeSimone et  al., 2020).2 When collecting data, we often hear comments like “you already asked me this” and “don’t waste my time asking me the same thing over and over.” Many measures still in use were created decades ago, when attention spans may have been longer (Subramanian, 2017, 2018) and taking surveys may have been more novel and interesting to respondents. Modern survey takers are now inundated with requests to complete surveys by pollsters or in exchange for small rewards by customer service departments tracking the performance of their employees and products. Organizations utilizing internal surveys can also face increased levels of hesitancy or reactance from respondents, who may believe that the organizations will fail to act on the survey information or who receive no feedback, personalized or otherwise (Thompson et  al., 2003; Podsakoff et  al., 2012), reducing the impact of organizational endorsement and participant motivation and willingness to comply with survey demands (Rogelberg et al., 2001). Many organizational scholars collect data online (whether from companies or online panels, e.g., MTurk; see Chapter 20), and these participants may display reduced engagement and attention when completing surveys because their answers will have no (perceived) value to or impact on their lives (Harms & DeSimone, 2015; Rogelberg et  al., 2001). The issue of respondent fatigue resulting from excessively long surveys or measures is widely


believed to reduce respondent compliance (e.g., skipping items on the survey increases in likelihood as a function of survey length; Hattie, 1983; Church, 2001), decrease conscientious responding (Huang et  al., 2012; Kam & Meyer, 2015), and completion (Berdie, 1973) with decreased data quality as a result (Karren & Woodard-Barringer, 2002; Kam & Meyer, 2015). Survey fatigue may drive participants to fall back on easily accessible yet irrelevant or biased cognitive schemas, which result in biased or erroneous responding (Harrison & McLaughlin, 1996; Hansborough et al., 2015). In other words, when people get tired and/or bored, they tend to answer reflexively instead of really considering the question. Similarly, respondents may resort to random responding or “straightlining” (answering the same Likert anchor for each item without considering the differences between item meaning), a form of insufficient responding that reduces variance and validity of the data (DeSimone et al., 2020). This is a problem in general, especially because when studying low base-rate phenomena (such as criminal activity or experiencing abusive supervision), inattentive responding at the midpoint of the scale can create false positive relationships between variables (Credé, 2010). Longer surveys also can be more vulnerable to distractions, especially with unsupervised internet research. While distractions may not alter the quality of the data, they can reduce response rates when participants fail to return to the survey (Ansolabehere & Schaffner, 2015). Research suggests that providing incentives does not overcome these survey fatigue or compliance issues (Hawkins et  al., 1988; Adua & Sharp, 2010), leaving scholars without an obvious solution to overcoming the deleterious effects of onerously long surveys.3 Therefore, regardless of the potential validity of a long form measure, in practice, especially as part of a long survey, the demands of longer measures and “finding the time” can negatively impact the quantity and quality (validity) of the data. Another potential issue with longer measures is that construct validity can also sometimes be threatened by construct creep, where the measure captures variance outside of the core of the construct (see Klein et al., 2012, for a discussion of this with respect to commitment). In a similar vein, adding new items to longer scales generally has diminishing returns in terms of increased validity (Keller & Danserau, 2001; John & Soto, 2007). Long measures offer other challenges empirically, especially with smaller samples. The fewer items provided by short form measures can ease sample size issues when executing (multilevel) confirmatory factor analyses (CFAs) for


measurement fit (Ziegler, Poropat, & Mell, 2014). Similarly, executing full latent structural equation models (SEM) becomes easier. Long forms may have so many items that full SEM is not possible given the typical sample size in organizational research. In this case, resorting to parceling or path analysis are often the only options, reducing the value of SEM because error is not as precisely estimated or not estimated at all. Further, conducting common method variance analyses becomes difficult when measures are long and samples are small (Podsakoff et  al., 2003). Researchers have long commented on the utility of short measures for experience sampling and true longitudinal studies where participants are answering the same items on many occasions (Finsterbusch, 1976; Fleeson, 2001; Moore et  al., 2002). Empirical (sample size) challenges are crucial with ESM and longitudinal designs, for which dropout rates are problematic (Bolger & Laurenceau, 2013), and for multilevel teams and network research, where compliance from the vast majority of participants in the sample is a necessary precursor to analyses. So, it is possible that longer measures may inadvertently chase off the very participants required for estimation – a double-edged sword on which many a project has impaled.

SHORT FORM MEASURES: AN ETHICAL OBLIGATION Even though there are several things survey researchers can do to increase response rate and accuracy with long surveys (e.g., put items more important to participants earlier in the survey and put demographics last; Roberson & Sundstrom, 1990; provide incentives; Rose et al., 2007; study something of value to the participants and provide feedback; Rogelberg et al., 2001), there is a limit to participant availability and attention. The research ethics principle of respect for persons implies that we cannot force participants to complete excessively long surveys (and even if we could, we would argue that such data would be suspect for a variety of reasons). The research ethics principle of beneficence indicates researchers are obligated to maximize benefit of our research, and this means survey efficiency (asking no more items than is necessary to achieve the goal of the research) is an ethical obligation. The research ethics principle of justice, which questions who bears the burden of research and who benefits from it, calls us to recognize that only rarely does our research actually benefit our participants directly. Long surveys may benefit a



researcher’s career, but not the participant – another clear ethical issue. Practically, as Stanton and Rogelberg (2001) point out, survey participants are a finite resource, and survey fatigue extends to surveys in general. Consequently, utilizing excessively or unnecessarily long surveys can have the effect of damaging the reputations of social scientists in general, which can harm our collective ability to collect our data and, in turn, impact our ethical obligation to science (Rosenthal, 1994).

POTENTIAL ISSUES WITH SHORT FORM MEASURES It is critically important to realize that the importance of each item increases as the scale length decreases. With long measures (e.g., 30 items), the impact of any one bad item is diluted (1/30, or ∼3%) in the average; in short measures (e.g., four items), the overall assessment of the construct is more strongly impacted (here, 25%). With fewer items, scholars must be more careful, conscientious, and precise in their item selection in order to ensure adequate content domain coverage. Tay and Jebb (2018) point out that in general, (organizational) scholars have not attended to the full continua of constructs when constructing their measures and attending to this issue is even more complex with fewer items. Many scholars new to measurement development engage in ad hoc scale shortening, removing or changing items without fully appreciating the impact adjusting a measure can have. Ad hoc shortening of measures can (though does not necessarily) result in (1) a reduction in the coverage of the relevant content domain (perhaps eliminating elements of the content domain completely), (2) unacceptably low correlations with the original measure and a failure to reproduce a similar nomological network, (3) unacceptable measurement fit statistics, (4) and/or disturbing the factor structure, and (5) all of this without saving sufficient time to warrant a short form (Smith et  al., 2000). On the other hand, while some scholars argue that the shortening of measures to save time is “unjustifiable and not to be encouraged” (Wechsler, 1967, p. 37), we would argue that in some cases, shortened forms of original measures can produce near equivalent predicative validity (e.g., Gogol et al., 2014; Heene et  al., 2014; Liden et al, 2015; see also Nash, 1965; Moore et al., 2002; Pathki et al., 2022) and overcome (many of) the obstacles discussed in the prior section if the short form is developed correctly.

Because many scholars new to measurement development also believe that high coefficient alphas are a desirable property of measures (Bruner & Hensel, 1993; Cortina, 1993), short measures are often negatively compared to longer measures because this reliability index is typically smaller with short versus long meausres. However, the typical solution to this is to ask extremely redundant or similar items with the goal of maximizing alpha even with a small number of items (e.g., “I like my job,” “I love my job,” “I do not like my job”). In this way, some of the potential problems of long measures (e.g., fatigue and irritation) can be exacerbated with short measures. Further, if the content domain assessed is broad, efforts to capture this breadth via a short measure can negatively impact CFA fit statistics, which are sensitive to unidimensionality (John & Soto, 2007; see Pathki et  al., 2022, for an example). Consequently, the selection of high-quality items is therefore crucial for short form measures (see next section for recommendations). One major challenge for short form measures is that they frequently fall victim to their history. If the long form from which the items are drawn is fraught with problems, the short form measure based on a subset of those items may likewise be fraught with problems. Even well researched constructs suffer from poor extant measurement tools. For example, the most widely used measure of transformational leadership, the Multi-factor Leadership Questionnaire (Bass & Avolio, 1995), has been severely criticized for measurement problems (Van Knippenberg & Sitkin, 2013), as have popular measures of power and influence (Schriesheim & Hinkin, 1990) and organizational commitment (Meyer, Allen, & Smith, 1993). Scholars often employ previously published measures as if they were sacrosanct, naïve to their potential issues, and this simply is unwise. The “grandfathering” of measures that happened to be published once upon a time in a quality journal is a problematic research practice. To combat this, a number of scholars (e.g., Ryan et al., 1999; Wood & Harms, 2016; Carpenter et  al. 2016, 2021; Condon et  al., 2020) have recommended re-evaluating existing measures at the item level. Similarly, Little, Lindenberger, and Nesselroade (1999) recommend departing from the concept of the static unchanging measure, revisiting, and gradually improving measures as theory evolves. Regarding reconceptualization and measurement of commitment, a commonly researched construct, Klein et al. (2012) and Klein et al. (2014) provide an example, respectively. Another issue is the proliferation of short form measures. For example, organizational identification is usually measured by some variant subset of


the original six items (Meuser & Cao, 2013) and there are many variants of the Crowne & Marlowe (1960) social desirability measure (e.g., Strahan & Gerbasi, 1972; Reynolds, 1982; Fischer & Fick, 1993). On one hand, various short forms may perform better in particular contexts, such as with elderly and lower educated participants (Craig & McCann, 1978; Alwin & Krosnick, 1991), nonnative speakers (Kleiner, Lipps, & Ferrez, 2015), or a particular sample industry. On the other hand, the choice of measure is often employed as a moderator in meta-analyses. Sometimes the moderation of the resulting effect sizes is substantial (e.g. Grijalva et al., 2015, showed that gender differences in narcissism ranged from moderately negative to moderately positive depending on the measure used) and sometimes it is not (e.g., Dulebohn et al., 2012, showed that the measure of leader-member exchange did not alter the population correlations). Significant moderation suggests the measures are performing differently, perhaps taping into different aspects of the content domain (or worse, aspects that are not). Such differences have been identified across short measures of the Big Five personality traits (Credé et al., 2012; see also Pace & Brannick, 2010). Comparison across measures in this situation is complicated and even suspect. Consequently, it is wise not to assume that short measures are equivalent to longer measures, even if derived from those longer forms. Acceptance of a short form measure may be an issue. While many reviewers do not adequately scrutinize short form measures, others will. Adaptations can take the form of dropping items, adding others, or modifying existing items (Heggestad et al., 2019). Dropping items without adequate care can yield inadequate content coverage (see issues mentioned at the start of this chapter). Adding items can also be problematic (Keller & Danserau, 2001). Minor shifts in tone, grammar, punctuation, or wording, such as adjusting modifiers, (e.g., adding “sometimes,” “usually”, or “always” to items), can impact response patterns (Schuman & Presser, 1977; Deaton et al., 1980; Barnes & Dotson, 1989) or even the meaning of items (Widiger & Trull, 1992). Consequently, we recommend that when developing a short form measure, follow best practices in measurement development (see Chapter 33 in this volume), rather than simply selecting items based on atheoretical criteria such as factor loadings or some other process that can result in construct drift or reduce construct coverage. This way authors can better defend against reviewer concerns. When selecting a short form measure, look for published measurement development work and compare to the original measure. Many short form measures are created though simple psychometric reduction techniques. The


overreliance on factor loadings (EFA and CFA) to create a short measure is problematic (Cortina et al., 2020; Zickar, 2020), can yield wrong factors, or too many or too few factors. (Especially early in the measure development process where the factors may not yet be established.) This can yield removal of relevant peripheral segments of the content domain (Cortina et al., 2020). The selection of factors using the Scree plot or Eigenvalue less than 1, the common techniques, frequently misrepresent the number of factors actually in the data (Conway & Huffcutt, 2003). This also means that many extant measures may have an incorrect number of factors making them poor ground upon which to develop a short form. Over 20 years ago, Little and colleagues (1999, p. 208) cautioned against the overreliance on psychometrics: “theory provides the paramount source of guidance for picking a limited number of indicators to represent a construct” and it seems their advice has been largely overlooked, yielding measures that prioritize reliability and sacrifice validity (Cortina et  al., 2020). Consequently, simple reliance on psychometrics rather than conceptual/theoretical development, and expert evaluation can yield measures with redundant items, an incorrect number of factors, and/or one disjointed from the construct definition and theory (Zickar, 2020). Common survey generation advice begins with the brainstorming of items that could potentially assess a content domain (represented by a larger circle). Expert reviewers remove the items that are confusing and/or do not in the eyes of other researchers and content domain experts assess the content domain. Suppose items A–I represent the items that survived this measure development process. Factor techniques employed to reduce the measure items or the common selection of the highest loading items will select items A–C (in the center of the larger circle), yielding items that assess nearly the same variance and are often very closely worded, and will yield the highest reliability of any collection of three items. Conversely, items D, H, and F (located at the periphery of the larger circle), may actually assess the content domain better, but because of their dissimilarity and fewness in number, the psychometrics will be poorer than the A–C alternative. The result is clear: employing the A–C short form reduces the content domain assessed from the larger circle to the smaller. Another issue with short measures is that they can structurally restrict the variance of measures simply by offering fewer opportunities to provide nuance or endorse items. Specifically, the range of scores for a 20-item measure with response options on a 1–5 scale is 20–100 while a five-item shortened measure would have a possible range of only 5–25. For rare events, a reduced set of items means



that there will be less possibility for endorsement of any items. A similar issue emerges if respondents are not being conscientious in their ratings. A single inattentive or random response on a measure with only a few items can radically change a score when it would have been largely swamped in a measure with far more numerous items. Clearly, there are pros and cons to long and short form measures, and balancing measurement comprehensiveness and practical realities is challenging. The research question, theory, and hypotheses should guide the choice of measurement selection taking into account the practical limitations (Yousefi-Nooraie et al., 2019).

ADVICE FOR CREATING OR SELECTING A SHORT FORM MEASURE 1 Begin with a valid measure. Do not assume publication in a top journal indicates validity of a measure found therein. Seek out published validation work of the original measure. 2 Be wary of shortened measures in the literature that are poorly described (e.g., where it is not clear which of the original items were used) and those lacking published empirical support. This is especially important when considering further shortening the measure (known as cascading adaptations; Heggestad et  al., 2019). Conduct your own validation prior to employing a scale lacking published validity information. 3 The short form should preserve the content domain coverage of the original measure. If the construct is broad (has multiple subfacets), you need at least one item to reflect each subfacet. The content should be as balanced as possible across subfacets. 4 Whenever possible, short form measures should reflect narrow rather than broad constructs. 5 Items should be as face-valid as possible and follow standard item generation/selection best practices as outlined in other chapters of this volume (see also Cortina et al., 2020). 6 Select items using a combination of theory and empirical analyses (e.g., CFA), not simply one or the other. Relying on empirics can produce a measure that lacks appropriate content domain coverage (e.g., eliminating a fact/dimension). 7 You do not need to be concerned about alpha reliability. Redundancy, which increases reliability, reduces validity (Heggestad et  al., 2019;­ measures-of-broad-constructs-issues-of-reliabilityand-validity/ for an empirical example). Cronbach (1951) suggested that internal reliability is just an inexact proxy for retest (which is a preferable metric for short measures). 8 The short form should have sufficient overlapping variance with the long form. 9 If your test is unidimensional (or within each subfacet), if possible, you should do what you can to retain items that reflect different difficulty levels of the construct (see Drasgow et al., 2010; e.g., Harms et  al., 2018) so as to capture the entire range of the construct. Range is more important than internal reliability because it allows for more robust correlations. 10 If the short form is not unidimensional, it must reproduce the long form’s factor structure. 11 The time savings of a short form must be worth any loss of validity when compared to the long form.

CONCLUSION Even nearly 100 years ago, scientists and thinkers recognized the primacy of counting and measuring, a legacy of the enlightenment (Heilbron, 1990). In the words of Guion (2011, p. 157), “measurement starts with the idea that anything that exists, exists to some degree, and therefore can be measured.” Korman eloquently states the primacy and importance of accurate measurement “without it, we have nothing.” Stein referred to this primacy as our “hope and salvation.” Freud lamented the danger before us: “It is impossible to escape the impression that people commonly use false standards of measurement.” Researchers are caught between the rock of data quality and the hard place of data quantity. Certainly, short measures (e.g., four items) are not a universal panacea (Nash, 1965). When introducing one of the most widely used short measures of personality, Gosling et al. (2003) likened the measure he was introducing to a methadone clinic. Specifically, he noted the problems typically associated with short measures, both the potential for reduced reliability and validity, and that introducing a new short measure ran the risk of backfiring and damaging the field it was intended to help. But he also noted that there were (limited) circumstances where such an inventory was necessary and hoped that people would use his measure responsibly.4 We would echo that sentiment. We understand the dangers inherent in creating and using short measures. Nonetheless, we believe that they have their


uses and, when developed and used appropriately, can be a useful tool for researchers. How many items are necessary? Aristotle answered this question: seek the “goldilocks zone.” Ultimately, assuming a pool of good items, the right number to use is no more than is necessary to capture validly the construct of interest with sufficient reliability to have faith in the measure (Alwin & Beattie, 2016). Discerning this is no small matter and we hold it is our responsibility as scientists to strive for this lofty goal.

Notes 1.  We acknowledge that a host of research practices can impact response rates, even with long surveys (e.g., Gupta, Shaw, & Delery, 2000). A discussion of this is beyond the scope of the chapter. Further, we argue that while research practices may increase the number of items researchers can ask of a sample, it does not negate the pros and cons discussed here with respect to the cost-benefit arguments on how researchers use those items. 2.  We also acknowledge that the scale on which the measure is assessed (e.g., 5- versus a 7-point Likert) is a complex issue (that can impact response rates, compliance, reliability, and validity) addressed in many articles (e.g., Lozano, Garcia-Cueto, & Muniz, 2008). Addressing this literature is a chapter by itself, and therefore these important issues are out of the scope of the present chapter. 3.  Respondent fatigue and mood are such concerns that some have advocated for novel approaches to improving survey experiences, such as entertainment breaks during survey execution (e.g., tic-tactoe; Kostyk et al., 2019). 4.  We are aware of the irony that Gosling’s measure is widely used even as he recommended against that very thing. We suspect that this reflects both the desire of researchers to use established measures from reputable scholars, but also the tendency to not take the time to carefully read and reflect on scale validity papers. In this instance, the problem was likely exacerbated by Gosling promoting it on his website where he also provided translations of the scale into 25 different languages.

ACKNOWLEDGEMENTS We thank David Keating for research support and the SEC Faculty Travel Grant for travel support during the writing of this chapter.


REFERENCES Adua, L., & Sharp, J. S. (2010). Examining survey participation and response quality: The significance of topic salience and incentives. Survey Methodology, 36(1), 95–109. Alwin, D. F., & Beattie, B. A. (2016). The KISS principle in survey design: Question length and data quality. Sociological Methodology, 46(1), 121–52. Alwin, D. F., & Krosnick, J. A. (1991). The reliability of survey attitude measurement: The influence of question and respondent attributes. Sociological Methods & Research, 20(1), 139–81. Ansolabehere, S., & Schaffner, B. F. (2015). Distractions: The incidence and consequences of interruptions for survey respondents. Journal of Survey Statistics and Methodology, 3(2), 216–39. Barnes Jr, J. H., & Dotson, M. J. (1989). The effect of mixed grammar chains on response to survey questions. Journal of Marketing Research, 26(4), 468–72. Bass, B. M., & Avolio, B. J. (1995). Manual for the Multifactor Leadership Questionnaire: Rater form (5X short). Palo Alto, CA: Mind Garden. Berdie, D. R. (1973). Questionnaire length and response rate. Journal of Applied Psychology, 58(2), 278–80. Bolger, N., & Laurenceau, J. P. (2013). Intensive longitudinal methods: An introduction to diary and experience sampling research. New York: Guilford Press. Bowling, N. A., Gibson, A. M. & DeSimone, J. A. (2022). Stop with the questions already! Does data quality suffer for scales positioned near the end of a lengthy questionnaire? Journal of Business and Psychology. s10869-021-09787-8 Bruner, G. C., & Hensel, P. J. (1993). Multi-item scale usage in marketing journals: 1980 to 1989. Journal of the Academy of Marketing Science, 21(4), 339–44. Bundi, P., Varone, F., Gava, R., & Widmer, T. (2018). Self-selection and misreporting in legislative surveys. Political Science Research and Methods, 6(4), 771–89. Burisch, M. (1984). Approaches to personality inventory construction: A comparison of merits. American Psychologist, 39, 214–27. Campbell, D.T., & Fiske, D.W. (1959). Convergent and discriminant validation by the multitraitmultimethod matrix. Psychological Bulletin, 56, 81–105. Carpenter, N. C., Son, J., Harris, T. B., Alexander, A. L., & Horner, M. T. (2016). Don’t forget the items: Item-level meta-analytic and substantive validity techniques for reexamining scale validation. Organizational Research Methods, 19(4), 616–50.



Carpenter, N. C., Newman, D. A., & Arthur Jr, W. (2021). What are we measuring? Evaluations of items measuring task performance, organizational citizenship, counterproductive, and withdrawal behaviors. Human Performance, 34(4), 316–49. Church, A. H. (2001). Is there a method to our madness? The impact of data collection methodology on organizational survey results. Personnel Psychology, 54, 937–69. Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7, 309–19. Condon, D. M., Wood, D., Möttus, R., Booth, T., Costantini, G., Greiff, S., Johnson, W., Lukaszweski, A., Murray, A., & Revelle, W. (2020). Bottom-up construction of a personality taxonomy. European Journal of Psychological Assessment, 36, 923–34. Conway, J. M., & Huffcutt, A. I. (2003). A review and evaluation of exploratory factor analysis practices in organizational research. Organizational Research Methods, 6, 147–68. Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98–104. Cortina, J. M., Sheng, Z., Keener, S. K., Keeler, K. R., Grubb, L. K., Schmitt, N., Tonidandel, S., Summerville, K. M., Heggestad, E. D., & Banks, G. C. (2020). From alpha to omega and beyond! A look at the past, present, and (possible) future of psychometric soundness in the Journal of Applied Psychology. Journal of Applied Psychology, 105(12), 1351–81. Craig, C. S., & McCann, J. M. (1978). Item nonresponse in mail surveys: Extent and correlates. Journal of Marketing Research, 15(2), 285–9. Credé, M. (2010). Random responding as a threat to the validity of effect size estimates in correlational research. Educational and Psychological Measurement, 70(4), 596–612. Credé, M., Harms, P., Niehorster, S., & Gaye-Valentine, A. (2012). An evaluation of the consequences of using short measures of the Big Five personality traits. Journal of Personality and Social Psychology, 102(4), 874–88. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. Crowne, D. P., & Marlowe, D. (1960). A new scale of social desirability independent of psychopathology. Journal of Consulting Psychology, 24, 349–54. Deaton, W. L., Glasnapp, D. R., & Poggio, J. P. (1980). Effects of item characteristics on psychometric properties of forced choice scales. Educational and Psychological Measurement, 40(3), 599–610. DeSimone, J. A., Davison, H. K., Schoen, J. L., & Bing, M. N. (2020). Insufficient effort responding as a partial function of implicit aggression.

Organizational Research Methods, 23(1), 154–80. DeVellis, R. F. (2016). Scale Development: Theory and Applications (4th ed.). Washington DC: Sage. Drasgow, F., Chernyshenko, O. S., & Stark, S. (2010). 75 years after Likert: Thurstone was right! Industrial and Organizational Psychology: Perspectives on Science and Practice, 3(4), 465–76. Dulebohn, J. H., Boomer, W. H., Liden, R. C., Brouer, R. L., & Ferris, G. R. (2012). A meta-analysis of antecedents and consequences of leader-member exchange: Integrating the past with an eye towards the future. Journal of Management, 38, 1715–59. Finsterbusch, K. (1976). Demonstrating the value of mini surveys in social research. Sociological Methods & Research, 5(1), 117–36. Fischer, D. G., & Fick, C. (1993). Measuring social desirability: Short forms of the Marlowe-Crowne social desirability scale. Educational and Psychological Measurement, 53, 417–24. Fleeson, W. (2001). Toward a structure- and processintegrated view of personality: Traits as density distributions of states. Journal of Personality and Social Psychology, 80, 1011–27. Franke, G. R., Rapp, A., & “Mick” Andzulis, J. (2013). Using shortened scales in sales research: Risks, benefits, and strategies. Journal of Personal Selling & Sales Management, 33(3), 319–28. Freud, S. (1930). Civilization and its discontents. Internationaler Psychoanalytischer Verlag Wien. Gogol, K., Brunner, M., Goetz, T., Martin, R., Ugen, S., Keller, U., … & Preckel, F. (2014). “My questionnaire is too long!” The assessments of motivational-affective constructs with three-item and single-item measures. Contemporary Educational Psychology, 39(3), 188–205. Gosling, S., Rentfrow, P., & Swann, W. (2003). A very brief measure of the Big-Five personality domains. Journal of Research in Personality, 37, 504–28. Grijalva, E., Newman, D., Tay, L., Donnellan, M.B., Harms, P.D., Robins, R., Yan T. (2015). Gender differences in narcissism: A meta-analytic review. Psychological Bulletin, 141, 261–301. Guion, R. M. (2011). Assessment, measurement, and prediction for personnel decisions. Routledge. Gupta, N., Shaw, J. D., & Delery, J. E. (2000). Correlates of response outcomes among organizational key informants. Organizational Research Methods, 3, 323–47. Hansborough, T. K., Lord, R. G., & Schyns, B. (2015). Reconsidering the accuracy of follower leadership ratings. The Leadership Quarterly, 26, 220–37. Harms, P.D. & DeSimone, J.A. (2015). Caution! Mturk workers ahead – Fines doubled. Industrial and Organizational Psychology: Perspectives on Science and Practice, 8, 183–90.


Harms, P. D., Krasikova, D., & Luthans, F. (2018). Not me, but reflects me: Validating a simple implicit measure of psychological capital. Journal of Personality Assessment, 100, 551–62. Harrison, D. A., & McLaughlin, M. E. (1996). Structural properties and psychometric qualities of organizational self-reports: Field tests of connections predicted by cognitive theory. Journal of Management, 22(2), 313–38. Hattie, J. (1983). The tendency to omit items: Another deviant response characteristic. Educational and Psychological Measurement, 43(4), 1041–5. Hawkins, D. I., Coney, K. A., & Jackson Jr, D. W. (1988). The impact of monetary inducement on uninformed response error. Journal of the Academy of Marketing Science, 16(2), 30–5. Heggestad, E. D., Scheaf, D. J., Banks, G. C., Monroe Hausfeld, M., Tonidandel, S., & Williams, E. B. (2019). Scale adaptation in organizational science research: A review and best-practice recommendations. Journal of Management, 45(6), 2596–2627. Heilbron, J. L. (1990). The measure of enlightenment. In T. Frangsmyr, J. L. Heilbron, & R. E. Rider (Eds.) The Quantifying Spirit in the Eighteenth Century. University of California Press. Heene, M., Bollmann, S., & Bühner, M. (2014). Much ado about nothing, or much to do about something? Effects of scale shortening on criterion validity and mean differences. Journal of Individual Differences, 35(4), 245–9. Hinkin, T. R. (1995). A review of scale development practices in the study of organizations. Journal of Management, 21(5), 967–88. Hinkin, T. R. (1998). A brief tutorial on the development of measures for use in survey questionnaires. Organizational Research Methods, 1(1), 104121. Huang, J. L., Curran, P. G., Keeney, J., Poposki, E. M., & DeShon, R. P. (2012). Detecting and deterring insufficient effort responding to surveys. Journal of Business and Psychology, 27(1), 99–114. Jackson, D. N. (1970). A sequential system for personality scale development. In Current topics in clinical and community psychology (Vol. 2, pp. 61–96). Elsevier. Jenkins, G. D., & Taber, T. D. (1977). A Monte Carlo study of factors affecting three indices of composite scale reliability. Journal of Applied Psychology, 62(4), 392–8. John, O. P., & Soto, C. J. (2007). The importance of being valid: Reliability and the process of construct validation. In R. W. Robins, R. C. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 461–91). Guilford Press. Kam, C. C. S., & Meyer, J. P. (2015). How careless responding and acquiescence response bias can influence construct dimensionality: The case of job


satisfaction. Organizational Research Methods, 18(3), 512–41. Karren, R. J., & Woodard-Barringer, M. (2002). A review and anlaysis of the policy-capturing methodology in organizational research: Guidelines for research and practice. Organizational Research Methods, 5(4), 337–61. Keller, T., & Dansereau, F. (2001). The effect of adding items to scales: An illustrative case of LMX. Organizational Research Methods, 4(2), 131–43. Klein, H. J., Molloy, J. C., Brinsfield, C. T. (2012). Reconceptualizing workplace commitment to reress a stretched construct: Revisiting assumptions and removing confounds. Academy of Management Review, 37(1), 130–51. Klein, H. J., Cooper, J. T., Molloy, J. C., & Swanson, J. A. (2014). The assessment of commitment: Advantages of a unidimensional, target-free approach. Journal of Applied Psychology, 99(2), 222–38. Kleiner, B., Lipps, O., & Ferrez, E. (2015). Language ability and motivation among foreigners in survey responding. Journal of Survey Statistics and Methodology, 3(3), 339–60. Korman, A.K. (1974). Contingency approaches to leadership. In J.G, Hunt & L. L, Larson (Eds.), Contingency approaches to leadership. Southern Illinois University Press. Kostyk, A., Zhou, W., & Hyman, M. R. (2019). Using surveytainment to counter declining survey data quality. Journal of Business Research, 95, 211–219. Krosnick, J. A. (2018). Questionnaire design. In The Palgrave handbook of survey research (pp. 439– 455). Palgrave Macmillan. Liden, R. C., Wayne, S. J., Meuser, J. D., Hu, J., Wu, J., & Liao, C. (2015). Servant leadership: Validation of a short form of the SL-28. The Leadership Quarterly, 26, 254–69. Little, T. D., Lindenberger, U., & Nesselroade, J. R. (1999). On selecting indicators for multivariate measurement and modeling with latent variables: When ”good” indicators are bad and “bad” indicators are good. Psychological Methods, 4(2), 192–211. Lozano, L. M., García-Cueto, E., & Muñiz, J. (2008). Effect of the number of response categories on the reliability and validity of rating scales. Methodology, 4(2), 73–9. Malhotra, N. K. (2006). Questionnaire design and scale development. The handbook of marketing research: Uses, misuses, and future advances, 83–94. Meuser, J.D., & Cao, X. (2013, August). I’m good because of what I get: A meta-analytic mediation model of organizational identification. Paper presented at the annual meeting of the Academy of Management, Lake Buena Vista, FL.



Meyer, J. P., Allen, N. J., & Smith, C. A. (1993). Commitment to organizations and occupations: Extension and test of a three-component conceptualization. Journal of Applied Psychology, 78, 538–51. Moore, K. A., Halle, T. G., Vandivere, S., & Mariner, C. L. (2002). Scaling back survey scales: How short is too short?. Sociological Methods & Research, 30(4), 530–67. Nash, A. N. (1965). A study of item weights and scale lengths for the SVIB. Journal of Applied Psychology, 49(4), 264–9. Pace, V. L., & Brannick, M. T. (2010). How similar are personality scales of the “same” construct? A meta-analytic investigation. Personality and Individual Differences, 49(7), 669–76. Pathki, C. S., Kluemper, D. H., Meuser, J. D., & McLarty, B. D. (2022). The org-B5: development of a short work frame-of-reference measure of the big five. Journal of Management, 48, 1299–1337. Podsakoff, P. M., MacKenzie, S. B., Lee, J. Y., & Podsakoff, N. P. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88(5), 879–903. Podsakoff, P. M., MacKenzie, S. B., & Podsakoff, N. P. (2012). Sources of method bias in social science research and recommendations on how to control it. Annual Review of Psychology, 63, 539–69. Reynolds, W. M. (1982). Development of reliable and valid short forms of the Marlowe-Crowne social desirability scale. Journal of Clinical Psychology, 38, 119–25. Roberson, M. T., & Sundstrom, E. (1990). Questionnaire design, return rates, and response favorableness in an employee attitude questionnaire. Journal of Applied Psychology, 75(3), 354–57. Rogelberg, S. G., Fisher, G. G., Maynard, D. C., Hakel, M. D., & Horvath, M. (2001). Attitudes toward surveys: Development of a measure and its relationship to respondent behavior. Organizational Research Methods, 4(1), 3–25. Rogelberg, S. G., Church, A. H., Waclawski, J., & Stanton, J. M. (2004). Organizational survey research. Handbook of research methods in industrial and organizational psychology, 140–60. Rose, D. S., Sidle, S. D., & Griffith, K. H. (2007). A penny for your thoughts: Monetary incentives improve response rates for company-sponsored employee surveys. Organizational Research Methods, 10(2), 225–40. Rosenthal, R. (1994). Science and ethics in conducting, analyzing, and reporting psychological research. Psychological Science, 5, 127–34. Ryan, A. M., Chan, D., Ployhart, R. E., & Slade, L. A. (1999). Employee attitude surveys in a multinational organization: Considering language and

culture in assessing measurement equivalence. Personnel Psychology, 52(1), 37–58. Sakshaug, J. W., & Eckman, S. (2017). Following up with nonrespondents via mode switch and shortened questionnaire in an economic survey: evaluating nonresponse bias, measurement error bias, and total bias. Journal of Survey Statistics and Methodology, 5(4), 454–79. Schmitt, N., & Stuits, D. M. (1985). Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement, 9(4), 367–73. Schriesheim, C. A., & Hinkin, T. R. (1990). Influence tactics used by subordinates: A theoretical and empirical analysis and refinement of the Kipnis, Schmidt, and Wilkinson subscales. Journal of Applied Psychology, 55, 246–57. Schuman, H., & Presser, S. (1977). Question wording as an independent variable in survey analysis. Sociological Methods & Research, 6(2), 151–70. Sheth, J. N., & Roscoe, A. M. (1975). Impact of questionnaire length, follow-up methods, and geographical location on response rate to a mail survey. Journal of Applied Psychology, 60(2), 252–4. Simms, L. J. (2008). Classical and modern methods of psychological scale construction. Social and Personality Psychology Compass, 2(1), 414–33. Smith, G. T., McCarthy, D. M., & Anderson, K. G. (2000). On the sins of short-form development. Psychological Assessment, 12, 102–11. Smith, M. J., Colligan, M. J., & Tasto, D. L. (1979). A questionnaire survey approach to the study of the psychosocial consequences of shiftwork. Behavior Research Methods & Instrumentation, 11(1), 9–13. Stanton, J. M., & Rogelberg, S. G. (2001). Using internet/intranet web pages to collect organizational research data. Organizational Research Methods, 4(3), 200–17. Stein, G. (1937). Everybody’s autobiography, Random House. Strahan, R., & Gerbasi, K. C. (1972). Short, homogeneous versions of the Marlowe-Crowne Social Desirability Scale. Journal of Clinical Psychology, 28, 191–3. Subramanian, D. R. (2017). Product promotion in an era of shrinking attention span. International Journal of Engineering and Management Research, 7, 85–91. Subramanian, D. R. (2018). Myth and mystery of shrinking attention span. International Journal of Trend in Research and Development, 5, 1–6. Tay, L., & Jebb, A. T. (2018). Establishing construct continua in construct validation: The process of continuum specification. Advances in Methods and Practices in Psychological Science, 1(3), 375–88.


Thomas, K. W., & Kilmann, R. H. (1975). The social desirability variable in organizational research: An alternative explanation for reported findings. Academy of Management Journal, 18(4), 741–52. Thompson, L. F., Surface, E. A., Martin, D. L., & Sanders, M. G. (2003). From paper to pixels: Moving personnel surveys to the Web. Personnel Psychology, 56(1), 197–227. Van Knippenberg, D., & Sitkin, S. B. (2013). Critical assessment of charismatic – transformational leadership research: Back to the drawing board? Academy of Management Annals, 7, 1–60. Wechsler, D. (1967). Manual for the Wechsler preschool and primary scales of intelligence. The Psychological Corporation. Widiger, T. A., & Trull, T. J. (1992). Personality and psychopathology: an application of the five‐factor model. Journal of Personality, 60, 363–93. Wood, D., & Harms, P.D. (2016). On the TRAPs that make it dangerous to study personality with per-


sonality questionnaires. European Journal of Personality, 30, 327–8. Yammarino, F. J., Skinner, S. J., & Childers, T. L. (1991). Understanding mail survey response behavior a meta-analysis. Public Opinion Quarterly, 55(4), 613–39. Yousefi-Nooraie, R., Marin, A., Hanneman, R., Pullenayegum, E., Lohfeld, L., & Dobbins, M. (2019). The relationship between the position of name generator questions and responsiveness in multiple name generator surveys. Sociological Methods & Research, 48(2), 243–62. Zickar, M. J. (2020). Measurement development and evaluation. Annual Review of Organizational Psychology and Organizational Behavior, 7, 213–32. Ziegler, M., Poropat, A., & Mell, J. (2014). Does the length of the questionnaire matter? Expected and unexpected answers from generalizability theory. Journal of Individual Differences, 35, 205–61.

9 Response Option Design in Surveys G a v i n T. L . B r o w n a n d B o a z S h u l r u f

INTRODUCTION Structured response formats capture and quantify the degree or intensity (i.e., mild to strong) and valence (i.e., positive vs. negative) of opinion, attitude, or beliefs (Allport, 1935). These prespecified response formats make rapid and consistent responding much easier for participants, enabling them to indicate their truthful position. Schaeffer and Dykema (2020) provide an updated and elegant taxonomy of how response options can be structured according to the nature of the phenomenon under investigation. Alvarez and Van Beselaere (2005) extend this list to what can be done within digital survey interfaces. • Dichotomous options (e.g., yes–no, like–unlike me) allow identification of the occurrence or presence of an event or quality • Requests for discrete values allow participants to identify a specific value (e.g., number of times participant has experienced harassment in the last month) • Open-ended options are useful only when the prompt does not have pre-specified responses

• Ordered selection strategies allow identification of (a) relative or absolute frequency, (b) intensity, (c) agreement, (d) importance, and so on • Drop-down boxes and radio buttons work well when participants must only choose one option • Checkboxes permit multiple selection of all options that apply The most commonly used response system is Likert’s (1932) balanced, 5-point, agreement rating scale. The Likert scale of 5 balanced options has been widely transferred to other constructs such as frequency, confidence, truthfulness, or certainty (Gable & Wolf, 1993). Choice of response format should be determined by the question being posed. In other words, it is better to respond to a question about satisfaction with intensity of satisfaction, rather than intensity of agreement that one is satisfied (e.g., “How satisfied are you?: not at all … extremely” is better than “I am extremely satisfied: strongly agree … strongly disagree”; Schaeffer & Dykema, 2020, p. 49). Because the design of response scales influences participant psychology, it is important to understand design features that enhance the validity of scale information. Further, because differences in

Response Option Design in Surveys

response option design matters to how the data can be analyzed, we also consider statistical issues related to the different designs.

SCALE LENGTH In applied settings, it has taken considerable time to establish some consensus around the design of ordered response options for degree or intensity (i.e., mild to strong) and valence (i.e., positive vs. negative) of opinion, attitude, or beliefs (Schaeffer & Presser, 2003). Ordered rating scales can be unipolar (e.g., none to very large value) or bipolar (i.e., very negative to very positive). In this section we examine decisions that have to be made around the number of options given and the related issue of whether a midpoint or neutral option should be used.

Number of Options The number of response options required for Likert-type questionnaires has been fiercely debated ever since Rensis Likert introduced his 5-point approval scale (Likert, 1932). The classic Likert scale used a neutral midpoint and two mirrored options approve/disapprove and strongly approve/disapprove. Since then, the word agree has been substituted by others for approve. Research into scale length involves two aspects; that is, how humans respond when faced with different numbers of options and how different lengths impact statistical properties. Human Responses. The first implication of the Likert balanced 5-point scale is that with only two positive or two negative options, variance in responding is highly constrained. All positive or negative respondents are separated by only two choices, meaning that discrimination among respondents is restricted. Simms et al. (2019) concluded that too few options do not allow respondents space to express the nuances of their self-report, while too many options may confuse participants by forcing them to make finegrained distinctions in response to items for which their response is more coarse-grained. Hence, the length of scale in part has to depend on the nature of the phenomenon being evaluated. If participants are knowledgeable then longer scales may be more effective because participants have a basis for making a more refined distinction. Other factors constrain human capability to deal with long scales. First, is the constraint imposed by


working memory (Miller, 1956), which is limited to 7±2 units, leading to the idea that scales with more than 7 points would challenge human ability to hold in mind the meaning of the options while answering items. Indeed, respondents’ ability to distinguish between categories declines as the number of response options increases (Contractor & Fox, 2011) suggesting the optimal number of responses within a Likert-type scale should be no more than 7. Second, humans are prone to biases in decision making (Kahneman, Slovic, & Tversky, 1982) that create systematic error when they respond to rating scales. Biased and, thus, invalid response styles include tendencies to: • use either central or extreme options, • portray oneself in a socially desirable fashion, or • acquiesce to the expectations of the surveyor (Messick, 1991; OECD, 2013). Respondents are also sensitive to cultural norms, meaning that different social groups may rarely exhibit negative or positive attitudes (Shulruf, Hattie, & Dixon, 2008). The importance of previous experience within a context means that scalelength selection should conform to contextual norms (Batista-Foguet et  al., 2009). Shulruf, Hattie, and Dixon (2008) demonstrated that longer scales (i.e., 11-points) were more sensitive to response styles, concluding that fewer options would reduce the impact of biases. Hence, from a human perspective, scales longer than 7 are unlikely to be helpful, except in societies where experience of longer scales or expertise in the domain being evaluated is common. Statistical Issues. Most survey research exploits classical test theory (CTT) methods to determine item quality (i.e., difficulty, discrimination) and test quality (i.e., mean, standard deviation, standard error of measurement) (Hambleton & Jones, 1993). CTT methods presume that: 1 item responses are normally distributed, 2 error residuals are independent of each other, and 3 correlation among items reflects a common latent trait or factor that explains such responding. When CTT methods are used, 5-option scales have better reliability than 2–4-option scales (MaydeuOlivares et al., 2009). Under CTT, there is significant literature indicating that estimates of scale reliability are greater with longer response options (Alwin, 1997; Lozano, García-Cueto, & Muñiz,



2008; Leung, 2011; Wu & Leung, 2017). Indeed, Simms et al. (2019) found that internal estimates of consistency (i.e., Cronbach’s alpha) increased steeply between two and four responses, but after five options alpha increased only marginally. A range of empirical studies support the 7-point rating scale threshold, demonstrating that there is little to no improvement in scale psychometric properties when the number of options is greater than 7 (Bandalos & Enders, 1996; Lozano, GarcíaCueto, & Muñiz, 2008; Contractor & Fox, 2011; Duvivier et  al., 2011). Further, there is evidence that longer response scales can compensate when few items exist in a scale (Maydeu-Olivares et al., 2009). However, this should be approached cautiously as error in one place may not be compensated by decreasing errors elsewhere. In addition to evaluating internal estimates of scale reliability, test-retest consistency or reliability research suggests that no less than 5 options be used (Weng, 2004). However, Dolnicar (2021) presents evidence that test-retest reliability is much higher when given just two options (forcedchoice binary) than in 5- or 7-point balanced rating scales. Because researchers want to create scale scores, the impact of length on mean scores also matters. Mean scores decrease as the number of responses increases, particularly between two and four responses (Simms et al., 2019). Despite inconclusive literature regarding the optimal number of responses (Chyung, Roberts, Swanson, & Hankinson, 2017), there is sufficient evidence to suggest that in most cases between 5- and 7-point scales are preferable. To help resolve the length issue, it is important to consider the impact of including a neutral midpoint in bipolar scales that use odd numbers.

Neutral Midpoint Bipolar scales traverse the range of strongly negative to strongly positive attitude. Logically, then, there is a neutral midpoint between both ends. However, the neutral midpoint (i.e., neither approve nor disapprove) can create ambiguity in interpreting what is meant when this option is selected (Guy & Norvell, 1977; Garland, 1991; Weems & Onwuegbuzie, 2001; Krosnick, 2002; Shulruf, Hattie, & Dixon, 2008; Krosnick & Presser, 2009; Weijters, Cabooter, & Schillewaert, 2010; Kulas & Stachowski, 2013; Nadler, Weston, & Voyles, 2015; Schaeffer & Dykema, 2020). Despite the face value of a neutral midpoint, this option may not help identify respondents’ true perspectives. Participants may choose the midpoint because they are truly neutral, but it may also be that:

a b c d e f

they do not wish to divulge their true opinion, they don’t understand the prompt, they really don’t know what they think, they don’t care enough to give an honest answer, they are unsure, agreement or disagreement depends on some condition that has not been stated in the item, or g they do not know enough to form an opinion. Not knowing why the midpoint was selected compromises the validity of any interpretation taken from such data. This means there is sufficient doubt as to the meaning of a midpoint in a bipolar scale to question the validity of its use. Chyung et  al. (2017) provide conditions under which a midpoint can be omitted, including: • when participants are unfamiliar or uncomfortable with the survey topic, • when the researcher can expect participants not to have formed their opinion about the topic, • when there is strong social desirability pressure, or • when participants are likely to show satisficing behavior. By omitting a neutral midpoint (i.e., using an even number of options), respondents are forced to take a stand as to where on the scale their views best lie. This should provide a more valid response but may lead to more missing responses (Chyung et al., 2017). Unless the research goal is to investigate people who select the midpoint or neutral option, it seems conceptually ambiguous to offer “neither” or “neutral” as legitimate options. Forced choice in one direction or the other seems preferable, even if this results in somewhat lower response rates. Consequently, our advice, based on our experience as survey researchers in education and medicine, is to avoid neutral midpoints in bipolar scales. Based on that analysis, we would recommend 6-point response scales, rather than 5- or 7-point scales that contain a midpoint. We acknowledge that our recommendation may contradict the opinions of other survey researchers who prefer scales with midpoints. A somewhat novel approach to situations when there is likely to be strong social desirability is to pack the response scale in the direction in which participants are likely to endorse a phenomenon (Lam & Klockars, 1982). For example, subordinates are highly likely to experience a strong pressure to agree with the policies under which they are employed, even if they are not enthusiastic for them. In this situation, it is difficult to choose negative options, but choosing very weakly positive answers (e.g., slightly

Response Option Design in Surveys

agree) is a way to indicate low level of endorsement, but still fulfil the expected positive reply. Packing the response options in the expected positive direction gives more options in the positive space, resulting in superior statistical characteristics to equally spaced or balanced response scales, resulting in more variation in responses (Klockars & Yamagishi, 1988). This approach has been used extensively in surveys with teachers (Brown, 2006) and students (Brown, 2013) in multiple contexts. Thus, whenever researchers can reliably anticipate the direction of participant attitude, packing of the response options in that direction can be recommended. A second alternative to social desirability and acquiescence is to use a relative frequency scale (e.g., never, rarely, seldom, occasionally, often, very often, always), which has the advantage of a within-person fair comparison of items reducing bias, as the relative frequency is controlled within each person (Shulruf, Hattie, & Dixon, 2008 Eggleton et al., 2017).

OPTION WORDING Once the structure of the response scale is set, researchers have to resolve what labels, if any, to place on each option. We consider especially the challenges when investigating how often participants think, feel, or experience a phenomenon.

Anchor Labels There are many approaches to giving verbal labels to rating scale response options. Osgood’s (1964) semantic differential scale anchored only the two bipolar ends of a 7-point continuum, under the assumption that each point between the two ends was an equal interval. Psychophysical measurements (e.g., hand grip, line length) are anchor free approaches that depend on direct measurement of the physical phenomenon to infer attitudinal direction and strength (Lodge, 1981). Survey tools, such as Qualtrics,1 provide sliders with only endpoints marked (e.g., 0–100%) so that options are selected as a point on the line between the two endpoints to indicate direction and intensity of attitude. Precision in responses can be derived from actual points used, much in the spirit of psychophysics. Nonetheless, a slider is more vulnerable to biases since it offers unlimited response options, which, despite its mathematical advantages, may compromise the validity of responding. A related option, clearly designed


to enable data entry, is to put numbers beside each option button or box. This gives the appearance of an equal interval scale but may introduce confusion if any verbal labels have a different semantic value to the participant than the assigned numeric value. For example, the numeric value 2 for the second option “disagree” may seem incoherent with a negative response. Images can be designed to reflect ordinal increments, such as incrementally ordered smiley-faces (i.e., lowest = severe downturn of the mouth to highest = very much upturned smiley mouth, possibly with open mouth) for use in early schooling attitude measurement. ‘Otunuku and Brown (2007) reported self-efficacy and subject interest results from school children as young as nine years of age with this approach. Actual photographs have been inserted into response options to help participants visualize metaphorically the direction and intensity of their response (Leutner et  al., 2017). Of course, how participants understand metaphorical images compared to verbal anchors is not yet well-established and should be used cautiously. There are questions, however, as to whether each ordinal option in a rating scale should be labeled. There is strong consensus that, at a minimum, the end points of rating scales should be clearly labeled (Dixon, Bobo, & Stevick, 1984). The labeling of all response options seems desirable, if for no other reason than that because this guides the participant as to the meaning of their choice and gives the researcher a constant, stable meaning for an option (Schaeffer & Dykema, 2020). The choice of verbal anchor itself is in itself challenging. Likert’s original options of approve, strongly approve have been extended to include adverbs of varying intensity (Simms et al., 2019; Casper et al., 2020). Research has used panels of judges to calibrate adverbs to establish the intensity of each adverb term, leading to selection of options that have quasi-equivalent intervals. An unpublished study by Hattie and Klockars (Hattie, personal communication, February 1999), using a similar method, found adverbs that provide nearly equal intervals for a positively packed scale. Table 9.1 provides verbal anchors for 5- to 7-point balanced agreement rating scales and a 6-point positively packed scale using score values reported by Casper et al. (2020). Schaeffer and Dykema (2020) recommended 5-option terms with corresponding numeric values for three other scale types: • Unipolar intensities: 0 =“not at all,” 1 = “slightly” or “a little,” 2 = “somewhat” or “moderately,” 3 =“very,” 4 = “extremely”;



Table 9.1  Agreement/disagreement intensity options by scale length and design selected for quasi-equivalent distances Valence



6-point positively packed

6-point balanced



98.59 82.62 — 69.01 64.29 50.08

Strongly agree — Mostly agree Moderately agree Slightly agree —

Strongly agree Agree — — Slightly agree —


41.19 — 15.32 1.21

Strongly agree — — Moderately agree — Neither agree or Disagree — — Disagree Strongly Disagree

— Mostly Disagree — Strongly Disagree

Slightly Disagree — Disagree Strongly Disagree

Strongly agree Agree — Moderately agree Slightly agree Neither agree or Disagree Slightly Disagree — — Strongly Disagree

*Scores taken from Table A1 Casper et al. (2017)

• Relative frequencies: 0 = “never,” 1 = “rarely,” 2 = “sometimes,” 3 = “very often,” 4 = “extremely often,” or 5 = “always”; and • Quantities: 0 = “none,” 1 = “a little” or “a little bit,” 2 = “some” or “somewhat,” 3 = “quite a bit,” and 4 = “a great deal.” Readers are encouraged to select terms that maximize equivalence in perceptual distance between adverbs and to test these when surveying samples from non-American populations. A related question in designing response options is how to minimize the impact of response styles, such as always picking the same option to speed non-effortful completion. A simple solution is to introduce “not” into the item stem to catch a person who may be faking or not paying attention. However, a negatively worded item is cognitively demanding to discern which end of the scale is needed to indicate one’s true response. Strong evidence exists that the insertion of “not” into an item changes the psychological response mechanism (Schriesheim & Hill, 1981; Brown, 2004; Schaeffer & Dykema, 2020; Ford & Scandura, this volume). While some might wish to keep “not” items, perhaps by reverse scoring, it is doubtful that such a process creates equivalent factors and item statistics than positively worded items (Chang, 1995; Roszkowski & Soven, 2010). Hence, we advise strongly to avoid items that are negatively worded.

Frequency Frequency response items are problematic, compared to agreement responses, because human

recall of the frequency of with which events happen or engaged in is prone to multiple memory errors (Schacter, 1999; Schwarz, 1999). This contrasts with a person’s ability to state their current status. Respondents are influenced by the structure of the response format provided by the survey researcher (Coughlin, 1990). The salience and importance of life events matters to the consistency of responding between self-administered surveys and interviews (Harris & Brown, 2010). Responses are more consistent around events such as having been treated for cancer as opposed to the amount of alcohol drunk in a week. Consequently, minimizing participant reliance on their memory for how often or when events took place is strongly recommended (Dillman et  al., 2014). Nonetheless, sometimes, participant perception of the factual frequency of a phenomenon is needed. Because respondents recall recent and more prominent events better than distant and mundane events (i.e., recall bias), they rely heavily on the design of response options to guide their thinking. Participants rely on the researcher to present frequency scales based on what normal frequencies are. For example, if a frequency scale is centered on daily with options ranging from every quarter-hour to monthly, respondents will tend to assume that daily is relatively normal, and if they consider themselves “normal” will use that option to orient themselves. That means participants who consider themselves “normal” or “average” tend to use middle of the scale response options. Alternatively, respondents may use the first anchor of the response scale as a reference by which they base their answers (Lavrakas et al., 2019; Yan & Keusch, 2015). Despite these vulnerabilities, relative frequency scales have the advantage of within-person control, meaning that recall

Response Option Design in Surveys

bias may be less of a problem, as it consistently becomes part of the individual’s perception at the time at which they responded to the survey. Thus, their individual perception of the “normative” or expected frequency of events is what anchors their responses (Strack, 1992). The challenge for response options lies in the vocabulary used to describe the frequency of events. In many cases, the terms employed for response scales are vague quantities or frequencies rather than specific and natural intervals. Casper et  al. (2020) provided relative values for 46 frequency terms, showing an average step increment of just 2.20 points (SD = 4.22). This means that many frequency terms are close to each other (e.g., “often,” “pretty often,” “frequently,” “commonly,” and “quite often” fall within 4 points of each other), meaning that many respondents will not be able to reliably distinguish them, Nevertheless, although respondents may agree that a set of terms describe a relatively rare event, it is difficult for lay people to know which describes rarer occurrences of events than others (Foddy, 2003). Toepoel, Das, and van Soest (2009) summarize the literature and their own findings on this topic as: response categories have a significant effect on response formulation in mundane and regular questions (questions for which estimation is likely to be used in recall and that refer to an event occurring regularly), whereas in salient and irregular questions (questions in which direct recall is used in response formatting and the event occurs episodically) the response categories do not have a significant effect. (Toepoel et al., 2009, p. 372)

One approach to increasing the validity of a frequency scale is to anchor frequencies in a precise description of time, such as: a b c d

at least once a day less than daily but more than once a week less than weekly but at least once a month less than once monthly but at least once in six months e happens less than once in six months f never happens.

very frequent

An alternative approach is to use 5 or 6 frequency responses across the extremes of a continuum with anchor points for each option that clearly indicate visually the intended difference between each option (Figure 9.1). Such a scale will be well understood by all respondents in the same way and will avoid any ambiguity of meaning. Even people who hold relative low levels of literacy or poor vocabulary (e.g., due to English being their second language) may find this scale meaningful and not confusing. Nonetheless, a concrete frequency scale may engender recall bias that compromises the validity of the data reported (Coughlin, 1990). With large samples (i.e., n > 500) this dilemma can be resolved by simply asking in binary fashion if an event has occurred in the previous 24 hours. In such cases, the frequency within the population would be at least as accurate as using any other frequency scale, while also eliminating recall bias.

Formatting On screen survey administration allows for more complex response formats, such as completing matrix tables and putting objects in rank order. Although matrix tables are very efficient, they require substantial effort by respondents. For example, options that are near the bottom of a list from which participants had been asked to select “all that apply” tend to be selected less than if each item requires a response (Dillman et  al., 2014). While smartphones provide connection to an emailed or web-based survey, their small screens and input devices make reading and responding to survey requests quite difficult (Dillman et al., 2014). On page and on-screen formatting of response options seem to require very similar considerations (Dillman et al., 2014). Ensuring that response options are laid out accessibly (horizontally in one row or vertically in one column) and in an inherent order (lowest to highest or vice versa) reduces fatigue and inattentiveness. A recent experiment (Leon et al., 2021) with the order in which response options are displayed (negative to positive vs. positive to negative) found that disagree


always -----------------□----------------□----------------□----------------□---------------- never frequent


very infrequent

Figure 9.1  Alternative approach for vague quantity frequency options



options later in the sequence were less selected than when presented earlier in the sequence. Note that this may be a cultural norm at work in that most languages read from left to right. However, Leon et al. (2021) concluded that order of presentation had limited effect on data quality. When designed for small screens, vertical layouts with equal spacing are more likely to avoid having to scroll across to reach a desired option. Unequal spacing in either direction is likely to mislead respondents as to the merit of various options. If required, “don’t know” or “no opinion” options should be physically separated from substantive choices to minimize the possibility that participants will see those as being on the same scale. Regardless of platform, designers should consistently exploit size, font, brightness, and color, and physical location parameters. Hence, careful design of screens and response spaces to cater for this is essential; this optimisation capacity is something leading edge web-survey companies build into their systems, even in China (Mei & Brown, 2018). Nevertheless, ensuring that the survey questions and response options display consistently and with adequate space and selectability across all screen sizes and types is essential. When targeting small screens, the use of one question per page makes sense. This also has the advantage of sending data back to the survey host server frequently, although this will take the participant longer to complete the survey. While researchers may want no missing data for any item, forcing participants to answer every item on a screen before being allowed to proceed to the next could hurt participant motivation and content validity. If participants have to answer a question that they do not want to or for which they have no answer, there is a strong possibility of quitting the survey altogether or lying; both have bad consequences for completeness and validity (Dillman et al., 2014). Because input/output errors occur online, survey-specific error messages that help participants return to the survey are strongly recommended.

PSYCHOMETRIC PROPERTIES To create a score using a rating scale, response options must be converted to numbers. Since Likert (1932), ordinal response options are assumed to lie on a single continuum and to be equally spaced from the most favorable to the least favorable (Allport, 1935). Further, selection of higher-ordered categories (e.g., agree) implies that the person’s response dominates lower-ordered

options (i.e., strongly disagree, disagree, and neither). On that basis, Likert’s approach (1932) was to sum the scores for each item based on the option selected (1 = strongly disapprove to 5 = strongly approve) and use that as the score for each individual. He argued that this was valid because the distribution of the rating options was “fairly normal.” This approach construes the response options as stopping points on a continuous variable, just as centimetre or inch marks are stopping points on the continuous variable length. As Lord (1953) pointed out, the numbers do not know that they have come from a scale that has unequal size intervals. Consequently, when individual items are aggregated into scales, it is common to assume the numbers function in a parametric fashion, just as dichotomous test items do when summed into a test total score (Norman, 2010). Despite being assigned equal ratio numbers (i.e., the distance between 2 and 3 is the same as the distance between 4 and 5), ordered options in a rating scale do not automatically have equal size. This means conventional arithmetic (item 1 = 2, item 2 = 3, item 3 = 2, therefore, scale score = 7, average = 2.33) is not actually defensible (Stevens, 1946) because the options being given numbers do not necessarily have equal sizes (Chyung, Roberts, Swanson, & Hankinson, 2017). Indeed, Liddell and Kruschke (2018) demonstrated that most papers they found with Likert-type rating scales used a continuous metric approach to evaluating the information. Consequently, they showed that an ordered-probit model (i.e., a regression model for three or more ordered values) was more accurate and less biased, particularly when the responses were not normally distributed. If, as suggested by some authors (Capuano et al., 2016; Toapanta-Pinta et al., 2019; Kleiss et al., 2020) the Likert scale is ordinal, then an appropriate method of analyzing the data needs to be implemented. This is a significant challenge for conventional methods of evaluating Likert-type summated rating scales. To convert unequal sized response options into a valid numeric scale, it is possible to use item response theory (IRT) methods (Embretson & Reise, 2000) that adjust item parameters according to a logistic formula that accounts for an underlying overall ability or attitude trait. This approach can produce different results for item quality than conventional CTT methods (Shaftel, Nash, & Gillmor, 2012). For example, MaydeuOlivares et  al. (2009) found, under IRT, that 5-option scales had better precision relative to shorter response options. In contrast, under IRT analysis, a 6-point positively packed agreement scale showed relatively uniform peaks in a wellordered series of option selections (Deneen et al., 2013). Hence, researchers need to be clear what

Response Option Design in Surveys

analytic approach they plan to take when designing the length of a response scale. Most survey research aggregates answers to sets of items into theoretically meaningful latent causes or factors. Evidence for the validity of scale coherence most often starts with scale reliability estimation. Although scale reliability is widely reported using Cronbach’s (1951) alpha estimator, alpha is a deficient method because it falsely requires the true score for each item to have constant and equal variance (Dunn, Baguley, & Brunsden, 2014). More appropriate estimators of the reliability of a set of items include McDonald’s omega and Coefficient H (McNeish, 2018). Stronger evidence of a scale’s integrity is found with factor analytic methods (Bryant & Yarnold, 1995). These methods require multivariate normal distribution, which can be problematic regardless of number of options but is especially the case in items scored dichotomously (e.g., like me vs. unlike me) or on short scales (i.e., 2 to 4 options). More sophisticated estimators allow researchers assurance that a set of items is not just conceptually but also statistically coherent. The asymptotic distribution free estimator allows scale analysis of dichotomous variables (Browne, 1984). For 3- and 4-point scales, several estimators (e.g., diagonal weighted least squares, weighted least squares– mean, or weighted least squares–mean and variance) can compensate for the non-normal, discrete nature of options and provide more accurate fit indices than the continuous variable estimator maximum likelihood (DiStefano & Morgan, 2014). Fortunately, evidence exists that ordinal rating scales, with at least 5 options, function as if they were continuous variables (Finney & DiStefano, 2006). At that length, maximum likelihood estimation and para metric methods of estimating scale integrity can be conducted with little fear that inappropriate assumptions are being made. Much less prevalent in the survey research field is the use of Thurstone’s approach to scaling in which items are created to exemplify an “ideal point” (Roberts, Laughlin, & Wedell, 1999). In this approach, items are written for each level of response such that only those at that level would select them (Drasgow et  al., 2010). This means that a person with a very positive attitude would not agree with a less positive statement, in contrast to the summated rating approach which assumes a very positive person would also agree with a less positive item. Roberts, Laughlin, and Wedell (1999) demonstrated response curves for attitudes toward abortion items that clearly exhibit strongly pro-abortion people did not endorse items with weaker yet positive attitudes toward abortion. This approach requires unfolding models of IRT for analysis.


CONCLUSION The design of response option systems has an impact on survey takers, who tend to assume that the response options have been intelligently selected to frame normal attitudes and use that information to guide their response (Schwarz, 1999). Response options can elicit unintended response styles (e.g., central tendency, social desirability; Messick, 1991; OECD, 2013) or be subject to cultural norms (Shulruf, Hattie, & Dixon, 2008). To minimize response styles, the survey researcher should attempt to focus participant attention on their current state of mind (e.g., importance, agreement, or preference) and honest self-report. Option design also has implications for the statistical analysis of data, ranging from scale reliability to factor determination and subsequent score analysis. This review makes clear that choice of response options always depends on the content of the survey, the sample of respondents being surveyed, and the type of statistical analysis preferred. Our recommendations are based on our reading of the literature and our experience designing and analyzing survey inventories.

RECOMMENDATIONS • Response options must align with survey instructions • All response options should have a verbal anchor label (not a number) • Anchor labels need to be equidistant and structured to ensure clarity • Scale length should be either 5 or 6 options • Scales should avoid midpoints, unless the middle, neutral response has a particular important meaning and is appropriate to participant knowledge base • Positively or negatively packed scales should be considered when the most likely direction of response is predictable • Length and direction of scales should be consistent with cultural and social norms • Anchor frequencies with specific time ranges that are meaningful to the type of event • Electronic surveys need to minimize eye and hand fatigue and number of clicks required to respond • Statistical analysis must be consistent with statistical properties of the response scale (i.e., parametric only if option length is 5 or more; otherwise non-parametric methods)



Note 1 survey-module/editing-questions/question-typesguide/question-types-overview/

REFERENCES Allport, G. W. (1935). Attitudes. In C. Murchison (Ed.), A handbook of social psychology (pp. 798– 844). Worcester, MA: Clark University Press. Alvarez, R. M., & Van Beselaere, C. (2005). Webbased survey. In K. Kempf-Leonard (Ed.), Encyclopedia of Social Measurement (Vol. 3, P-Y, pp. 955–62). Amsterdam, NL: Elsevier Academic Press. Alwin, D. F. (1997). Feeling thermometers versus 7-point scales: Which are better? Sociological Methods & Research, 25(3), 318–40. https://doi. org/10.1177/0049124197025003003 Bandalos, D. L., & Enders, C. K. (1996). The effects of nonnormality and number of response categories on reliability. Applied Measurement in Education, 9(2), 151–60. s15324818ame0902_4 Batista-Foguet, J. M., Saris, W., Boyatzis, R., Guillén, L., & Serlavós, R. (2009). Effect of response scale on assessment of emotional intelligence competencies. Personality and Individual Differences, 46(5), 575–80. 12.011 Brown, G. T. L. (2004). Measuring attitude with positively packed self-report ratings: Comparison of agreement and frequency scales. Psychological Reports, 94(3), 1015–24. pr0.94.3.1015-1024 Brown, G. T. L. (2006). Teachers’ conceptions of assessment: Validation of an abridged version. Psychological Reports, 99(1), 166–70. https://doi. org/10.2466/pr0.99.1.166-170 Browne, M. W. (1984). Asymptotically distributionfree methods for the analysis of covariance structures. British Journal of Mathematical and Statistical Psychology, 37(1), 62–83. https://doi. org/10.1111/j.2044-8317.1984.tb00789.x Brown, G. T. L. (2013). Student conceptions of assessment across cultural and contextual differences: University student perspectives of assessment from Brazil, China, Hong Kong, and New Zealand. In G. A. D. Liem & A. B. I. Bernardo (Eds.), Advancing cross-cultural perspectives on educational psychology: A festschrift for Dennis McInerney (pp. 143–67). Information Age Publishing.

Bryant, F. B., & Yarnold, P. R. (1995). Principalcomponents analysis and exploratory and confirmatory factor analysis. In L. G. Grimm & P. R. Yarnold (Eds.), Reading and understanding multivariate statistics (pp. 99–136). APA. Capuano, A., Dawson, J., Ramirez, M., Wilson, R., Barnes, L., & Field, R. (2016). Modeling Likert scale outcomes with trend-proportional odds with and without cluster data. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 12(2), 33–43. https://doi. org/10.1027/1614-2241/a000106 Casper, W. C., Edwards, B. D., Wallace, J. C., Landis, R. S., & Fife, D. A. (2020). Selecting response anchors with equal intervals for summated rating scales. The Journal of Applied Psychology, 105(4), 390–409. Chang, L. (1995). Connotatively consistent and reversed connotatively inconsistent items are not fully equivalent: Generalizability study. Educational & Psychological Measurement, 55(6), 991–7. https:// Chyung, S., Roberts, K., Swanson, I., & Hankinson, A. (2017). Evidence‐based survey design: The use of a midpoint on the Likert scale. Performance Improvement, 56(10), 15–23. 10.1002/pfi.21727 Contractor, S. H., & Fox, R. J. (2011). An investigation of the relationship between the number of response categories and scale sensitivity. Journal of Targeting, Measurement and Analysis for Marketing, 19(1), 23–35. Coughlin, S. (1990). Recall bias in epidemiologic studies. Journal of Clinical Epidemiology, 43(1), 87–91. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. Deneen, C., Brown, G. T. L., Bond, T. G., & Shroff, R. (2013). Understanding outcome-based education changes in teacher education: Evaluation of a new instrument with preliminary findings. Asia-Pacific Journal of Teacher Education, 41(4), 441–56. Dillman, D. A., Smyth, J. D., & Christian, L. M. (2014). Internet, phone, mail, and mixed-mode surveys: The tailored design method (4th ed.). Hoboken, NJ: Wiley. DiStefano, C., & Morgan, G. B. (2014). A comparison of diagonal weighted least squares robust estimation techniques for ordinal data. Structural Equation Modeling: A Multidisciplinary Journal, 21(3), 425–38. 915373 Dixon, P. N., Bobo, M., & Stevick, R. A. (1984). Response differences and preferences for all-

Response Option Design in Surveys

category-defined and end-category-defined Likert formats. Educational and Psychological Measurement, 44, 61–6. 0013164484441006 Dolnicar, S. (2021). 5/7-point “Likert scales” aren’t always the best option: Their validity is undermined by lack of reliability, response style bias, long completion times and limitations to permissible statistical procedures. Annals of Tourism Research, 91, 103297. annals.2021.103297 Drasgow, F., Chernyshenko, O. S., & Stark, S. (2010). 75 years after Likert: Thurstone was right! Industrial and Organizational Psychology, 3(4), 465–76. 01273.x Dunn, T. J., Baguley, T., & Brunsden, V. (2014). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology, 105(3), 399–412. Duvivier, R. J., van Dalen, J., Muijtjens, A. M., Moulaert, V. R. M. P., van der Vleuten, C. P. M., & Scherpbier, A. J. J. A. (2011). The role of deliberate practice in the acquisition of clinical skills. BMC Medical Education, 11(1), 101. 10.1186/1472-6920-11-101 Eggleton, K., Goodyear-Smith, F., Henning, M., Jones, R., & Shulruf, B. (2017). A psychometric evaluation of the University of Auckland General Practice Report of Educational Environment: UAGREE. Education for Primary Care, 28(2), 86–93. 1268934 Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. LEA. Finney, S. J., & DiStefano, C. (2006). Non-normal and categorical data in structural equation modeling. In G. R. Hancock & R. D. Mueller (Eds.), Structural equation modeling: A second course (pp. 269– 314). Greenwich, CT: Information Age Publishing. Foddy, W. (2003). Constructing questions for interviews and questionnaires: theory and practice in social research. Cambridge University Press. Gable, R. K., & Wolf, M. B. (1993). Instrument development in the affective domain: Measuring attitudes and values in corporate and school settings. (2nd ed.). Boston, MA: Kluwer Academic Publishers. Garland, R. (1991). The mid-point on a rating scale: Is it desirable? Marketing Bulletin, 2(3), 66–70. V2_N3_Garland.pdf Guy, R. F., & Norvell, M. (1977). The Neutral Point on a Likert Scale. The Journal of Psychology, 95(2), 199–204. 7.9915880


Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues & Practice, 12(3), 38–47. Harris, L. R., & Brown, G. T. L. (2010). Mixing interview and questionnaire methods: Practical problems in aligning data. Practical Assessment, Research and Evaluation, 15(1). 10.7275/959j-ky83 Kahneman, D., Slovic, P., & Tversky, A. (1982). Judgment under Uncertainty: Heuristics and Biases. Cambridge University Press. Kleiss, I., Kortlever, J., Karyampudi, P., Ring, D., Brown, L., Reichel, L., Driscoll, M., & Vagner, G. (2020). A comparison of 4 single-question measures of patient satisfaction. Journal of Clinical Outcomes Management, 27(1), 41–8. jcomjournal/article/216218/business-medicine/ comparison-4-single-question-measures-patient Klockars, A. J., & Yamagishi, M. (1988). The influence of labels and positions in rating scales. Journal of Educational Measurement, 25(2), 85–96. https:// Krosnick, J. A. (2002). The cause of no-opinion response to attitude measures in surveys: They are rarely what they appear to be. In R. N. Groves, D. A. Dillman, J. L. Eltinge, & R. J. Little (Eds.), Survey nonresponse (pp. 87–100). John Wiley & Sons. Krosnick, J. A., & Presser, S. (2009). Question and questionnaire design. Elsevier. Kulas, J. T., & Stachowski, A. A. (2013). Respondent rationale for neither agreeing nor disagreeing: Person and item contributors to middle category endorsement intent on Likert personality indicators. Journal of Research in Personality, 47(4), 254–62. Lam, T. C. M., & Klockars, A. J. (1982). Anchor point effects on the equivalence of questionnaire items. Journal of Educational Measurement, 19(4), 317–22. Lavrakas, P., Traugott, M., Kennedy, C., Holbrook, A., Leeuw, E., & West, B. (2019). Experimental Methods in Survey Research. John Wiley & Sons. Leon, C. M., Aizpurua, E., & van der Valk, S. (2021). Agree or disagree: Does it matter which comes first? An examination of scale direction effects in a multi-device online survey. Field Methods. https:// Leung, S.-O. (2011). A comparison of psychometric properties and normality in 4-, 5-, 6-, and 11-point Likert scales. Journal of Social Service Research, 37(4), 412–21. .2011.580697 Leutner, F., Yearsley, A., Codreanu, S.-C., Borenstein, Y., & Ahmetoglu, G. (2017). From Likert scales to



images: Validating a novel creativity measure with image based response scales. Personality and Individual Differences, 106, 36–40. https://doi. org/10.1016/j.paid.2016.10.007 Liddell, T., & Kruschke, J. (2018). Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328–48. 08.009 Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22, 5–55. Lodge, M. (1981). Magnitude scaling. Thousand Oaks, California: Sage. Lord, F. M. (1953). On the statistical treatment of football numbers. The American Psychologist, 8(12), 750–1. Lozano, L. M., García-Cueto, E., & Muñiz, J. (2008). Effect of the number of response categories on the reliability and validity of rating scales. Methodology, 4(2), 73–9. Maydeu-Olivares, A., Kramp, U., García-Forero, C., Gallardo-Pujol, D., & Coffman, D. (2009). The effect of varying the number of response alternatives in rating scales: Experimental evidence from intra-individual effects. Behavior Research Methods, 41(2), 295–308. BRM.41.2.295 McNeish, D. (2018). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23(3), 412–33. Mei, B., & Brown, G. T. L. (2018). Conducting online surveys in China. Social Science Computer Review, 36(6), 721–34. 0894439317729340 Messick, S. (1991). Psychology and methodology of response styles. In R. E. Snow & D. E. Wiley (Eds.), Improving inquiry in social science: A volume in honor of Lee J. Cronbach (pp. 161–200). Hillsdale, NJ: LEA. Miller, G. A. (1956). The magical number seven plus or minus two: some limits on our capacity for processing information. Psychological Review, 63(2), 81–97. 101.2.343 Nadler, J. T., Weston, R., & Voyles, E. C. (2015). Stuck in the middle: The use and interpretation of midpoints in items on questionnaires. The Journal of General Psychology, 142(2), 71–89. /10.1080/00221309.2014.994590 Norman, G. (2010). Likert scales, levels of measurement and the “laws” of statistics. Advances in Health Sciences Education, 15(5), 625–32. https:// OECD. (2013). OECD guidelines on measuring subjective well-being. Paris, France: OECD Publishing. Osgood, C. E. (1964). Semantic differential technique in the comparative study of cultures. American

Anthropologist, 66(3), 171–200. https://doi. org/10.1525/aa.1964.66.3.02a00880 ‘Otunuku, M., & Brown, G. T. L. (2007). Tongan students’ attitudes towards their subjects in New Zealand relative to their academic achievement. Asia Pacific Education Review, 8(1), 117–28. Roberts, J. S., Laughlin, J. E., & Wedell, D. H. (1999). Validity Issues in the Likert and Thurstone Approaches to Attitude Measurement. Educational and Psychological Measurement, 59(2), 211–33. Roszkowski, M. J., & Soven, M. (2010). Shifting gears: Consequences of including two negatively worded items in the middle of a positively worded questionnaire. Assessment & Evaluation in Higher Education, 35(1), 113–30. 10.1080/02602930802618344 Schacter, D. L. (1999). The seven sins of memory: Insights from psychology and cognitive neuroscience. American Psychologist, 54(3), 182–203. Schaeffer, N. C., & Dykema, J. (2020). Advances in the science of asking questions. Annual Review of Sociology, 46, 37–60. annurev-soc-121919-054544 Schaeffer, N. C., & Presser, S. (2003). The science of asking questions. Annual Review of Sociology, 29(1), 65–88. soc.29.110702.110112 Schriesheim, C. A., & Hill, K. D. (1981). Controlling Acquiescence Response Bias by Item Reversals: The Effect on Questionnaire Validity. Educational and Psychological Measurement, 41(4), 1101–14. Schwarz, N. (1999). Self-reports: How the questions shape the answers. American Psychologist, 54(2), 93–105. 54.2.93 Shaftel, J., Nash, B. L., & Gillmor, S. C. (2012, April 15). Effects of the number of response categories on rating scales. Paper presented at the annual conference of the American Educational Research Association, Vancouver, BC. Shulruf, B., Hattie, J., & Dixon, R. (2008). Factors affecting responses to Likert type questionnaires: Introduction of the ImpExp, a new comprehensive model. Social Psychology of Education, 11, 59–78. Simms, L., Zelazny, K., Williams, T., & Bernstein, L. (2019). Does the number of response options matter? Psychometric perspectives using personality questionnaire data. Psychological Assessment, 31(4), 557–66. Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–80. Strack, F. (1992). “Order Effects” in survey research: Activation and information functions of preceding

Response Option Design in Surveys

questions. In N. Schwarz & S. Sudman (Eds.), Context effects in social and psychological research (pp. 23–34). Springer. Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34, 273–86. https:// Toapanta-Pinta, P., Rosero-Quintana, M., SalinasSalinas, M., Cruz-Cevallos, M., & Vasco-Morales, S. (2021). Percepción de los estudiantes sobre el proyecto integrador de saberes: análisis métricos versus ordinales. Educación Médica, 22, 358–63. Toepoel, V., Das, M., & van Soest, A. (2009). Design of web questionnaires: The effect of layout in rating scales. Journal of Official Statistics, 25(4), 509–28. 0049124108327123 Weems, G. H., & Onwuegbuzie, A. J. (2001). The impact of midpoint responses and reverse coding on survey data. Measurement and Evaluation in Counseling and Development, 34(3),


166–76. 12069033 Weijters, B., Cabooter, E., & Schillewaert, N. (2010). The effect of rating scale format on response styles: The number of response categories and response category labels. International Journal of Research in Marketing, 27(3), 236–47. https://doi. org/10.1016/j.ijresmar.2010.02.004 Weng, L.-J. (2004). Impact of the number of response categories and anchor labels on coefficient alpha and test-retest reliability. Educational and Psychological Measurement, 64(6), 956–72. https://doi. org/10.1177/0013164404268674 Wu, H., & Leung, S.-O. (2017). Can Likert scales be treated as interval scales? A simulation study. Journal of Social Service Research, 43(4), 527–32. 1329775 Yan, T., & Keusch, F. (2015). The effects of the direction of rating scales on survey responses in a telephone survey. Public Opinion Quarterly, 79(1), 145–65.

This page intentionally left blank


Item Development

This page intentionally left blank

10 A Typology of Threats to Construct Validity in Item Generation L u c y R . F o r d a n d Te r r i A . S c a n d u r a

There has been much attention paid in the research methods literature to the issues of construct validity of measures; however treatment of the issue of item generation is scant, and contained to basic coverage in survey research methods textbooks (Dillman, 1978; Fowler, 1993; Dillman, Smyth, & Christian, 2014). Moreover, there is little research on the content validity of survey items, or the extent to which individual items measure the content domain of the construct (Schriesheim, Powers, Scandura, Gardiner, & Lankau, 1993). It should also be noted that it is generally accepted that content validity is a necessary precondition for construct validity (Schriesheim et  al., 1993). However, it seems obvious that poorly written items will result in poor psychometric properties of measures. Thus, we feel that it is time to look upstream in our construct validation process to the genesis of the survey items themselves. The costs of poor measurement in organizational research are substantial. Survey items that make no sense to survey respondents may result in high levels of non-response and/or response bias. Participants may even be alienated by emotionladen questions that are difficult to answer (Narins, 1999). DeVellis (2003) notes that “researchers often ‘throw together’ or ‘dredge up’ items and assume they constitute a suitable scale” (p. 11). Such practices result in poor measures and results

that may not be replicable. Poor measures result in a lack of construct validity and the inability of statistical procedures to detect statistical significance. And yet, even if significance is detected, the lack of good measurement may result in inappropriate conclusions being made. For example, while authors often reference the exchange of benefits between a leader and a follower, a commonly used measure (LMX-7) does not directly measure exchange per se (Bernerth, Armenakis, Feild, Giles, & Walker, 2007). We also need to be concerned about whether the items in each scale mean the same thing to each respondent (Fowler, 1993). Another concern is that there is a proliferation of scales in the literature, purportedly measuring different things, but appearing to differ little in content (Shaffer, DeGeest, & Li, 2016). An example of this is the surprising finding that two measures which should be quite different, engagement and job burnout, overlap considerably. Meta-analytic findings demonstrated that dimension-level correlations between burnout and engagement are high, and burnout and engagement dimensions exhibit a similar pattern of association with correlates (Cole, Walter, Bedeian, & O’Boyle, 2012). Moreover, this study found that controlling for burnout in meta-regression equations substantively reduced the effect sizes associated with engagement. Clearly the content of items is fundamental



to the advancement of organizational research, yet little attention has been paid to the practice of item generation, either in the literature or in practice. As MacKenzie, Podsakoff, and Podsakoff (2011) note “it is important to remember that the effectiveness of any content adequacy assessment technique is only as good as the definitions of the construct (and the items) that are developed by the researcher in the first place” (p. 306). The purpose of this chapter is to review the literature on item generation and develop a typology of common threats to construct validity due to poor item generation practices. This typology should prove useful to the researcher who wishes to deliberately avoid these threats to construct validity in their scale development practice. While other authors (i.e. Hinkin, 1995; DeVellis, 2003; Dillman et  al., 2014; Hardy & Ford, 2014) have addressed item generation in their work, none of the sources consulted for development of this typology have provided authors with a reasonably comprehensive list of threats to construct validity, with clear prescriptive descriptions allowing avoidance of unnecessary error.

ITEM GENERATION: A BRIEF OVERVIEW Content validity is a necessary but not sufficient condition for a measure to have construct validity. And of course, item generation is an essential step in the construct validity process (Hinkin, 1995). Content validity has been defined as the degree to which a particular measure reflects a specific intended content domain, while construct validity is concerned with “the extent to which a particular measure relates to other measures consistent with theoretically derived hypotheses concerning the concepts … that are being measured” (Carmines & Zeller, 1991). In many cases the issue of construct validity is handled quite succinctly: “Use published measures.” Thus, we have many dissertations and research studies that employ only measures that have been used in prior research and have purportedly acceptable psychometric properties. While this advice appears sound, there are some problems with this advice that are apparent upon closer scrutiny. First, many published and commonly used measures have threats to construct validity due to poorly written items. Many measures in use may therefore not measure the construct they are purported to measure, or may be indistinguishable empirically from scales intended to measure similar but distinct constructs, thereby casting doubt on the construct validity of the scale. Second, measures may be well written, but time

bound, such that “as years go by, the contents of some test items may become dated. Therefore, periodic revisions may be required to keep test item wording current” (Warner, 2008, p. 871). Third, if only published measures are employed, then the field does not move forward in new and relevant directions. This is analogous to “reshuffling” the same old deck of cards with the same tired constructs (do we really need another study of job satisfaction and organizational commitment no matter how good the measures are purported to be?). Our field is in need of new ideas, new constructs and (of course) new measures that represent them well, in order to extend the nomological net beyond currently existing knowledge. In addition, revision of existing scales that have less than desirable construct validity is necessary. For example, the concept of empowerment emerged as a new organizational phenomenon in the 1980s. Spreitzer (1995), while borrowing from and adapting existing measures, carefully developed and validated a measure of empowerment that could be used in organizational research, using a deductive approach. She specified four dimensions of empowerment and presented a preliminary nomological network of relationships between empowerment and organizational concepts. This work moved the field forward so that new studies might be conducted on an emerging organizationally-relevant topic. New measures are thus often necessary to advance organizational knowledge. For example, Brown, Treviño, and Harrison (2005) developed a new measure of ethical leadership. None existed as it was a relatively new construct. In their paper, they describe a rigorous process conducting seven studies, each with unique samples, in order to develop, evaluate and refine the scale, and then in the last three studies, to establish discriminant and predictive validity of the measure. Such thoughtful scale development practices exist in the literature, but tend to be the exception rather than the rule.

Where Do Survey Items Come From? As noted by Schwab (1980) content validity is essential to the item generation process. There are generally two ways to develop survey items: Deductive and inductive. Deductive scale development begins with a classification scheme or typology prior to the writing of the items and subsequent data collection. This approach requires a thorough understanding of the construct of interest and a complete review of the literature so that consonant theoretical definitions are generated prior to the development of items. A clear operational definition grounded in theory is required for


the deductive approach. In some cases, subject matter experts are asked to review the items to ensure that the items accurately reflect the domain of the construct. The second method is the inductive method. The inductive approach requires little theory prior to the writing of the items. The research generates measures from individual items. With this approach, researchers usually develop items by asking a pretest sample of respondents to describe their work experiences, reactions to something that is happening in the workplace, or aspects of their own behavior at work. For example, the researcher might ask, “Tell me about your reactions to your organization undergoing downsizing.” Responses might include “The downsizing is causing me stress,” or “I have seen others lose their jobs, and I am worried that I will lose mine.” Content analysis is then used to classify responses to develop items on the basis of these responses. To examine practices employed to establish content validity, Hinkin (1995) reviewed a sample of 75 studies published between 1989 and 1993 that developed new measures. With respect to item generation procedures, he found that most studies (83%) employed deductive methods, 11% were inductive and 6% employed a combination of methods. Deductive approaches using a classification scheme or typology are by far the most common, and this may be due to the emphasis on strong theory in the major journals. Inductive approaches were far less common. They are used when there is little or no theory to guide item writing. We are not advocating one method over the other. Deductive procedures may be more appropriate when there is well-established theory to guide the writing of items. Inductive processes might be necessary when there is a new line of inquiry for which the input of respondents is needed. Ideally, both processes should be used in a reflexive manner, however only a small percentage of studies in Hinkin’s review employed both. While it is not the purpose of our study to review the approaches employed to generate items, we are in agreement with DeVellis (2003) that the process is often less systematic than desired.

ITEM GENERATION THREATS TO CONSTRUCT VALIDITY In the process of generating items it is important to keep in mind the importance of construct validity. There are a number of potential threats to construct validity that relate to the wording of the items themselves, which can be avoided in the


item generation process. We present a typology of these threats and define each based upon the research methods literature. We based this typology on the literature on item generation, which suggests that these are practices to avoid in the construction of survey items.

Ambiguity Despite admonitions regarding ambiguity that relate back to a classic paper on the construction of attitude scales, the problem persists: “Above all, regardless of the simplicity or complexity of vocabulary or the naiveté or sophistication of the group, each statement must avoid every kind of ambiguity” (Likert, 1973, p. 91). We defined item ambiguity as: Any statement that is confusing, vague, or otherwise subject to multiple interpretations. Respondents cannot answer such items validly because they may not understand what the item is requiring them to respond to (Oppenheim, 1992). For example, an item that asks respondents about their intent to quit their job in the near future might be interpreted by some as next week, and by others as being in the next six months (Hardy & Ford, 2014). It is more effective, therefore, to provide temporal anchors for time-bound words. Another example might be the use of the word “organization.” An item that asks respondents whether their “organization treats them well” may invoke different referents in the minds of individual respondents, if the term “organization” is not clearly defined. For example, for one employee, their organization might be represented by their individual manager, while for another it might be represented by the nameless, faceless executives at a distant headquarters. Clearly the interpretations of the word are very different for these two respondents, with the researcher having no way of knowing what their referent was when responding. In addition, clearly the term “organization” will mean something different to an executive at the corporate office than it will to a manual laborer at a remote company site. Alreck and Settle (1985) suggest that the only sure way to avoid such ambiguity is to ensure that every questionable word or phrase in every item is thoroughly checked to ensure that they invoke the same meaning in every respondent.

Leading Questions We considered leading questions to be items which lead the respondent to believe there is a correct answer (Narins, 1999). For example, “Smoking causes cancer. How much do you



smoke?” In this case, the additional information given in the question might prompt the respondent to underreport the amount they smoke. In addition, we considered items to be leading if they contain implicit assumptions, where the researcher assumes that some component must be true (Saris & Gallhofer, 2007). For example, an item such as “When did you last talk with your mentor about your career plan?” assumes that the respondent has a mentor, and/or that they have talked with that person about their career plan. The respondent may answer it even if they don’t have a mentor, or they haven’t had that conversation, and thus the incidence of the behavior would be overreported. Experiments conducted by Swann and his colleagues found that when a questioner asks a leading question of a respondent, observers use their knowledge of conversational rules to infer that the questioner had an evidentiary basis for the question (Swann, Giuliano, & Wegner, 1982). Respondents treated leading questions as conjectural evidence as implied by the question. This research also found that when respondents answer leading questions they are in fact misled by the answers because they want to cooperate with the person posing the question, and may misrepresent their own personality traits. Leading questions typically arise as the result of the use of nonneutral language as demonstrated by the examples given above (Penwarden, 2015).

Double-Barreled This is the threat to construct validity most frequently mentioned in texts and articles on item generation as a problem to avoid (e.g. Converse & Presser, 1986; Oppenheim, 1992; Hinkin, 1998; Saris & Gallhofer, 2007). We define double-­barreled questions as those items that ask multiple questions within a single statement (e.g. “Are you happy and well-paid?”) In this case the respondent may be happy or well-paid but they would be responding to only part of the question (Hinkin, 1998). Saris & Gallhofer define this issue as a situation where “two simultaneously opposing opinions are possible” (2007, p. 87). Stated differently, when a respondent endorses a double-barrelled item, it cannot be known which part is the source of the positive (or negative) response to the question. Such items may also be confusing to survey respondents since they are not sure exactly what the question is.

“I am unhappy in my job”) are employed in survey measures. While using both positive and negative items in a scale is intended to avoid an acquiescence bias (DeVellis, 2003), research on negativelyworded items has indicated that these may produce response bias (McGee, Ferguson, & Seers, 1989; DeVellis, 2003; Merritt, 2012). This phenomenon has been studied empirically (Cordery & Sevastos, 1993) and theoretically (Marsh, 1996). In many cases it appears that the reverse coded items end up loading on a different factor than the one intended, and researchers are advised to pay careful attention to dimensionality when using reverse scored items. For example, Magazine, Williams, and Williams (1996) demonstrated the method effects resulting from reverse coded items in their examination of a measure of Affective and Continuance Commitment. If reverse coded items are used explicitly to test for careless responding, there are other methods that might be used that would not result in the error associated with a reverse coded item. For example, the researcher might use items that all respondents might be expected to answer in the same manner if they are paying attention, such as “I was born on February 30th” (Huang, Curran, Kenney, Poposki, & DeShon, 2011).

Negative Wording Some authors (e.g. DeVellis, 2003) use this term to refer to reverse coded items. Others, however, use the term to mean statements that include negative words (such as “not”) irrespective of whether they are measuring the construct or the polar opposite of the construct. It is this latter meaning to which we are referring, as the former is encoded in the previous term, reverse coding. The negative meaning is conveyed by the choice of words, such as “restrict,” “not,” and “control.” Negatively worded statements may bias responses since they may result in lack of endorsement (Converse & Presser, 1986) due to such concerns as social desirability bias or the potential that certain words might trigger a negative emotional reaction in participants (Design, 2017). Belson (1981) discusses items with a negative element as a factor which might impair a respondent’s understanding of the item. Another concern raised by Harrison and McLaughlin (1993) is that a block of negatively worded items can have a cognitive carryover effect, resulting in biased responses to subsequent neutral questions.

Reverse Coding

Double Negatives

In some instances, statements that reflect the polar opposite of the construct (e.g. for job satisfaction:

Statements in which a double negative occurs may be particularly confusing for respondents (e.g., “Pay


for performance is not an unjust policy”). Doublenegatives exist when two negative words are linked in the same item (Converse & Presser, 1986; Oppenheim, 1992). Here the intent of the researchers might be to have a positive evaluation of pay for performance but the respondent will perhaps indicate a negative response due to the wording (e.g. Hinkin, Tracey, & Enz, 1997). Another example is when a double negative is more subtle and it ends up in a survey item. Consider this agree/disagree item: “Please tell me whether you agree or disagree with the following statement about supervisors in organizations: Supervisors should not be required to supervise direct reports when they are leaving work in the parking lot”. A person may agree that supervisors should not be required to supervise direct reports in the parking lot, but the disagree side gets confused because it means “I do not think that supervisors should not be required to supervise direct reports when they are in the parking lot. This might happen when researchers don’t read aloud and listen to all the questions in a series (Converse & Presser, 1986). Anyone who has seen double negatives in an exam question will recognize the problems with this particular item wording concern. For example, we may ask students to select the correct answer to the question: “Which of the following is the least unlikely result of poor leadership?”

Jargon This refers to statements in which technical jargon appears that might not be understood by all respondents (Oppenheim, 1992). Jargon is the use of organizational textbook terms or buzzwords from the industry. For example, an item that asks respondents about emotional intelligence or social learning might not be understood by all respondents. In the information technology field, a question about cloud computing may confuse as there is no universal definition of the term, and different organizations may have a different view of its meaning. And while the terms stereotype and prejudice have very distinct meanings for scholars, the average survey respondent is unlikely to differentiate between the two. This particular threat to construct validity is relatively easy to avoid, though, by using the plainest and most commonly used words that accurately convey the intended meaning.

Colloquialisms This refers to statements in which slang appears – phrases that may be misunderstood by non-native English speakers or respondents who have not


kept up with the latest in slang may be a threat to content validity. In addition, items may be interpreted differently by respondents from different areas of the U.S., different countries, or respondents from different industries (e.g., a statement that asks if your supervisor would “go to bat” for you). Oppenheim (1992) warns of the possibility of alternative meanings for colloquialisms. For example, “passing the buck” may mean to hand off responsibility in the U.S., but might be interpreted to mean handing over money in other parts of the world (Dunham & Smith, 1979)! Hardy and Ford (2014) highlighted the different interpretations of words and phrases by those from different geographic regions, such as the word momentarily which means “for a moment” to a British English speaker, and “in a moment” to an American English speaker. This latter example is not strictly a colloquialism, except to the degree that it creates the same threat to construct validity as is caused by the use of colloquialisms.

Acronyms Abbreviations of words may represent a concern as not all respondents may be familiar with abbreviations (Oppenheim, 1992). For example, putting acronyms like LMX (leader–member exchange) or TQM (total quality management) may or may not have any meaning or the same meaning for all respondents in a survey. Further, while organization specific acronyms may be known to most organizational members, the use of such acronyms without definition might pose a challenge to new organizational members who have not yet learned all the internal language. Furthermore, there are plenty of acronyms that have multiple meanings. For example, CD may refer to a Computer Disk, or a Certificate of Deposit.

Prestige Bias Prestige bias is defined as statements that would prompt the respondent to agree with high-status experts. Statements that begin with “All experts agree that…” might produce a response of agreement rather than disagreement even if the respondent disagrees with the “experts,” as they believe they “should” align their response with those who “know best.” Another example would be “four out of five CEOs surveyed endorse strategic planning. Do you?” Prestige bias has been shown to have pervasive effects in respondents. For example, prestige bias has been shown to affect the peer review process in academic journals (Lee, Sugimoto, Zhang, & Cronin, 2013). Since this



effect is present among scholars who should be aware of such bias, we can expect that it will influence the average survey respondent. Most authors who write about prestige bias conflate it with social desirability bias. We believe, however, that they are distinct forms of bias, as it may be possible for a question to include prestige bias, and skew responses in the opposite direction to that which would be socially desirable.

Social Desirability Bias This possible threat is similar to prestige bias, but without the invocation of a prestigious other. Such statements may prompt bias towards a socially acceptable response. For example, “Would you save a drowning person?” In this case, people are reluctant to answer a question honestly because they are trying to appear socially or politically correct. The motivation to “save face” or impress others might alter the manner in which a respondent answers a question. Due to the potential for this bias, many researchers use a scale in their surveys to detect socially desirable responding (DeVellis, 2003). Moorman & Podsakoff (1992) reported results of a meta-analytic review of 33 studies that examined the relationships between social desirability response sets and organizational behavior constructs. Their findings showed that social desirability, as traditionally measured in the literature through social desirability measures (e.g. Crowne & Marlowe, 1960; Paulhus, 1989), is significantly (although moderately) correlated with several widely used constructs in organizational behavior research. Social desirability bias can create serious construct validity problems, in addition to being cited as one of the causes of selfreport bias in organizational research when all of the data is collected from the same source (Donaldson & Grant-Vallone, 2002).

Acquiescence Bias In the literature there are two distinct definitions of acquiescence bias. The first is when the scale, taken as a whole, encourages the respondent to respond positively to all of the items (e.g. DeVellis, 2003). The second refers to a statement that is likely to prompt a “yes” response in respondents. For example, asking respondents if they read the newspaper every day may produce a “yes” response. However, in reality, they may only have time to read the paper a few times a week or on Sundays. Alreck & Settle (1985) suggest that the question should not give any indication of which

is the “preferred” response, in order to avoid this bias.

SUMMARY The preceding sections reviewed a number of potential threats to construct validity that emanate from the writing of survey items themselves. There has been recent attention in the literature on organizational research on the examination of the items themselves. For example, Carpenter et  al. (2016) conducted a meta-analytic study at the item level of analysis on a commonly used scale of task performance and organizational citizenship behavior, and found that several of the items did not perform in the initial validation of the measures. This study underscores the need for researchers to be more diligent in item generation and the construction of measures at the outset to ensure that measures perform as claimed by researchers in the original publication. This is consistent with the recent measure-centric approach to understanding method variance advocated by Spector et al. (2017) in which more attention should be paid to developing a theory of the measure. To add to this discussion, we have organized the threats to construct validity due to item generation into the following typology.

A TYPOLOGY OF THREATS TO CONSTRUCT VALIDITY After developing the list of threats to construct validity, the authors discussed the list and it became clear that the threats can be categorized into four broad categories, which can be represented in a 2x2 matrix as shown in Figure 10.1. Scale-centered threats (as opposed to contextcentered threats) are those items which are inherently flawed in and of themselves. For example, most texts that address construct validity in item development recommend that items not be doublebarreled. This reflects the content of the item itself. On the other hand, context-centered threats are those where the item may be inherently wellwritten, and not contain any scale-centered threats to construct validity, but because of the context, the item is misinterpreted by the participant. For example, there is no problem with using acronyms in an item if every participant is clear exactly what that acronym means in the context in which it is being used. On the other hand, if an acronym has alternative meanings known to the participants,


Scale-centered Item Reverse coding construction Negative wording Double negatives Item meaning Double-barreled Aquiescence bias

Context-centered Colloquialisms Acronyms Jargon Ambiguity Leading questions Prestige bias Social desirability bias

Figure 10.1  Matrix of threats to construct validity then it is possible they will respond with the wrong one in mind. On the other axis of the matrix are item construction and item meaning. Item construction refers specifically to the structure of the item itself. For example, it has been shown numerous times that negatively worded items create a bias (McGee et al., 1989; DeVellis, 2003) that affects the validity of results. This is a very specific problem within an item and is easily avoided. Conversely, item meaning refers to the potential interpretation on the part of the participant. For example, a double-barreled item such as “I am happy with my pay and benefits” may elicit responses from participants in which they are focusing specifically on their pay, or on their benefits, or on both. This, of course, renders comparison among participants problematic.

GENERAL DISCUSSION In the future, we need to conduct more rigorous construct validity studies whenever we develop new scales or adapt existing ones. It is quite common for a researcher to “throw together” some items for a construct he/she needs to measure (DeVellis, 2003, p. 11), include them in a survey, and establish their construct validity through the use of exploratory/confirmatory factor analysis prior to testing the hypotheses in the study. DeVellis (2003, p. 86) suggests that all items should be reviewed by experts prior to use. He notes that the experts serve three purposes; to rate the extent to which they believe each item measures the construct of interest, what you have potentially failed to include, and most relevant to this chapter, the clarity and conciseness of the items. Hardy and Ford (2014) go a step further in


recommending that a sample of participants be asked what items mean to them in order to understand how respondents are receiving the items. Relying solely on academic experts may well result in more flawed items than if we ask respondents what the items mean to them. Ren and colleagues found that under some circumstances those with greater expertise may have stronger cognitive biases than those with less expertise (Ren, Simmons, & Zardkoohi, 2017). This concern is shared by Hardy and Ford (2014) who found that research methodologists made more mistakes in interpreting survey instructions than did regular study participants. We believe that doing so with this typology as a guide will result in researchers uncovering the subtler problems in generated items. Many existing measures could potentially be improved by rewording items to avoid the pitfalls outlined in this chapter and evaluating the context in which the survey is being administered, in order to assess the likelihood that context centered threats to construct validity are a concern. Experts with methodological training are likely to “catch” many item problems, particularly those that are scale centered, as they are related to the construction of the item itself, without concern for context. It has been our experience, based on our work in this area, that context centered threats to construct validity are more difficult for researchers to identify. In particular they often do not see an item as ambiguous, as their reaction to it is based on their own particular interpretation of the word(s), based on their own life experience. It is not always easy for the researcher to step out of their own “mental box” and imagine what other potential interpretations could be made of an item. A potential remedy would be to have multiple experts write out a description of what the item means to them, and in the event of an item referring to the organization, a group, etc., what their referent is. These descriptions could then be compared for consistency. Hardy and Ford (2014) demonstrated that respondents to surveys often interpret items differently, and that asking a sample of participants to describe the meaning of each item is a useful mechanism for uncovering miscomprehension. If it is clear that respondents are not all interpreting the referent in the same manner, then it would behoove researchers to clearly delineate the appropriate referent in the instructions, in order to attempt to ensure that at least most of the respondents are thinking about it in the intended manner. This not a perfect solution as not all survey respondents read the instructions (Hardy & Ford, 2014). As organizational researchers seem to rarely, if ever, provide information in their



published papers about the instructions that were provided to survey participants, it is difficult to assess whether any given study has been affected by this particular threat to construct validity. Perhaps it is time that we started reporting instructions in our methods sections, in an effort to ensure that measures are comparable across studies. It can be useful to include participants from different geographical regions, different organizations, and with different native languages, in order to uncover other context-centered threats to construct validity in the items. For example, asking participants from different geographic regions to explain what they understand items to mean should uncover colloquialisms, as some participants will be less likely to understand them correctly. This chapter reviewed the threats to construct validity based upon item generation. It seems that many of the problems encountered with common method variance, and other problems emanating from poor measurement might be alleviated if more time were taken to construct the items in measurement scales from the outset. While we don’t advocate specifically for the use of deductive or inductive methods to gain insight into item content, we do note that deductive methods are far more common in the literature, and perhaps a combination of the two approaches would be more helpful. By alerting researchers to the threats to construct validity in our typology, we hope that researchers will take more trouble to follow our recommendations and produce improved measures for business research, by using our recommendations in conjunction with existing guidelines on such things as scale length. These guidelines for developing items, used in conjunction with existing work on concerns about scale length and statistical methods for empirical evaluation of the generated scales, should result in improved scales in our literature.

Note From Ford, L. R., & Scandura, T. A. (2018). A typology of threats to construct validity in item generation. American Journal of Management, 18(2). https:// Reprinted with permission.

REFERENCES Alreck, P. L., & Settle, R. B. (1985). The survey research handbook. Homewood, IL: Richard D. Irwin, Inc.

Belson, W. A. (1981). The design and understanding of two survey questions. Aldershot, Hants, England: Gower Publishing Co. Ltd. Bernerth, J., Armenakis, A., Feild, H., Giles, W., & Walker, H. (2007). Leader-Member Social Exchange (LMSX): Development and validation of a scale. Journal of Organizational Behavior, 10.1002/ job.443, 979–1003. doi:10.1002/job.443 Brown, M. E., Treviño, L. K., & Harrison, D. A. (2005). Ethical leadership: A social learning perspective for construct development and testing. Organizational Behavior and Human Decision Processes, 97(2), 117–34. doi:10.1016/j.obhdp. 2005.03.002 Carmines, E. G., & Zeller, R. A. (1991). Reliability and vaidity assessment. Newbury Park: Sage. Carpenter, N. C., Son, J., Harris, T. B., Alexander, A. L., & Horner, M. T. (2016). Don’t forget the items: Item-level meta-analytic and substantive validity techniques for reexamining scale validation. Organizational Research Methods, 19(4), 616–50. doi:10.1177/1094428116639132 Cole, M., Walter, F., Bedeian, A., & O’Boyle, E. (2012). Job burnout and employee engagement: A meta-analytic examination of construct proliferation. Journal of Management, 38(5), 1550–81. doi:10.1177/0149206311415252 Converse, J. M., & Presser, S. (1986). Survey questions: Handcrafting the standardized questionnaire (10.4135/9781412986045). Newbury Park, CA: Sage. Cordery, J. L., & Sevastos, P. P. (1993). Responses to the original and revised job diagnostic survey: Is education a factor in responses to negatively worded items? Journal of Applied Psychology, 78(1), 141–3. doi:0021-9010.78.1.141 Crowne, D. P., & Marlowe, D. (1960). A new scale of social desirability independent of psychopathology. Journal of Consulting Psychology, 24, 349–54. doi:10.1037/h0047358 Design, Q. (2017). Questionnaire design. DeVellis, R. F. (2003). Scale development: Theory and applications (2nd ed.). Thousand Oaks, CA: Sage. Dillman, D. A. (1978). Mail and telephone surveys: The total design method. New York: Wiley. Dillman, D. A., Smyth, J. D., & Christian, L. M. (2014). Internet, phone, mail, and mixed-mode surveys: The tailored design method. Hoboken, NJ: John Wiley. Donaldson, S. I., & Grant-Vallone, E. J. (2002). Understanding self-report bias in organizational behavior research. Journal of Business & Psychology, 17(2), 245–60. doi:10.1023/a:1019637632584 Dunham, R. B., & Smith, F. (1979). Organizational surveys: An internal assessment of organizational health. Scott Foresman & Co.


Fowler, F. J. (1993). Survey research methods (2nd ed.). Newbury Park, CA: Sage. Hardy, B., & Ford, L. R. (2014). It’s not me, it’s you: Miscomprehension in surveys. Organizational Research Methods, 17(2), 138–62. doi:10.1177/ 1094428113520185 Harrison, D. A., & McLaughlin, M. E. (1993). Cognitive processes in self-report responses: Tests of item context effects in work attitude measures. Journal of Applied Psychology, 78(1), 129–40. doi:10.1037/0021-9010.78.1.129 Hinkin, T. R. (1995). A review of scale development in the study of behavior in organizations. Journal of Management, 21, 967–88. doi:10.1177/ 014920639502100509 Hinkin, T. R. (1998). A brief tutorial on the development of measures for use in survey questionnaires. Organizational Research Methods, 1(1), 104–21. doi: 10.1177/109442819800100106 Hinkin, T. R., Tracey, J. B., & Enz, C. A. (1997). Scale construction: Developing reliable and valid measurement instruments. Retrieved from Huang, J. L., Curran, P. G., Kenney, J., Poposki, E. M., & DeShon, R. P. (2011). Detecting and deterring insufficient effort responding to surveys. Journal of Business & Psychology, 27, 99–114. Lee, C., Sugimoto, C., Zhang, G., & Cronin, B. (2013). Bias in peer review. Journal of the Association for Information Science and Technology, 64(1), 2–17. doi:10.1002/asi.22784 Likert, R. (1973). The method of constructing an attitude scale. In S. Houston, J. Schmid, R. Lynch, & W. Duff (Eds.), Methods and techniques in business research (pp. 90–5). New York: MSS Information Corporation, Ardent Media. MacKenzie, S. B., Podsakoff, P. M., & Podsakoff, N. P. (2011). Construct measurement and validation procedures in MIS and behavioral research: Integrationg new and existing techniques. MIS Quarterly, 35(2), 293–A295. Retrieved from https:// com/login.aspx?direct=true&db=bth&AN=604619 34&site=ehost-live Magazine, S., Williams, L., & Williams, M. (1996). A confirmatory factor analysis examination of reverse-coding effects in Meyer and Allen’s affective and continuance commitment scales. Educational and Psychological Measurement, 56(2), 241–50. Marsh, H. (1996). Positive and negative global selfesteem: A substantively meaningful distinction or artifacts? Journal of Personality and Social Psychology, 70(4), 810–19. doi:10.1037/0022-3514. 70.4.810 McGee, G. W., Ferguson, C. E., & Seers, A. (1989). Role conflict and role ambiguity: Do the scales measure these two constructs? Journal of Applied


Psychology, 74, 815–18. doi:10.1037/0021-9010. 74.5.815 Merritt, S. M. (2012). The two-factor solution to Allen and Meyer’s (1990) affective commitment scale: Effects of negatively worded items. Journal of Business and Psychology, 27(4), 421–36. doi:10.1007/s10869-011-9252-3 Moorman, R. H., & Podsakoff, P. M. (1992). A metaanalytic review and empirical test of the potential confounding effects of social desirability response sets in organizational behaviour research. Journal of Occupational and Organizational Psychology, 65(2), 131–49. doi:10.1111/j.2044-8325.1992. tb00490.x Narins, P. (1999). Write more effective survey questions. Keywords: SPSS, Inc. Oppenheim, A. N. (1992). Questionnaire design, interviewing and attitude measurement (New ed.). New York: St. Martin’s Press. Paulhus, D. L. (1989). Assessing self-deception and impression management in self-reports: The Balanced Inventory of Desirable Responding – Version 6. Unpublished manuscript. Penwarden, R. (2015). 5 common survey question mistakes that’ll ruin your data. Retrieved from Ren, R., Simmons, A., & Zardkoohi, A. (2017). Testing the effects of experience on risky decision making. American Journal of Management, 17(6). Saris, W. E., & Gallhofer, I. N. (2007). Design, evaluation, and analysis of questionnaires for survey research (10.1002/9780470165195). Hoboken, NJ: John Wiley & Sons, Inc. Schriesheim, C. A., Powers, K. J., Scandura, T. A., Gardiner, C. C., & Lankau, M. J. (1993). Improving construct measurement in management research: Comments and a quantitative approach for assessing the theoretical content adequacy of paper-and-pencil survey-type instruments. Journal of Management, 19(2), 385. doi:10.1177/ 014920639301900208 Schwab, D. P. (1980). Construct validity in organizational behavior. In B. M. Staw & L. L. Cummings (Eds.), Research in organizational behavior (Vol. 2). Greenwich, CT: JAI Press, Inc. Shaffer, J. A., DeGeest, D., & Li, A. (2016). Tackling the problem of construct proliferation: A guide to assessing the discriminant validity of conceptually related constructs. Organizational Research Methods, 19(1), 80–110. doi:10.1177/1094428115598239 Spector, P. E., Rosen, C. C., Richardson, H. A., Williams, L. J., & Johnson, R. E. (2017). A new perspective on method variance: A measure-centric approach. Journal of Management, 014920631668729. doi:10.1177/0149206316687295 Spreitzer, G. M. (1995). Psychological empowerment in the workplace: Dimensions, measurement and



validation. Academy of Management Journal, 38(5), 1442. doi:10.2307/256865 Swann, W., Giuliano, T., & Wegner, D. (1982). Where leading questions can lead: The power of conjecture in social interaction. Journal of Personality

and Social Psychology, 42(6), 1025–32. doi:10.1037//0022-3514.42.6.1025 Warner, R. M. (2008). Applied Statistics: From bivariate to multivariate techniques. Thousand Oaks, CA: Sage.

11 Measurement Models: Reflective and Formative Measures, and Evidence for Construct Validity L i s a S c h u r e r L a m b e r t , Tr u i t W. G r a y a n d Anna M. Zabinski

The focus of research is on testing the relationships between constructs, adding to our body of knowledge about how the world works. Many of our constructs are latent, meaning we cannot directly observe them, so we need to find and measure observable traces or indicators of the constructs. Indicators might be obtained from a wide array of sources (e.g., questions on a survey, financial, or human resource records, or expert ratings of behavior) but these indicators are not the construct itself. The relationship between the indicators and the construct creates the measurement foundation to test the relationships between constructs (DeVellis, 2017; Edwards, 2003; Schwab, 1980). If the measurement foundation is unsound then our tests of relationships between constructs may not only be biased, but they may be wrong. The purpose of construct validity techniques and analyses is to marshal evidence regarding the relationship between a construct and its indicators. Just as with relationships between constructs, relationships between indicators and constructs should be specified via a model; specifically, a measurement model that requires theoretical and empirical support (Anderson & Gerbing, 1988; Cohen et al., 2003; Edwards, 2003). The relationship between the indicators of a construct and the construct itself can be represented by drawing

an arrow in a figure of the measurement model (see Figure 11.1a and 11.1b). The arrow symbolizes a causal relationship that can be tested (Anderson & Gerbing, 1988; Cohen et al., 2003; Edwards, 2003). A central question that must be answered by researchers developing a new measure is whether the arrow linking the construct and indicator will point to the construct or to the indicator (Bollen & Lennox, 1991). When the arrow starts from the construct and points to the indicator, we infer that the construct causes the score on the indicator and the relationship is referred to as reflective, i.e. the value of the construct is reflected in the score of the indicator (Figure 11.1a). Conversely, an arrow that starts from the indicator and points to the construct is stipulating a causal relationship whereby the meaning of the construct is determined by the meaning of the indicator (Figure 11.1b). The relationship is labeled as formative because the indicators of the construct combine to create, or form, the meaning of the construct. The choice between specifying the relationship between a construct and its indicators as reflective or formative should not be arbitrary. The choice has implications for theorizing with the construct and implies a different set of standards for evaluating the construct validity of the measure.



Figure 11.1a and 11.1b  Comparing reflective to formative constructs To illustrate the distinction between reflective and formative measures, we contrast two approaches to measuring the construct of job satisfaction that is a person’s cognitive and emotional evaluation of one’s job (Locke, 1976; Judge & Kammeyer-Mueller, 2012). A reflective approach might involve asking multiple survey questions that are general and global in nature, e.g., “in general, I am satisfied with my job,” “all in all, the job I have is great,” and “my job is very enjoyable.” In contrast, a formative approach would require asking questions about each dimension of the job, e.g., “I am satisfied with my compensation,” “I am satisfied with my coworkers,” and “I am satisfied with the type of work I do.” As with reflective measures with multiple items, the scores on the dimensions are typically added (and divided by the number of items) to create a composite score representing job satisfaction. We will continue this example of job satisfaction, among other examples, throughout the chapter to highlight key distinctions between reflective and formative approaches. Methodologists have been debating, and occasionally raging about, the advantages and disadvantages of reflective and formative measures for decades (Bollen & Lennox, 1991; Bollen & Diamantopoulos, 2017; Edwards, 2011; Howell et al., 2007; MacKenzie et al., 2005). However, much of the debate has been waged with equations and dense conceptual arguments that may be difficult to follow. In this chapter, we strive to simplify the positions of both sides, without losing the substance of the arguments, and give guidance to researchers seeking to improve their measurement practice. We begin by explaining the attributes of reflective and formative constructs. We then compare formative measures to reflective measures highlighting important differences in construct validity procedures and in interpretation of their construct (composite) scores. Next, we more briefly address higher order constructs and multidimensional constructs, i.e., constructs composed of two or more dimensions, which can be entirely reflective, entirely formative, or a combination of both

reflective and formative elements. Finally, we offer guidelines for choosing the measurement model for a new construct. Despite our efforts at even-handed treatment, we now give away the punchline of the chapter by stating that we see few benefits, and many complications, from using formative measures. Yet, we acknowledge that some of the ideas inherent in arguments for formative measurement are attractive so we conclude with strategies for turning formative measures into reflective measures, preserving the rationale for formative measures, but obtaining the precision and testability of reflective measures.

EXPLANATION OF REFLECTIVE MEASURES Reflective measures are sometimes called effect measures because the indicators are the effects or outcomes of the construct (Edwards, 2003; DeVellis, 2017). The idea underpinning reflective measures is that the constructs exist in reality but may be unobservable in their true, unbiased state. Although the construct (ξ in Figure 11.1a) cannot be perfectly measured, we can measure imperfect indicators (x1) of the construct which are measured with error (ε1). The direction of the arrow between the construct and the indicator implies a theoretically based, causal relationship (λ1) that as the construct increases, the score on the indicator increases. The construct and the indicators are unidimensional, meaning the construct contains one concept, and the indicators capture the full breadth of the concept. For example, the idea of overall job satisfaction can be measured with indicators such as “in general, I am satisfied with my job,” “all in all, the job I have is great,” and “my job is very enjoyable.” As a result, each indicator is constructed (e.g., survey item is written) to reflect exactly the same idea as the construct (e.g. the conceptual definition of the construct) and indicators should be entirely interchangeable with each other. The assumption that the indicators equally reflect the construct with minimum error is tested with an array of procedures including content validity exercises (Colquitt et al., 2019; Hinkin & Tracey, 1999; Willis, 2005), calculations of reliability, e.g., alpha, omega, test-retest (Cortina, 1993; McNeish, 2018), exploratory and confirmatory factor analyses (Asparouhov & Muthén, 2009; Brown, 2015; Conway & Huffcutt, 2003; Fabrigar et al., 1999; Jackson et al., 2009; Widaman, 2012) and examinations of convergent and discriminant


validity (Campbell & Fiske, 1959; Cronbach & Meehl,1955; Shaffer et al., 2016), among others. We will not elaborate on the methods for assessing construct validity for reflective measures because they are more familiar and well documented elsewhere (we site many useful sources throughout this chapter). We also refer readers to other chapters in this book (specifically, 14, 15, 16, and 18).

EXPLANATION OF FORMATIVE MEASURES In contrast to reflective measures, formative measures, also referred to as cause indicators, embody a fundamentally different relationship between the construct and its indicators. In formative constructs, the indicators (x1 in Figure 11.1b) are theorized to “cause” the construct (η) (Bollen & Lennox, 1991). The theoretical reasoning behind formative models is that some constructs are properly caused by their indicators – the indicators collectively constitute the construct, or the indicators combine in such a way as to cause the construct. Formative measures are viewed as causal such that increases in the indicator cause the value of the construct to increase. The items themselves form the construct and the meaning of the construct is derived from the content of the items. This basic distinction between formative measures and reflective measures is seen in the direction of the arrow between the indicator and the construct (λ1 vs. γ1 in Figures 11.1a and 11.1b). In reflective measures, the arrow goes from the construct to the indicator, as the indicator is meant to reflect the meaning of the construct, while in formative measures the arrow is reversed, consistent with the idea that the construct is caused by, and the meaning of the construct defined by, the indicators. The indicators in a formative model are assumed to be measured without error, but the residual of the construct (ζ) captures variance, and error, in the construct unexplained by the indicators. Stipulating that the indicators cause the construct also means that the items of a formative construct are not viewed as interchangeable, equally valid indicators of the construct. In other words, the requirement for a unidimensional meaning of the construct is dropped in favor of specifying indicators that represent different facets, aspects, or dimensions of the formative construct. Commonly cited examples of such formative constructs include SES (socioeconomic status composed of education, occupation, father’s occupation, etc.), job satisfaction (satisfaction with pay, supervisor, type of work, co-workers, etc.), and stress (checklist or frequency of insomnia,


headaches, irritability, anxiety, depression, rapid heart rate, etc.).

COMPARING FORMATIVE MEASURES TO REFLECTIVE MEASURES We further explain formative constructs by contrasting them to the more widely understood reflective constructs by beginning with what we think are the most prominent differences between formative and reflective constructs: philosophical foundations, internal consistency requirements, and how measurement error is modeled. We then address dimensionality and interpretation, estimating measurement models, and conclude by summarizing differences in construct validity evaluations.

Philosophical Underpinnings The idea that indicators are reflected manifestations of a construct is consistent with a philosophy that is referred to as critical realism (Borsboom et al., 2004; Loevinger, 1957). In other words, the ideas or entities defined in constructs exist in the world but our observations, i.e., our indicators of the reflective construct, are flawed. Some advocates of formative measures indicate their support for a constructivist view of constructs which embraces the idea that constructs may not objectively exist in the world but are ideas or entities that can be created, named, and used by researchers (Borsboom et  al., 2003; Podsakoff et  al., 2016). Advocates of formative measures argue that, for some constructs, this approach – that the indicators cause or combine to form the construct – correctly represents the relationship between indicators and their constructs. One argument for formative measures claims that ignoring those causal processes is wrong and would lead to dropping many useful and commonly used measures. Another argument places less emphasis on causal relationships and focuses on how the indicators, or dimensions, combine in some fashion to create the meaning of the construct (MacKenzie et  al., 2005). A third argument suggests that psychological constructs, e.g., attitudes or opinions, might be appropriately measured with reflective measures but that objective phenomena can and should be measured with multiple discrete, and largely error free, indicators, e.g., socioeconomic status, working conditions, etc. In sum, these arguments lead to the implication that some relationships between



construct and indicator(s) are accurately represented as formative measures.

Internal Consistency Chief among the differences between reflective and formative constructs is that the internal consistency among indicators is irrelevant when evaluating the adequacy of a formative measure (Bollen & Lennox, 1991). In contrast, reflective indicators are required to correlate highly with each other. Accordingly, correlations among formative indicators may be high or low, or even negative, rendering reliability, e.g., Cronbach’s alpha, or omega, useless as a criterion for evaluating construct validity. For instance, imagine a formative job satisfaction measure where there is no theoretical reason to expect that satisfaction with working conditions is correlated with satisfaction with compensation. Any analysis focusing on covariance among the items is unnecessary and the hurdle of establishing adequate internal consistency is not present. Indeed, high correlations among formative indicators may signal that the indicators do not capture distinct aspects of the construct, and are therefore redundant (Bollen & Lennox, 1991).

Measurement Error Assessments of reflective constructs assume that the items are measured with error (δ or ε) and an estimate of the error for each item is provided by confirmatory factor analyses (CFA). Alternatively, items for formative constructs are assumed to be measured without error. Tests of formative constructs typically, but not always, measure error at the construct level by estimating the residual, referred to as zeta (ζ in Figure 11.1b) (Bollen & Lennox, 1991). The residual in a formative measure contains unexplained variance, which includes error, in the construct not associated with the indicators (as does the residual in a regression equation). Unfortunately, it is not possible to distinguish between measurement error and unexplained variance. The assumption that formative items contain no measurement error is unlikely. While the amount of measurement error in indicators may vary, we assert that every item has some amount of error. Indicators that are frequently viewed as objective, for example accounting numbers used to measure firm performance or supervisors’ ratings of employee performance, still contain error. Stray moods or emotions may inadvertently influence

scores on job attitudes, survey items may be written poorly, and raters may have incomplete, biased, or inaccurate observations of behavior. It is hard for us to conceive of an item that has no error. For example, if the first author’s husband were asked his age on seven subsequent days, he would report the wrong age on at least one of those days. The presence of measurement error introduces bias into analytical results (e.g. estimates of coefficients are biased) therefore hypothesis tests may be inaccurate (Bollen, 1989; Cohen et al., 2003). At times, the bias in results associated with ignoring measurement error may be negligible, but we believe that the measurement error in the indicators should be estimated rather than assumed to be zero as is the case with formative models.

Dimensionality and Interpretation of the Construct As with reflective measures, delineating the domain or definition of the formative construct establishes the foundation for evaluating the quality of the relationship between the construct’s content and its indicators (Cronbach & Meehl, 1955; DeVellis, 2017; Podsakoff et al., 2016). However, one of the differences between reflective and formative constructs lies in what is referred to as dimensionality. Reflective indicators should be unidimensional, capturing the same content, albeit expressed in different ways, and should represent the entire domain of the construct. The key distinguishing feature of formative constructs is that the requirement for unidimensionality is not applicable, meaning that formative constructs may contain multiple dimensions or facets (Diamantopoulos & Winklhofer, 2001). It is imperative to adequately represent the formative construct’s domain by including indicators for all relevant facets or dimensions of the construct (Bollen & Lennox, 1991). Indeed, the definition of the construct is not independent of the facets, or indicators, of the construct. Continuing with the example of a formatively measured attitude of job satisfaction, the meaning of the construct depends on the specific dimensions included, e.g., satisfaction with compensation, the nature of the work, co-workers, and the supervisor (among other possible dimensions). Adding or omitting a facet changes the conceptualization of the construct, e.g., including or excluding the quality of supervision changes the theoretical meaning of job satisfaction. The critical question is how many dimensions of job satisfaction are needed to properly measure the construct; for instance, the quality of the work environment,


benefits, and opportunities for promotion and professional growth, might also reasonably belong to the formative construct. Should satisfaction with the room temperature or with the office plants also be included to assess job satisfaction? Adding additional facets likely explains more variance, but there is no theoretical threshold for how much variance each indicator should explain in a formative construct. Because the domain of the construct depends on the specific indicators measured, the meaning of the construct changes as the measurement changes (Diamantopoulos & Winklhofer, 2001). Perhaps the most critical concern is that the scores for formative constructs have no clear conceptual interpretation. A numerical score should represent the value of respondents’ standing on a construct, but scores on formative measures cannot be clearly interpreted. Let’s revisit the example of a formative model of job satisfaction, measured with three items assessing satisfaction with pay, the meaningfulness of the work, and satisfaction with co-workers. The scores on the three items, each measured on a 1 to 7 point scale, will be added and the sum divided by three; the result will be used in regression equations to represent the construct. Imagine that three of the respondents in the data set have a score of 4 for the construct. One respondent is equally satisfied by all three facets of the job and gives scores of 4 for each of the three items ((4 + 4 + 4)/3 = 4). The second respondent is very satisfied with the meaning of the work but much less satisfied with the pay and co-workers ((2 + 7 + 3)/3 = 4), and the third respondent enjoys co-workers’ company but believes they are underpaid and the work is only tolerable ((2 + 3 + 7)/3 = 4). All three respondents in this example have the same construct level score in the data set, but this numerical score does not differentiate between their satisfaction with pay, the meaningfulness of the work, and satisfaction


with co-workers. Analyses based on these scores assume that the relationships with other variables are the same across each facet, e.g., that pay satisfaction matters as much as satisfaction with co-workers.

Estimating the Measurement Model The measurement model, represented in a figure linking indicators to a construct, represents a hypothesis that is tested with CFA analyses. Showing that the measurement model adequately fits the data and is superior to alternative measurement models is a de rigueur component of most publications. Recall that in CFA it is necessary to estimate an identified model; meaning there is enough information in the covariance matrix to estimate the model. In a one factor CFA there must be at least three indicators (and more indicators means the model is overidentified and the fit of the model can be tested). Moreover, the scale of the model must be set, by fixing either one path (λ1) from the construct to an item, or by fixing the variance of the construct, equal to one. Many structural equation modeling programs automatically set the scale, so it is possible to be largely unaware of this feature, but understanding this requirement is necessary to see how differently formative models are estimated. Formative models cannot be identified no matter how many indicators are included for the construct (Edwards, 2011; MacKenzie et al., 2005). Instead, formative models require incorporating reflective items, or reflective constructs, to identify the model and set the scale. Below we explain two of the most common approaches for modifying formative models for testing. Figure 11.2a presents a MIMIC model, referring to multiple indicator, multiple causes. It

Figure 11.2a and 11.2b  Identifying formative constructs with reflective indicators



includes two reflective indicators of the formative construct so that the model can be identified. It is important to recognize that the loadings from the formative indicators to the formative construct (γ1) are not stable, but vary depending on the content of the reflective indicators – meaning that if you swap out one reflective indicator for another the loadings on the formative indicators will change (Edwards & Bagozzi, 2000; Joreskog & Goldberger, 1975). In other words, “changing dependent constructs changes the formative construct” where in this instance, the dependent constructs refer to the reflective items (Howell et al., 2007, p. 211). When the content of the reflective items stems from the same construct, the variation in the loadings of the formative indicators on the construct may be small. However, as the content of the reflective items varies from the content of the formative construct, the formative loadings can change considerably. In effect, the meaning of the formative construct now depends on both the formative indicators/dimensions and the reflective items in the model, increasing the definitional ambiguity of formative constructs. Without a stable definition of a construct, it is folly to theorize relationships with other constructs. There is little guidance on what kind of reflective indicators are appropriate for establishing identification of a formative measure when assessing construct validity; however, MacKenzie, Podsakoff, & Podsakoff (2011) recommend that the reflective indicators should represent the content of the total measure rather than of a single dimension. For instance, if job satisfaction is being measured as a formative construct (e.g. satisfaction with pay, supervisor, type of work, co-workers, etc.), the reflective indicators selected to identify the model should represent the total construct of job satisfaction, e.g., “overall job satisfaction.” Another way to identify a formative measure, see Figure 11.2b, is to include at least two reflective, endogenous constructs, each with at least two reflective indicators. MacKenzie and colleagues (2011) specify that the two reflective variables should not be causally related to the formative construct. However, it is necessary to link the reflective constructs to the formative construct with arrows that imply that the reflective constructs are consequences of the formative construct. It is not clear to us how these links do not constitute causal relationships. Moreover, identifying a formative model with reflective, endogenous constructs confounds the interpretation of the formative construct with the meaning of the endogenous constructs (Edwards & Bagozzi, 2000; Howell et al., 2007). Specifically, swapping in different endogenous constructs will lead to different loadings

for the formative construct suggesting that there ought to be some criteria for selecting appropriate endogenous constructs. Essentially, this approach suffers from the same ambiguities as the MIMIC model in that the interpretation of the formative construct is influenced by the arbitrary selection of reflective constructs.

Standards for Evaluating Construct Validity To establish construct validity with reflective constructs it is necessary to show that items within a construct correlate more highly with each other (convergent validity) than with items from other constructs (discriminant validity). The idea that constructs should exhibit discriminant and convergent validity is central in tests of reflective constructs where CFA is used to estimate models that include multiple constructs related to the focal construct, i.e., the nomological network, as a test of discriminant validity (Cronbach & Meehl, 1955). Items that “cross load,” meaning that they correlate highly with an unassigned, or unintended, construct in a CFA are candidates for deletion during the scale development process. Formative constructs (η) should be distinct from other constructs, i.e., evidence of discriminant validity, and should be related as expected to other constructs in its nomological network (Diamantopoulos & Winklhofer, 2001; MacKenzie et al., 2011). However, it is not necessary that indicators within a formative measure (x1, xx) correlate more highly with each other than they do with indicators of other constructs under study. For instance, an item measuring satisfaction with compensation need not be correlated with an item for satisfaction with relationships with co-workers. The procedures for evaluating the construct validity of formative measures differ markedly from those applied to reflective measures. The specification of a reflective measurement model is comparatively precise even with the relatively lax version of CFA models (i.e., a congeneric model, where the loadings are theoretically predicted to be high but may not equal, and the errors are predicted to be low but may be unequal) that is commonly used in empirical tests. Although the estimates of a measurement model may vary considerably across samples, the standards for evaluating the model remain unchanged. Practices for assessing the quality of reflective measures (e.g., content validity, CFA, nomological validity) are continually being updated but are relatively well articulated (DeVellis, 2017; Jackson et al., 2009).


The typical criteria applied to evaluating the validity of reflective constructs are inapplicable for formative measures, but the criteria that should be applied to formative measures are rarely stated by methodologists. There is no methodological guidance for the magnitude of the loadings, the degree of expected correlation between indicators, or the expected relationship of the indicators with other constructs. Presumably, authors developing formative constructs might specify what requirements must be met, or what standards should apply in their scale development, but even if they were specified (which we have never seen in management literature), it is not clear what the theoretical basis should be for such criteria. The lack of specificity in the formative measurement model introduces ambiguity when evaluating the tests of the model across multiple samples because there are no criteria for evaluating if the hypothesized measurement model is supported or rejected (Howell et  al., 2007). In practice, we have seen researchers determine how to treat formative constructs, i.e., keeping facets distinct or combining them for analyses, based on their results in a specific sample. In effect, this practice relies on the results to determine which validity criteria should be applied to the construct – reversing the prescribed process for defining a measurement model and assessing validity. In sum, we draw three conclusions about formative measures: (1) the construction of formative measures is not grounded in a theoretical approach to measurement, (2) the criteria for evaluating the construct validity of formative constructs are vague, and (3) the definition and interpretation of formative constructs is inherently ambiguous (Edwards, 2011; Howell et al., 2007). Another way of visually summarizing the logic of formative constructs is captured by a sign posted outside of an apple orchard in Blue Ridge, Georgia (Figure 11.3) which presents the construct of Mercier Orchards as composed of a sum of the number of trees, acreage, elevation, and date of establishment.

SECOND ORDER FACTOR MODELS AND MULTIDIMENSIONAL MODELS Second order factor models and multidimensional models link measurement models of the types already discussed in this chapter with constructs that are at a higher, typically more abstract or general construct level. There are quite a few variations that have been developed, each making a different theoretical statement about causal


Figure 11.3  Sign outside of an apple orchard in Georgia illustrating a formative construct relationships between the constructs and involving different challenges for statistical estimation or theoretical interpretation. Other authors have covered this ground more thoroughly than we will (Edwards, 2001; Law et al., 1998; Marsh & Hocevar, 1985). Our focus is not on enumerating or naming the types of models but on conveying simpler explanations of how causal and theoretical specifications implied by models may vary. We describe a few common model types to illustrate the variety but focus our attention on the theoretical statements implied by each model. The purpose of this section is to familiarize readers with a range of causal linkages specified in models so that you can select, test, and defend your choice of measurement model when developing a measure. The simplest model involving multiple constructs is illustrated in Figure 11.4 (model a) which shows three reflective constructs each with multiple indicators. The curved, double-headed arrows linking the constructs indicate that these distinct constructs are allowed to freely correlate and that there is no statement about the causal relationship between them. This is the typical default model when testing reflective measurement models with CFA. Reflective models of related constructs can sometimes be more parsimoniously explained by a higher order reflective construct, as in Figure 11.4 (model b) depicting three reflective constructs as facets or dimensions of the second order construct. The general idea is that the construct exists on a higher, more abstract level, and the dimensions are manifestations of facets of the construct (Brown, 2015; Marsh & Hocevar, 1985). The causal flow is from the higher order construct to the lower order, first level, reflective constructs and implies that the magnitudes of the relationships between the



Figure 11.4a, 11.4b, and 11.4c  Example models


higher order construct and lower order constructs are equal and of the same sign. When the higher order construct is reflective, it implies that as the higher order construct increases then all the facet constructs increase equally. Another variation of a multidimensional construct (Figure 11.4, model c) shifts the causal direction between the lower order reflective dimensions and the second order construct by reversing the direction of the causal arrows to indicate that the second order construct is formative. The theoretical task here is to justify why and how the higher order construct is formed by the three reflective constructs. Remember, with a formative construct there is no requirement that the lower order constructs equally contribute to the higher order construct, meaning the three reflective, lower order constructs do not have to equally relate to the higher order construct. Another variation of this model, not pictured, is to present a fully formative multidimensional construct where formative lower order constructs predict a higher order formative construct. Multidimensional models can be constructed entirely with reflective constructs, or with formative constructs, or some combination of both. It is also possible to construct multidimensional models embodying additional levels of constructs. Multidimensional constructs, however they are designed, make statements about the causal relationship between indicators and constructs that should be precisely and theoretically specified and supported by empirical evidence. As with any measure, it is essential to establish construct validity, including testing the measurement model, before turning to tests of hypotheses embodied in the structural model (Anderson & Gerbing, 1988). We have reviewed a few multidimensional measurement models and urge researchers to fully and theoretically define the relationships between their indicators, dimensions, and constructs.

DECIDING HOW TO MEASURE YOUR CONSTRUCT: REFLECTIVE, FORMATIVE, MULTIDIMENSIONAL A useful approach for determining whether a construct should be specified as reflective or formative is to conduct a thought experiment (Jaccard & Jacoby, 2010). Imagine that the construct of interest increases – if the values of the indicators increase equally in response then the construct is reflective. Consider an imaginary subordinate’s perception of interpersonal justice, meaning whether the supervisor has treated the subordinate


with respect and sensitivity (Colquitt, 2001). As the employee’s perceptions of just interpersonal treatment increase, the respondent ought to select higher scores for a survey item that asks “Have you been treated with respect?” Conversely, if the indicators in the thought experiment do not increase commensurately with the increase in the construct, then advocates of formative measurement say the construct is formative. For example, imagine a person’s socioeconomic status is increasing. It is not plausible that an increase in SES would causally drive increases in education or income. Formative advocates might declare SES to be a formative construct. Another technique for distinguishing between reflective and formative constructs, too complex for us to introduce here, involves conducting vanishing tetrad analyses (Bollen & Ting, 2000). Note that both thought experiments and the vanishing tetrad analyses presume that constructs are inherently reflective or formative. Other methodologists recognize that constructs may not be inherently reflective or formative but can be measured in either form (Edwards, 2011; Howell et al., 2007). Whether a construct should be measured with formative or reflective measures is a theoretical issue which is not resolved by a statistical test. We assert that it is up to the researcher to theoretically determine if the focal construct should be measured reflectively or formatively, then to defend that choice with logical and empirical evidence. The causal processes discussed by advocates of formative measures do, in our opinion, exist. Increasing the values of job satisfaction facets (e.g. pay, relationships with supervisors and co-workers, benefits, interesting work and so on) will cause overall job satisfaction to increase. However, it is unreasonable to theorize that all facets of job satisfaction are equally effective at increasing overall job satisfaction – some are likely to be more strongly related, and other facets weakly related, to the overall attitude. Such effectiveness may also vary from person to person; employees may not uniformly weight the facets thought to contribute to job satisfaction.

OUR RECOMMENDATION: FORMATIVE LOGIC EMBEDDED AND TESTED IN REFLECTIVE MODELS We recommend embedding the logic of formative constructs in properly specified, and tested, reflective models. Before addressing this recommendation, it is necessary to engage in a small discussion on how broadly or narrowly constructs are defined.



Some constructs are defined at a level of abstraction that may not correspond to the abstraction in the indicators. For example, the original measure of abusive supervision, defined as “the sustained display of hostile verbal and nonverbal behaviors, excluding physical contact” (p. 178), was measured with fifteen items (Cronbach’s alpha = .90) including “ridicules me,” “lies to me,” and “invades my privacy” (Tepper, 2000)1. These behaviors all constitute examples of abusive supervision, and perhaps might be called facets of the construct because they were written as a collection of specific behaviors. It is possible, perhaps likely, that an abusive supervisor does not engage in all the identified behaviors; for example, a supervisor might ridicule a subordinate but not lie (or vice versa). The specificity of the indicators does not correspond to the comparatively more abstract definition of the abusive supervision construct. One of the purposes of content validity analyses is to ensure that the indicators accurately reflect the meaning of the construct (Schwab, 2005). We are not the first to argue that when the specificity of the construct does not correspond to the specificity of the indicators the validity of the measure is compromised. This issue of correspondence between constructs and indicators has been noted by other authors and some have used the issue to justify multidimensional and formative measures. In the context of developing a taxonomy of multidimensional constructs, Wong et al. (2008) decried the practice of theorizing at the higherlevel construct (e.g., organizational citizenship behavior) but testing hypotheses at the dimension level (e.g., dimensions of organizational citizenship like civic virtue, sportsmanship, altruism, etc.) and called for clear specification of the relationship between constructs and indicators. MacKenzie and colleagues (2005) noted that generally defined constructs and specific items do not correspond, naming transformational leadership and procedural justice as examples, and stated that these measurement models have been incorrectly specified as reflective. They recommend that misspecified models (e.g., transformational leadership and procedural justice) be respecified formatively as latent composite measurement models. The difficulty with the MacKenzie et al. (2005) recommended solution is that the revised measurement model inherits the difficulties and ambiguities of formative models and overlooks the lack of correspondence between the construct and the indicators. We suggest that respecifying the relationship between broadly defined constructs and specific, facet-like indicators as formative is problematic,

for reasons identified in this chapter, but we propose a simple solution. We recommend following long standing advice to operationalize the construct to correspond with indicators at the same level of specificity, or abstraction, of the construct definition. That is, when working with a broadly defined, abstract concept, the indicators ought to be operationalized in a similarly broad and abstract way. When working with a narrow, precise construct definition, the indicators ought to capture the same narrow precision. This very simple solution of ensuring correspondence is entirely consistent with the well-understood principles of reflective measurement, classical test theory, and content validity (Anderson & Gerbing, 1991; Cronbach & Meehl, 1955; Hinkin & Tracey, 1999; Messick, 1995). When research interest is focused on the relationship between specific dimensions or facets and a more general, abstract construct, it is easy to embed the logic of a formative model into a fully reflective model by applying our recommendation to add a structural relationship between the specific dimensions and the higher order construct (Figure 11.5). In this model three reflective constructs (perhaps each representing a facet or dimension of job satisfaction), each with multiple indicators, are predicted to cause a higher order construct (perhaps, overall job satisfaction). The key difference is that the higher order construct is itself reflective and is represented by three or more reflective indicators that have been specified at the same level of abstraction as the higher order construct. Remember, the advice for testing formative constructs with MIMIC models involved selecting two reflective indicators (MacKenzie et  al., 2005) – it should be simple enough to select three reflective indicators. The paths from the lower order reflective constructs to the higher order construct capture the idea inherent in the formative approach but can be tested as structural components of the model (i.e., the lower order reflective constructs are considered independent variables in the structural model). This model combines reflective measurement with three structural paths (e.g., facets of job satisfaction) linking the three dimensions to the higher order construct (e.g., overall satisfaction). The reflective measurement model might be tested first, then the full model with structural paths tested, consistent with the twostep procedures of SEM (Anderson & Gerbing, 1988). In this fashion, it is possible to model the structural prediction that increases in facets of satisfaction will yield an increase in global satisfaction – while retaining the advantages of reflective measures.



Figure 11.5  Embedding the logic of a formative model into a reflective model for testing

SUMMARY AND CONCLUSION What is remarkable is that methodologists agree on the importance of properly specifying constructs and measurement models, and largely agree on the attributes of reflective and formative measures and the procedures for testing them. The disagreement between proponents and advocates of formative measures may stem from their philosophical difference about the “realness” of the constructs we use and test in our theories (i.e., a critical realist perspective versus a constructivist perspective). Regardless of whether the measure is designated as reflective or formative, it is essential to clearly specify the theoretical measurement model and to acquire the empirical evidence necessary to support construct validity before testing hypotheses (Anderson & Gerbing, 1988). This chapter is a brief introduction to reflective and formative measurement but does not address all possible forms of measurement models. Multidimensionality has also been addressed by modeling variance shared among indicators with bifactor models that accommodate relationships of indicators to specific assigned factors and to global factors (Morin, Arens, & Marsh, 2016; Reise, 2012). Nor do we cover how multidimensional item response theory models (IRT) can address indicators that simultaneously load on to specific and general factors (Lang & Tay, 2021). Furthermore, we ask readers to avoid blindly accepting our arguments in this chapter. Instead,

we urge readers to plunge into the research we have cited to better understand these issues for themselves. We have attempted to summarize the essential positions and arguments associated with a discussion that has played out over decades by many methodologists, and any flaws in this summary are ours.

Note 1  I count Dr. Tepper as a mentor and friend. He and I have discussed his scale and more recent versions of the measure frequently and I include only items that focus on active aggression rather than passive abuse (Mitchell & Ambrose, 2007). I offer only abusive supervision as an example but there are many other measures in the management literature where the specificity of the items does not well correspond to the general nature of the construct.

REFERENCES Anderson, J. C., & Gerbing, D. W. (1988). Structural equation modeling in practice: A review and recommended two-step approach. Psychological Bulletin, 103(3), 411–23. Anderson, J. C., & Gerbing, D. W. (1991). Predicting the performance of measures in a confirmatory factor analysis with a pretest assessment of their



substantive validities. Journal of Applied Psychology, 76(5), 732–40. Asparouhov, T., & Muthén, B. (2009). Exploratory Structural Equation Modeling. Structural Equation Modeling: A Multidisciplinary Journal, 16(3), 397–438. Bollen, K. A. (1989). Structural equations with latent variables. John Wiley & Sons. Bollen, K. A., & Diamantopoulos, A. (2017). In defense of causal–formative indicators: A minority report. Psychological Methods, 22(3), 581–96. Bollen, K. A., & Lennox, R. (1991). Conventional wisdom on measurement: A structural equation perspective. Psychological Bulletin, 110(2), 305–14. Bollen, K. A., & Ting, K.-f. (2000). A tetrad test for causal indicators. Psychological Methods, 5(1), 3–22. Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review, 110(2), 203–19. Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–71. Brown, T. A. (2015). Confirmatory factor analysis for applied research. Guilford Press. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-­ multimethod matrix. Psychological Bulletin, 56(2), 81–105. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. Lawrence Erlbaum Associates. Colquitt, J. A. (2001). On the dimensionality of organizational justice: A construct validation of a measure. Journal of Applied Psychology, 86(3), 386–400. Colquitt, J. A., Sabey, T. B., Rodell, J. B., & Hill, E. T. (2019). Content validation guidelines: Evaluation criteria for definitional correspondence and definitional distinctiveness. Journal of Applied Psychology, 104(10), 1243–65. Conway, J. M., & Huffcutt, A. I. (2003). A review and evaluation of exploratory factor analysis practices in organizational research. Organizational Research Methods, 6(2), 147–68. Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78(1), 98–104. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. DeVellis, R. F. (2017). Scale development: Theory and applications. Sage. Diamantopoulos, A., & Winklhofer, H. M. (2001). Index construction with formative indicators: An alternative to scale development. Journal of Marketing Research, 38(2), 269–77.

Edwards, J. R. (2001). Multidimensional constructs in organizational behavior research. Organizational Research Methods, 4(2), 144–92. Edwards, J. R. (2003). Construct validation in organizational behavior research. In Organizational behavior: The state of the science, 2nd ed. (pp. 327–71). Lawrence Erlbaum Associates. Edwards, J. R. (2011). The fallacy of formative measurement. Organizational Research Methods, 14, 370–88. Edwards, J. R., & Bagozzi, R. (2000). Relationships between constructs and measures. Psychological Methods, 5(2), 155–74. Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4(3), 272–99. Hinkin, T. R., & Tracey, J. B. (1999). An analysis of variance approach to content validation Organizational Research Methods, 2(2), 175–86. Howell, R. D., Breivik, E., & Wilcox, J. B. (2007). Reconsidering formative measurement. Psychological Methods, 12(2), 205–18. Jaccard, J., & Jacoby, J. (2010). Theory construction and model-building skills. Guilford Press. Jackson, D. L., Gillaspy, J. A., Jr., & Purc-Stephenson, R. (2009). Reporting practices in confirmatory factor analysis: An overview and some recommendations. Psychological Methods, 14(1), 6–23. Joreskog, K. G., & Goldberger, A. S. (1975). Estimation of a model with multiple indicators and multiple causes of a single latent variable. Journal of the American Statistical Association, 70(351), 631–9. Judge, T. A., & Kammeyer-Mueller. (2012). Job attitudes. Annual Review of Psychology, 63, 341–67. Lang, J. W. B., & Tay, L. (2021). The science and practice of item response theory in organizations. Annual Review of Organizational Psychology and Organizational Behavior, 8(1), 311–38. Law, K. S., Wong, C., & Mobley, W. H. (1998). Toward a taxonomy of multidimensional constructs. Academy of Management Review, 23(4), 741–55. Locke, E. A. (1976). The nature and causes of job satisfaction. In M. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 1297–1350). Rand McNally. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(3), 635–94. MacKenzie, S. B., Podsakoff, P. M., & Jarvis, C. B. (2005). The problem of measurement model misspecification in behavioral and organizational research and some recommended solutions. Journal of Applied Psychology, 90(4), 710–30. MacKenzie, S. B., Podsakoff, P. M., & Podsakoff, N. P. (2011). Construct measurement and validation


procedures in MIS and behavioral research: Integrating new and existing technologies MIS Quarterly, 35(2), 293–334. Marsh, H. W., & Hocevar, D. (1985). Application of confirmatory factor analysis to the study of selfconcept: First- and higher order factor models and their invariance across groups. Psychological Bulletin, 97(3), 562–82. McNeish, D. (2018). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23(3), 412–33. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–9. Mitchell, M. S., & Ambrose, M. L. (2007). Abusive supervision and workplace deviance and the moderating effects of negative reciprocity beliefs. Journal of Applied Psychology, 92(4), 1159–1168. Morin, A. J. S., Arens, A. K., & Marsh, H. W. (2016). A bifactor exploratory structural equation modeling framework for the identification of distinct sources of construct-relevant psychometric multidimensionality. Structural Equation Modeling: A Multidisciplinary Journal, 23(1), 116–39. Podsakoff, P. M., MacKenzie, S. B., & Podsakoff, N. P. (2016). Recommendations for creating better concept definitions in the organizational, behavioral, and social sciences. Organizational Research Methods, 19(2), 159–203.


Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667–96. Schwab, D. P. (1980). Construct validity in organizational behavior. Research in organizational behavior, 2, 3–43. Schwab, D. P. (2005). Research methods for organizational studies. Lawrence Erlbaum Associates. Shaffer, J. A., DeGeest, D., & Li, A. (2016). Tackling the problem of construct proliferation: A guide to assessing the discriminant validity of conceptually related constructs. Organizational Research Methods, 19(1), 80–110. Tepper, B. J. (2000). Consequences of abusive supervision. Academy of Management Journal, 43(2), 178–90. Widaman, K. F. (2012). Exploratory factor analysis and confirmatory factor analysis. In H. Cooper, P. M. Camic, D. L. Long, A. T. Panter, D. Rindskopf, & K. J. Sher (Eds.), APA handbook of research methods in psychology, Vol 3: Data analysis and research publication (pp. 361–89). Willis, G. B. (2005). Cognitive interviewing: A tool for improving questionnaire design. Sage. Wong, C.-S., Law, K. S., & Huang, G.-h. (2008). On the importance of conducting construct-level analysis for multidimensional constructs in theory development and testing. Journal of Management, 34(4), 744–64.

12 Understanding the Complexities of Translating Measures: A Guide to Improve Scale Translation Quality S h e i l a K . K e e n e r, K a t h l e e n R . K e e l e r *, Zitong Sheng* and Tine Köhler

The oversampling of Western, educated, industrialized, rich and democratic (WEIRD) participants has attracted attention in the social sciences (Henrich et al., 2010). The need to study diverse populations makes translation a critical concern, as researchers often need to adapt scales developed in one language to another. Similarly, translation is unavoidable when well-validated scales to measure a construct do not exist in the language in which surveys are administered. Despite the prevalence of scale translation, researchers may not always be aware of its importance and the consequences of poor translation quality. Researchers rarely investigate translation quality before data collection (Bradley, 2013). Methodological pitfalls originating from survey translation often go undetected until data are collected or analyzed and are often mistakenly attributed to survey content problems (Sperber et  al., 1994). Poor translation quality can be detrimental in various ways – at best, it results in a waste of time and resources. More importantly, poor translation quality has critical consequences for measurement reliability and validity, which serve as the foundation of social science research (Cortina et al., 2020). Conclusions drawn from studies using poorly *  Denotes equal contribution

translated scales can be “meaningless, inconclusive, or misguiding” (Schaffer & Riordan, 2003, p. 169). Thus, although a good translation is only one step towards ensuring successful adaptation of a scale from one culture to another, it is a vitally important step and merits careful attention. This chapter demonstrates the importance of scale translation and provides guidance for researchers to navigate the translation process, select appropriate translation methods, and evaluate the quality of a translation. After an exploration of how cultural variations in construct conceptualization can affect scale adequacy and translation processes, we review and discuss three general approaches to translation. Based on our review, we provide specific guidance for how researchers should undertake translation. Finally, we discuss the various ways in which researchers can evaluate the quality of their translation using both qualitative and quantitative methods. We note that researchers may choose to develop culturespecific measures instead of translating existing scales. As this chapter’s focus is on scale translation, we only briefly discuss cases where developing culture-specific measures is more appropriate than translating existing scales.


CULTURAL VARIATION IN CONSTRUCT MEANING AND ITS EFFECT ON TRANSLATION Before starting scale translation, a researcher should first consider whether the construct in question is the same in the cultures in which data are collected. Most translation processes assume that the underlying construct is the same across cultures. In most cases, the main concern with translation processes is lexical differences. However, as language and cross-cultural scholars have long argued, language differences do not just include lexical differences in the words and grammar used, but can also include differences in the cultural meaning of constructs, the contexts in which they reside, behavioral prescriptions related to the construct, interpretation of behaviors related to the construct, or, at a very fundamental level, in the existence of certain constructs (e.g., Kim, 2005; Harzing et  al., 2013). Language scholars have even found that survey respondents express different cultural values when they take the survey in their native language versus another language in which they are fluent (e.g., Harzing & Maznevski, 2002). Survey researchers should take these effects into account, especially when collecting data across different cultures for the purpose of cultural comparisons. Underlying differences in construct meaning have implications for appropriate translation of surveys, as scale translation extends beyond translating the words used in surveys to the translation of the respective construct into a different cultural context. Consider the example of the construct “honesty,” which is of relevance in integrity surveys and a sub-facet of HEXACO (Lee & Ashton, 2004). Prior research suggests that honesty is not perceived and defined consistently across cultures (Liu et  al., 2003; Köhler, 2010). Moreover, the behaviors and attitudes that constitute honesty are not the same across cultures (Hugh-Jones, 2016). In many cultural contexts, being truthful in thought and action would sometimes be inappropriate, and not being truthful would not be considered dishonest. For example, attributed to the Confucian value of harmony, avoiding potential conflicts in interpersonal interactions is a strong norm for East Asians (Morris et al., 1998). In this case, deception can be tolerated if it helps maintain interpersonal harmony. This is not the case in Western societies where truthfulness is valued above interpersonal harmony. Capturing these construct differences, recent work has struggled to establish the measurement invariance of honesty scales across different languages (e.g., Thielmann et al., 2019). Similarly, differences in the meaning


and nomological network of constructs have been found for constructs such as trust (Köhler, 2010), integrity (Köhler et al., 2010), and loyalty (Köhler, 2010). In this regard, cross-cultural researchers have distinguished “emic” concepts from “etic” concepts. Like the aforementioned examples, emic constructs are those that have culture specific validity. In comparison, etic concepts are those that have cross-cultural validity (Triandis, 1976). If one’s construct of interest is emic, its measurement would then require the development of a culture- and language-specific scale, for which translation from one language to another would be inappropriate. To determine if a concept is etic or emic, Brislin (1970) recommended using decentering, which refers to multiple rounds of translation and back-translation where the original source language version is open to modification. At the end of the process, the construct assessed is assumed etic if the two versions are substantially similar, otherwise it is emic. The focus of the chapter, scale translation, is more applicable for etic concepts. Whether translation is appropriate in the first place, however, is often not evaluated, which can have significant implications for conclusions drawn from such research. Even when translation is justified, there are several ways in which translation quality may be less than adequate and can have serious consequences regarding the trustworthiness of the results. We elaborate on this point in the following section.

IMPLICATIONS OF INAPPROPRIATE TRANSLATIONS Inappropriate translations can greatly affect scale reliability and validity (see Chapters 14 and 15 in this Handbook for an in-depth treatment of these topics). Among multiple forms of reliability, what concerns us the most here is internal consistency reliability, which is estimated based on the variances and covariances of test items. High quality scales should have high internal consistency reliability when scores for all items are highly correlated with each other, as they reflect the same underlying construct. Construct validity refers to the extent to which measurement items adequately represent the underlying theoretical construct that they are supposed to measure. Reliability and construct validity then influence the predictive validity of a measure, which refers to whether the measure relates to a criterion that is theoretically associated with it. Predictive validity will be



underestimated when a measure has low reliability and/or low construct validity. If a measure includes concepts that are culturally distinct or items that were developed observing specific cultural norms, translating it into another language and culture is likely to lead to construct contamination and deficiency issues – limiting internal consistency and, ultimately, predictive validity. Even when the concept has demonstrated crosscultural validity, some items may require both knowledge of the underlying construct and of the target culture to be translated appropriately. For example, “courtesy” means being thoughtful of one’s behavior in English, but its direct translation in Chinese (i.e., “礼貌”) more closely represents the meaning of “good manners.” The English word “identity” usually represents one’s self-concept or qualities that differentiate a person or group from others, while its direct translation in Chinese (i.e., “身份”) refers to one’s personal information such as name, nationality, or social status. Linguistic inaccuracies like these in the translation process would lead to a translated scale not representing the underlying construct that one intends to measure. Reliability and construct validity would therefore suffer. Given the critical consequences associated with translation quality and appropriateness, we provide essential guidance in navigating the translation process.

INTRODUCTION OF APPROACHES TO TRANSLATION Assuming that translation is appropriate for an existing scale, there are several different translation approaches that can be followed, including one-to-one translation (e.g., Harkness, 2003), back-translation (e.g., Brislin, 1970), and the committee approach (e.g., Van de Vijver & Hambleton, 1996; Douglas & Craig, 2007). Below, we briefly describe each approach. We then discuss a recommended translation process. One-to-one translation. In one-to-one translation (i.e., one-way, direct, or forward translation; Harkness, 2003), one bilingual person translates an existing scale from the source language to the target language (Harkness, 2003). Back-translation. In its basic form, backtranslation occurs when a bilingual individual translates a scale from the source language into the target language and another bilingual individual independently translates the target language version back into the source language (Brislin, 1970). The original and back-translated versions are compared. If errors are detected between the two

versions, the target language version is revised and back-translated again. This process can go through several rounds until no additional errors are identified. Brislin (1970) also suggested that researchers employ monolingual reviewers to compare the original and back-translated versions and bilingual reviewers to compare the original and translated versions for meaning errors. Lastly, Brislin et al. (2004) recommended that once no additional errors are identified, the translated scale should be pretested with monolingual individuals. If results of pretesting reveal that some items are not performing as expected, the survey should undergo additional revision and then be pretested again. It is unclear, however, whether these additional recommended steps are followed in practice as not many researchers provide sufficient detail about their translation process or how they attempted to ensure equivalence (e.g., Schaffer & Riordan, 2003; Heggestad et al., 2019). Committee translation. We use the term committee translation to refer specifically to the TRAPD approach to team translation, which is a five-step procedure involving parallel Translation, Review, Adjudication, Pretesting, and Documentation (Harkness, 2003, see also Harkness et  al., 2004; Douglas & Craig, 2007; Harkness et  al., 2010). The first step in this approach is parallel translation, during which two or more bilinguals independently translate the materials from the source language to the target language. Then, the original translators and one independent reviewer (potentially the researcher if they are knowledgeable of the source and target languages) meet to review the translations and arrive at a final version for pretesting. Adjudication may be necessary if the translators cannot agree on a final version. After a final version is chosen, the next step is to pretest the scale. If issues are identified during pretesting, the translation should be revised by the translators and then re-pretested in an iterative fashion until no additional issues are identified (Douglas & Craig, 2007). Lastly, the scale and the translation process are thoroughly documented to explain the reasoning behind decisions and any changes made after pretesting, as well as for version control if additional changes are made in the future (Harkness, 2003).

RECOMMENDED GUIDELINES FOR TRANSLATION Each translation approach has advantages and disadvantages. Therefore, we recommend that researchers should combine aspects from each


approach (see Figure 12.1). First, researchers should choose translators who not only have indepth knowledge of both the source and target languages but also an understanding of both cultures and the content being translated. This knowledge allows the translator(s) to better

Figure 12.1  Translation Decision Guide


understand the intended meaning of each item and choose a translation that best conveys this intended meaning in a manner that is most relevant to the culture where data are being collected (Harkness et  al., 2004; International Test Commission, 2017).



Second, to help avoid difficulties in the translation process, the source-language scale should be reviewed to identify aspects that may be difficult to translate or culturally irrelevant in the target language (e.g., idioms, culture-specific references; Hambleton, 2004; Weeks et  al., 2007). If necessary, the original scale should be decentered to make it easier to translate (Werner & Campbell, 1970; Brislin, 1973; Hambleton, 2004). Third, multiple translators should be used throughout the process and multiple translations should be generated (e.g., Douglas & Craig, 2007; Harkness et al., 2010). The use of multiple translators to produce multiple independent translations overcomes drawbacks associated with only using one translator, such as translators’ idiosyncrasies. For example, one translator may be more proficient in either the source or target language, may be more or less familiar with various regional dialects, and may have more or less familiarity with the content that they are translating/comparing (Harkness, 2003; Hambleton, 2004). Evidence suggests that translations created by groups tend to be superior to those created by individuals (e.g., McKay et al., 1996) as they maximize the collective expertise of the translators and thus yield a final translated product with fewer meaning errors (Douglas & Craig, 2007; Harkness et al., 2010). Fourth, although back-translation does not guarantee equivalence, it does allow a sourcelanguage monolingual researcher to participate in the translation process and provide feedback. Furthermore, if comparisons between both the original and target-language versions and the original and back-translated versions both suggest equivalency, researchers can have more confidence in the quality of the translation. Therefore, back-translation should also be built into the translation process (Weeks et al., 2007). Lastly, translations should be evaluated by members of the target population using the pretesting methods discussed in the next section (e.g., Douglas & Craig, 2007; Weeks et al., 2007). This is particularly important as translators likely differ substantively from target language monolinguals who will ultimately complete the translated measure. For instance, they may be, on average, more highly educated and use the language differently than monolinguals (Hambleton, 2004). Furthermore, translators may share common rules for using certain translations for some words that are not actually equivalent in meaning. This would make a translation appear acceptable to other translators even though it may not make sense to the target population (Brislin, 1970; Hulin, 1987). Strictly following these guidelines may be resource intensive; therefore, it is not surprising that most researchers use a less rigorous process.

Yet, there are examples of rigorous translations that include several of the recommendations. For instance, Kim et  al. (2017) followed a modified back-translation approach to translate their newly developed measure of voluntary workplace green behaviors. Specifically, the first author translated the new survey into the target language. This translation was reviewed by a focus group of bilingual individuals who also had content expertise to ensure that the survey was easy to understand and culturally relevant. Two additional bilingual authors then engaged in an iterative process to identify and address discrepancies between the source language version and the translation. The final translated survey was back-translated into the source language as an additional check of equivalence. Lastly, an additional group of individuals in the target population reviewed the final survey to ensure items were relevant for the specific research context. Song et  al. (2019) translated existing measures of their focal constructs using a combination of back-translation and the committee approach. First, three bilingual translators independently translated the source language survey. These individuals then discussed their independent translations until they reached consensus. Then, three additional bilingual translators independently back-translated the survey. These three individuals discussed the back-translations until consensus was reached. Lastly, a separate bilingual committee with content expertise compared the original and back-translated versions and decided on the final translation. As can be seen, both Kim et al. (2017) and Song et al. (2019) engaged in a translation process that followed many, but not all, of the aforementioned guidelines. Unfortunately, there is little research examining which of these guidelines are most crucial. Therefore, we encourage future research to provide further guidance.

HOW TO EVALUATE TRANSLATION QUALITY Regardless of approach, it is important for researchers to evaluate the quality of their final translated materials. In other words, researchers need to evaluate the degree to which the translated survey reflects the construct of interest and, assuming the construct itself is culturally invariant, yields similar patterns of responses across languages (Brislin, 1973; Mohler & Johnson, 2010). Here, we review several techniques to assess both qualitative and quantitative equivalence of the final translation.


We also provide a checklist for how to evaluate translation quality (see Table 12.1). Pretesting. Pretesting is the recommended first step to assess translation quality (Brislin et al., 2004; Caspar et al., 2016). Pretesting allows researchers to identify potential sources of measurement error prior to conducting their full study. Pretesting can include techniques or methods such as pilot studies and cognitive interviewing. Pilot studies involve pretesting the actual data collection with a smaller sample that is similar to the target sample in terms of national origin, language, and demographic distributions (Brislin et  al., 2004). Open-ended responses at the end of the survey can be used to ask participants for feedback regarding the clarity of the items. Some strengths of pilot testing are that it is realistic, allows for testing of all procedures, and can enable researchers to acquire feedback from participants about the questionnaires. There are, however, some drawbacks of conducting a pilot test. It can be costly and requires additional time to plan. It may also be difficult to find a large enough sample that is similar to the target population. Finally, although a pilot test would inform researchers how well the translated measure performs in the target population, it does not, on its own, provide any indication whether survey items function in an equivalent manner. Cognitive interviewing can determine whether the true meaning of questions is conveyed to participants and whether survey items function as intended. Cognitive interviewing stems from a survey methodology movement which argues that participants’ thought processes need to be understood to identify potential sources of measurement error and assess validity (Hibben & de Jong, 2016). Generally, cognitive interviewing


is a think-aloud technique in which respondents either provide their in-the-moment thoughts as they answer the question (i.e., concurrent thinkaloud) or engage in an interview about how they came up with answers to specific questions after completing the survey (i.e., retrospective thinkaloud). Cognitive interviewing can also involve verbal probing in which the interviewer asks additional follow up questions to further clarify the participant’s thinking (Willis, 2004). This technique provides rich insight into whether the translated version maintains the original meaning of the questions. Cognitive interviewing allows researchers to pick up item miscomprehension issues that cannot be detected by conventional statistical analyses (Hardy & Ford, 2014). This allows items to be revised prior to the main data collection. However, this process lacks realism and is highly burdensome to participants. It also assumes that participants can identify what information they relied upon to come up with their answer. Furthermore, in the case of the concurrent think-aloud technique, it may interfere with response formation, in that participants overinterpret questions (Willis, 2004). Similarly, in the case of verbal probing, interviewer bias or leading may alter responses the respondent may provide, creating additional sources of error (Ericsson & Simon, 1998). The effectiveness of cognitive interviewing is also unknown, as empirical evaluations of these methods have produced mixed results (Willis, 2004). Finally, similar to pilot testing, cognitive interviewing does not provide direct evidence of functional equivalence. Thus, although pretesting can help researchers identify errors in translated instruments and provide preliminary indicators of translation quality, pretesting alone is not sufficient to establish

Table 12.1  Recommended steps to evaluate translation quality 1.



Pretest translated materials with smaller sample representative of the target sample.   Can be accomplished either by following a cognitive interview protocol or conducting a pilot study.   Should be completed prior to main data collection. Test for measurement invariance.   Identify equivalent referent items for each latent factor.   Test for configural invariance using the free-baseline approach.   If configural invariance is met, test for metric and scalar invariance simultaneously.   If scalar invariance is not met, compute effect size of non-equivalence (dMACS).   Recommended that this be done prior to main data collection using data from a pilot study in which two samples are collected (i.e., one sample takes the original survey’s language, and one sample takes the translated version). Conduct IRT DIF analyses following recommended procedures to determine if any items are non-equivalent.   Have translators review and revise or remove flagged items.



equivalency (e.g., Van de Vijver & Hambleton, 1996; Mohler & Johnson, 2010). Ideally, pretesting should be combined with one or more statistical approaches to assess translation quality.

STATISTICAL TECHNIQUES TO ASSESS EQUIVALENCE Measurement invariance (MI) indicates the extent to which the same construct or set of items is being measured across groups (Vandenberg & Lance, 2000). MI can be used to directly assess whether the translated items function similarly to the original items. MI can be assessed using several methods, such as Multiple Indicators Multiple Causes (MIMIC) or alignment optimization (for a review see Somaraju et  al., 2021), but here we focus on two popular methods: the multi-group confirmatory factor analysis (MGCFA) framework and item response theory (IRT) with differential item functioning (DIF). We discuss these techniques as they relate to translation below. For a more in-depth review of MI please see Chapter 13 in this Handbook. Multi-group Confirmatory Factor Analysis. MGCFA can be used to assess MI in a series of hierarchical models in which additional equality constraints on model parameters are added to reach stronger forms of invariance (Vandenberg & Lance, 2000). Vandenberg and Lance (2000) outlined a four-step procedure for establishing MI using MGCFA. The lowest level of MI is configural invariance. Configural invariance tests whether similar factors emerge in each group. It requires that the same items load onto the same factors in each group with free factor loadings. Configural invariance serves as the baseline model, which must be met before stricter forms of invariance can be tested. The next level of invariance, metric invariance, tests whether the factor loadings of the items onto the same constructs are equivalent across groups. Factor loadings reflect the degree to which differences among individuals’ answers to an item are due to differences among their standing on the latent construct that is measured by that item. If the factor loadings are not invariant between the original and translated versions of the survey, this suggests a potential change in the meaning of items during the translation process (Chen, 2008). Metric invariance is tested by constraining the factor loadings to be equal across the groups and then comparing the model fit to that of the configural invariance model using a chi-square difference test.

Scalar invariance tests whether the item intercepts (i.e., the origin or starting value of the scale) are equivalent between groups. When both the factor loadings and intercepts are invariant across groups, it can be inferred that scores from different groups have the same unit of measurement and same starting point (Chen, 2008). Scalar invariance is achieved if there is a significant difference in model fit when compared to the fit of the metric model. Scalar invariance allows for comparisons of factor means and provides a stronger test that the original and translated scales are functionally equivalent. If scalar invariance is not achieved, it suggests possible measurement bias. Finally, the fourth level of invariance is strict factorial invariance. This test includes determining the invariance of the scale (i.e., factor variances) and the individual items (i.e., residual variances). This is the strongest form of invariance but also the most difficult to obtain in practice. More recent empirical work on MI recommends that researchers estimate the scalar invariance model immediately after establishing configural invariance (Stark et  al., 2006; Somaraju et  al., 2021) rather than following the aforementioned four steps. The simultaneous approach simplifies the process by reducing the number of comparisons and potential for errors early in the estimation process. The inherent relationship between the intercepts and factor loadings means that bias in the factor loadings can also lead to bias in the intercepts (and vice versa) and thus it does not make sense to separate the two steps (Somaraju et al., 2021). When estimating MI as an indicator of translation quality using MGCFA, we first recommend that researchers take time to identify an equivalent referent item between the original and translated versions for each latent factor. Latent factors are unobserved and therefore do not have an inherent mean or variance. The latent mean and variance must be given a metric to estimate the model. This is usually done by selecting a referent item to set both the mean and variance (Stark et  al., 2006; Somaraju et  al., 2021) and in most software programs (i.e., Mplus) the first item is chosen by default. Thus, it is important that the chosen referent item is equivalent across groups, as using a non-equivalent item can inflate Type 1 errors or mask non-equivalence (see Stark et  al., 2006 and Cheung & Lau, 2012 for methods for identifying equivalent referent items). After identifying the appropriate referent item(s), we recommend that researchers apply parameter estimate constraints simultaneously and estimate scalar invariance directly after configural invariance is established between the original and translated versions. If scalar invariance is not established,


researchers should first determine the magnitude of non-equivalence, as the impact on scale scores and mean differences can be trivial when there is a small amount of non-equivalence in a small number of items (Nye et al., 2019; Somaraju et al., 2021). There are several effect sizes that researchers can use but we recommend the dMACS, which is a standardized metric comparable to Cohen’s d (Nye & Drasgow, 2011). Empirical work has recommended .4, .6, and .8 as cutoffs, indicating that non-equivalence between the original and translated versions will have small, medium, and large effects on study results (Nye et  al., 2019). Finally, if a non-trivial degree of measurement non-equivalence has been found between the original and translated versions, we recommend that researchers identify the specific translated items that are contributing to the non-equivalence using approaches such as differential item functioning (DIF) as described in the following section. It is often recommended that researchers test for partial equivalence if non-equivalence is identified, but we caution against doing so with translated measures unless the magnitude of non-equivalence is small (i.e., dMACS < .4) as this approach has a number of limitations (for a review see Somaraju et al., 2021). Item Response Theory and Differential Item Functioning. As a complement to MGCFA and other methods of examining MI based on classical test theory (CTT), researchers can also examine MI in an IRT framework through DIF analyses. DIF is present when individuals with the same theta (θ, trait/ability level), but from different subgroups, have a different probability of responding the same way to an item (Lord, 1980). DIF analyses can be used to determine if individuals responding to translated items respond similarly to individuals responding to items in the source language. Assessing MI in an IRT framework has several advantages. First, IRT affords more flexibility than CTT models like MGCFA, as IRT can accommodate a variety of response formats and styles (e.g., Tay et al., 2015). For instance, IRT models can handle dichotomous and polytomous data and be used for both dominance and ideal point responses models, which propose different relations between θ and responses (Lang & Tay, 2021). Second, IRT provides more item-level information than can be obtained through MGCFA, as it models multiple item parameters (e.g., Tay et al., 2015; Wells, 2021b). Specifically, for dichotomous data, depending on the IRT model chosen, item discrimination (a), item difficulty (b), and a guessing parameter (c) can be modeled. For polytomous data, item discrimination and multiple bs can be modeled. DIF can be examined for each


parameter, which may help researchers and translators identify issues. Third, researchers can also examine whether observed DIF ultimately results in differences between subgroups at the test level (differential test functioning; Raju et al., 1995). Despite these advantages, there are some disadvantages to using an IRT approach – the greatest one being that IRT methods require a larger sample size than MGCFA (i.e., at least 500 per group; e.g., Reise & Yu, 1990). For more information on IRT and how it compares to CTT approaches like MGCFA, see the IRT chapter (19) in this Handbook, Foster et al. (2017), Nye et al. (2020), Lang and Tay (2021), and Wells (2021a). To assess DIF in an IRT framework, one must first choose the appropriate IRT model for the specific characteristics of a scale. For dichotomous data, the most common models are 2-parameter models, which model item discrimination and difficulty, and 3-parameter models, which model item discrimination, difficulty, and a guessing parameter (Tay et  al., 2015). Graded response models (Samejima, 1969) are the most common IRT models for polytomous data and are used for Likert-type items (Tay et al., 2015). The generalized graded unfolding model (Roberts et al., 2000) can be used for ideal point models. It is important that assumptions and model fit are assessed prior to conducting DIF analyses; otherwise, results may be misleading (e.g., Bolt, 2002). Most IRT models assume unidimensionality and local independence (i.e., that, controlling for θ, items are uncorrelated), both of which can be tested using CFA (Nye et al., 2020). To assess absolute IRT model fit, adjusted χ2 (Drasgow et al., 1995) is often used; Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) can be used to compare the relative fit of two IRT models to ensure the best-fitting model is chosen (Tay et al., 2015). Once these assumptions are met, Tay et  al. (2015) recommended that multiple approaches be used to assess DIF, including the likelihood ratio test (LRT; e.g., Thissen et al., 1986; Thissen et al., 1993) and Wald chi-square (χ2; e.g., Cai et  al., 2011; Woods et al., 2013). For the LRT, one essentially compares a model where DIF is assumed (i.e., where item parameters vary across groups, termed the augmented model) to a model where it is assumed there is no DIF (i.e., where all item parameters are equal across groups, termed the compact model). If the compact model’s fit is significantly worse than the augmented model’s fit, DIF is assumed to be present. Wald χ2 takes a different approach to detecting DIF than the LRT, as it compares item parameter estimates (e.g., item difficulty, item discrimination) across groups. It is also important to note that anchor items can



influence DIF results. Generally, anchors known to be DIF-free should be used; however, newer approaches have been proposed to address this issue (see e.g., Wang et al., in press). For a review of other approaches to assess DIF, including ones that rely on observed scores, see Wells (2021a). After completing DIF analyses, items showing DIF can then be examined by translators to ascertain potential reasons for the differences in responding (e.g., there is a problem with the translation, the item is not appropriate for one culture; Petersen et  al., 2003). Depending on the issue identified by the translators, the flagged item(s) can then be re-translated or, if necessary, dropped from future data collection and analysis. In sum, researchers should consider both MGCFA and DIF to assess the quality of translated measures. Both approaches provide unique information that can help researchers and translators determine if translated items and scales function similarly to the source-language items and scales.

CONCLUSION Adequate translation is an important component of the research process. Researchers need to evaluate a variety of decisions such as whether translation is justified, which translation approach to use, and how to evaluate the quality of the final product. The goal of this chapter is to provide researchers with clear guidance regarding these choices so that they can make more informed decisions when translating existing scales. Figure 12.1 provides a guideline to help researchers determine the most appropriate approach to translation. First, researchers need to evaluate whether translation is even appropriate for the construct of interest. If the construct is emic in nature, then culturespecific scales are needed. If the construct is etic, then translating existing validated scales is appropriate. Although there is very little recent empirical work in this space – making it difficult to offer definitive recommendations – we suggest that researchers combine aspects of the back-translation and committee approaches if they have the resources to do so. Table 12.1 summarizes our recommendations and guidelines to evaluate translation quality. By enhancing researchers’ understanding of the translation process and potential pitfalls they may encounter, we hope this chapter will enhance future scale translation quality, and by extension, the trustworthiness of future research findings. We also hope that this chapter encourages more rigorous empirical work on translation approaches to further refine these recommendations.

REFERENCES Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15, 113–41. S15324818AME1502_01 Bradley, C. (2013). Translation of questionnaires for use in different languages and cultures. In C, Bradley (Ed.), Handbook of psychology and diabetes: A guide to psychological measurement in diabetes research and practice (pp. 43–56). Routledge. Brislin, R. W. (1970). Back-translation for crosscultural research. Journal of Cross-Cultural Psychology, 1, 185–216. Brislin, R. W. (1973). Questionnaire wording and translation. In R. Brislin, W. J. Lonner, & R. M. Thorndike (Eds.), Cross-cultural research methods (pp. 32–58). John Wiley & Sons, Inc. Brislin, R. W., MacNab, B., & Bechtold, D. (2004). Translation. In C. D. Spielberger (Ed.), Encyclopedia of applied psychology (pp. 587–96). Elsevier. Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO: Flexible, multidimensional, multiple categorical IRT modeling [Computer software]. Scientific Software International. Caspar, R., Peytcheva, E., Yan, T., Lee, S., Liu, M, & Hu, M. (2016). Pretesting. Guidelines for Best Practice in Cross-Cultural Surveys. Survey Research Center, Institute for Social Research, University of Michigan. pretesting/ Chen, F. F. (2008). What happens if we compare chopsticks with forks? The impact of making inappropriate comparisons in cross-cultural research. Journal of Personality and Social Psychology, 95, 1005–18. Cheung, G. W., & Lau, R. S. (2012). A direct comparison approach for testing measurement invariance. Organizational Research Methods, 15, 167–98. Cortina, J. M., Sheng, Z., Keener, S. K., Keeler, K. R., Grubb, L. K., Schmitt, N., Tonidandel, S., Summerville, K. M., Heggestad, E. D., & Banks, G. C. (2020). From alpha to omega and beyond! A look at the past, present, and (possible) future of psychometric soundness in the Journal of Applied Psychology. Journal of Applied Psychology, 105, 1351–81. Douglas, S. P., & Craig, C. S. (2007). Collaborative and iterative translation: An alternative approach to back translation. Journal of International Marketing, 15, 30–43. jimk.15.1.030 Drasgow, F., Levine, M. V., Tsien, S., Williams, B. A., & Mead, A. D. (1995). Fitting polytomous item response models to multiple-choice tests. Applied


Psychological Measurement, 19, 145–65. https:// Ericsson, K. A., & Simon, H. A. (1998). How to study thinking in everyday life: Contrasting think-aloud protocols with descriptions and explanations of thinking. Mind, Culture, and Activity, 5, 178–86. Foster, G. C., Min, H., & Zickar, M. J. (2017). Review of item response theory practices in organizational research: Lessons learned and paths forward. Organizational Research Methods, 20, 465–86. Hambleton, R. K. (2004). Issues, designs, and technical guidelines for adapting tests into multiple languages and cultures. In R. K. Hambleton, P. F. Merenda, & C. D. Spielberger (Eds.), Adapting educational and psychological tests for cross-cultural assessment (pp. 3–38). Psychology Press. Hardy, B., & Ford, L. R. (2014). It’s not me, it’s you: Miscomprehension in surveys. Organizational Research Methods, 17, 138–62. 10.1177/1094428113520185 Harkness, J. A. (2003). Questionnaire translation. In J. A. Harkness, F. J. R. van de Vijver, & P. Mohler (Eds.), Cross-cultural survey methods (Vol. 1, pp. 35–56). John Wiley & Sons. Harkness, J., Pennell, B. E., & Schoua-Glusberg, A. (2004). Survey questionnaire translation and assessment. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, & E. Singer (Eds.), Methods for testing and evaluating survey questionnaires (pp. 453–73). John Wiley & Sons, Inc. Harkness, J. A., Villar, A., & Edwards, B. (2010). Translation, adaptation, and design. In J. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Mohler, B.-E. Pennel, & T. W. Smith (Eds.), Survey methods in multinational, multiregional, and multicultural contexts (pp. 115–40). John Wiley & Sons, Inc. Harzing, A. W. & Maznevski, M. (2002). The interaction between language and culture: A test of the cultural accommodation hypothesis in seven countries. Language and Intercultural Communication, 2(2), 120–39. 14708470208668081 Harzing, A.W., Reiche, B.S., & Pudelko, M. (2013). Challenges in international survey research: A review with illustrations and suggested solutions for best practice. European Journal of International Management, 7, 112–34. https://doi. org/10.1504/EJIM.2013.052090 Heggestad, E. D., Scheaf, D. J., Banks, G. C., Monroe Hausfeld, M., Tonidandel, S., & Williams, E. B. (2019). Scale adaptation in organizational science research: A review and best-practice recommendations. Journal of Management, 45, 2596–2627.


Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world?. Behavioral and Brain Sciences, 33, 61–83. S0140525X0999152X Hibben, K. C. & de Jong, J. (2016). Cognitive interviewing. Guidelines for Best Practice in CrossCultural Surveys. Survey Research Center, Institute for Social Research, University of Michigan. https:// Hugh-Jones, D. (2016). Honesty, beliefs about honesty, and economic growth in 15 countries. Journal of Economic Behavior & Organization, 127, 99–114. 04.012 Hulin, C. L. (1987). A psychometric theory of evaluations of item and scale translations: Fidelity across languages. Journal of Cross-Cultural Psychology, 18, 115–42. 0022002187018002001 International Test Commission. (2017). The ITC guidelines for translating and adapting tests (2nd ed.). Kim, Y. Y. (2005). Association and Dissociation: A contextual theory of interethnic communication. In W. Gudykunst (Ed.), Theorizing about intercultural communication (pp. 323–40). Sage. Kim, A., Kim, Y., Han, K., Jackson, S. E., & Ployhart, R. E. (2017). Multilevel influences on voluntary workplace green behavior: Individual differences, leader behavior, and coworker advocacy. Journal of Management, 43, 1335–58. 10.1177/0149206314547386 Köhler, T. (2010). Honesty, loyalty, and trust: Differences in the meaning of core psychological constructs across cultures and their effect on multicultural teamwork. Presented at the Conference of the International Association of Cross-Cultural Psychology in Melbourne. Köhler, T., Gonzalez Morales, M. G., & Fine, S. (2010). Is integrity universal across cultures? Conceptual and measurement challenges. Presented at the International Congress of Applied Psychology in Melbourne. Lang, J. W., & Tay, L. (2021). The science and practice of item response theory in organizations. Annual Review of Organizational Psychology and Organizational Behavior, 8, 311–38. 10.1146/annurev-orgpsych-012420-061705 Lee, K., & Ashton, M. C. (2004). Psychometric properties of the HEXACO personality inventory. Multivariate Behavioral Research, 39(2), 329–58. Liu, C., Xiao, J., & Yang, Z. (2003). A compromise between self-enhancement and honesty: Chinese self-evaluations on social desirability scales. Psychological Reports, 92(1), 291–8. 0.2466%2Fpr0.2003.92.1.291



Lord, F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates. McKay, R. B., Breslow, M. J., Sangster, R. L., Gabbard, S. M., Reynolds, R. W., Nakamoto, J. M., & Tarnai, J. (1996). Translating survey questionnaires: Lessons learned. New Directions for Evaluation, 70, 93–104. Mohler, P. P., & Johnson, T. P. (2010). Equivalence, comparability, and methodological progress. In J. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Mohler, B.-E. Pennel, & T. W. Smith (Eds.), Survey methods in multinational, multiregional, and multicultural contexts (pp. 17–29). John Wiley & Sons, Inc. Morris, M. W., Williams, K. Y., Leung, K., Larrick, R., Mendoza, M. T., Bhatnagar, D., … Hu, J.-C. (1998). Conflict management style: Accounting for cross-national differences. Journal of International Business Studies, 29(4), 729–47. https://doi. org/10.1057/palgrave.jibs.8490050 Nye, C. D., & Drasgow, F. (2011). Effect size indices for analyses of measurement equivalence: Understanding the practical importance of differences between groups. Journal of Applied Psychology, 96, 966–80. 10.1037/a0022955 Nye, C. D., Bradburn, J., Olenick, J., Bialko, C., & Drasgow, F. (2019). How big are my effects? Examining the magnitude of effect sizes in studies of measurement equivalence. Organizational Research Methods, 22, 678–709. https://doi. org/10.1177/1094428118761122 Nye, C. D., Joo, S. H., Zhang, B., & Stark, S. (2020). Advancing and evaluating IRT model data fit indices in organizational research. Organizational Research Methods, 23, 457–86. 10.1177/1094428119833158 Petersen, M. A., Groenvold, M., Bjorner, J. B., Aaronson, N., Conroy, T., Cull, A., … & Sullivan, M. (2003). Use of differential item functioning analysis to assess the equivalence of translations of a questionnaire. Quality of Life Research, 12, 373–85. Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353–68. 10.1177/014662169501900405 Reise, S. P., & Yu, J. (1990). Parameter recovery in the graded response model using MULTILOG. Journal of Educational Measurement, 27, 133–44. Roberts, J. S., Donoghue, J. R., & Laughlin, J. E. (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24, 3–32.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34, 100–14. Schaffer, B. S., & Riordan, C. M. (2003). A review of cross-cultural methodologies for organizational research: A best-practices approach. Organizational Research Methods, 6, 169–215. https://doi. org/10.1177/1094428103251542 Sperber, A. D., Devellis, R. F., & Boehlecke, B. (1994). Cross-cultural translation: Methodology and validation. Journal of Cross-Cultural Psychology, 25(4), 501–24. 0022022194254006 Somaraju, A. V., Nye, C. D., & Olenick, J. (2021). A review of measurement equivalence in organizational research: What’s old, what’s new, what’s next? Organizational Research Methods. https:// Song, X., Anderson, T., Himawan, L., McClintock, A., Jiang, Y., & McCarrick, S. (2019). An investigation of a cultural help-seeking model for professional psychological services with US and Chinese samples. Journal of Cross-Cultural Psychology, 50, 1027– 1049. Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1292–1306. 10.1037/0021-9010.91.6.1292 Tay, L., Meade, A. W., & Cao, M. (2015). An overview and practical guide to IRT measurement equivalence analysis. Organizational Research Methods, 18, 3–46. Thielmann, I., Akrami, N., Babarović, T., Belloch, A., Bergh, R., Chirumbolo, A., … & Lee, K. (2019). The HEXACO–100 across 16 languages: A largescale test of measurement invariance. Journal of Personality Assessment, 102, 714–26. https://doi. org/10.1080/00223891.2019.1614011 Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group-mean differences: The concept of item bias. Psychological Bulletin, 99, 118–28. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Lawrence Erlbaum Associates. Triandis, H. C. (1976). Approaches toward minimizing translation. In R. W. Brislin (Ed), Translation: Applications and research (pp. 229–41). Gardner Press. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational


Research Methods, 3, 4–70. 10.1177/109442810031002 Van de Vijver, F., & Hambleton, R. K. (1996). Translating tests: Some practical guidelines. European Psychologist, 1, 89–99. 1016-9040.1.2.89 Wang, W., Liu, Y., & Liu, H. (in press). Testing differential item functioning without predefined anchor items using robust regression. Journal of Educational and Behavioral Statistics. 10.3102/10769986221109208 Weeks, A., Swerissen, H., & Belfrage, J. (2007). Issues, challenges, and solutions in translating study instruments. Evaluation Review, 31, 153–65. Wells, C. (2021a). Assessing measurement invariance for applied research (Educational and psychological testing in a global context). Cambridge University Press.


Wells, C. (2021b). Methods based on item response theory. In Assessing measurement invariance for applied research (Educational and psychological testing in a global context) (pp. 161–244). Cambridge University Press. Werner, O., & Campbell, D. T. (1970). Translating, working through interpreters, and the problem of decentering. In R. Naroll & R. Cohen (Eds.), A handbook of method in cultural anthropology (pp. 398–420). Natural History Press. Willis, G. B. (2004). Cognitive interviewing: A tool for improving questionnaire design. Sage. Woods, C. M., Cai, L., & Wang, M. (2013). The Langer-improved Wald test for DIF testing with multiple groups: Evaluation and comparison to two-group IRT. Educational and Psychological Measurement, 73, 532–47. 10.1177/0013164412464875

13 Measurement Equivalence/ Invariance Across Groups, Time, and Test Formats Changya Hu, Ekin K. Pellegrini and G o r d o n W. C h e u n g

“When a test item is found to function differently depending on the examinee’s group membership, questions are raised about what the item is actually measuring.” (Roger E. Millsap, 2011, p. 7)

One of the consistent research interests in social sciences studies is the comparisons across groups (e.g., gender, ethnicity, country, and treatment groups in experiments), sources of ratings (e.g., self-ratings versus supervisor-ratings), and time (multiple time points) (Vandenberg & Lance, 2000; Davidov et  al., 2018). However, the same scale may not be applied equally well in different groups. Hence, the prerequisite for these comparisons is establishing measurement equivalence/ invariance (ME/I), which refers to the equivalence of psychometric properties of the same scale across groups (Vandenberg & Lance, 2000). Researchers cannot make unequivocal conclusions about group differences without establishing ME/I since observed differences may be attributed to differences in the target parameters or the psychometric properties of the scales across groups (Steenkamp & Baumgartner, 1998). For example, differences in latent means of job satisfaction across cultures may be due to differences in true means in job satisfaction or differences in response sets (Cheung & Rensvold, 2000). Even with sound

precautions in choosing or developing measurement scales, it is possible that the focal construct has different conceptual meanings across groups (De Beuckelaer et  al., 2007) or that some of the items have different importance across groups (Cheung & Rensvold, 2002). Globalization has been accompanied by a proliferation in research that examines multicultural experiences in the last two decades (Davidov et  al., 2018; Aguinis et  al., 2020; Maddux et  al., 2021). Yet, there are still disciplines that have not paid enough attention to the ME/I issues in crosscultural comparisons. For example, Stevanovic et  al. (2017) review 26 scales used in crosscultural pediatric psychopathology and conclude that only a few have demonstrated adequate ME/I. Since full ME/I across-cultures is rare, researchers often face challenges in identifying non-invariant items in ME/I tests (Cheung & Lau, 2012). There’s also increasing scholarly interest in research with temporal designs, given one of the goals of organizational research is to study the change in individual perceptions and behaviors over time. In longitudinal studies, where the same instruments are administrated to the same participants over two or more time points, the equivalence of measurement properties is a prerequisite for any meaningful inferences in patterns of change (Chan, 1998; Little et al., 2007).


An important consideration in research employing temporal designs (e.g., repeated measures) is whether the scores obtained from the same measurement instruments are psychometrically equivalent across time (e.g., testing at two time points as the person moves from being a newcomer to a fully functioning employee) (Vandenberg & Morelli, 2016). If the psychometric functioning of items is changing over time, it may not be possible to differentiate measurement changes from intervention effects (Vandenberg & Lance, 2000). Examining ME/I across measurements over time is critical in eliminating changes in measurement properties as a potential reason for observed mean-level changes across time (Little, 2013). The non-independent nature of longitudinal data posits further challenges for analyzing ME/I tests. Establishing ME/I is also important in generalizing measurement scales developed in one context for use in a different context. For example, Meade et al. (2007) found lack of ME/I in some personality scales between online and paper-andpencil tests. Arthur et  al. (2014) found ME/I in personality measures between respondents using mobile devices and non-mobile devices. Similarly, Brown and Grossenbacher (2017) found general mental ability tests scores to be equivalent across non-mobile and mobile device users. On the other hand, King et al. (2015) found that test scores of the same cognitive ability test measured by mobile internet testing and personal computers were nonequivalent. These mixed findings suggest that research on the psychometric properties of scores from unproctored internet-based testing devices is an area where the practice has outpaced research. This chapter aims to advance research on tackling a common methodological challenge in social sciences research by focusing on the measurement properties of the construct examined across measurement conditions, such as group membership, time, and test format. We provide a clear, easyto-follow tutorial on conducting ME/I tests before examining data across groups, survey formats, and time. There are two common approaches in examining ME/I: the item response theory (IRT) based on the modern test theory and the structural equation modeling (SEM) based on the classical test theory (Widaman & Grimm, 2014; Tay et al., 2015). The IRT approach models item scores using the boundary response function characterized by a discrimination parameter and several location parameters. The IRT approach refers to the lack of factorial invariance as differential item functioning (DIF). There are both similarities and differences between the IRT and SEM approaches. For example, both assume local independence. While the IRT approach assumes a non-linear relationship


between the latent characteristic and the item, the SEM approach assumes a linear relationship (Raju et al., 2002; Meade & Lautenschlager, 2004). This chapter specifically focuses on the SEM approach because many constructs studied in social sciences research are developed based on the classical test theory and assume linear relationships between latent constructs and items. Readers interested in DIF are referred to Tay, Meade, and Cao (2015), and Lang and Tay (2021). In the following section, we provide a brief introduction to the ME/I concept. We then use an example to illustrate how to use R software to examine ME/I.

A BRIEF INTRODUCTION TO ME/I ME/I examines the extent to which a given measurement demonstrates similar factor structure (both the number of factors and pattern of factor loadings) of the latent constructs and similar calibration of the observed items concerning the latent constructs across groups (Vandenberg & Lance, 2000). For a measure to be equivalent across groups, individuals with identical levels of endorsement of the measured latent constructs should have the same observed scores (Drasgow & Kanfer, 1985). Several articles have provided detailed discussions on ME/I tests using multigroup confirmatory factor analysis (MGCFA) (Cheung & Rensvold, 1999; Vandenberg & Lance, 2000; Cheung & Rensvold, 2002; Schmitt & Kuljanin, 2008). CFA is rooted in the classical test theory (Bollen, 1989), which proposes the observed test score (X) is a linear combination of the intercept (c), latent construct true score (T), and error (E) (Lord & Novick, 1968). The relationship among those elements can be expressed in a regression question X = c + λT + E, where c represents the baseline intercepts of the items X, λ represents the factor loading of the items measuring the latent construct, and E is the item uniqueness that cannot be explained by the latent construct T. The classical test theory also assumes that T is not correlated with E, such that the variance of the observed score is the summation of the true score variance and the error score variance (Lord & Novick, 1968), which can be presented mathematically as follows: Var (X) = Var (T) + Var (E). The most frequently adopted approach to examine ME/I with the CFA framework generates and compares a series of MGCFA models. To perform a specific ME/I test (e.g., metric invariance), a constrained model with equality constraints is added to a set of parameters (e.g., factor loadings) across



groups. The comparison of the model fit between the unconstrained model (model without equality constraints) and the constrained model is used to test whether the ME/I is established. Traditionally, a non-significant chi-square difference between the two nested models suggests that the null hypothesis of invariance cannot be rejected and indicates the establishment of ME/I (Cheung & Lau, 2012). In contrast, a significant chi-square difference suggests at least one constrained parameter is unequal across groups. An issue related to the chisquare difference test of ME/I is that it is sensitive to sample size (Brannick, 1995). Accordingly, researchers have suggested using changes in other model fit indices to examine ME/I. For example, Cheung and Rensvold (2002) suggest a reduction in the comparative fit index (CFI) smaller than or equal to -0.01 to indicate support for the ME/I test. However, since both the chi-square difference test and the change in CFI approach do not provide an estimate of the magnitude of difference in parameters, Cheung and Lau (2012) further recommend using a direct test of ME/I by bootstrapping confidence intervals of the differences in parameters (e.g., the difference between an item’s factor loading across-groups).

EIGHT TESTS OF ME/I Figure 13.1 shows the eight tests of ME/I. Configural (Test 1), metric (Test 2), scalar (Test 3), and item residual variance (Test 6) invariance are tests at the measurement level, and the focus is on the psychometric properties of the measurement instrument (Little, 1997). Latent variance (Test 4), latent covariance (Test 5), path coefficient (Test 7), and latent mean (Test 8) invariance are at the latent construct level and may reflect substantive hypotheses of the research questions (Little, 1997). For example, Test 7 (latent path coefficients) tests for moderating effects when the moderator is a categorical variable, and Test 8 examines cross-group differences in latent means. It is necessary to establish the required ME/I at the measurement level first for invariance tests at the construct level to be interpretable (Little, 1997; Vandenberg & Lance, 2000). Configural invariance and metric invariance are required for all subsequent invariance tests. When configural and metric invariance are achieved, one can proceed to the invariance tests for construct invariance, construct covariance, item uniqueness, and path coefficient (Tests 4 to 7) without establishing scalar invariance because these tests are based on the variance–covariance matrices only,

without involving the mean structure. In other words, item intercepts are not involved in the estimation and no scalar invariance is required. However, if one wants to compare the latent means across groups, one needs to establish scalar invariance as evidence that the groups use the same intercepts for the latent constructs (Cheung & Rensvold, 2000). Some scholars (e.g., Vandenberg & Lance, 2000) suggested a Test 0 of covariance matrix invariance across groups to serve as an omnibus test of ME/I. Failure to reject the null hypothesis indicates no further ME/I tests are necessary. However, other scholars have argued the test of covariance matrix invariance is unnecessary (e.g., Cheung & Rensvold, 2002) because the establishment of such ME/I suggests all elements in the covariance matrices are invariant across groups, which rarely happens in empirical studies. Further, this test is inadequate because it ignores the comparison of the item intercepts and latent means. We follow Cheung and Rensvold (1999, 2002) and suggest the sequence of ME/I tests in Figure 13.1. While configural invariance and metric invariance are required in all cross-group comparisons, different research questions require different ME/I tests. In the following sections, we provide a step-bystep demonstration of the ME/I tests between two groups. We use the same dataset used by Cheung and Lau (2012) for this demonstration but with the lavaan (Rosseel, 2012) and semTools (Jorgensen et al., 2021) packages of the R program. We demonstrate the ME/I tests with R language because it is the most assessable to researchers. The flowchart in Figure 13.1 is used to guide the sequence of the ME/I tests. We do not demonstrate Test 7 (latent path coefficient invariance) because there is no hypothesized causal relationship among the constructs in this example. The R-codes used in this example and all output files are available for download ( The data used for the ME/I test were from the Work Orientations data set published by the International Social Survey Program in 1989. The measurement model examined is a 3-factor model that measures the quality of job context (JC), quality of job content (JQ), and quality of work environment (WE). Using a 5-point Likert-type scale, each construct was measured with four items (JC: v59, v60, v61, v67; JQ: v63, v64, v65, v71; WE: v69, v72, v73, v74). We set the three constructs to be correlated as these constructs reflect different aspects of perceived job quality. Full information maximum likelihood estimation is used to handle missing values, and the Maximum Likelihood Estimator with Robust Standard Errors (MLR) is used because it is more robust to non-normality.



Level 1

Test 1: Configural invariance test

Level 2

Full metric invariance fails Multiple constructs

Single construct

Full metric invariance exists

Test 2: Metric invariance test

Test 2a: Metric invariance test for each construct in model

Test 2b: Factor-ratio test (all referents and arguments combinations in each construct)

Item tests

Identify sets of invariant items

Strategies dealing with non-invariant items Interpret non-invariant items as data

Level 3

Full scalar invariance fails Multiple constructs

Delete noninvariant items

Partial metric invariance

Single construct

Full scalar invariance exists

Test 3: Scalar invariance test

Test 3a: Scalar invariance test for each construct in model

Test 4: Latent variance invariance test Test 5: Latent covariance invariance test

Test 3b: Factor-ratio test (all referents and arguments combinations in each construct)

Item tests

Test 6: Item residual variance invariance test Test 7: Path coefficients invariance test

Identify sets of invariant items

Strategies dealing with non-invariant items Interpret non-invariant items as data

Level 4

Delete noninvariant items

Partial scalar invariance

Test 8: Latent mean comparisons

Figure 13.1  Sequence of measurement invariance tests We conducted the ME/I test between the respondents from Great Britain (GB, N = 717) and the United States (US, N = 884).

Test 1: Configural invariance examines whether the factor pattern is the same across groups and reflects whether respondents from different groups



have the same conceptualization regarding the relationships between items and the latent constructs. In our example, configural invariance examines whether the GB and the US respondents conceptualize the 12-item scale to reflect the same three dimensions of job quality (work environment, job content, and job context), each measured by the same four items. Configural invariance is the fundamental ME/I test for all other ME/I tests. Only with the establishment of configural invariance can researchers proceed to the rest of the ME/I tests. Statistically speaking, it examines whether survey items have the same configuration of significant and non-significant factor loadings, which applies to single construct models as well. However, configural invariance does not require that the factor loadings of like items be identical across groups. There are three frequently used approaches to provide the scale for latent factors. The first approach is to fix the factor loading of the first item to one, which we recommend and use in the numerical example in this chapter. The second approach is to standardize the latent factor by fixing the variance of the latent factor to zero. This approach is not recommended for cross-group comparisons because it assumes the variance of latent variables are equal across groups (Cheung & Rensvold, 1999). The third approach is the effect coding (Little et  al., 2006) that fixes the average of factor loadings of all items for a latent factor to one. We do not recommend this approach because it requires all items to be measured on the same scale (Cheung et al., 2021). By default, lavaan fixes the factor loading of each construct’s first item (referent or marker item) to 1 to provide a scale for the construct. There is no equality constraint across the two groups, except the pattern of factor structure is the same. The configural invariance test (Test 1) is conducted by examining the overall model fit of the CFA model (Model.config), which was acceptable in our example (χ2 = 389.385, df = 102, p < .001, CFI= .926, RMSEA= .059, SRMR=.051). Further, all factor loadings were statistically significant, and none of the confidence intervals of the three correlations among the three mentoring support included an absolute value of 1. This baseline model provides evidence for the discriminant and convergent validity of the three-factor model in both the GB and the US groups and can be used to examine the ME/I test of metric invariance. Test 2: Metric invariance constrains the factor loadings for like items in the configural invariance model to be equivalent across groups (Horn & McArdle, 1992). This test provides a stronger test of factorial invariance by further constraining the unit of change in latent scores with regard to the scale intervals at the item level to be identical across groups (Steenkamp & Baumgartner, 1998).

With metric invariance, researchers can proceed to test the equality of like items’ intercepts (Test 3), latent path coefficient (Test 7), latent variances (Test 4), covariances (Test 5), or latent means (Cheung & Lau, 2012). When full metric invariance is rejected, researchers need to identify the item(s) with non-invariant factor loadings. The metric invariance test (Test 2) is conducted by creating a constrained model where the factor loadings are constrained to be invariant across groups. This is done by adding the group.equal = c(“loadings”) argument to the configural invariance model (Model.config) in the sem() function of lavaan. Metric invariance is concluded if the CFI of the constrained model (Model.metric) is lower than the CFI of the unconstrained model (Model.config) by less than -0.01 (Cheung & Rensvold, 2002), which implies imposing the equality constraints on the factor loadings does not result in a substantially worse model fit. Another approach is by comparing the loglikelihood values (with a chi-square distribution) between the two nested models (Model.config vs. Model.metric). If the difference is not statistically significant, metric invariance is supported. This is a statistical test for differences in fit between two nested models, but the result is affected by the same size. In other words, when the sample size is large, a small difference in factor loadings (or other parameters under comparison) will be statistically significant, leading to a rejection of invariance (Cheung & Rensvold, 2002). Both the differences in CFI and loglikelihood values between nested models are provided by the compareFit() function of the semTools. The small ΔCFI and non-significant chi-square difference in our example (ΔCFI = -.001, Δχ2 = 11.420, Δdf = 9, p = .248,) provide support for full metric invariance. As a result, we can proceed to any of the following ME/I tests: Test 3 (scalar invariance), Test 4 (latent variance invariance), Test 5 (latent covariance invariance), Test 6 (item residual variance invariance), and Test 7 (path coefficients invariance test). Test 3: Scalar invariance can be examined with the establishment of full or partial metric invariance (Meredith, 1993). Scalar invariance examines whether like items’ intercepts of the regression equation on the latent constructs are invariant across groups. Although metric invariance suggests the equality of metric or scale intervals used across groups, metric invariance does not consider whether the item intercepts (i.e., the observed item levels when the latent construct is zero) are the same across groups (Cheung & Rensvold, 2000). By constraining like items’ intercepts to be identical across groups, scalar invariance removes potential systematic upward or downward bias owing to the different initial status of like items (Steenkamp & Baumgartner, 1998).


When testing for scalar invariance, in addition to fixing the factor loading of the first item to one to provide a scale for the latent factor, the intercept of the first item is fixed to zero to define the latent mean (Bollen, 1989). Test 3 examines scalar invariance, which further constrains item intercepts of like items in Model.metric equal across the two groups by using the group.equal = c(“loadings”, “intercepts”) argument. Scalar invariance is examined by comparing the overall model fit between the two nested models (Model.metric vs. Model. scalar). Full scalar invariance was not supported as the loglikelihood difference was significant, and the reduction in CFI of the constrained model is larger than -0.01(Δχ2 = 78.680, Δ df = 9, p < .001, ΔCFI = -.016). Following Figure 13.1, we use the factorratio test and the BC bootstrap confidence intervals method to identify the non-invariant items. Another challenge to the ME/I test is identifying the non-invariant item when ME/I fails (Vandenberg & Lance, 2000; Cheung & Lau, 2012). Under this condition, the item with noninvariant parameters will be identified, and a partial invariance model will be created by releasing the equality constraints of the non-invariant parameters. Further ME/I test (e.g., comparison of latent means) can be examined with the establishment of partial invariance (e.g., partial scalar invariance). Although many researchers rely on the modification indices to identify non-invariant items by sequential examination of the largest modification index (Yoon & Millsap, 2007), Cheung and Rensvold (1999) demonstrate that this approach may fail if there is more than one non-invariant item. When one fixes the factor loading of a referent item to one and the intercept to zero to provide identification, this specification implies the referent item is invariant across groups. Although the choice of referent item may not impact the results of the configural invariance test, and the omnibus test of metric invariance and scalar invariance, it may affect the identification of items that have unequal factor loadings and/or intercepts across groups. When a non-invariant item is used as the referent item, other invariant items may be identified as non-invariant as a result of scaling. Cheung and colleagues (Cheung & Rensvold, 1999; Rensvold & Cheung, 2001) suggest using the factor-ratio test with alternative items as the reference indicator for identifying the non-invariant items. However, this approach can be labor-intensive, as a factor measured by p items requires p(p-1)/2 tests. Hence, Cheung and Lau (2012) propose the direct comparison approach to test significance of differences in parameters using the bias-corrected (BC) bootstrap confidence intervals of the differences to identify the non-invariant items.


We conducted Test 3b following Cheung and Lau’s (2012) approach to identify items with noninvariant intercepts. For each factor, a total of six confidence intervals were constructed based on the 2000 bootstrapping results. Cheung and Lau (2012, p. 179) discuss the rationale and detailed procedure for identifying non-invariant item intercepts. Based on the results, the intercepts of v73 (WE), v65(JQ), v59(JC), and v67(JC) were identified as unequal between the two groups. These non-invariant item intercepts were set to be freely estimated in the partial invariance model. The model fit of the partial scalar invariance model (Model.pscalar) was acceptable (χ2 = 424.738, df = 117, p < .001, CFI = .927, RMSEA=.061, SRMR =.054). As WE and JQ both had more than half of the items being invariant (Cheung & Lau, 2012), the partial invariance model can be used to compare the latent means for WE and JQ, but not JC in Test 8. Test 4: Latent variance invariance examines whether variances of like constructs are the same across groups. This is a prerequisite for crossgroup comparisons of latent construct correlation. The latent variance invariance test is conducted by adding equality constrains of latent factor variance between the two groups to the metric invariance model (Model.metric in Test 2). Factor variance invariance is usually tested for each construct instead of an omnibus test for equality of all factor variances. Hence, we conduct three tests to examine if the variance of each latent factor was equal between the two groups. In the constrained model, all latent variances are fixed to be equal across groups using the group.equal = c(“loadings”, “lv.variances”) argument in the sem function of lavaan, and the latent factors that are allowed to have different variances are specified by the group.partial argument in the sem function of lavaan. For example, the constraint model Model. LVAR1 was used to compare the variance of WE across groups. Hence, the variance of JC and JQ were allowed to be freely estimated by including these two variances in the group.partial argument. The invariance of latent variance was examined by comparing the model fit of the two nested models (Model.metric vs. Model.LVAR1). The small change in CFI and nonsignificant chi-square difference test (Δχ2 = .216, Δ df = 1, p = .642; ΔCFI = .001) indicating the variance of WE were invariant between the two groups. The same procedure was used to test the factor variance invariance of JC (Model.LVAR2) and JQ (Model.LVAR3) between the two groups. The results indicate that factor variances of JC (Δχ2 = 1.562, Δ df = 1, p = .211, ΔCFI = .001) and JQ (Δχ2 = 2.569, Δ df = 1, p = .103, ΔCFI = -.001) were not statistically significantly different between the two groups.



Test 5: Latent covariance invariance examines whether the covariances between like constructs are the same across groups. This test and the test of construct variance invariance (Test 4) can be combined to serve as an omnibus test of the invariance of variance/covariance matrices of the latent constructs across groups (Steenkamp & Baumgartner, 1998; Vandenberg & Lance, 2000). Test 5 examines latent covariance invariance by adding equality constrains of like latent factor covariance in the metric invariance model to be equal between the two groups. Similar to Test 4, we conduct three tests to directly examine the equality of each covariance between the two groups. The first constrained model (Model. LCOV1) fixed the covariance between WE and JC to be equal across groups. The covariance between WE and JC was invariant as the difference in CFI was small, and the difference in loglikelihood values between the two nested models (Model.metric and Model.LCOV1) was non-significant (Δχ2 = .208, Δ df = 1, p = .648, ΔCFI = .001). The same procedure was used to test the factor covariance invariance between WE and JQ (Model.LCOV2) and between JC and JQ (Model.LCOV3). The results indicate that factor covariances between WE and JQ (Δχ2 = 1.936, Δ df = 1, p = .164, ΔCFI = .001) and factor covariances between JC and JQ (Δχ2 = 5.441, Δ df = 1, p = .020, ΔCFI = -.001) were not statistically significantly different between the two groups. Based on the results, full latent factor covariance invariance was established. Test 6: Item residual variance invariance (also known as uniqueness invariance) can be examined with the establishment of metric invariance (Steenkamp & Baumgartner, 1998). Uniqueness refers to the measurement error in the observed item score. Therefore, Test 6 examines whether the like items are associated with the same degree of measurement errors while measuring the same latent constructs across groups. Unless equality of reliability across groups is of particular research interest (e.g., Cheung, 1999), this test is unnecessary as SEM automatically partials out measurement errors in its parameter estimations. We directly examined the equality of each item’s uniqueness by adding the equality constraint on item residuals in the group.equal argument and relax the equality constraints of items that were not in comparison in the group.partial argument. For example, the constrained model (Model.Resv69) fixed the uniqueness of item v69 to be equal between the two groups by adding the group.equal = c(“loadings”, residuals”) argument and including all the item residuals other than that of v69 in the group.partial argument. The uniqueness of item v69 was invariant as the CFI difference was small and the loglikelihood values

difference between the two nested models (Model. metric vs. Model.Resv69) was non-significant (Δχ2 = 2.337, Δ df = 1, p = .126; ΔCFI = .000). We continued the same procedures for the remaining 11 items and found that the uniqueness of item v73 was unequal between the two groups. All outputs are included in the supplementary files. Test 7: Latent path coefficient invariance examines whether path coefficients between latent constructs are invariant across groups. As this test examines whether structural relationships between the latent constructs are invariant across groups, it is a test of moderation when the moderator is a categorical variable. Test 8: Latent (construct) means invariance tests whether latent constructs’ means are the same across groups. Although some researchers suggest the establishment of full metric and scalar invariance as a prerequisite for Test 8 (e.g., Bollen, 1989), others suggest partial metric and scalar invariance as the prerequisite (e.g., Steenkamp & Baumgartner, 1998; Vandenberg & Lance, 2000). However, there is no clear guideline for the minimum invariant items required (Cheung & Lau, 2012). For example, Steenkamp and Baumgartner (1998) suggest a minimum number of two items (including the referent item). In contrast, Cheung and Lau (2012) suggest at least more than half of the items to be invariant as the prerequisite of Test 8. Based on the partial scalar invariance model established in Test 3 (Model.pscalar), Test 8 further constrains latent means to be equal between two gender groups. We conducted three tests to directly examine the equality of each latent mean between the two groups. Test 8 was examined by the chi-square difference test between the two nested models (Model.pscalar and Model.LM1). The non-significant chi-square difference test (Δχ2 = 1.846, Δ df = 1, p =.160, ΔCFI = .000) suggests the latent means were not significantly different between the two groups. We continue the same procedures to test the equality of the other two latent means. The latent means of JC (Model. LM2) was unequal (Δχ2 = 43.19, Δ df = 1, p 200; Meade et al., 2008), changes in alternative fit indices are useful in evaluating ME/I. Another issue is unbalanced sample sizes across groups since extreme imbalance in sample sizes across groups (e.g., Arthur et al., 2014) may mask noninvariance (Yoon & Lai, 2018). With unequal sample sizes, the tested models are closer to the factor model of the larger sample size to reduce the chi-square value, which may mask non-invariance in the smaller group. Yoon and Lai (2018) proposed a subsampling approach in which multiple random subsamples (e.g., 100 random samples with the same sample size as the smaller group) from the larger sample group are used to test for ME/I. Rather than reporting results based on tests of unbalanced sample sizes, they suggest reporting the results based on the subsampling approach. Nevertheless, the sample size is a complicated issue in ME/I as it concerns the power and precision of the ME/I tests. More future studies are needed to provide useful guidelines concerning sample sizes and unbalanced samples. Number of groups in ME/I is an issue that has received increasing research attention (Rutkowski & Svetina, 2014; Marsh et al., 2018), particularly when the research objective is to compare latent means across multiple groups. For example, many large-scale international surveys, such as the World Health Survey, and the European Social Survey, compare survey results across many countries. Similarly, many longitudinal studies, such as the General Social Survey in many countries, and the New Zealand Attitudes and Values Study compare results across multiple years. These cross-national and longitudinal studies need to demonstrate metric and scalar invariance



to interpret the cross-national or across-time differences unequivocally. However, due to the larger number of groups, the chance of establishing full measurement invariance is smaller. Hence, several approaches, such as the alignment method and modification indices method, have been proposed to examine ME/I with a large number of groups. However, there are three limitations of these approaches. First, the alignment method proposed by Asparouhov and Muthén (2014) is exploratory and not appropriate for hypothesis testing. Marsh et al. (2018) tried to improve the alignment method with the alignment within CFA approach. Nevertheless, both approaches have failed to recognize the relationship between scalar invariance and latent mean comparisons across groups. Since latent means are not aggregates of the item means in SEM, a scalar invariance test is conducted to examine if all items of a latent factor give the same mean difference across items after adjusting the scale by factor loadings (Cheung & Lau, 2012). If the mean difference is the same across items, then any item (usually the referent item) can be used to define the latent mean difference across groups and be subject to test for statistical significance. Any approach that allows for an item’s intercepts to differ across groups in effect has not included that item for the latent mean comparison. Second, these approaches still consider ME/I as a requirement for cross-group comparisons and try to identify the largest number of comparable groups. However, scalar invariance can seldom be achieved across a large number of groups, and therefore we suggest that it may be more meaningful to develop theoretical justification to explain or predict the non-invariance to enhance our understanding of cross-cultural differences (Cheung & Rensvold, 1999). Third, approaches to improve the omnibus test of ME/I across multiple groups (e.g., Rutkowski & Svetina, 2014) may be counterproductive because invariance among a majority number of groups may disguise the noninvariance of a few groups. Future research should explore more efficient ways to identify non-­ invariant groups and present the results of crossgroup comparisons. One possibility is to conduct pairwise comparisons using the direct comparison approach (Cheung & Lau, 2012) and present the results similar to the pairwise comparisons in ANOVA.

CONCLUSION The purpose of this chapter is to provide a brief overview and illustration on how to conduct ME/I

analysis with the MGCFA approach using the R program. The material is intended to serve both as an introductory discussion to those who are new to ME/I analysis, and as a refresher to those who are familiar with the concepts discussed in the chapter. It should be noted that several important topics are not discussed in this chapter. These omitted areas include the choice of referent indicators (Rensvold & Cheung, 2001), the uses of testing ME/I with item parcels (Meade & Kroustalis, 2006), the power of ME/I tests (Meade & Bauer, 2007), ME/I across ordinal and continuous variables (e.g., performance rating, age) (Bauer, 2017; Molenaar, 2021), and ME/I tests of ordinal scale (Merkle et  al., 2014; Liu et  al., 2017). We hope our tutorial inspires researchers to tackle fruitful novel research questions that can be evaluated with ME/I testing.

REFERENCES Aguinis, H., Ramani, R. S., & Cascio, W. F. (2020). Methodological practices in international business research: An after-action review of challenges and solutions. Journal of International Business Studies, 51, 1593–1608. s41267-020-00353-7 Arthur, W., Doverspike, D., Muñoz, G. J., Taylor, J. E., & Carr, A. E. (2014). The use of mobile devices in high‐stakes remotely delivered assessments and testing. International Journal of Selection and Assessment, 22, 113–23. ijsa.12062 Asparouhov, T., & Muthén, B. (2014). Multiple-group factor analysis alignment. Structural Equation Modeling, 21, 495–508. 10705511.2014.919210 Bagozzi, R. P. (1983). Issues in the application of covariance structure analysis: A further comment. Journal of Consumer Research, 9, 449–50. https:// Bauer, D. J. (2017). A more general model for testing measurement invariance and differential item functioning. Psychological Methods, 22, 507–26. Bollen, K. A. (1989). Structural equations with latent variables. John Wiley & Sons. Brannick, M. T. (1995). Critical comments on applying covariance structure modeling. Journal of Organizational Behavior, 16, 201–13. https://doi. org/10.1002/job.4030160303 Brown, M. I., & Grossenbacher, M. A. (2017). Can you test me now? Equivalence of GMA tests on mobile and non‐mobile devices. International Journal of Selection and Assessment, 25, 61–71.


Chan, D. (1998). The conceptualization and analysis of change over time: An integrative approach incorporating longitudinal mean and covariance structures analysis (LMACS) and multiple indicator latent growth modeling (MLGM). Organizational Research Methods, 1, 421–83. https://doi. org/10.1177/109442819814004 Chen, F., Curran, P. J., Bollen, K. A., Kirby, J., & Paxton, P. (2008). An empirical evaluation of the use of fixed cutoff points in RMSEA test statistic in structural equation models. Sociological Methods & Research, 36, 462–94. 10.1177/0049124108314720 Cheung, G. W. (1999). Multifaceted conceptions of self‐other ratings disagreement. Personnel Psychology, 52, 1–36. j.1744-6570.1999.tb01811.x Cheung, G. W. (2008). Testing equivalence in the structure, means, and variances of higher-order constructs with structural equation modeling. Organizational Research Methods, 11, 593–613. Cheung, G. W., Cooper-Thomas, H. D., Lau, R. S., & Wang, L. C. (2021). Testing moderation in business and psychological studies with latent moderated structural equations. Journal of Business and Psychology, 36, 1009–1033. Cheung, G. W., & Lau, R. S. (2012). A direct comparison approach for testing measurement invariance. Organizational Research Methods, 15, 167–98. Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25, 1–27. S0149-2063(99)80001-4 Cheung, G. W., & Rensvold, R. B. (2000). Assessing extreme and acquiescence response sets in crosscultural research using structural equations modeling. Journal of Cross-Cultural Psychology, 31, 187–212. Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 9, 233–55. https://doi. org/10.1207/S15328007SEM0902_5 Davidov, E., Schmidt, P., Billiet, J., & Meuleman, B. (Eds.). (2018). Cross-cultural analysis: Methods and applications (2nd ed.). Routledge. De Beuckelaer, A., Lievens, F., & Swinnen, G. (2007). Measurement equivalence in the conduct of a global organizational survey across countries in six cultural regions. Journal of Occupational and Organizational Psychology, 80, 575–600. https:// Drasgow, F., & Kanfer, R. (1985). Equivalence of psychological measurement in heterogeneous populations. Journal of Applied Psychology, 70,


662–80. 70.4.662 Gerbing, D. W., & Anderson, J. C. (1984). On the meaning of within-factor correlated measurement errors. Journal of Consumer Research, 11, 572– 80. Golembiewski, R. T., Billingsley, K., & Yeager, S. (1976). Measuring change and persistence in human affairs: Types of change generated by OD designs. Journal of Applied Behavioral Science, 12, 133–57. 002188637601200201 Gunn, H. J., Grimm, K. J., & Edwards, M. C. (2020). Evaluation of six effect size measures of measurement non-invariance for continuous outcomes. Structural Equation Modeling: A Multidisciplinary Journal, 27, 503–14. 05511.2019.1689507 Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research, 18, 117–44. Jöreskog, K. G. (1974). Analyzing psychological data by structural analysis of covariance matrices. In R. C. Atkinson, D. H. Krantz, R. D. Luce, & P. Suppes (Eds.), Contemporary developments in mathematical psychology (Vol. II, pp. 1–56). Freeman. Jorgensen, T. D., Pornprasertmanit, S., Schoemann, A. M., & Rosseel, Y. (2021). semTools: Useful tools for structural equation modeling. R package version 0.5-5. semTools King, D. D., Ryan, A. M., Kantrowitz, T., Grelle, D., & Dainis, A. (2015). Mobile internet testing: An analysis of equivalence, individual differences, and reactions. International Journal of Selection and Assessment, 23, 382–94. ijsa.12122 Lang, J. W. B., & Tay, L. (2021). The science and practice of item response theory in organizations. Annual Review of Organizational Psychology and Organizational Behavior, 8, 311–38. https:// Little, T. D. (1997). Mean and covariance structures (MACS) analyses of cross-cultural data: Practical and theoretical issues. Multivariate Behavioral Research, 32, 53–76. s15327906mbr3201_3 Little, T. D. (2013). Longitudinal structural equation modeling. Guilford Press. Little, T. D., Slegers, D. W., & Card, N. A. (2006). A non-arbitrary method of identifying and scaling latent variables in SEM and MACS models. Structural Equation Modeling, 13(1), 59–72. Little, T. D., Preacher, K. J., Selig, J. P., & Card, N. A. (2007). New developments in latent variable panel



analyses of longitudinal data. International Journal of Behavioral Development, 31, 357–65. https:// Liu, Y., Millsap, R. E., West, S. G., Tein, J.-Y., Tanaka, R., & Grimm, K. J. (2017). Testing measurement invariance in longitudinal data with ordered-­ categorical measures. Psychological Methods, 22, 486–506. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley. Maddux, W. W., Lu, J. G., Affinito, S. J., & Galinsky, A. D. (2021). Multicultural Experiences: A Systematic Review and New Theoretical Framework. Academy of Management Annals, 15, 345–76. Marsh, H. W., Guo, J., Parker, P. D., Nagengast, B., Asparouhov, T., Muthén, B., & Dicke, T. (2018). What to do when scalar invariance fails: The extended alignment method for multi-group factor analysis comparison of latent means across many groups. Psychological Methods, 23, 524–45. Meade, A. W., & Bauer, D. J. (2007). Power and precision in confirmatory factor analytic tests of measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 14, 611–35. https:// Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power and sensitivity of alternative fit indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568–92. https://doi. org/10.1037/0021-9010.93.3.568 Meade, A. W., & Kroustalis, C. M. (2006). Problems with item parceling for confirmatory factor analytic tests of measurement invariance. Organizational Research Methods, 9, 369–403. https://doi. org/10.1177/1094428105283384 Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of item response theory and confirmatory factor analytic methodologies for establishing measurement equivalence/invariance. Organizational Research Methods, 7, 361–88. https://doi. org/10.1177/1094428104268027 Meade, A. W., Michels, L. C., & Lautenschlager, G. J. (2007). Are Internet and paper-and-pencil personality tests truly comparable? An experimental design measurement invariance study. Organizational Research Methods, 10, 322–45. https://doi. org/10.1177/1094428106289393 Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–43. Merkle, E. C., Fan, J., & Zeileis, A. (2014). Testing for measurement invariance with respect to an ordinal variable. Psychometrika, 79, 569–84. https://doi. org/10.1007/s11336-013-9376-7 Molenaar, D. (2021). A flexible moderated factor analysis approach to test for measurement invariance across a continuous variable. Psychological

Methods, 26, 660–79. met0000360 Nye, C. D., Bradburn, J., Olenick, J., Bialko, C., & Drasgow, F. (2019). How big are my effects? Examining the magnitude of effect sizes in studies of measurement equivalence. Organizational Research Methods, 22, 678–709. 10.1177/1094428118761122 Nye, C. D., & Drasgow, F. (2011). Effect size indices for analyses of measurement equivalence: Understanding the practical importance of differences between groups. Journal of Applied Psychology, 96(5), 966–80. Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87, 517–29. Rensvold, R. B., & Cheung, G. W. (2001). Testing for metric invariance using structural equation models: Solving the standardization problem. In C. A. Schriesheim & L. L. Neider (Eds.), Equivalence in measurement (Vol. 1, pp. 21–50). Information Age. Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48, 1–36. v048.i02 Rutkowski, L., & Svetina, D. (2014). Assessing the hypothesis of measurement invariance in the context of large-scale international surveys. Educational and Psychological Measurement, 74, 31–57. Schmitt, N., & Kuljanin, G. (2008). Measurement invariance: Review of practice and implications. Human Resource Management Review, 18, 210– 22. Steenkamp, J.-B. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25, 78–107. Stevanovic, D., Jafari, P., Knez, R., Franic, T., Atilola, O., Davidovic, N., Bagheri, Z., & Lakic, A. (2017). Can we really use available scales for child and adolescent psychopathology across cultures? A systematic review of cross-cultural measurement invariance data. Transcultural Psychiatry, 54, 125–52. Tay, L., Meade, A. W., & Cao, M. (2015). An overview and practical guide to IRT measurement equivalence analysis. Organizational Research Methods, 18, 3–46. 1094428114553062 Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70. 10.1177/109442810031002


Vandenberg, R. J., & Morelli, N. A. (2016). A contemporary update on testing for measurement equivalence and invariance. In J. P. Meyer (Ed.), Handbook of employee commitment (pp. 449–61). Edward Elgar Publishing. https://doi. org/10.4337/9781784711740.00047 Widaman, K. F., & Grimm, K. J. (2014). Advanced psychometrics: Confirmatory factor analysis, item response theory, and the study of measurement invariance. In Handbook of research methods in social and personality psychology (2nd ed.) (pp. 534–70). Cambridge University Press.


Yoon, M., & Lai, M. H. C. (2018). Testing factorial invariance with unbalanced samples. Structural Equation Modeling: A Multidisciplinary Journal, 25, 201–13. 2017.1387859 Yoon, M., & Millsap, R. E. (2007). Detecting violations of factorial invariance using data-based specification searches: A Monte Carlo study. Structural Equation Modeling: A Multidisciplinary Journal, 14, 435–63. 10705510701301677

This page intentionally left blank


Scale Improvement Methods

This page intentionally left blank

14 Reliability Justin A. DeSimone, Jeremy L. Schoen and Tine Köhler

Among the many definitions of survey reliability, most discuss reliability as a form of measurement precision. Social science researchers are typically interested in assessing a person’s standing on a construct (e.g., intelligence, personality, attitude). However, latent constructs are not directly observable. Consequently, researchers develop scales to assess a person’s standing on these constructs and attempt to measure the characteristic of interest as precisely as possible. In classical test theory (CTT: Lord & Novick, 1968), an observed score on a survey reflects both the respondent’s true score (i.e., their actual standing on the construct) and measurement error (i.e., any deviation of the observed score from that actual standing). No survey is perfect (i.e., free from measurement error). Many potential sources of measurement error exist, including unstable or dynamic constructs, surveys that measure more than one thing, poor item or survey design (e.g., items that may be interpreted differently by different respondents), unfavorable testing conditions, intra-individual factors unrelated to the construct (e.g., fatigue, distraction, dishonest responding), rater bias (e.g., halo effects, stereotypes), and many more. The more measurement error an observed score contains, the less representative the observed score is of the true score, and the less confident we are that the observed score indicates a respondent’s

true standing on the construct. From this perspective, reliability is defined in terms of the extent to which an observed score reflects the true score, or as a function of the correlation between true scores and observed scores (i.e., the ratio of true score variance to observed score variance). If we assume that true scores are independent of measurement error (a typical assumption in CTT), then the variance in observed scores is a function of only the variance in true scores and the error variance, which leads to the intuitive conclusion that surveys containing less error variance are more reliable. Survey reliability has an important connection to survey validity (which will be discussed further in the following chapter). Reliability serves as a conceptual “upper bound” for validity (Gulliksen, 1950; Nunnally & Bernstein, 1994). While reliability is concerned with the amount of error in observed scores (i.e., measurement precision), validity is concerned with how well a measure represents the construct of interest. A survey cannot be valid if it is not reliable, but it is possible for a survey to be reliable without being valid (e.g., if it is precisely measuring something other than the intended construct). Consider again the mathematical definition of reliability as a correlation between observed scores and true scores. The ability for a survey to correlate with conceptually



related measures (convergent validity) or outcomes (criterion-related validity) requires that the observed scores also correlate with their true underlying construct. Thus, the presence of measurement error will attenuate the ability of a survey to correlate with other measures or criteria (Spearman, 1904). As reliability has implications for validity and perceptions of survey quality, it is a primary consideration in scale development and scale modification. Ceteris paribus, researchers and practitioners prefer using surveys with higher reliability in an effort to produce more consistent, trustworthy, and meaningful results. Observed scores are often conceptualized as a function of both the survey and the respondents (Birnbaum, 1968; Cronbach et  al., 1972). Consequently, the reliability of a survey should be calculated separately for every sample in which that survey is used, as surveys may perform differently in various samples (just as some constructs have different meanings or interpretations for different samples). Comparing the reliability of different measures of the same construct may allow researchers to select appropriate measures of a construct for a given sample. Selecting more reliable measures yields more precise assessment of our constructs of interest, less biased predictions of the relationships between our constructs of interest, and stronger predictive power to test our hypotheses and evaluate our theories (Cohen et al., 2003). Because measurement error can take various forms, different estimates of reliability operationalize this error in different ways. Some reliability indices attempt to isolate variance from a particular source (e.g., a test–retest correlation examining changes to scale scores over time), while generalizability theory allows for the simultaneous examination of multiple sources of measurement error. In the following section, we introduce the concept of different types of reliability estimates, explain how each operationalizes measurement error, and discuss the various conceptual and practical considerations required for their use.

TYPES OF RELIABILITY When reliability is taught in statistics or research methods courses, it is common to differentiate between different “types” of reliability coefficients. A brief, yet instructive, description of various types of reliability is provided by Lee Cronbach (1947), who specifies four different definitions of reliability corresponding to four different sets of assumptions and definitions of

measurement error. Cronbach focuses primarily on two factors: equivalence (consistency in scores across survey content) and stability (consistency in scores across time). Importantly, Cronbach (p. 2, emphasis in original) notes that “different assumptions lead to different types of coefficients, which are not estimates of each other.” The purest of Cronbach’s four types involves a hypothetical self-correlation in which the researcher would correlate scores across participants while holding constant both survey content and time. While this self-correlation would generate a very useful estimate of reliability, it is not possible to achieve without the use of a time machine, as it involves multiple administrations of the same survey at the same time to the same respondents. Cronbach’s other three types of reliability hold fewer factors constant, thereby increasing the potential sources of error or variation that may influence measurement. Specifically, the coefficient of equivalence involves administering different (but related) surveys at the same time. The coefficient of stability involves administering the same survey at different times. Cronbach also discusses a coefficient of stability and equivalence that involves administering different (but related) surveys at different (but proximal) times. While equivalence and stability are not necessarily the only factors that influence survey reliability, their distinction is instructive in understanding several truths about reliability coefficients. First, different estimates of reliability are not interchangeable. For example, surveys that are internally consistent are not necessarily temporally consistent (and vice versa). Even different coefficients within a given “type” of reliability (e.g., two different estimates of internal consistency) often differ in their operationalization of measurement error. Second, the selection of a specific reliability coefficient communicates a set of assumptions and an operationalization of measurement error. Third, not all reliability estimates are capable of being calculated in all situations, nor are all reliability estimates appropriate for use in all situations. Fourth, in practice, there is no “perfect” reliability estimate, as the closest we have come (Cronbach’s self-correlation) is purely hypothetical and cannot be achieved in practice. The following sections introduce and describe various types of reliability estimates prevalent in modern research, concluding with a brief description of how these types can be considered simultaneously through generalizability theory. In addition to internal consistency (equivalence) and temporal consistency (stability), we discuss forms of inter-rater reliability which can be computed when multiple judges provide ratings or scores. We then turn our attention to broader issues in


reliability and provide flowcharts to assist readers in determining appropriate reliability indices to report for their needs. Although uncommon in the literature, it is advisable to report more than one reliability coefficient if multiple coefficients are appropriate and authors clearly delineate which estimates are reported. Choosing appropriate reliability estimates to report is an important research decision, and we encourage readers to resist the temptation to simply report what they perceive to be prevalent or popular. Instead, we encourage critical thinking about the nature of measurement error, incorporating estimates of reliability into research design, and transparently reporting and justifying the reliability estimates that are most appropriate for a survey’s measurement and theory.

INTERNAL CONSISTENCY The internal consistency of a survey is a coefficient of equivalence (Cronbach, 1947). As such, estimates of internal consistency are concerned with the extent to which different indicators of a construct are similar (or “equivalent”) measures of the same construct. These indicators can be different versions of the survey, two halves of a survey, or even individual items within a single survey. When computing an estimate of internal consistency, measurement error is operationalized as the “uniqueness” of the different indicators, or the variance in each indicator that is not shared with other indicators. However, internal consistency estimates differ in terms of “how equivalent” they require the different indicators to be. Table 14.1 describes each of these levels of equivalence (e.g., parallel, tau-equivalent), and readers interested in learning more about these levels can find more information in Gulliksen (1950) or Lord and Novick (1968). The four levels of equivalence are nested such that “higher” or more restrictive levels (e.g.,

parallel) can be considered special cases of lower levels (e.g., congeneric). In other words, parallel measurements can also be considered tau-equivalent, essentially tau-equivalent, and congeneric. These levels of equivalence reflect assumptions that underlie estimates of internal consistency. Additionally, since internal consistency refers to all indicators measuring the same construct, internal consistency estimates are useful for unidimensional surveys and not advisable for estimation of the reliability of multidimensional surveys (though estimates of internal consistency for each dimension of a multidimensional survey can and should be computed separately). Designing an internally consistent survey can improve clarity in the interpretation of survey scores and results. To design an internally consistent survey, the items included in the survey need to be conceptually and empirically related to one another. Survey developers typically accomplish this by selecting items that are intended to measure the same construct and demonstrate high inter-item correlations.

Parallel Forms The earliest estimate of internal consistency is calculated by the correlation between two parallel forms of the same survey (preferably administered side-by-side to minimize time-related differences). Because the forms are parallel, this correlation is assumed to reflect the common true score. Any differences in scores between the forms are attributed to the “unique” aspects of each survey, with this uniqueness serving as the operationalization of measurement error. The development of parallel forms of a survey is practically difficult, as it involves developing two separate versions of a survey alike in both content and difficulty. Report a parallel forms correlation when estimating the internal consistency of two maximally equivalent forms of a survey for which the equivalence of true scores and error variances has been established.

Table 14.1  Levels of measurement equivalence Equivalence




T1 = T2, σ e21 = σ e22

True scores equal, error variances equal

Tau-equivalent Essentially tau-equivalent Congeneric

T1=T2 T1=T2+a T1=b*T2+a

True scores equal True scores differ by an additive constant (a) True scores differ by a multiplicative and additive constant (b, a, respectively)

Note: T = true score, σ e2 = error variance.




Split-half If a survey is unidimensional, it can be split into equivalent halves, with each half serving as a parallel form. Like the parallel forms coefficient, the split-half coefficient allows for the computation of a correlation between parallel forms (halves), operationalizing measurement error as the uniqueness of each half. Ben Wood (1922) pioneered this technique by correlating the scores on even- and odd-numbered items, while subsequent critiques of this technique noted the importance of establishing the equivalence of each half (Crum, 1923; Kelley, 1924). Importantly, when computing the split-half coefficient, each “form” contains only half the number of items as the full survey. Since internal consistency is known to be a function of the number of items, it is common to adjust the computed split-half correlation using the Spearman-Brown prophesy formula to estimate the level of internal consistency of the full-length survey (Spearman, 1910). Report a split-half coefficient when estimating the internal consistency of a survey that can be divided into parallel halves.

Kuder-Richardson Formula 20 (KR20) and Coefficient Alpha Kuder and Richardson (1937) developed a formula to estimate internal consistency that serves as the average of all possible split-half coefficients, though only under the condition where all interitem correlations are equal. Even when this condition is not satisfied, KR20 can serve as a lower bound for reliability so long as the items are essentially tau-equivalent (Gulliksen, 1950). Unlike the aforementioned internal consistency indices, KR20 does not require the establishment of parallel forms of halves. KR20 is a function of the number of items, the sum of each item’s variance, and the variance of the total scores (computed by summing item scores). KR20 decreases as the ratio of summed item variance to total score variance increases, thus KR20 operationalizes measurement error as the extent to which the variance of individual items is large relative to the variance of the total scores. KR20 was originally developed for use with dichotomous data (e.g., items that were only scored as “right” or “wrong”). However, this technique was later expanded for use with polytomous data (Guttman, 1945), which eventually came to be known as coefficient alpha (Cronbach, 1951). The only difference between the mathematical formulas for alpha and KR20 is how the variance of individual items is calculated. Alpha is the most widely reported reliability estimate in social science research (Cho, 2016).

Despite its popularity, authors frequently bemoan its misunderstanding and overuse (e.g., Cortina, 1993; Sijtsma, 2009; McNeish, 2017). According to researchers’ major grievances, alpha’s ease of calculation and availability in statistical programs has led to its misuse and misinterpretation by researchers willing to ignore or overlook assumptions (e.g., unidimensionality and essential tau-equivalence) and evaluations of suitability in favor of a quick and easy estimate of internal consistency. When researchers and scale developers report alpha simply because of its computational convenience, they risk violating the aforementioned assumptions, which may yield misestimates of reliability. Many researchers cite Jum Nunnally’s (1978) book when evaluating alpha with respect to a cutoff (usually .70). Specifically, Nunnally emphasizes that this cutoff is appropriate “in the early stages of research on predictor tests or hypothesized measures of a construct” (p. 245).1 Researchers may be skeptical of low alpha values, but it may also be appropriate to be skeptical of alpha levels that are too high, especially in shorter scales (John & Soto, 2007), as this may indicate that a survey contains redundant items, which may indicate a lack of construct coverage (i.e., low content validity). Report KR20 or alpha when estimating the internal consistency of an essentially tauequivalent, unidimensional survey.

COMPOSITE RELIABILITY AND OMEGA Composite reliability estimates rely on factor loadings to define relationships between items and latent factors, operationalizing measurement error as item variance that cannot be explained by an underlying latent factor. The most reported estimate of composite reliability is the omega coefficient, which is typically attributed to Roderick McDonald (1999). Omega relies on an obliquely rotated factor analysis, the loadings from which are used to compute composite reliability (Revelle, 2009). There are two forms of omega: “hierarchical” (ωh), which only uses loadings on a general factor (on which all items load), and “total” (ωt), which uses loadings on a general factor as well as group factors (on which only a subset of items load). Although its calculation is substantially more complex than the previously discussed estimates of internal consistency, omega has two specific advantages over those estimates. First, omega assumes congeneric measurement as opposed to more stringent requirements such as parallel or tau-equivalent measurement. Second, as omega is


computed using factor analysis, loadings and error terms can be used as estimates of true score and error. Since a factor loading reflects the correlation between an item and its latent construct, the squared loading of an item on a factor is a direct estimate of true score variance while the item’s error term reflects uniqueness (i.e., measurement error). Report ωh when estimating the internal consistency of a congeneric, unidimensional survey on which all items load only on a single factor. Report ωt when estimating the internal consistency of a congeneric survey in which all items reflect a single general factor, but groups of items also reflect more specific factors.

Empirical Reliability An “empirical reliability” estimate of internal consistency relies on similar logic as composite reliability, but leverages item response theory (IRT), modeling observed scores as a function of both the item/survey and the respondent (Samejima, 1994). Empirical reliability assumes congeneric measurement and is computed as a function of the variance of person latent trait estimates from an IRT model (which serve a similar role to loadings and true score estimates) and the squared standard errors of those estimates (which serve a similar role to error variance). Thus, empirical reliability operationalizes measurement error as the extent to which the survey yields uncertainty in classifying respondents’ latent trait levels. Report empirical reliability when estimating the internal consistency of a congeneric survey for which IRT-based person parameter estimation is available.

TEMPORAL CONSISTENCY Temporal consistency is a coefficient of stability (Cronbach, 1947), or the extent to which a survey measures a construct consistently across different survey administrations (e.g., time 1 and time 2; or T1 and T2). Since all estimates of temporal consistency operationalize measurement error as change over time, an important conceptual consideration for their estimation involves the expected stability of the construct. Temporal consistency is most appropriate when the focal construct is expected to be stable over the period between survey administrations. For example, constructs such as ability or personality are expected to be more stable over time than constructs such as emotions or attitudes. Temporal consistency can be assessed for less stable constructs, but the time


interval between measurement periods should be smaller. Ideally, estimates of measurement error in coefficients of stability reflect changes in how the scale is interpreted by respondents over time. However, because coefficients of stability cannot differentiate between sources of change over time, estimates of measurement error may also include confounding factors such as (a) true score changes in the respondents (e.g., history or maturation effects) and/or (b) changes to the situation (e.g., an intervention or change in testing conditions between T1 and T2). Researchers can mitigate these confounding sources of variation by (a) computing temporal consistency only for constructs expected to remain stable and (b) ensuring testing conditions are similar across all survey administrations. Designing a temporally consistent survey can improve our confidence that survey scores are stable and reduce the risk that survey results are specific to a particular time period.

Test–retest A straightforward way to evaluate the temporal consistency of a survey is simply to correlate survey scores at T1 and T2. The logic is similar to that of a parallel forms correlation, but instead of operationalizing measurement error as the uniqueness of the forms (which does not exist because the T1 and T2 surveys are identical and perfectly equivalent/parallel), the test–retest correlation operationalizes measurement error as changes in scores over time. Although test–retest correlations are typically computed on total scores, they can also be computed for each individual item to evaluate the temporal consistency of individual items. Test–retest correlations are relatively popular due to ease of calculation (i.e., a Pearson correlation), but have garnered some criticism. Cicchetti (1994) notes that the test–retest correlation evaluates the consistency of rank-orderings of survey scores across time, but is less sensitive to agreement (i.e., the extent to which survey scores retain their exact value). For example, if five respondents have scores of 1,2,3,4,5 at T1 and 6,7,8,9,10 at T2, the test–retest correlation will be 1.0 despite the scores being higher at T2 than T1. DeSimone (2015) also notes that there are multiple configurations of item scores that will yield the same summed or averaged survey scores. For example, if a person responds to five items with 1,2,3,4,5 at T1 and 5,4,3,2,1 at T2, the average (3) and sum (15) of these two sets of scores will be identical despite a very different response pattern. Both Cicchetti (1994) and DeSimone (2015) offer



alternatives to the test–retest correlation (discussed below) that partially address these limitations. Report a test–retest correlation when estimating the temporal consistency of the rank order of scores for a survey that has been administered to the same set of respondents multiple times.

Intraclass Correlation Coefficient (ICC) ICCs are more commonly estimated for inter-rater reliability (see below) but can also be used to estimate temporal consistency (Cicchetti, 1994). The difference is that instead of the same target(s) being rated by different judges (inter-rater), ICCs for temporal consistency rely on the same indicators being measured at different times (inter-time). One benefit of using ICCs over Pearson correlations is the ability to evaluate agreement (as opposed to consistency) in survey scores over time. Data are entered similarly to a two-way ANOVA, and the ICC is computed as a function of the number of times and respondents as well as the mean square estimate for times, respondents, and error. Koo and Li (2016) recommend using a two-way, mixed effects model to estimate agreement, which operationalizes measurement error as the extent to which scores are different at T1 and T2. Data are subjected to a two-way mixed-effects model in which the “time” factor is fixed and the “respondent” factor is random. Thus, an ICC(A,1) is most appropriate when survey scores are summed/averaged (yielding a single score per respondent per time) and an ICC(A,k) is most appropriate when multiple scores are evaluated (see McGraw & Wong, 1996). Various suggestions exist for acceptable magnitudes of temporal consistency ICCs (Cicchetti, 1994; Koo & Li, 2016), but these recommendations draw heavily from the literature on inter-rater reliability. Report ICCs when estimating the temporal agreement of scores for a survey that has been administered to the same set of respondents multiple times.

SRMRTC While test–retest correlations and ICCs focus on the stability of item or survey scores across time, SRMRTC estimates the stability of item interrelationships (DeSimone, 2015). By leveraging the standardized root mean-square residual (SRMR; Bentler, 1995), SRMRTC provides an evaluation of the stability of the inter-item correlation matrix and operationalizes measurement error as differences in inter-item correlations over time.

Bentler’s SRMR compares two correlation matrices, yielding an estimate of the average absolute difference between correlations. SRMRTC uses an identical formula, but instead of comparing “observed” and “expected” correlation matrices (as is typical in structural equation modeling, or SEM), SRMRTC compares inter-item correlation matrices computed at T1 and T2. In addition to the aforementioned assumption of construct stability, SRMRTC assumes consistency in how the content of each item is related to the content of each other item. DeSimone (2015) suggests that SRMRTC values exceeding .08 may indicate instability in item interrelationships over time, though this recommendation is based on the empirical cutoff recommendation for SRMR in the context of SEM provided by Hu and Bentler (1999). Report SRMRTC when estimating the temporal consistency of item interrelationships over time.

CLTC CLTC estimates the stability of a survey’s structure across time (DeSimone, 2015). Like SRMRTC, CLTC provides more information about the temporal consistency of a survey than can be evaluated by simply examining item or total scores. CLTC compares a survey’s principal component analysis (PCA) loadings across administrations, weighting each component by the average amount of variance it explains, to provide an estimate of the stability of these loadings across time. Importantly, PCA loadings are used (as opposed to loadings from exploratory factor analysis) to eliminate the potential effects of changes in (a) communality estimates (variance partitioned) and (b) rotation matrices across time. In addition to construct stability, CLTC assumes that the relationships between items and principal components remains stable over time, operationalizing measurement error as differences in loading patterns from T1 to T2. DeSimone (2015) suggests that CLTC levels below .70 may indicate inconsistency in component structure, though this cutoff has not been evaluated empirically. Report CLTC when estimating the temporal consistency of a survey’s component structure over time.

INTER-RATER RELIABILITY Inter-rater reliability refers to a set of concepts associated statistical estimates related to assessing how well judges align in their ratings of targets.


Inter-rater reliability is typically used when an indicator is scored or rated by multiple judges. In general, two classes of rater alignment are discussed with relation to inter-rater reliability. The first, consistency,2 operationalizes measurement error as inconsistency in the rank ordering of scores/ratings assigned by judges. The second, agreement, operationalizes measurement error as the extent to which judges provide different values for scores/ratings. Consistency is necessary, but insufficient for agreement, as consistency only requires rank-order similarity whereas agreement requires similarity of the actual scores/ratings provided. As an example, if Rater 1 assigns scores of 1,2,4 for three targets but Rater 2 assigns scores of 4,5,7, there would be a high level of consistency but low agreement. High agreement would require both raters to provide similar ratings (e.g., 1,2,3; 1,2,4). Designing a survey with high inter-rater reliability can improve our confidence that survey scores reflect a perception of the respondent that is shared across raters. In order to design a survey with high inter-rater reliability, it is important to select knowledgeable raters and remove sources of subjectivity or bias in the scoring instructions.

Inter-rater Correlation Like the parallel forms and test–retest correlations discussed above, computing a Pearson correlation between two sets of ratings is a straightforward way to estimate inter-rater reliability. As correlations emphasize rank-order similarity as opposed to absolute similarity, inter-rater correlations should be considered estimates of consistency, operationalizing measurement error as inconsistency in judgments between raters. The logic behind this technique also follows the logic of the parallel forms and test–retest indices, treating the two raters as “parallel” (Murphy & DeShon, 2000). Inter-rater correlations are rarely reported due to some potential drawbacks in estimation and interpretation. For example, when more than two raters provide scores, separate correlations can be computed for each pairwise combination of raters, which can yield results that are difficult to interpret. Additionally, inter-rater correlations may overestimate error variance, yielding an underestimate of inter-rater reliability. Report an inter-rater correlation when estimating inter-rater consistency between two raters who provide parallel estimates.

Kappa Kappa assesses inter-rater reliability for judgments of targets into “k unordered categories”


(Cohen, 1960, p. 37). That is to say, the raters make a judgment based on a nominal categorization, with higher values indicating higher frequency of the raters agreeing on category assignments. Accordingly, kappa is an estimate of agreement and operationalizes measurement error as rater disagreement about categorization of targets. A common alternative to kappa involves simply estimating the percentage of cases in which the raters agree, though kappa is generally preferable because it also accounts for the expected value of agreement by chance. Assumptions include the independent rating of targets by both raters and both raters’ judgments being afforded equal importance. Cohen’s Kappa is typically computed for agreement between two raters, though adaptations exist that can be applied with more than two raters or when raters do not rate all targets (Fleiss, 1971). Report kappa when estimating inter-rater agreement concerning categorization of targets into nominal categories.

ICC ICCs were discussed above as an option for computing temporal consistency, with advice to focus on indices of agreement for temporal consistency. However, many different ICCs may be used to estimate inter-rater reliability, and the appropriate choice of ICC depends on the ways in which targets are judged by raters and the intended inferences (Shrout & Fleiss, 1979). ICCs are based on an ANOVA framework, and all ICC estimates are estimated as a function of the mean square values for raters, mean square values for targets, mean square error, number of raters, and/or number of targets. Raters are assumed to be interchangeable and some ICCs can be computed when not all raters rate all targets. Forms of ICCs exist to assess both agreement and consistency, with measurement error operationalized in various ways, but always involving a mean square error term. Several naming conventions are used when describing ICCs. It is common to see the estimate for consistency for j raters nested within k groups (e.g., employees rating their supervisor, team members judging characteristics of their own team) referred to as ICC(1) and the stability of the means of those ratings referred to as ICC(2) (see Krasikova & LeBreton, 2019; Bliese, 2000). When raters are nested within targets, ICCs will always indicate agreement, though when raters and targets are crossed (i.e., all raters rate all targets), indices of both agreement and consistency are calculable. Many different cutoff values are offered for inter-rater agreement statistics (Cicchetti, 1994;



LeBreton & Senter, 2008). Suggested cutoffs vary based on type of estimate used, application, and/ or assumptions made (e.g., type of comparison distribution). LeBreton and Senter (2008) suggest ranges for quantifying inter-rater agreement estimates of .00 to .30 indicating a lack of agreement, .31 to .50 as weak agreement, .51 to .70 as moderate agreement, .71 to .90 indicating strong agreement, and .91 to 1.00 as very strong agreement. ICCs are a broad category, but they should be reported when estimating inter-rater agreement or consistency for crossed or nested data (for further guidance, see McGraw & Wong, 1996).

RWG The rwg and associated rwg(j) (for multi-item ratings) indices provide estimates of within-group agreement (James et al., 1993). Rather than relying on proportions (like kappa) or an ANOVA framework (like ICCs), rwg instead compares observed variance in ratings to the variance expected under various random distributions (e.g., a “rectangular” or uniform distribution, a “triangular” or normal distribution). The decision about which random distribution to use directly affects the value of rwg, and although it is most common to use a uniform distribution, this decision should be justified based on conceptual considerations (Meyer et al., 2014). Measurement error is operationalized as the extent to which observed rating variance exceeds the rating variance expected under the selected random distribution. A separate rwg statistic can be estimated for every target, and these estimates can then be aggregated (e.g., mean, median) to report an overall estimate. rwg(j) is used when raters provide scores on multi-item scales, with the assumption that the items are parallel. In cases of low variation in group means, ICCs can provide underestimates of inter-rater reliability, while rwg and rwg(j) are not influenced by between-group variation. ICCs and rwg estimates are often used in conjunction, particularly when justifying aggregation in multilevel modeling. A cutoff value of .70 is often used, though this value is not based on empirical evidence or simulation studies (LeBreton & Senter, 2008). Report rwg when estimating inter-rater agreement on continuous single or multi-item rating scales.

GENERALIZABILITY THEORY As noted above, reliability estimates are often categorized into different “types” based on how

they operationalize measurement error (e.g., attributing it to differences in survey content, across time, or between raters). All reliability estimates attempt to explain variance in item or total scores as a function of some fundamental characteristic of the testing process. The techniques described above each attempt to do so by holding all factors constant except the single factor to which they attribute measurement error (e.g., inequivalence in item content, instability across time, inconsistency in rating/scoring). Generalizability theory, on the other hand, offers researchers a method of modeling response variance as a function of multiple factors simultaneously (Shavelson & Webb, 1981). For example, a “G study” partitions and assigns response variance to various factors (e.g., differences in item content, differences in time, differences in raters) and combinations of these factors, allowing for a simultaneous examination of different ways to operationalize measurement error. Generalizability theory draws heavily from the basic tenets of sampling theory. Specifically, sampling theory allows us to make inferences from our sample to the larger population by randomly selecting members of a theoretically infinite population to study. Cronbach et  al. (1972) describe the sampling of items similarly to the sampling of respondents. If a theoretically infinite population exists from which respondents can be sampled, a conceptually similar statement can be made about a theoretically infinite population of items that can be sampled to measure a construct. Although generalizability theory focuses primarily on items and respondents, as both factors exist in all multiitem surveys, the same “sampling” concept can be applied to any other survey characteristics (e.g., sampling from an infinite population of forms, times, raters, etc.). In generalizability theory, item responses serve as the dependent variable in an ANOVA-like framework, with the items and respondents (at a minimum) serving as factors. In addition to items and respondents, additional factors may also be modeled depending on the testing situation. For example, if different (e.g., parallel) forms of the survey exist, the survey is administered at multiple times, or the survey is scored by multiple raters; then survey forms, times, or raters could be modeled as additional factors (respectively). Generalizability theory attributes response variance to each of these factors and each combination of these factors, giving analysts a sense of the extent to how responses vary as a function of each factor/combination. While we always expect variation attributable to respondent individual differences, we typically hope to see far less variation attributed to these


other factors (e.g., items, times, raters, forms). In scale development, generalizability theory allows for the exploration of amounts of variance accounted for by each of these other factors (and interactions between these factors) that can help to determine if the scale is performing as intended and where to focus attention for future item development. Generalizability theory further allows scale developers to simultaneously consider the various factors that each of the aforementioned “types” of reliability attempt to isolate.

ADDITIONAL ISSUES RELATED TO RELIABILITY Cutoff Values Reliability indices tend to produce continuous values between 0 and 1, with higher values indicating more reliable measurement (i.e., less measurement error). However, because of the complexity of research design, sampling, and data analysis, there exists no single cutoff value for any index of reliability that accurately distinguishes between acceptable and unacceptable measures in all situations. There is a danger of oversimplifying and/or misinterpreting a continuous index through artificial dichotomization (cf. Folger, 1989). While higher reliability values are generally preferable to lower values, we advise researchers to interpret reliability indices as continuous indicators of measurement error and, when possible, to compare reliability values across scales and across samples when deciding which measures to use. We understand that researchers often rely on cutoff values to guide decisions, and we have discussed and referenced some of the more popular cutoff value suggestions in previous sections. We emphasize that most cutoff suggestions are arbitrary rules of thumb, are not informed by empirical or simulation data, may vary by situation, and are still subject to debate in the literature. Instead of dichotomously evaluating measures as good or bad (or acceptable/unacceptable), researchers should consider the research context when evaluating a survey’s reliability estimate. Relatedly, it is critically important for survey developers/evaluators to avoid relying on poor scale development practices in an effort to artificially inflate reliability estimates to reach a desired cutoff. Such practices include the use of redundant items, administering a survey twice within an inappropriately short time interval, or intentionally relying on raters/scorers with overly


similar perspectives. Relying on these practices may increase reliability but do so at the expense of construct validity (John & Soto, 2007).

Scale Modification Reliability (and validity) may be affected by scale modifications (Heggestad et  al., 2019). Such modifications include deleting items to shorten the scale or to remove “problematic” items, lengthening scales, combining scales, rewording items, changing the target or reference point of items, changing item order, and many other types of changes. Any scale modification decision should be deliberately considered, justified, and disclosed when discussing study measurement and methodology. The considerations for scale modification are identical to the considerations for scale development described throughout this book. For example, it is a best practice to transparently report all research design decisions made over the course of a study to facilitate understanding of how a study was conducted and evaluation of the appropriateness of the decisions made (Aguinis et al., 2018). In addition, transparency is necessary for future replications of the research (Köhler & Cortina, 2021) or when comparing the results of multiple studies. Reporting the performance of surveys across different studies, samples, and research contexts may allow future researchers to more accurately evaluate the performance of a survey and/or may facilitate effective revisions or adaptations to an existing survey. Several guidelines exist for the reporting of scale modifications, and we strongly encourage researchers to follow best practice recommendations for reporting (Heggestad et al., 2019; Smith et al., 2000).

Using Reliability Estimates in Meta-analyses Reliability estimates can be useful when conducting certain types of meta-analyses. For example, a popular meta-analytic methodology in certain disciplines is Schmidt and Hunter’s (2015) psychometric meta-analysis. In addition to accounting for sampling error, psychometric meta-analysis encourages analysts to account for other sources of potential bias such as measurement error (reliability). Adjusting effect sizes for measurement error involves estimating what the relationships would be under conditions of perfect measurement (Spearman, 1904). DeSimone (2014) and Köhler et  al. (2015) discuss the assumptions



Internal Consistency (Equivalence)

Are there two or more parallel forms of the test?


Parallel Forms




Can the test be split into two parallel halves?


Does the test contain essentially tau-equivalent indicators?






Are test items scored dichotomously?


Do you have access to IRT-based person parameter estimates and standard errors?


Empirical Reliability


Does the test only contain one factor or also contain groups of items that load onto minor factors?

One Factor

Omega Hierarchical

Groups of Items

Omega Total

Figure 14.1  Flowchart for determining which internal consistency index to use



underlying this adjustment as well as corresponding considerations for interpreting a meta-analytic effect size. Survey researchers should consider that their study results may eventually be included in a meta-analysis. Thus, it is crucial to choose appropriate reliability estimates that reflect central concerns about measurement error in the study. Additionally, survey researchers should explicitly report the type of reliability estimate computed alongside important information about interpreting that reliability estimate, including scale modifications, reliability estimates for sub-scales if the construct is not unidimensional, information about the time interval between measurements, information about rater characteristics, or whatever else is deemed an important influence on the reliability estimate in question. By following these recommendations, survey researchers can be more confident that their results will be incorporated

appropriately into secondary analyses such as meta-analysis.

CONCLUSION Consolidating the information provided in this chapter, the flowcharts in Figures 14.1–14.3 constitute guides intended to help researchers determine the most appropriate reliability index to use based on both practical and conceptual considerations. Figure 14.1 (internal consistency) is useful when equivalence is a primary concern (“Is the entire survey measuring the same thing?”). Figure 14.2 (temporal consistency) is useful when stability is a primary concern (“Is the survey performing consistently across time?”). Figure 14.3 is useful when scoring/rating is a primary concern

Temporal Consistency (Stability)

Are you interested in assessing absolute agreement or rank-order consistency?




For which aspect of the responses is consistency important?

Rela�onships Between Items


Component Loadings Test Scores

Item Scores

Test-level Test– retest Correla�on

Item-level Test– retest Correla�ons


Figure 14.2  Flowchart for determining which temporal consistency index to use



Inter-rater Reliability


What is the level of measurement for your measure’s ratings?

Are you interested in assessing absolute agreement or rank-order consistency?


Is it important to account for differences in rater means?

Interval or Ra�o




Is it important to account for differences in rater means?

ICC (Agreement)



ICC (Consistency)

Inter-rater Correla�on



Figure 14.3  Flowchart for determining which inter-rater reliability index to use (“Is the survey scored/rated similarly by different scorers/raters?”). If readers are concerned about more than one of these issues, we encourage the use of generalizability theory as a more exploratory investigation of survey reliability. We hope that these figures assist researchers in selecting appropriate reliability indices as well as understanding and justifying those choices in their research. We encourage interested readers to learn more about reliability estimation and psychometrics, and we urge all researchers to transparently and accurately report how reliability is calculated, interpret reliability estimates properly, and report any information pertaining to measurement that may influence how a reliability estimate is interpreted by readers. By achieving a more thorough understanding of reliability and measurement error, survey researchers can improve their measurement, which can lead to more confident tests of hypotheses and contributions to theory.

Notes 1.  In the first edition of Nunnally’s book (1967, p. 226), the cutoff is lower (“reliabilities of .60 or .50 will suffice”). However, the 3rd (and last)

edition of the book (Nunnally & Bernstein, 1994) retained the recommendation of .70, and all three editions of the book specify cutoffs of .80 for basic research and .90 (preferably .95) in applied settings. 2.  Consistency is also commonly (and confusingly) referred to as “reliability.”

REFERENCES Aguinis, H., Ramani, R. S., & Alabduljader, N. (2018). What you see is what you get? Enhancing methodological transparency in management research. Academy of Management Annals, 12(1), 83–110. Bentler, P. M. (1995). EQS structural equations program manual. Encino, CA: Multivariate Software. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord and M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–479), Reading, Reading MA: Addison-Wesley. Bliese, P. D. (2000). Within-group agreement, nonindependence, and reliability: Implications for data aggregation and analysis. In K. J. Klein & S. W


Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extension, and new directions (pp. 349–81). San Francisco, CA: Jossey-Bass. Cho, E. (2016). Making reliability reliable: A systematic approach to reliability coefficients. Organizational Research Methods, 19(4), 651–82. Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Lawrence Erlbaum Associates Publishers. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2013). Applied multiple regression/correlation analysis for the behavioral sciences. Mahwah, NJ: Routledge. Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied, Psychology, 78, 98–104. Cronbach, L. J. (1947). Test “reliability”: Its meaning and determination. Psychometrika, 12(1), 1–16. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. Cronbach, L. J., Glesser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements. New York: Wiley. Crum, W. L. (1923). Note on the reliability of a test, with special reference to the examinations set by the College Entrance Board. The American Mathematical Monthly, 30, 296–301. DeSimone, J. A. (2014). When it’s incorrect to correct: A brief history and cautionary note. Industrial and Organizational Psychology, 7, 527–31. DeSimone, J. A. (2015). New techniques for evaluating temporal consistency. Organizational Research Methods, 18(1), 133–52. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–82. Folger, R. (1989). Significance tests and the duplicity of binary decisions. Psychological Bulletin, 106, 155–60. Gulliksen, H. (1950). Theory of mental tests. New York: Wiley. Guttman, L. (1945). A basis for analyzing test–retest reliability. Psychometrika, 10(4), 255–82. Heggestad, E. D., Scheaf, D., Banks, G. C., Hausfeld, M. M., Tonidandel, S., & Williams, E. (2019). Scale adaptation in organizational science research: A review and best-practice recommendations. Journal of Management, 45, 2596–627.


Hu, L., & Bentler, P. M. (1999). Cutoff criterion for fit indices in covariance structure analysis: conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. James, L. R., Demaree, R. G., & Wolf, G. (1993). rwg: An assessment of within-group inter-rater agreement. Journal of Applied Psychology, 78, 306–9. John, O. P., & Soto, C. J. (2007). The importance of being valid: Reliability and the process of construct validation. In R. W. Robins, R. C. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 461–94). New York: Guilford Press. Kelley, T. L. (1924). Note on the reliability of a test: A reply to Dr. Crum’s criticism. The Journal of Educational Psychology, 15, 193–204. Köhler, T., & Cortina, J. M. (2021). Play it again, Sam! An analysis of constructive replication in the organizational sciences. Journal of Management, 47(2), 488–518. Köhler, T., Cortina, J. M., Kurtessis, J. N., & Gölz, M. (2015). Are we correcting correctly? Interdependence of reliabilities in meta-analysis. Organizational Research Methods, 18, 355–428. Koo, T. K., & Li, M. Y. (2016). A guideline to selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15, 155–63. Krasikova, D., & LeBreton, J. M. (2019). Multilevel measurement: Agreement, reliability, and nonindependence. In S. E. Humphrey & J. M. LeBreton (Eds.), Handbook of multilevel theory, measurement, and analysis (pp. 279–304). Washington, DC: American Psychological Association. Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2(3), 151–60. LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about inter-rater reliability and interrater agreement. Organizational Research Methods, 11, 815–52. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum. McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46. McNeish, D. (2017). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23, 412–33. Meyer, R. D., Mumford, T. V., Burrus, C. J., Campion, M. A., & James, L. R. (2014). Selecting null distributions when calculating rwg: A tutorial and review. Organizational Research Methods, 17, 324–45.



Murphy, K. R. & DeShon, R. (2000). Inter-rater correlations do not estimate the reliability of job performance ratings. Personnel Psychology, 53, 873–900. Nunnally, J. C. (1967). Psychometric theory. New York: McGraw Hill. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw Hill. Nunnally, J. C. & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw Hill. Revelle, W. (2009). Classical test theory and the measurement of reliability. In An Introduction to Psychometric Theory with Applications in R. www. Samejima, F. (1994). Estimation of reliability coefficients using the test information function and its modifications. Applied Psychological Assessment, 18, 229–44. Schmidt, F., & Hunter, J. (2015). Methods of metaanalysis: Correcting error and bias in research findings (3rd ed.). Thousand Oaks, CA: Sage.

Shavelson, R. J., & Webb, N. M. (1981). Generalizability theory: 1973–1980. British Journal of Mathematical and Statistical Psychology, 34, 133–66. Shrout, P. E., & Fleiss, J. L. (1979). Interclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 36, 420–8. Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107–20. Smith, G. T., McCarthy, D. M., & Anderson, K. G. (2000). On the sins of short-form development. Psychological Assessment, 12(1), 102. Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101. Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271–95. Wood, B. D. (1922). The reliability and difficulty of the College Entrance Examination Board Examinations in algebra and geometry. New York: College Entrance Examination Board.

15 Validity Chester A. Schriesheim and Linda L. Neider

INTRODUCTION In science, valid measurement lies at the very heart of our efforts to study, understand, and explain various phenomena. True in all scientific endeavors, but especially in the social sciences, we cannot separate what is known from how it is known (measured) (cf. Nunnally & Bernstein, 1994; Kerlinger & Lee, 2000). Why? Because if we are measuring the wrong phenomenon, or if we are interpreting the evidence incorrectly, study findings are likely to be erroneous and possibly very seriously misleading. Repeating a favorite quotation from Abraham Korman (1974, p. 194) that summarizes the critical nature of valid measurement, “The point is not that accurate [valid] measurement is ‘nice.’ It is necessary, crucial, etc. Without it we have nothing.” Example. As an example of the importance of valid measurement, imagine conducting research relating employee intelligence to employee work satisfaction. Imagine further that we have decided to use a bathroom weight scale to measure intelligence (and employees are individually weighed, following good experimental procedure – e.g., two independent persons recording each observation, etc.). Despite the good procedure, any inferences or conclusions that are drawn from the study’s

findings are going to be erroneous and potentially misleading because we are using an invalid measure of intelligence. Bathroom scales are designed to measure body weight and, although body weight may or may not be related to intelligence, body weight is not the phenomenon we seek to measure. Thus, using a bathroom scale is not a valid way to measure intelligence, although it may yield a valid measure of body weight.

The New Conceptualization of Validity About two decades ago, psychometricians came to a new consensus understanding of what validity entails. This new unified concept of validity, as agreed upon by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (AERA, APA and NCME, 2014): interrelates validity issues as fundamental aspects of a more comprehensive theory of construct validity that addresses both score meaning and social values in test interpretation and test use. That is, the new unified validity concept integrates



considerations of content, criteria, and consequences into a construct framework for the empirical testing of rational hypotheses about score meaning and theoretically relevant relationships, including those of an applied and a scientific nature. (Messick, 1995, p. 741; see also Newton & Shaw, 2013)

Historically, validity has been most commonly defined as the degree to which a measure is viewed as assessing what it purports to measure (e.g., Nunnally & Bernstein, 1994). In other words, validity has, historically, been treated as a property of a particular measure. The new conceptualization views validity much more broadly, as the degree to which interpretations and conclusions drawn from the use of a measure are supported by scientific evidence. Validity evidence. According to Guion and Schriesheim (1997, p. 380), “Validity is currently … judged from a variety of accumulated information.” There are many types or forms of validity evidence that have been identified, each with a specific meaning and domain of relevance. For example, with respect to a particular study (not a specific measure), one can distinguish between the study’s internal validity (the degree to which the study has controlled for contaminating or confounding factors) and its external validity (the degree to which a study’s findings can be generalized to other persons and conditions) (Shadish, Cook, & Campbell, 2002); there is also “face validity,” which refers to the degree to which a measurement process appears or looks reasonable to observers. However, this chapter focuses on the three major types of evidence that have been historically recognized as central to scientific inquiry and to the sound application and interpretation of a measure: Content, criterion-related, and construct validity evidence (note, again, that these are types of evidence and not types of validity; Newton & Shaw, 2013).

Content Validity Evidence Content validity evidence refers to evidence that bears on the appropriateness of a particular construct measure’s content (Guion, 1997). In other words, content validity refers to the specific domain of a construct and whether the evidence on a measure indicates that it representatively samples the full relevant content domain of the construct and does not assess extraneous or irrelevant content. For example, whether a survey questionnaire measure of felt anxiety assesses the full domain of felt anxiety and whether it also

(unintentionally) assesses some extraneous factor(s), such as elements of personality (e.g., locus of control, extraversion, etc.). Judgments of content validity evidence have often been casual or informal, based upon the opinions of “subject matter experts” (Nunnally & Bernstein, 1994). However, recent developments in applied psychometrics now provide methods for the quantitative analysis and testing of content validity judgments (e.g., Schriesheim, Powers, Scandura, Gardiner, & Lankau, 1993; Colquitt, Sabey, Rodell, & Hill, 2019) and for enhanced practices using subject matter experts and focus groups (Haynes, Richard, & Kubany, 1995; Vogt, King, & King, 2004). Some of these methods are briefly summarized in this chapter, along with best practice recommendations on how to develop survey measures so as to maximize their content validity.

Criterion-related Validity Evidence Criterion-related validity evidence refers to empirically obtained relationships between a measure and other constructs or variables to which it theoretically should be related. This is usually assessed through associative statistical tests such as Pearson correlation, and two forms of criterion-related validity evidence are generally distinguished: Concurrent and predictive. Concurrent validity evidence refers to obtained evidence of an empirical relationship between a measure and a theoretically relevant dependent variable (such as job satisfaction), in which both measurements are obtained at the same point in time. Predictive validity evidence refers to an empirical relationship where the dependent variable is measured at a later time (Schriesheim, 1997).

Construct Validity Evidence A construct is a hypothetical (latent) variable that is not directly measurable. It is literally constructed for the purpose of theoretical explanation (Kerlinger & Lee, 2000). Going back to our first example, “intelligence” is a construct that was created to explain why certain “smart” (or “intelligent”) behaviors are displayed by some persons in some situations and not by others. Intelligence cannot be directly observed or measured, although we can observe and measure behaviors that we believe are indicative of intelligence. Basically, construct validity refers to a summary judgment that is made about the degree to which interpretations and applications that are based upon a construct and its measure are supported by the extant


scientific evidence (AERA, APA and NCME, 2014). The process of construct validation can be a long and complicated one and this chapter summarizes different types of evidence that go into arriving at this judgment. Best practices for construct validation are also considered, along with additional research that may enhance the process of judging construct validity.

CONSTRUCTING MEASURES AND ASSESSING CONTENT VALIDITY Much has been written about the processes that should be employed in the development of survey questionnaire measures. In particular, readers should familiarize themselves with the seminal works of Hinkin (1995, 1998) and, more recently, MacKenzie, Podsakoff, and Podsakoff’s (2011) integrative article for detailed recommendations on best practices. Given the potential complexity of the construct validation process, it would be ridiculous to argue that any type or form of validity evidence is more important than any other form of evidence. However, it is generally accepted that “if the content of a research instrument is not adequate, that measure cannot be a valid operational procedure for measuring the theoretical variable of interest (i.e., it cannot be construct-valid…).” (Schriesheim et  al., 1993, p. 386). This suggests that content validity may be the first psychometric property of any measure that should be assessed, since finding inadequate content negates the need for further examining the validity of a measure – at least until the identified content validity problems have been resolved.

Developing New Theory on the Construct The first step in constructing any new measure is to develop a sound and detailed theory of the construct that the new measure is supposed to be measuring. Such theory can be deductively or inductively developed, and the underlying construct can be reflective or formative in nature (see MacKenzie et  al., 2011, for details). The theory of the construct must clearly identify what aspects or subdimensions are legitimately included and excluded from the construct. For example, an employee’s general job satisfaction can be theorized to be a global or overall (summary) affective reaction to all of the employee’s experiences


while working for a particular employer (e.g., Hoppock, 1935). It can also be defined as an affective reaction that is based equally upon an employee’s satisfaction with twenty specific areas (such as supervision, pay, co-workers, company policies, e.g., Weiss, Dawis, England, & Lofquist, 1967). The theorist must specify whatever aspects or subdimensions that the theory says are relevant and must be included, and each aspect or subdimension should be specifically defined and its weight as a component of the construct theoretically specified (for example, being equally weighted relative to the other subdimensions). The second step in new measure construction is to identify similar and dissimilar constructs and to develop theory as to why the construct is or is not related to these other constructs. Lastly, a good theory of a new construct maps out the “nomological network” of the construct (Cronbach & Meehl, 1955) in terms of constructs and variables to which it should and should not be related. This allows the development of criterion-related validity evidence as well as studies to demonstrate convergence and divergence with other, theoretically relevant, constructs and variables (Campbell & Fiske, 1959). All such evidence is necessary in order to reach a sound conclusion concerning validity.

Item Generation Once detailed theory has been developed concerning the construct, item generation can begin. If the construct is unitary (i.e., has no subdimensions), the construct definition is used to write items that measure only the construct and avoids measuring other (related or unrelated) constructs. If the construct is multidimensional, appropriate subdimension definitions are used to write items measuring each subdimension. Response format selection. Before item writing commences, a response format should be selected that is congruent with the theory of the construct and in which the psychological intervals between the response options are approximately equal. For example, a commonly encountered decision is whether to measure frequency, magnitude, or agreement. Measuring frequency requires response choices such as Always, Very Often, Sometimes, Seldom, and Never, while measuring magnitude necessitates choices such as Entirely, Mostly, About Half, Very Little, and Not at All. Agreement is easily measured on a Likert response scale, with categories such as Strongly Agree, Agree, Undecided or Neutral, Disagree, and Strongly Disagree. (Likert response



scales are perhaps the most commonly used, but other response choices can be developed; see Schriesheim & Novelli, 1989, and Schriesheim & Castro, 1996, for illustrations of how this may be accomplished.) Item specifics. Generally speaking, research has shown that having 4–6 or more items per subdimension is necessary for adequate subdimension internal reliability and adequate construct domain sampling (Hinkin, 1998; for more information on reliability, see Chapter 14, in this volume). However, because items will be eliminated due to unsatisfactory evaluations, it is best to write at least twice as many items as the final measure is planned to contain. Items should be kept short and simple. Long and/or complicated questions run a substantial risk of being misunderstood and eliciting erroneous responses or encouraging respondents to not answer these questions. All items should be checked for their reading level and each item must be written at the level most appropriate for the planned respondents. Double-barreled items. Double-barreled items should generally be avoided. An example of such an item is, “My supervisor is skilled in human relations and the technical aspects of our job.” The reason for this admonition is that answers to double-barreled questions are difficult to interpret. For example, responding “[3] Undecided or Neutral” on a 1 to 5 Likert-type response scale (which contains [1] Strongly Disagree, [2] Disagree, [3] Undecided or Neutral, [4] Agree, and [5] Strongly Agree) to the above statement may signify that the respondent strongly disagrees that the supervisor has human relations skills (response = 1) and strongly agrees that the supervisor has job-related technical skills (response = 5). Other divergent opinions may also exist and cause a “3” response

which does not apply to both the supervisor’s human relations and technical skills. Item focus and leading questions. Based upon the construct and subdimensions definitions, items should maintain a consistent perspective or focus. For example, they should assess only behavior or only affective responses. Perspective or focus should not be mixed unless the construct uses different perspectives or foci as subdimensions. Leading questions should be avoided as they may bias results. For example, asking respondents, “Isn’t the company’s pay not only fair but generous compared to other local employers?” may encourage more “Agree” and “Strongly Agree” responses than asking whether, “The company’s pay is fair compared to other local employers.” Avoid items with restricted variances and, possibly, reverse-scoring. Items that would be answered similarly by most respondents should not be included as they typically produce restricted item variances and reduce scale internal reliability (Nunnally & Bernstein, 1994). Lastly, the scale developer should consider whether reverse-scored items should be employed. Simply put, item reversals can impair scale reliability and validity (Schriesheim, Eisenbach, & Hill, 1991) and may be unnecessary if attention check questions are included on the instrument that is used to collect study data. An example of an attention check is the item, “If you are reading and paying attention to each question, please respond to this question by marking ‘Strongly Agree’ as your answer or response.” Summary. Figure 15.1 provides a brief summary of the item generation process as described above. Strictly speaking, the first listed items concern theory development prior to item generation, simply because sound items cannot be generated

1. 2. 3. 4.

Develop the construct’s theory, including subdimensions and subdimension weights. Identify similar and dissimilar constructs and variables. Construct a nomological network delineating empirically related and unrelated constructs and variables. Develop clear definitions of the construct and its subdimensions, including its focus (e.g., attitude, perceived behavior). 5. Write 4–6 (or more) items for each subdimension: a. Each item should employ simple wording, and not be double-barreled. b. Each item should be at the appropriate reading level for intended respondents. c. Leading items should be avoided, along with items that have low variances. d. Reverse-worded and reverse-scored items can be replaced with attention check items to ensure that respondents are taking the time to read each item. 6. Proceed to the item evaluation process next. Figure 15.1  Summary of item development process


if the construct’s or variable’s theory is incomplete or non existent.

Theoretical Content Validity Assessment In discussing content validity, Schwab (1980) describes several analytic options (including factor analysis of respondent-provided data) which are useful in the construct validation process. However, these data-analytic options do not directly address the issue of how well a measure theoretically represents an underlying content domain. As Jackson (1971) and Messick (1989) note, these traditional empirical analyses of actual respondent-provided field data indicate how the respondents perceive a set of items and therefore reflect the empirical correspondence between the construct and its operationalization. However, they do not directly address the issue of content validity, which involves determining the theoretical reasonableness of a measure’s item content. Some content validation approaches. Although content validation has most often been seen as an inherently qualitative and judgmental process (cf. Nunnally & Bernstein, 1994; Guion, 1997; Kerlinger & Lee, 2000), several new approaches to quantitatively assessing the content validity of paper-and-pencil instruments have become available. Schriesheim et al. (1993), and Schriesheim, Cogliser, Scandura, Lankau, and Powers (1999) developed and investigated the usefulness of several variants of a factor-analytic approach to examining content validity. In an extension and refinement of this strategy, Hinkin and Tracey (1999) proposed and demonstrated the usefulness of analysis-of-variance (ANOVA) for analyzing the data collected through the Schriesheim et  al. (1993, 1999) procedure. Interestingly, both the factor-analytic and ANOVA approaches have been found to be robust and informative for the drawing of conclusions about an instrument’s content validity, and they have been employed in a number of subsequent studies to-date (e.g., Hinkin & Schriesheim, 2008; Colquitt, Baer, Long, & Halvorsen-Ganepola, 2014). Content judging process. Both analytic strategies rely on the same judgment process. Basically, subjects are presented with the items to be evaluated and the theoretical definitions that they are to use to evaluate each item’s content. The subjects then indicate the theoretical content that each item is measuring by rating each item on each dimension. Figure 15.2 provides an illustration of one format that can be used to elicit content evaluations. Additional examples of rating options are provided in Schriesheim et  al. (1993, 1999),


including further discussion of specific rating formats and alternatives. The employment status of the subjects supplying the judgments does not seem to make a substantial difference in obtained results, as long as the subjects are intelligent and literate enough to grasp how they are supposed to make and record judgments and as long as they are motivated to provide unbiased and thoughtful responses (Schriesheim et al., 1993; Hinkin & Tracy, 1999; Colquitt, et al., 2014). Summarizing and analyzing the data. The collected subject-provided data are summarized in a 2-dimensional “extended data matrix” (Gorsuch, 1983) with I columns representing the items, and J × K rows (representing the J subjects and the K theoretical content categories being evaluated). This matrix is next analyzed using either the exploratory factor-analytic procedures described in Schriesheim et al. (1993, 1999) or the ANOVA procedures outlined in Hinkin and Tracy (1999). The results are then examined and poorly performing items (i.e., those that do not measure their correct theoretical dimension) are eliminated from the item pool. If this results in the loss of too many items for any dimension, new items are created and the judging process repeated until sufficient content-valid items are identified for each dimension (remember that at this stage we want at least two times the expected final number of items per dimension). Approach comparison. The two alternative methods (factor-analytic and ANOVA) described above may seem at first to provide redundant information but, actually, they do not. In addition to providing information about each item’s theoretical content (shown by the items’ loadings on the factor dimensions or subdimensions), the factoranalytic approach provides information about the theoretical distinctiveness of each dimension that is being measured (shown by the number of factors obtained). Thus, if a dimension’s theoretical definition is poor (e.g., confusing or redundant with other definitions included in the analysis – see Schriesheim et al., 1993 for further discussion), or if its items are poor operationalizations, a separate factor for the dimension should not be obtained. On the other hand, the exploratory factor-analytic approach suggested by Schriesheim et  al. (1993) does not provide for statistical significance testing of an item’s content classification, something that can be accomplished by performing a confirmatory factor analysis (CFA) and something that the ANOVA approach provides as well. The ANOVA approach is also amenable for use in smaller-sized samples than is the exploratory or confirmatory factor-analytic approach, a feature that should allow its application in virtually any scale assessment study.



The purpose of this questionnaire is to determine what type of supervisor behavior is described by various statements. Beginning on the next page is a list of statements which can be classified as expressing or measuring Directive Supervisor Behavior (supervisor behavior that provides guidance and direction to subordinates concerning how to perform their jobs), Supportive Supervisor Behavior (supervisor behavior that shows concern for the comfort and well-being of subordinates), or Neither (supervisor behavior that is neither Directive nor Supportive). These statements come from a widely used supervisory behavior questionnaire, and we believe that you can help advance knowledge on organizations by indicating the degree to which each statement is concerned with Directive, Supportive, and Neither directive nor supportive supervisor behavior. We appreciate and thank you in advance for your participation. INSTRUCTIONS: For each of the statements which appear on the following three pages: A. Read each statement carefully. B. Decide on the degree to which the statement refers to the type of supervisor behavior that you are being asked to rate. C. Circle the number for each statement that indicates the degree to which the statement reflects the type of supervisor behavior you are rating. Use the following response scale: 1 = None or hardly any, 2 = Some, 3 = Moderately or about half, 4 = Much, and 5 = Completely or almost completely. Please read and rate all of the statements, being careful not to omit or skip any. If you have any questions, please be sure to ask for help. EXAMPLES: [Examples were given of how to rate statements high, moderate, and low.] Now, begin on the next page. Please remember to rate each statement carefully and not omit or skip any. Use the definition of supervisor behavior given at the top of each page in making your ratings for that page. Thanks again. Definitions [One definition appears at the top of each page, followed by the item response categories and the supervisor behavior items to be rated.] Directive Supervisor Behavior is supervisor behavior that provides guidance and direction to subordinates concerning how to perform their jobs. Supportive Supervisor Behavior is supervisor behavior that shows concern for the comfort and well-being of subordinates. Neither is supervisor behavior that is neither Directive nor Supportive. 5 = Completely or almost completely 4 = Much 3 = Moderately or about half 2 = Some 1 = None or hardly any

Figure 15.2  Sample content validity rating questionnaire Source: Based on Schriesheim et al., 1993, Appendix A.

Measure Administration, Item, and Factor Analyses As Schriesheim et  al. (1999) indicate, content adequacy cannot be directly examined by traditional quantitative approaches – such as the factor analysis of respondent self-reports of their experiences – since content validity involves judgment concerning fidelity to abstract theoretical constructs or subdimensions. However, these types of analyses do provide very useful evidence on construct validity, and obtained results should agree

with predictions that are made from the theory underlying the construct (so that if either “theoretical” or “empirical” content assessments show deficiencies, an instrument’s construct validity should be questioned; cf. Jackson, 1971; Messick, 1989). In fact, experience with the Schriesheim et al. content validity assessment process indicates that its application often eliminates troublesome scale items and yields measures that have factor structures (in subsequent empirical research) which are congruent with the theory of the constructs that they were intended to measure.


Item administration. Once the theoretical content validity assessment has been completed and yielded a suitable item pool, the new measure should be administered to appropriate sample(s) for empirical refinement. The sample(s) administered should be drawn from the population(s) in which the new measure is intended to be used. The researcher must also decide whether to use a concurrent or predictive study design. (In a concurrent design, all measures are collected at the same time. In a predictive design, at least some dependent variables are measured at a later time.) In both designs, required sample sizes should be determined based upon needed statistical power (Cohen, 1988), the different data analyses which are envisioned, and predicted respondent attrition (due to incomplete responses, not completing a second data collection, etc.). Accompanying the new measure in its administration should be other measures of related and unrelated variables (e.g., employee absence frequency) and constructs (e.g., satisfaction with supervision). These additional measures are meant to be used to obtain criterion-related validity evidence and evidence of the convergence and divergence of the construct with other constructs. Initial item pool factor analysis. Once the collected data have been cleaned (to eliminate incomplete responses, careless respondents, etc.), the first analysis that should be performed with the new measure is a factor analysis – to determine whether the items’ empirical structure fits with the construct’s theory and whether the items are measuring their proper theoretical dimensions. For these analyses, exploratory (EFA) or confirmatory factor analysis (CFA) can be employed. Generally, the authors prefer starting with confirmatory factor analysis because it directly tests the obtained empirical item structure against the (hypothesized) theoretical item structure. If a satisfactory fit of the hypothesized confirmatory factor structure with the data is not obtainable, exploratory factor analysis can often provide insight as to the actual structure of the data and may be employed. Generally, because measures contain error and constructs are hypothesized to be independent, orthogonal rotation of a principal axes or common factor solution may be especially appropriate for these exploratory analyses. Simple structure. In measure construction, the factor structure that we would want to obtain is where the number of factors equals the number of theoretical constructs that the items are supposed to represent. Ideally, all items will show “simple structure” – they will have a statistically significant (for CFA) or meaningful (for EFA) loading (e.g., > .40; Ford, MacCallum, & Tait, 1986) on only one factor. If the obtained factor structure is not what was theorized, then we must conclude


that the theory is flawed and try to understand in what way(s) and what might be the cause(s) of our discrepant results. Items with complex structures (i.e., that have significant or meaningful loadings on more than one factor) generally should be eliminated from further consideration for inclusion in the construct measure. Item analysis. Finally, in addition to examining the factor loadings to identify best-performing items, the internal reliability of each subdimension or construct should be assessed, and the contribution of each item to measurement internal reliability determined. At this time, items that diminish the reliability of the measure are removed (unless there is a good reason for retention – such as domain sampling adequacy). The result should be construct or subdimension measures that have internal reliabilities (such as coefficient alpha) of .70 or higher (Nunnally, 1978).

Securing Criterion-Related Validity Evidence Up to this point we have been focusing on concerns about the new measure’s internal content and structure. We now turn to examine the new measure’s relationships with other constructs and variables. When we administered the new measure to a sample that was drawn from a population that we wish to study, we obtained data on the new measure and also on theoretically related and unrelated constructs and variables. These are the data that will now be used in our analyses. Our well-developed theory of the construct should delineate other variables and constructs to which it should be related. For example, if we are creating a new construct and new measure of satisfaction with supervision, our theory might specify that satisfaction with one’s supervisor should be negatively and strongly related to the intent to turnover or quit the current employer in the coming calendar year. If our data collection contained our new supervision satisfaction measure and a measure designed to assess turnover intent, we would expect to see at least a modest and statistically significant negative correlation between the two. If the data were collected concurrently, finding such a relationship would constitute supportive concurrent criterion-related evidence. Of course, predictive criterion-related evidence would be obtained if the turnover intent data were collected subsequent to collecting the data on satisfaction with supervision. A good theory about our satisfaction with supervision construct and measure should specify an appropriate time period for supportive predictive evidence to be found. Of course, the more predictions about the construct’s relationships with other



variables and constructs are supported, the stronger the concurrent and/or predictive validity evidence.

Examining Convergent and Discriminant Validity Having supportive concurrent and predictive evidence increases confidence in the validity of our new construct (and its underlying theory and proposed nomological network), as well as that of its associated measure. Because, as part of our theorizing about the construct, we earlier identified other theoretically relevant variables and constructs with which it should show convergence and divergence, we can now directly assess these predicted relationships, assuming that we have measured the necessary variables when we were collecting item analysis and criterion-related data from the sample employed in those analyses. Multitrait-multimethod matrix and related analyses. The examination of convergent and divergent relationships is inherent in Cronbach and Meehl’s (1955) idea of examining a construct’s nomological network as a key step in the construct validation process. However, Campbell and Fiske (1959) were the first scholars to formalize and systematize the convergent-divergent analysis process and to propose specific criteria for evaluating the convergence and discrimination of a construct and its measure. Although the Campbell and Fiske (1959) approach is not currently considered the best approach for examining convergence and divergence, it can be used when other approaches cannot be successfully employed, for example, due to such things as data and parameter estimation problems (see discussion below). Campbell and Fiske (1959) proposed creating and testing a “Multitrait-Multimethod Matrix” (MTMM) for evidence of convergence and discrimination. Convergence refers to the relationship between measures that should be theoretically related, while discrimination refers to the relationship between supposedly unrelated measures. Convergence. In conducting an MTMM analysis the starting point is the construction of a correlation matrix between similar and dissimilar construct measures. The individual correlations in the matrix are then subjected to several comparisons that are designed to assess convergence and discrimination. The criterion for establishing convergence is that the “validity coefficients” (correlations between the same trait measured by different methods) should be statistically significant and large enough to be considered meaningful (Cohen, 1988, suggests that a large effect size translates to r = 0.5 and η2 = 0.14). To obtain

alternative measures of the constructs being assessed, either suitable existing measures need to be identified and employed or secondary measures need to be constructed using items specifically developed for this process. Discrimination. The first MTMM criterion for supporting discrimination is that relationships between similar traits measured by different methods should exceed the relationship between different traits measured by different methods. Campbell and Fiske (1959) did not specify a procedure for testing the significance of this pattern, but Gillet and Schwab (1975) proposed that the obtained proportion of supportive relationships be tested against .50 (assuming that half the relationships would be greater by chance alone). The second discrimination criterion is that a trait’s validity coefficients should exceed the relationships between that trait and other traits measured by the same method. Using the same proportion testing approach (as for the first discrimination criterion) allows significance testing for this criterion. The third and final discrimination criterion is that the pattern of trait correlations be the same among all different trait-same method and different traitdifferent method relationships. Again, Gillet and Schwab (1975) suggest accomplishing this by rank-ordering the correlations and computing Kendall’s coefficient of concordance (W). Finding significant agreement indicates that the third criterion is met. (See Schriesheim and DeNisi, 1980, for a simple example of the Campbell and Fiske MTMM process outlined above.) As an alternative to the three discrimination tests summarized above, the Fornell and Larcker (1981) criterion is one of the most popular techniques used to check the discriminant validity of measurement models. According to this criterion, the square root of the average variance extracted by a construct must be greater than the correlation between the construct and any other construct. Analysis of variance of an MTMM matrix. Kavanagh, MacKinney, and Wollins (1971) developed an analysis-of-variance (ANOVA) approach to testing for measure convergence and discrimination and for the presence of “method and source bias” (as an alternative to the Campbell and Fiske, 1959, approach). Method and source bias refers to relationships that are obtained between measures solely due to the use of data collected from the same sample by the same data collection method. For example, if we were to correlate two different measures of satisfaction with supervision that were collected via the same questionnaire that was administered to the same sample at the same time, the obtained relationship is likely to be impacted (inflated) by common source and method bias. The Kavanagh et al. (1971) approach can use the


MTMM correlation matrix as its data source, and applying this approach is relatively simple and easy. It also represents a second way of looking at a measure’s convergence and discrimination, and it may be used when data and estimation problems prevent the application of another method. In general, with the advent and popularization of confirmatory factor analysis (CFA) and structural equation modeling (SEM), both the Campbell and Fiske and Kavanagh et  al. approaches have seen substantially reduced use; CFA and SEM analytic methods are now more common, and we now turn to discuss these. MTMM CFA models. A recent advance in the use of CFA for examining data structures is the “correlated uniqueness model” for analyzing MTMM data. Traditionally, MTMM data have been examined using CFA models which specified separate trait, method, and error (uniqueness) factors (cf. Widaman, 1985; Schmitt & Stults, 1986). In these traditional models, the uniqueness factors are specified to be uncorrelated within themselves or with any other factors. However, the trait factors are correlated among themselves, as are the method factors (the traits are uncorrelated with the methods). While allowing the examination of MTMMs from a perspective akin to that of Campbell and Fiske (1959), these “method factor models” have consistently encountered a number of parameter estimation problems. Sometimes, they have had identification problems that have led to the nonconvergence of parameter estimates. Sometimes, they have had parameter estimates that are theoretically inadmissible (e.g., negative error variances, factor correlations in excess of 1.0, etc.). Finally, although there is no way of being certain, these method factor models are believed to overestimate methods effects by including some true trait variance as part of their method variance estimates (see Kenny & Kashy, 1992, and Marsh, 1989, for further details). Correlated uniqueness models, in contrast to method factor models, have been shown to be more likely to be identified and to have parameter estimates which are theoretically admissible (Marsh, 1989; Kenny & Kashy, 1992). These models specify that the latent variables (“trait factors”) are intercorrelated among themselves, and that the uniqueness or error terms for each separate method are correlated among themselves. However, the uniquenesses are not correlated across methods, nor are they correlated with the traits. As Kenny and Kashy (1992, p. 169) note, “restricting the method-method correlations to zero seems to be a very strong assumption” but is generally needed for method factor model identification. However, not estimating method-method covariances causes


the amount of trait-trait covariance to be overestimated, yielding conservative estimates of discriminant validity (Kenny & Kashy, 1992). An example of the use of this method, as well as of the entire measure development process, is presented in Neider and Schriesheim (2011). In this study, the modeling process was simplified and the likelihood of model convergence increased by employing the correlated uniqueness model and a single indicator approach (which involves setting measure error variances equal to the product of the measure’s variance times one minus its internal consistency reliability; Williams & Hazer, 1986). Maximum likelihood LISREL (Jöreskog & Sörbom, 2006) was employed for the modeling process and the recommendations of Marsh (1989) and Kenny and Kashy (1992) were employed; the modeling process followed the basic sequence outlined in Widaman (1985) and Schmitt and Stults (1986). Comparisons of model fit were based upon the chi-square difference test, which allows the examination of statistical significance. Additionally, Marsh (1989) specifically recommends the use of the Tucker and Lewis Index (TLI; 1973), also known as the Non-Normed Fit Index (NNFI), to assess MTMM model fit with the correlated uniqueness model. Finally, as discussed by Browne and Cudeck (1993), Medsker et  al. (1994), and Hu and Bentler (1999), the standardized root mean square residual (Std. RMR) and the root mean square error of approximation (RMSEA) (and its confidence interval) were used as additional indicators of model fit.

Replication and Development of Norms Because factors such as sample demographics and organizational culture affect obtained results, and because the measure development process entails rewording, deleting, and adding new items, the analyses summarized above should be repeated with one or more (preferably more) additional samples. As MacKenzie et al. (2011, p. 317) note, “This is important in order to assess the extent to which the psychometric properties of the scale may have been based on idiosyncrasies in the developmental sample of data and to permit a valid statistical test of the fit of the measurement model.” These new analyses should be used to further refine the measure and to provide additional evidence on content, concurrent, predictive, and construct validity. As a final step in the new measure development process, norms need to be developed. MacKenzie et al. (2011) discuss characteristics that need to be



considered in the development of norms, but for our purposes suffice it to say that, to interpret an obtained score, context is needed. For example, does a score of 83% on a calculus exam indicate high, moderate, or low understanding of calculus? The answer is that it depends upon the distribution of scores as to whether the 83% is high, low, or in between. Norms provide historical data that allow one to better interpret the meaning of any particular score. Consequently, norms are critical for proper score interpretation.

SUMMARIZING AND INCREASING THE EVIDENCE Based upon the evidence provided by the processes described above, a conclusion can be drawn concerning the new construct’s and measure’s construct validity. This conclusion would be based upon evidence of measure reliability (a measure must be reliable to be valid; Nunnally & Bernstein, 1994), content validity, criterionrelated validity (in conformity with its nomological network), and convergence and divergence with similar and dissimilar constructs. This conclusion would need to be periodically reviewed and reassessed because of the accumulation of additional data on the construct and measure as it is used in future research. As Nunnally and Bernstein (1994) note, validation should be considered a never-ending process. To further strengthen the evidence on construct validity, additional future studies should be contemplated. Included in such studies, the following types may be especially informative (for more details, see Ghiselli, Campbell, & Zedeck, 1981): 1 “Processes analysis” of how and why respondents answer a particular set of questions in a particular manner. This can help ensure that the process used in responding is close to what the instrument developer conceptualized when the measure was developed. For example, when respondents complete a Leader–Member Exchange (LMX) instrument, what are the exchanges upon which they base their responses? 2 Because correlation and factor-analytic results can be sample- and included variable-dependent, additional item and factor analyses need to demonstrate that the clustering of item intercorrelations is consistent with the nature of the factors specified in the construct’s theoretical definition. For example, if the construct is defined as having two distinct sub-dimensions, then two appropriate





groupings of items should be found in these analyses. Similarly, if a construct is theoretically unidimensional, then having multiple item groupings would be indicative of poor construct validity. Based upon the theory underlying the construct, its measure should have appropriate levels of reliability. As an example, for a construct that is theorized as unidimensional and stable over time, both internal consistency and test-retest reliabilities should be high. In comparison, if the construct is a self-perceived individual difference attribute, then moderate or low inter-rater reliabilities may be more indicative of construct validity. More studies of the construct’s nomological network should be undertaken to expand knowledge concerning the construct and to validate the theory upon which the construct is based. This type of study is currently over-emphasized by the field so that it is sometimes the only construct validity evidence available on a measure. Alone, evidence of this type is likely to be insufficient for drawing sound conclusions about an instrument’s construct validity. The construct measure should be examined for potential sources of bias and invalid variance, by correlating it with other variables (such as affect, social desirability, etc.) and by modeling these effects using analytic approaches such as those described by Podsakoff et  al. (2003). Uncovered problems should either lead to revision of the construct measure (to ameliorate the problems), more use of experimental methods, or to the use of statistical controls for confounding factors in field-collected data. Finally, experimental studies may be particularly valuable when the construct is used as the dependent variable or a cross-classification variable. For example, if good and poor-quality leadermember exchange is experimentally created in two groups, a significant mean between-group difference on an LMX measure substantially helps increase our confidence in the construct validity of that measure.

Ghiselli et al. (1981, p. 287) specifically note, and we strongly and emphatically agree, that more construct validity can be attributed to a measure when a mixture of study types is used than when the same type is repeated multiple times. Thus, scholars’ apparent preoccupation with “substantive” or “criterion-related” validity to the virtual exclusion of other types of investigations can be seen as highly problematic.


CONCLUSION We believe that social science theory and research can advance only if we pay more attention to measurement issues in general and construct validity in particular. If this does not occur, we will continue to have cycles where initial interest in a phenomenon is followed by the enthusiastic adoption and use of a particular measure (or set of measures). Disappointment then sets in as shortcomings of the measure(s) slowly become apparent and findings are increasingly seen as artifacts of the research and measurement process. This ultimately may lead to the abandonment of entire areas of investigation. Clearly, if we do not take construct measurement more seriously, these types of “boom and bust” cycles are going to not only continue but may become more widespread and serious. As expressed at the beginning of this chapter, without valid measurement we are likely to know very little or nothing about the phenomena that we wish to study (Korman, 1974).

REFERENCES American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington: AERA, APA and NCME. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136–62). Newbury Park, CA: Sage. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Colquitt, J.A., Baer, M.D., Long, D.M., & HalvorsenGanepola, M.D.K. (2014). Scale indicators of social exchange relationships: A comparison of relative content validity. Journal of Applied Psychology, 99(4), 664. Advance online publication. http:// Colquitt, J. A., Sabey, T. B., Rodell, J. B., & Hill, E. T. (2019). Content validity guidelines: Evaluation criteria for definitional correspondence and definitional distinctiveness. Journal of Applied Psychology, 104(10), 1243–65. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.


Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with unobservable variables and measurement error. Journal of Marketing Research, 18(1), 39–50. Ford, J. K., MacCallum, R. C., & Tait, M. (1986). The application of exploratory factor analysis in applied psychology: A critical review and analysis. Personnel Psychology, 39(2), 291–314. Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Measurement theory for the behavioral sciences. San Francisco, CA: Freeman. Gillet, B., & Schwab, D. P. (1975). Convergent and discriminant validities of corresponding Job Descriptive Index and Minnesota Satisfaction Questionnaire scales. Journal of Applied Psychology, 60(3), 313. Gorsuch, R. L. (1983.) Factor analysis (2nd ed.). Hillsdale, NJ: Erlbaum. Guion, R. (1997). Content validity. In L. Peters, S. Youngblood, & C. Greer (Eds.), The Blackwell dictionary of human resource management (pp. 58– 9). Oxford, England: Blackwell. Guion, R., & Schriesheim, C. (1997). Validity. In L. Peters, S. Youngblood, & C. Greer (Eds.), The Blackwell dictionary of human resource management (pp. 380–1). Oxford, UK: Blackwell. Haynes, S. N., Richard, D. C. S., & Kubany, E. S. (1995). Content validity in psychological assessment: A functional approach to concepts and methods. Psychological Assessment, 7(3), 238–47. Hinkin, T. R. (1995). A review of scale development practices in the study of organizations. Journal of Management, 21(5), 967–88. Hinkin, T. R. (1998). A brief tutorial on the development of measures for use in survey questionnaires. Organizational Research Methods, 1(1), 104–21. Hinkin, T. R., & Schriesheim, C. A. (2008). An examination of “nonleadership”: From laissez-faire leadership to leader reward omission and punishment omission. Journal of Applied Psychology, 93(6), 1234–48. Hinkin, T. R., & Tracey, J. B. (1999). An analysis of variance approach to content validation. Organizational Research Methods, 2, 175–86. http:// Hoppock, R. (1935). Job satisfaction. New York: Harper. Hoelter, J. W. (1983). The analysis of covariance structures: Goodness-of-fit indices. Sociological Methods & Research, 11(3), 325–44. Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1–55. Jackson, D. N. (1971). The dynamics of structured personality tests: 1971. Psychological Review, 78(3), 229.



Jöreskog, K. G., & Sörbom, D. (2006). LISREL 8.80. Lincolnwood, IL: Scientific Software International Inc. Kavanagh, M. J., MacKinney, A. C., & Wolins, L. (1971). Issues in managerial performance: Multitrait-multimethod analyses of ratings. Psychological Bulletin, 75(1), 34. Kenny, D. A., & Kashy, D. A. (1992). Analysis of the multitrait-multimethod matrix by confirmatory factor analysis. Psychological Bulletin, 112(1), 165. Kerlinger, F. N., & Lee, H. B. (2000). Foundations of behavioral research (4th ed.). Fort Worth, TX: Harcourt. Korman, A. K. (1974). Contingency approaches to leadership: An overview. In J. G. Hunt & L. L. Larson (Eds.), Contingency approaches to leadership (pp. 189–95). Carbondale, IL: Southern Illinois University Press. MacKenzie, S. B., Podsakoff, P. M., & Podsakoff, N. P. (2011). Construct measurement and validation procedures in MIS and behavioral research: Integrating new and existing techniques. MIS quarterly, 293–34. Marsh, H. W. (1989). Confirmatory factor analyses of multitrait-multimethod data: Many problems and a few solutions. Applied Psychological Measurement, 13(4), 335–61. Medsker, G. J., Williams, L. J., & Holahan, P. J. (1994). A review of current practices for evaluating causal models in organizational behavior and human resources management research. Journal of Management, 20(2), 439–64. Messick, S. (1989). Validity. In R. Linn (Ed.), Educational Measurement (3rd ed.). Washington, D.C.: American Council on Education. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–9. Neider, L. L., & Schriesheim, C. A. (2011). The authentic leadership inventory (ALI): Development and empirical tests. The Leadership Quarterly, 22(6), 1146–64. Newton, P. E., & Shaw, S. D. (2013). Standards for talking and thinking about validity. Psychological Methods, 18(3), 301–19. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill. Nunnally J C., & Bernstein I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill. Podsakoff, P. M., MacKenzie, S. B., Lee, J. Y., & Podsakoff, N. P. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88(5), 879. Schmitt, N., & Stults, D. M. (1986). Methodology review: Analysis of multitrait-multimethod

matrices. Applied Psychological Measurement, 10(1), 1–22. Schriesheim, C. A. (1997). Criterion-related validity. In L. H. Peters, S. A. Youngblood, and C. R. Greer (Eds.), The Blackwell dictionary of human resource management (p. 67). Oxford, UK: Blackwell Publishers. Schriesheim, C. A., & Castro, S. L. (1996). Referent effects in the magnitude estimation scaling of frequency expressions for response anchor sets: An empirical investigation. Educational and Psychological Measurement, 55(4), 557–69. Schriesheim, C. A., Cogliser, C. C., Scandura, T. A., Lankau, M. J., & Powers, K. J. (1999). An empirical comparison of approaches for quantitatively assessing the content adequacy of paper-andpencil measurement instruments. Organizational Research Methods, 2(2), 140–56. Schriesheim, C. A., & DeNisi, A. S. (1980). Item presentation as an influence on questionnaire validity: A field experiment. Educational and Psychological Measurement, 40(1), 175–82. Schriesheim, C.A., Eisenbach, R. J., & Hill, K. D. (1991). The effect of negation and polar opposite item reversals on questionnaire reliability and validity: An experimental investigation. Educational and Psychological Measurement, 51(1), 67–78. Schriesheim, C. A., & Novelli Jr, L. (1989). A comparative test of the interval-scale properties of magnitude estimation and case III scaling and recommendations for equal-interval frequency response anchors. Educational and Psychological Measurement, 49(1), 59–74. Schriesheim, C. A., Powers, K. J., Scandura, T. A., Gardiner, C. C., & Lankau, M. J. (1993). Improving construct measurement in management research: Comments and a quantitative approach for assessing the theoretical content adequacy of paperand-pencil survey-type instruments. Journal of Management, 19(2), 385–417. Schwab, D. P. (1980). Construct validity in organizational behavior. Research in Organizational Behavior, 2, 3–43. Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference (2nd ed.). Boston: Houghton Mifflin. Shaffer, J. A., DeGeest, D., & Li, A. (2016). Tackling the problem of construct proliferation: A guide to assessing the discriminant validity of conceptually related constructs. Organizational Research Methods, 19(1), 80–110. Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38(1), 1–10. Vogt, D. S., King, D. W., & King, L. A. (2004). Focus groups in psychological assessment: Enhancing content validity by consulting members of the


target population. Psychological Assessment, 16(3), 231–43. Weiss, D. J., Dawis, R. V., England, G. W., & Lofquist, L. H. (1967). Manual for the Minnesota Satisfaction Questionnaire: Minnesota studies in vocational rehabilitation. Minneapolis: Industrial Relations Center, University of Minnesota. Widaman, K. F. (1985). Hierarchically nested covariance structure models for multitrait-multimethod


data. Applied Psychological Measurement, 9(1), 1–26. Williams, L. J., & Hazer, J. T. (1986). Antecedents and consequences of satisfaction and commitment in turnover models: A reanalysis using latent variable structural equation methods. Journal of Applied Psychology, 71(2), 219.

16 Item-level Meta-analysis for Re-examining (and Initial) Scale Validation: What Do the Items Tell Us? Nichelle C. Carpenter and Bulin Zhang

The purpose of this chapter is to describe the merits and technical steps associated with itemlevel meta-analysis. This analysis is conducted at the level of the scale items, rather than the overall scale itself, and provides necessary empirical support for a scale’s construct validity that cannot be determined at the scale level. There are many ways to provide evidence of construct validity and in this paper, we focus on the specific inference that the items comprising a scale actually represent the scale’s intended construct. This results in cumulative information about each item on the scale and the internal structure of the scale which, in turn, can be used to justify an item’s inclusion or removal, and demonstrate the degree to which items represent the intended construct. We briefly review scale validation techniques and how itemlevel meta-analysis can address some of the limitations of existing practices. We also describe examples from the literature that show how itemlevel meta-analytic techniques have been applied to (a) initially validate a scale and (b) re-evaluate construct validity evidence for an existing scale. We conclude with a discussion of future directions and important areas where item-level meta-analysis is likely to provide useful insights for researchers and practitioners. Although we note that the techniques we describe can be applied to shortened scales, this is not the focus of our chapter

(interested readers should refer to Chapter 8 that covers this topic in considerable detail). Our chapter advances item-level meta-analysis as a procedure that complements and extends the scale-level steps that are commonly used to validate scales.

LIMITATIONS OF COMMONLY USED SCALE VALIDATION PROCEDURES When introducing new scale measures, scholars typically exert great efforts to demonstrate construct validity evidence – for example, that the items empirically represent the intended scale construct (e.g., via factor analysis) – and follow the best practices and standards. Unfortunately, the fly in the ointment is that either one or a few steps may be carried out in a limited way or, alternatively, the established steps are limited themselves. In order to continuously strengthen construct validity evidence for new and existing scales, it is important that we first review the drawbacks in how we initially and continuously validate scales, which helps to position the merits of an item-level meta-analytic approach. Limited use of content validation. First, scale validation processes are limited in how


items are initially evaluated for inclusion on the scale. Specifically, content validation is used less frequently and in a limited way when compared with other validation methods, such as factor analysis (Colquitt et  al., 2019). Content validation describes the procedure of determining the extent to which the content of an item (i.e., each item’s wording) on a scale represents the scale’s construct (e.g., Hinkin & Tracey, 1999; see also substantive validity, Anderson & Gerbing, 1991). Such examinations evaluate construct-related validity in a subjective manner via rater judgments of whether (or the extent to which) the item represents the construct (e.g., Anderson & Gerbing, 1991; Colquitt et  al., 2019). Scale development studies that evaluate content validity evidence typically focus on the correspondence between a scale’s items and the intended construct’s definition (i.e., definitional correspondence; Colquitt et  al., 2019). This process is valuable because it typically results in removing items that are conceptually incongruent with the scale construct’s definition (e.g., Hinkin, 1998). However, this technique is often applied in a restrictive manner when an item’s content is compared to only one construct definition, that representing the scale. This can result in a scale retaining items that are ambiguous because the item content actually references more than one construct, even if it does represent the scale construct. It is important that scholars evaluate item content in a comprehensive manner (i.e., considering multiple constructs) to identify not only whether items represent the intended construct but also to ensure that scale items do not (unintentionally) reflect constructs outside of the scale’s intended domain (e.g., Haynes et al., 1995; Ferris et al., 2008). In other words, whereas it is important to demonstrate that scale A’s item corresponds to Construct A as intended, it is also necessary to ensure that this item is not also consistent with Construct B and Construct C. One example in the literature of scale developers who took a comprehensive approach to content validation is from Klein et  al. (2014). The authors developed a new scale measuring commitment and, as part of their process of selecting and refining the scale items, conducted several thorough steps to evaluate the content of the items. Importantly, not only were the items compared to the construct definition of commitment (the intended construct), but also to definitions of alternative constructs such as satisfaction and identification. Specifically, the final four scale items were evaluated by subject matter experts (SMEs; N = 23), who rated the extent to which each of these items represented the commitment definition versus definitions from other constructs (e.g., satisfaction, turnover intentions). The SMEs


were presented with definitions from the different constructs and asked to sort each proposed scale item into the construct category they judged it to best fit. The items retained for the final scale were judged to represent the “commitment” definition 94% of the time. Limited use of multiple samples. Common scale validation processes are also limited by their focus on separate individual samples. It is certainly commonplace for scale developers to collect multiple samples for the purpose of conducting analyses for construct validation purposes (e.g., factor analysis, reliability, nomological network examinations). However, researchers typically conduct these analyses in piecemeal fashion, on a single sample at a time. For example, researchers may use one independent sample to examine the scale correlates, and another sample for factor analysis to evaluate the scale’s structure. This is problematic primarily because an individual sample is subject to sampling error due to its small size relative to the population, which often results in distorted item parameter estimates. As a result, it becomes difficult to gauge the likely generalizability of the scale’s factor structure, and of each item’s descriptive statistics and factor loadings. Thus, although a scale may demonstrate the theorized and expected dimensionality in one specific sample, such patterns may not necessarily emerge when more samples are simultaneously evaluated. It is important to not only evaluate how the new scale performs in a single sample, but also how the scale performs when multiple samples are combined, as this is an important way to understand the scale without the distorting influence of sampling error (i.e., via increased sample size). Klein et al. (2014) also serve as an example of how scale validation efforts may use multiple samples to address concerns with sampling error. The authors stated that their new commitment scale was intended to apply across a variety of workplace contexts. As such, the authors empirically evaluated the scale across five different workplace and occupational settings: (a) a representative sample of employees from potential jurors (N = 1,003); (b) hospital employees (N = 523); (c) manufacturing plant line workers (N = 374); (d) undergraduate management students (N = 348); and (e) university alumni from a human resources program (N = 239). This new measure was developed to be used with different commitment targets (e.g., “commitment to the” organization, team, or academic goal) – as such, the targets assessed differed across the samples. For example, four samples evaluated “commitment to the organization,” whereas the student sample did not. Therefore, the combination of the five samples could not assess all possible ways that the scale was intended to be



used. Nevertheless, the combination of the independent samples provided a robust way of evaluating the scale’s nomological network (although the scale’s proposed unidimensionality and item factor loadings were tested in the individual samples), particularly relative to each independent sample on its own. Although the authors did not use meta-analysis, the combination of the independent samples represents a way that primary studies can begin to address the concerns we have noted around sampling error and its effect on understanding how a scale may perform when it is administered. Limited focus on scale items. Finally, although scale developers may focus on the quality of scale items when a scale is first developed (Carpenter et  al., 2016), this focus often is incomplete and wanes when the scale is re-examined. Indeed, most of the efforts taken to examine and understand construct validity evidence that supports the use of a scale (whether initially or as part of reevaluation efforts) focus on the scale or construct level of analysis – items typically receive less relative attention. For example, examinations of descriptive statistics (e.g., mean, standard deviation) or nomological network usually focus on the overall scale, and not on the individual scale items. This makes it easy to miss any existing issues with scale items. We note that this is also the case when meta-analysis is used to re-evaluate construct validity evidence for a scale. Meta-analysis has certainly been useful in answering questions about construct validity, including the relationships that a construct scale has with nomological network correlates (e.g., Kinicki et  al., 2002). However, a more complete re-examination of a construct scale should entail an evaluation of whether the scale’s factor structure holds up and the extent to which scale items remain strongly related to their expected factor. Thus, we maintain that the combination of both scale and item-focused evaluations is important to identify problems with construct validity and their solutions. For example, Hoffman et al.’s (2007) meta-analysis of organizational citizenship behavior’s (OCB) factor structure (i.e., dimensionality) and nomological network (e.g., task performance) was conducted at the level of the scale dimension (e.g., altruism, civic virtue). They found that OCB was unidimensional and empirically distinct from task performance. However, Carpenter et al.’s (2016) meta-analysis – focused on OCB and task performance items – revealed that some OCB items on the scale (judged to be more aligned with definition of task performance and/or work withdrawal) worsened model fit when they were forced to load on OCB factors. This illustrates how emphasizing scalelevel approaches without incorporating item-level

approaches makes it likely that issues with items that impact how scales should be used are missed. Furthermore, the scale items are critical to scale development processes because these items are what respondents actually encounter and evaluate when the scale is administered. Each item is assumed to reflect an essential part of the construct and, thus, contribute to our understanding of the construct. However, such important information is lost when the sole focus is on scale-level approaches. Although scale-focused studies may find evidence that supports construct validity, our understanding of construct validity remains incomplete without a precise investigation of items and how specific items enhance our understanding and evaluation.

CONTRIBUTION OF ITEM-LEVEL META-ANALYSIS The limitations of scale development processes we described above may be addressed by incorporating item-level meta-analysis into the empirical procedures that are commonly used to demonstrate and evaluate construct validity evidence. Meta-analysis is a critical method for understanding a construct’s population parameters without the confounding effects of sampling error and other study artifacts (Schmidt & Hunter, 2015). In essence, meta-analysis includes more data and statistical power than a single primary study or independent sample, which provides a more precise estimate of population parameters (Stone & Rosopa, 2017). Additionally, meta-analytic techniques allow researchers to statistically correct for common errors such as sampling error and unreliability so that relationships (e.g., correlations, mean differences) are more clearly understood without their obfuscating influence. Finally, metaanalysis assists researchers in resolving inconsistent findings from independent studies and can more conclusively determine the extent to which subpopulations (e.g., moderator variables) exist for the construct or relationship of interest (Schmidt & Hunter, 2015). Viswesvaran and Ones (1995) identified that meta-analysis allows researchers to obtain the true score correlations among operationalizations of the same and distinct constructs. By applying structural equation modeling techniques (e.g., factor analysis and path analysis) to a meta-analytically estimated correlation matrix at the scale level, researchers can estimate both the adequacy of the measurement model (e.g., dimensional structure) and relationships among conceptually


distinct constructs. This results in a more complete understanding of theoretical expectations, particularly those involving multiple constructs, which cannot be measured in a single research study. As such, Viswesvaran and Ones’ (1995) work contributed to scale validation by demonstrating how scale-level meta-analysis at the scale level can be adopted for realistic theory testing, and especially for demonstrating construct validity evidence. For instance, by integrating meta-analysis with path analysis to clarify a work engagement’s nomological network, Christian et al. (2011) demonstrated that engagement was distinct from job satisfaction, for example. However, combining meta-analysis with an item-level focus yields additional construct validity evidence regarding a scale that cannot be determined with a scale-level approach alone. This is because specific scale items are not typically considered in meta-analyses seeking to evaluate construct validity evidence. For example, Dalal (2005) meta-analytically evaluated construct validity evidence for OCB and CWB (i.e., whether the correlation between OCB and CWB supported their distinctiveness) and found that OCB-CWB correlations were significantly stronger when reverse-worded items were included on measures. However, to evaluate the influence of this itemlevel moderator, the author relied on scale-level approaches – that is, separating studies based on whether the OCB scale contained reversedworded items – and could not identify more explanation about the items. In contrast, Carpenter et  al.’s (2016) item-level meta-analysis on task performance and OCB pinpointed that reverseworded items on the Williams and Anderson (1991) scale (e.g., “takes undeserved breaks”) had relatively weaker factor loadings on their expected factor and contributed to weaker model fit compared to when they were able to load onto a new factor. Thus, item-level meta-analysis can show exactly which items are the problem and can provide clear solutions (e.g., remove reverse-worded items or reassign them to different scale factors). Altogether, incorporating the item level of analysis with meta-analysis can lead to a more interpretable understanding of a scale and its items. Therefore, the contribution of item-level metaanalysis to scale validation efforts is as follows. First, the meta-analysis of multiple independent samples using the scale items alleviates concerns of sampling error variance (Schmidt & Hunter, 2015). This results in more precise and stable estimates of the items’ underlying factor structure and factor loadings. Next, item-level metaanalytic efforts also help researchers identify and understand the moderators that may impact the magnitude and/or direction of factor loadings and


the fit of measurement models. Finally, item-level meta-analysis can be used with other item-focused techniques (e.g., substantive/content validity) to generate and evaluate alternative factor structures that may better represent the items. These efforts can yield important advances that are not likely to be found via scale-level approaches.

HOW TO CONDUCT AN ITEM-LEVEL META-ANALYSIS FOR SCALE VALIDATION When a new scale is developed to assess a construct, item-level meta-analysis stands as an important tool that can be used to understand and communicate the likely performance of the scale. Of course, the first step of scale validation efforts starts with generating items that represent the intended construct or dimension. Thus, it is necessary to follow accepted and recommended steps to develop the items. In line with Hinkin (1998), this includes developing a clear construct definition, taking steps to generate possible items and exemplars of the definition, and then taking steps to reduce these items, perhaps via substantive validity analysis (e.g., Anderson & Gerbing, 1991). It is at this stage that scale developers usually identify the “final” set of items that are eventually validated. Yet, we suggest that it may be useful to retain more items at this stage – the use of itemlevel meta-analysis could empirically confirm whether borderline items should indeed be retained on or omitted from the scale. Conducting an item-level meta-analytic process to initially validate a scale first requires identifying the relevant multiple independent samples. This means the scale developer should identify the ways that the multiple independent samples should differ, as these differences will inform potential moderators for the examination. As one example, one sample variant could be industry, such that the items are evaluated in samples representing health care, retail, and manufacturing environments. Another way that independent samples could vary is on job type. Here, items could be evaluated in samples where employees work in hourly wage positions or in a salaried position. We also encourage scale developers to ensure that samples are diverse in terms of sample demographics such as gender, race and ethnicity, and age. This step requires scale developers to consider the diverse contexts where the scale may be used and attempt to incorporate such different settings into the samples collected. The next step to initially validate a scale or revisit validity evidence for an existing scale is



to collect enough samples that warrant metaanalysis. For example, Schmidt and Hunter (2015) recommended that at least three samples were necessary to justify meta-analysis, particularly when comparing meta-analytic correlations across moderator categories (i.e., each category needs at least three samples). This is a commonly used practice in meta-analyses of scale-level data and we expect this also applies to item-level data. What this means is that as researchers identify relevant moderators (e.g., industry), they should also ensure that at least three independent samples are used for the particular moderator category (e.g., for industry, collect three samples from manufacturing settings and three from retail settings). As noted above, we have observed that it is commonplace for researchers to obtain numerous independent samples as part of their construct validation efforts (e.g., Ferris et  al., 2008; Klein et  al., 2014). To adapt this practice for item-level meta-analysis, we recommend first strategizing how multiple samples could represent different moderator variables or categories. Then, these multiple samples can be meta-analyzed in the aggregate (i.e., deriving meta-analytic estimates of item-level relationships for all samples) and also by moderator (i.e., deriving meta-analytic estimates of item-level relationships for samples in each moderator category [i.e., manufacturing]). These meta-analyzed relationships between the items on the scale can then fit meta-analytic correlation matrices that can be analyzed using SEM to evaluate the proposed scale’s measurement model and factor loadings, and also test measurement invariance of such findings across moderators. We describe these steps in more detail in the following sections. Literature search considerations specifically for conducting item-level meta-analysis to reevaluate construct validity. The analytical steps of item-level meta-analysis are the same whether revisiting validity evidence for an existing scale or conducting the initial validation. However, the process of collecting relevant data are quite different. Whereas for an initial scale validation, the scale developer likely has the multiple independent samples that can be meta-analytically combined to determine scale structure, dimensionality, and item factor loadings, this is not likely the case for re-evaluating construct validity. Specifically, someone desiring to re-evaluate an existing scale is not likely to have the item-level data from the studies where the scale has been used since it was developed. In this section, we describe ways researchers can collect data relevant for an item-level meta-analysis conducted for the purpose of re-evaluating the construct validity of an established scale. We also note examples from the literature that illustrate approaches to our

recommendations (e.g., Klein et al., 2001; Wood et al., 2008). The first step is to conduct a literature search to locate studies and samples in which the scale was used. We recommend starting with searching the studies that include the scale in their reference list. This step also requires verifying that the target scale was indeed administered in the referenced study and determining whether any item-level statistics are included in the published or unpublished article. It is not common for this level of data to be included in an article (e.g., item-level statistics take up precious publication space). As such, the next step typically involves contacting study authors and requesting itemlevel statistics. In order to construct an item-level meta-analytic correlation matrix (i.e., representing the correlations among all possible item pairs on the scale), scholars should request at a minimum: (a) correlations between each item pair on the scale; (b) means for each item; and (c) standard deviations for each item. To ensure items can be identified, it is important that researchers request the item wording alongside the statistics or item labels or, as an alternative, researchers could create and then send study authors a standardized form or table with an item list for authors to complete. We provide an example of what this could look like in Figure 16.1. Once all possible independent samples are collected, the scholar can move to the steps for coding, combining, and analyzing the data, which are described in the following sections. As one example of how item-level meta-­analytic data were collected, Klein et al. (2001) noted that their literature search only returned one article reporting some portion of the necessary item-level statistics. As a result, the authors contacted three authors who used the scale in published studies to request the necessary data. Similarly, Wood et al. (2008) described that of the 53 studies they identified as measuring the relevant items, only one study provided item-level correlations. The authors then contacted the lead authors of each study to request item-level correlations, means, and standard deviations. Wood et  al. had a 78% response rate from their efforts. Coding. Once researchers collect their independent samples (whether for initial or continued scale validation), we suggest the following steps as part of the coding process for each sample: (a) code the correlations between all possible item pairs on the scale (e.g., correlation between item 1 and item 2; correlation between item 1 and item 3, and so on); (b) record each item’s mean and standard deviation; and (c) code the study-level moderators that apply to each item (e.g., industry, job title, country, rating scale). We provide an example



Figure 16.1  Example item listing for literature search and coding item-level coding sheet in Figure 16.2. In line with typical recommended meta-analytic steps, we encourage study authors to ensure adequate levels of agreement between authors who complete the coding (i.e., see Schmidt & Hunter, 2015). One way that this can be accomplished is authors

initially coding a subset of the studies in the metaanalytic database (e.g., 25% of studies) and then meeting to resolve any disagreements that emerge with the coding. Once study authors are calibrated in their coding, then the remaining coding can be completed by a single author, if desired.

Figure 16.2  Example coding sheet for item-level meta-analysis



As one example, Klein et  al. (2001) coded each sample’s size, inter-item correlations, item means and standard deviations. The three moderators they coded were task complexity (low, medium, high), measurement timing (e.g., goal commitment measured before, during, or after task performance), and goal origin (whether goal was self-set or assigned). As a second example, Wood et al. (2008) found in their item-level datasets that the wording of the items varied across samples, due to reverse wording or other adjustments. To address their reasonable concern that the variations in item wording could shift the meaning of items, two SMEs “familiar with the psychometric properties of trust-related measurement items” reviewed the 45 items that were under consideration. Although the specific criteria for evaluating the items were not described in the article, 40 of the 45 items were retained for the subsequent meta-analysis. It is a common practice for researchers to adjust the wording of items to fit a particular context or population, and this study shows that it is also important to consider whether data from these adjusted items can be aggregated. Analysis. Once the item-level meta-analytic database has been constructed, the next step is to conduct the meta-analyses. As these data are at the item level of analysis, they will be subjected to bare-bones meta-analysis, which obtains samplesize-weighted correlations and removes the influence of sampling error variance on correlations but does not involve corrections for unreliability (Schmidt & Hunter, 2015). This is because reliability (e.g., coefficient alpha) is assessed for a scale as a whole, and not for each individual item comprising a given scale. Each correlation between item pairs is meta-analyzed; for example, if the scale contains 10 items, this indicates 45 correlations, while for 20 items, this indicates 189 correlations. The meta-analyses can be carried out on the overall set of independent samples (e.g., 10 independent samples) and they can also be carried out on the subsets of samples representing the moderator categories – for example if there are three samples from a manufacturing company and three samples from a retail organization, then the inter-item correlations from the healthcare samples could also be meta-analyzed separately from the retail samples (i.e., two additional sets of meta-analyses). For example, Klein et al. (2001) described that they conducted separate meta-analyses for each of the subgroups of a moderator category. This means that for the “goal origin” moderator, the authors separately meta-analyzed samples consisting of self-goals and then meta-analyzed samples consisting of assigned goals. We also note here

that Wood et al. (2008) reported that the 44 independent samples they obtained allowed them to achieve a minimum of five samples per item-level correlation to be meta-analyzed (the number of samples per correlation ranged from 5 to 26). After conducting the desired meta-analyses of item pairs, the next step is to fit the item-level correlation matrices that will be analyzed to evaluate validity evidence. As noted above, each correlation matrix represents the relationship between each of the items on the scale. Each cell in the matrix represents a correlation between two items – we provide an example illustration in Figure 16.3. Once the desired bare-bones meta-analytic itemlevel correlations have been completed, each correlation should be converted to a covariance using the sample-size-weighted standard deviations (Carpenter et  al., 2016). Finally, computing the harmonic mean for each covariance matrix is necessary for analysis (Viswesvaran & Ones, 1995; Shadish, 1996). In line with our prior discussion of moderators, these steps – (a) item-level meta-analysis, (b) converting correlations to covariances, and (c) computing the harmonic mean – need to be completed for the overall analysis and for any moderator category of interest. Evaluating construct validity evidence. After completing the item-level covariance matrix, the next step is to use this matrix to evaluate construct validity evidence. This typically means that, at a minimum, the theorized measurement model is tested with CFA to evaluate the theorized factor structure of the scale. Subsequent steps may also include testing measurement invariance to evaluate whether the scale’s structure holds across different moderator categories. To conduct the initial validation of a scale, this step requires identifying the expected (i.e., theorized) factor structure, and the factor on which items are supposed to load. This step also applies to those who wish to reevaluate an established scale’s factor structure, but they may also need an extra step of identifying the theorized factor structure, likely from the original paper publishing the items. After fitting the original factor structures and examining model fit and each item’s factor loadings (and cross-loadings, if applicable), the next decision is whether each item fits the scale as originally theorized or if revisions are necessary. This is likely to be supported by the magnitude and direction of factor loadings. Those who are re-evaluating an existing scale may already have suggestions of alternative factor structures, perhaps from individual studies that used the scale, from conceptual or theoretical advances, or, as we discussed previously, a substantive validity assessment. Thus, testing a sequence of models and comparing model fits can help determine whether the scale items are best



Figure 16.3  Example of item-level correlation matrix (prior to covariance conversion) represented by the original factor structure or if an alternative structure fits better. Klein et  al.’s (2001) evaluation of the 9-item goal commitment scale started with a CFA where the original nine items loaded on a single factor. The model fit of this initial and theorized model was marginal on the basis of commonly used fit indices (e.g., chi square = 533.31, df = 27; RMSEA = .11; CFI = .86). A subsequent exploratory factor analysis showed two-factors, although one of these factors explained a small proportion of the total variance. The authors examined the original unidimensional model’s correlation matrix, factor loadings, and residuals and found that two items had the weakest factor loadings (e.g., factor loadings .44 and .31 when remaining loadings started at .54), largest residuals (R2 = .19 and .98, when remaining items started at .27), and contributed to reduced model fit relative to the remaining seven items. The authors removed these two items and repeated the prior steps – this next time, two additional items were identified as reducing model fit. The remaining five items were tested again, this time resulting in acceptable levels of model fit (e.g., chi square = 68.52, df = 5); RMSEA = .067; CFI = .977). The authors retained these final five items as the refined and final measure of goal commitment. Thus, these steps for re-evaluating construct validity resulted in a revised goal commitment scale that contained items that were more empirically aligned with the underlying theoretical expectations. Measurement invariance testing. The use of item-level meta-analysis also provides the

opportunity to evaluate whether the factor structure – whether it is the original or an alternative structure – represents the scale items across important conditions. As noted above, when coding the independent samples, additional variables can be coded to represent categorical sample characteristics, study contexts, or other variables of interest. We described above that separate metaanalyses and correlation/covariance matrices are needed for each moderator category. The measurement invariance tests are typically conducted with multigroup CFA, where each moderator category represents a separate group. Measurement invariance can be tested in sequential steps of increasing restrictiveness, including: (a) configural invariance, which focuses on the equivalence of the measurement model (i.e., factor structure) across groups; and (b) metric invariance, which adds the constraint of equivalent factor loadings for the groups. Each type of invariance is important for further evaluating and identifying specific items that contribute to heterogeneity in how the scale overall performs across different settings. Demonstrating measurement invariance requires a non-significant increase in CFA model fit compared to the less restrictive model, which demonstrates that additional constraint does not result in a significant change in model fit. As one example, Wood et  al. (2008) expected that their scale measuring buyer’s trust in a seller was not invariant across the different targets of trust (e.g., salesperson or the selling institution). However, their expectations were not supported as the chi-square for the multi-group CFA testing



configural invariance was not statistically significant. This indicated that the measurement of buyer trust was empirically equivalent regardless of whether the focus of trust was on the salesperson or the organization. Additionally, Klein et  al. (2001) evaluated measurement invariance (e.g., configural and metric) for three different moderators (task complexity, measurement timing, and goal origin). As one example, for task complexity, which contained three groups (low [k = 7, n = 1195], medium [k = 7, n = 1396], and high [k = 3, n =273]), the initial configural invariance model evaluated the equivalence of the revised 5-item unidimensional measurement model across the three groups. Due to the large sample sizes of groups, it was not surprising that the chi-square statistic was statistically significant, which prompted the authors to evaluate the fit indices to gauge the fit. These statistics (e.g., RMSEA = .04; CFI = .97) provided empirical support of configural invariance for the refined goal commitment scale. The more-restrictive metric invariance model was also supported – the chisquare was again significant, but the remaining fit indices (e.g., RMSEA = .05, CFI = .95) supported the empirical equivalence of item factor loadings across different levels of task complexity. These examples are centered on the use of item-level meta-analysis and the subsequent use of measurement invariance to re-evaluate existing scales, yet these techniques are largely identical to what would be done as part of initial scale development efforts. A summary of the item-level meta-analytic steps we have described is located in Figure 16.4.

FUTURE DIRECTIONS AND IMPLICATIONS Our purpose in this chapter has been to describe the merits of using item-level meta-analysis as part of scale validation processes, the technical steps involved in these efforts, along with select examples of what these practices have looked like in the existing literature. Klein et  al.’s (2001) use of item-level meta-analysis for scale re-evaluation was, to our knowledge, the first published attempt, yet the use of these approaches remains rare. There are a few exceptions outside of management and I/O Psychology; for example, in leisure research, Manfredo et al. (1996) applied item-level meta-analysis to re-examine a scale on recreation experience preferences. Although their original scale contained 328 items, their meta-analysis focused on 108 items given data availability and how the scale was used in previous research studies. Nevertheless, their efforts resulted in empirical

support for the underlying scale structure that aligns with how it tends to be currently used (i.e., with less items). Thus, our first recommendation is for more widespread use of item-level metaanalysis as part of the scale validation process, whether for a newly developed or established scale. However, it is also important to consider new ways that item-level meta-analysis can be leveraged within these validation activities. First, construct and measurement proliferation are important research concerns that may be more easily disentangled via item-level metaanalysis. These issues are of particular concern in the organizational sciences, where there are constructs that appear similar but are assumed to be conceptually distinct, and multiple scales exist to measure the same construct (Colquitt et al., 2019). Item-level meta-analysis is a tool that can be used to provide a more fine-grained understanding of the similarities and differences that exist between constructs. This is important because even existing meta-analytic attempts to understand construct or measurement proliferation are limited by their focus on the scale-level of analysis (e.g., Cole et al. 2012; Carpenter & Berry, 2017), which does not help to understand why constructs may be strongly related. For example, Carpenter and Berry (2017) meta-analyzed the relationship between counterproductive work behavior (CWB) scales and withdrawal scales to determine whether these constructs (and their measures) were empirically redundant or distinct. However, to incorporate item-level meta-­analysis, it would be necessary to identify the studies where at least the two constructs (or measures) were assessed, and then follow the previous steps we described to collect item-level datasets (i.e., containing items from suspected proliferating scales) or item-level correlation matrices for the items under investigation. Thus, to provide a more precise investigation of whether CWB and withdrawal were similar or distinct, the recommended steps would be to identify studies in which both CWB and withdrawal were measured, collect the item-level raw data or correlation matrices, and proceed with the steps we have described to conduct the item-level meta-analysis and complete the necessary matrices. The subsequent meta-analytic CFA would more conclusively determine whether the CWB items and withdrawal items were best represented by a single factor – supporting empirical redundancy – or by two (or more) factors – supporting empirical distinctiveness – and the specific contribution of items. Next, it is important we note that to this point, our discussion of the merits of using item-level meta-analysis has been focused on items measuring focal constructs. However, it is also important to consider control variables. For many reasons,



Re-examining the Scale

Initial Scale

Scale Development

Locate Data

- Develop a clear construct definition - Generate possible items - Identify a “final” set of items that represent the intended construct (e.g., via content or substantive validity analysis)

- Locate studies and samples in which the scale was used - Determine whether item-level statistics are included - Contact authors to request item-level data

Obtain Multiple Samples

- Collect multiple samples (e.g., at least 3 samples for each moderator category) that warrant meta-analysis - Administer the items to multiple samples

Coding (for Each Sample) - Code item inter-correlations - Code each item’s mean and standard deviation - Code potential study-level moderators (e.g., industry, job title, country, rating scale) that apply to each item

Item-level Meta-analysis

- Conduct bare-bones meta-analysis of each inter-item correlation for the overall database or for each moderator category - Create the inter-item meta-analytic correlation matrix

Evaluate Construct Validity -

Identify the expected or theorized factor structure Conduct CFA to evaluate the theorized factor structure Examine model fit and each item’s factor loading Consider alternative factor structures, test a sequence of models, and compare model fits

Measurement Invariance Testing - If moderator categories were identified: o Conduct multigroup CFA o Evaluate heterogeneity in the items

Figure 16.4.  Steps for scale validation such as consideration of page limits or control variables’ peripheral roles in the theoretical framework, it is unlikely that construct validity evidence for control variable scales (e.g., factor structures or item factor loadings) are reported, especially relative to that for focal variables. This is concerning because these scales may have similar issues as their focal variable counterparts, but be much less likely to be discussed.

For example, social desirability, which is identified by scholars as a type of response bias participants use to look good in surveys (Crowne & Marlowe, 1960; Beretvas et  al., 2002), is often used as a control variable (e.g., Liu et  al., 2019; Owens et  al., 2019). Nevertheless, extant studies have revealed mixed evidence for the factor structure and item loadings of a popular Crowne-Marlowe social desirability scale. Although



researchers have developed various short versions of the Crowne-Marlowe social desirability scale to reduce respondent fatigue (e.g., Strahan & Gerbasi, 1972; Reynolds, 1982; Vésteinsdóttir et al., 2017), there is no consensus on which items can be eliminated in these short versions (Barger, 2002). As a result, there is currently an inconsistent understanding and usage of the scale’s items across studies. There is also confusion regarding the actual factor structure of the full social desirability scale. Although the Crowne-Marlowe social desirability scale and its short versions are assumed to be unidimensional, several studies failed to find support for this (e.g., Thompson & Phua, 2005). Despite these issues, social desirability is commonly used, though in an inconsistent manner. This leaves researchers wondering how to address unexpected findings in their own studies. This illustrates that item-level meta-analysis is needed to reconcile these mixed findings of this scale’s dimensionality. Finally, we note that an under-utilized merit of item-level meta-analysis is to develop and evaluate shortened versions of scales. Shortened scales are commonly used for many reasons, including time and resource limits, response fatigue concerns, and the contextual needs of organizations (e.g., König et al., 2010; Shoss et al., 2013; Wayne et al., 2013). We refer interested readers to Chapter 8 in this Handbook for specific guidance on how to shorten scales. We expect that validation processes for shortened scales can benefit from the use of item-level meta-analysis. In particular, because shortened scales are often developed in a single sample, this means that sampling error and generalization remain concerns when evaluating validity evidence for the scale. Thus, item-level meta-analysis may address these concerns. Specifically, scholars could collect multiple samples in which the shortened scale items were administered and then follow the steps to conduct an item-level meta-analysis to evaluate the (a) factor structure (e.g., unidimensionality) and (b) factor loading magnitudes. These pieces of information help determine whether the shortened scale conforms to the original scale’s assumptions. As one example, Huhtala et  al. (2018) developed a shortened version of the corporate ethical values scale and subsequently evaluated it using two independent samples. Although the scale exhibited dimensions consistent with the full scale, Huhtala et al. (2018) found that one item had an improper factor loading in one sample. This suggests the influence of sampling error and that item-level meta-analysis is likely to provide more guidance on how to improve the shortened scale. We also note that given the prevalence of shortened scales in the

literature, item-level meta-analysis can help to evaluate construct validity evidence for existing scales. Returning to an earlier example regarding social desirability, we expect that item-level metaanalysis can be used to obtain important insights for the various shortened social desirability scales, such as Reynolds’ (1982) Form A and to determine whether the shortened scale indeed displays unidimensionality as expected. Finally, we note that organizations commonly develop their own items and scales in order to evaluate worker attitudes (e.g., engagement) and performance behaviors that occur in the working environment. In order to have confidence in the interventions or decisions that come from these surveys, it is important that organization leaders also use the elements of item-level meta-analysis we have described here to evaluate whether the scale is measuring what is intended. For example, when developing an engagement scale, managers could collect multiple samples (e.g., by job type) for the items, and then conduct item-level meta-analysis to evaluate the expected structure of their developed scale, and determine whether the structure holds (i.e., is invariant) across different types of jobs. This has important implications not only for organizations that have just developed their own measure or those that have used certain measures over time. These steps are also necessary for increasing the confidence organization leaders have in the interventions or initiatives that stem from their surveys.

REFERENCES Anderson, J. C., & Gerbing, D. W. (1991). Predicting the performance of measures in a confirmatory factor analysis with a pretest assessment of their substantive validities. Journal of Applied Psychology, 76(5), 732–40. Barger, S. D. (2002). The Marlowe-Crowne affair: Short forms, psychometric structure, and social desirability. Journal of Personality Assessment, 79(2), 286–305. Beretvas, S. N., Meyers, J. L., & Leite, W. L. (2002). A reliability generalization study of the MarloweCrowne Social Desirability Scale. Educational and Psychological Measurement, 62(4), 570–89. Carpenter, N. C., & Berry, C. M. (2017). Are counterproductive work behavior and withdrawal empirically distinct? A meta-analytic investigation. Journal of Management, 43(3), 834–63. Carpenter, N. C., Son, J., Harris, T. B., Alexander, A. L., & Horner, M. T. (2016). Don’t forget the items: Item-level meta-analytic and substantive validity


techniques for reexamining scale validation. Organizational Research Methods, 19(4), 616–50. Cho, S., Carpenter, N. C., & Zhang, B. (2020). An item‐level investigation of conceptual and empirical distinctiveness of proactivity constructs. International Journal of Selection and Assessment, 28(3), 337–50. Christian, M. S., Garza, A. S., & Slaughter, J. E. (2011). Work engagement: A quantitative review and test of its relations with task and contextual performance. Personnel Psychology, 64(1), 89–136. Cole, M. S., Walter, F., Bedeian, A. G., & O’Boyle, E. H. (2012). Job burnout and employee engagement: A meta-analytic examination of construct proliferation. Journal of Management, 38(5), 1550–81. Colquitt, J. A., Sabey, T. B., Rodell, J. B., & Hill, E. T. (2019). Content validation guidelines: Evaluation criteria for definitional correspondence and definitional distinctiveness. Journal of Applied Psychology, 104(10), 1243–65. Crowne, D. P., & Marlowe, D. (1960). A new scale of social desirability independent of psychopathology. Journal of Consulting Psychology, 24(4), 349–54. Dalal, R. S. (2005). A meta-analysis of the relationship between organizational citizenship behavior and counterproductive work behavior. Journal of Applied Psychology, 90(6), 1241–55. Ferris, D. L., Brown, D. J., Berry, J. W., & Lian, H. (2008). The development and validation of the Workplace Ostracism Scale. Journal of Applied Psychology, 93(6), 1348–66. Haynes, S. N., Richard, D., & Kubany, E. S. (1995). Content validity in psychological assessment: A functional approach to concepts and methods. Psychological Assessment, 7(3), 238–47. Hinkin, T. R. (1995). A review of scale development practices in the study of organizations. Journal of Management, 21(5), 967–88. Hinkin, T. R. (1998). A brief tutorial on the development of measures for use in survey questionnaires. Organizational Research Methods, 1(1), 104–21. Hinkin, T. R., & Tracey, J. B. (1999). An analysis of variance approach to content validation. Organizational Research Methods, 2(2), 175–86. Hoffman, B. J., Blair, C. A., Meriac, J. P., & Woehr, D. J. (2007). Expanding the criterion domain? A quantitative review of the OCB literature. Journal of Applied Psychology, 92(2), 555–66. Huhtala, M., Kangas, M., Kaptein, M., & Feldt, T. (2018). The shortened Corporate Ethical Virtues scale: Measurement invariance and mean differences across two occupational groups. Business Ethics: A European Review, 27(3), 238–47. Kinicki, A. J., McKee-Ryan, F. M., Schriesheim, C. A., & Carson, K. P. (2002). Assessing the construct validity of the job descriptive index: a review and metaanalysis. Journal of Applied Psychology, 87(1), 14–32.


Klein, H. J., Cooper, J. T., Molloy, J. C., & Swanson, J. A. (2014). The assessment of commitment: advantages of a unidimensional, target-free approach. Journal of Applied Psychology, 99(2), 222–38. Klein, H. J., Wesson, M. J., Hollenbeck, J. R., Wright, P. M., & DeShon, R. P. (2001). The assessment of goal commitment: A measurement model metaanalysis. Organizational Behavior and Human Decision Processes, 85(1), 32–55. König, C. J., Debus, M. E., Häusler, S., Lendenmann, N., & Kleinmann, M. (2010). Examining occupational self-efficacy, work locus of control and communication as moderators of the job insecurity–job performance relationship. Economic and Industrial Democracy, 31(2), 231–47. Liu, Z., Riggio, R. E., Day, D. V., Zheng, C., Dai, S., & Bian, Y. (2019). Leader development begins at home: Overparenting harms adolescent leader emergence. Journal of Applied Psychology, 104(10), 1226–42. Manfredo, M. J., Driver, B. L., & Tarrant, M. A. (1996). Measuring leisure motivation: A meta-analysis of the recreation experience preference scales. Journal of Leisure Research, 28(3), 188–213. Owens, B. P., Yam, K. C., Bednar, J. S., Mao, J., & Hart, D. W. (2019). The impact of leader moral humility on follower moral self-efficacy and behavior. Journal of Applied Psychology, 104(1), 146–63. Paullay, I. M., Alliger, G. M., & Stone-Romero, E. F. (1994). Construct validation of two instruments designed to measure job involvement and work centrality. Journal of Applied Psychology, 79(2), 224–8. Reynolds, W. M. (1982). Development of reliable and valid short forms of the Marlowe‐Crowne Social Desirability Scale. Journal of Clinical Psychology, 38(1), 119–125. Schmidt, F. L., & Hunter, J. E. (2015). Methods of meta-analysis: Correcting error and bias in research findings. Thousand Oaks, CA: Sage. Schriesheim, C. A., Powers, K. J., Scandura, T. A., Gardiner, C. C., & Lankau, M. J. (1993). Improving construct measurement in management research: Comments and a quantitative approach for assessing the theoretical content adequacy of paperand-pencil survey-type instruments. Journal of Management Special Issue: Yearly Review of Management, 19(2), 385–417. Shadish, W. R. (1996). Meta-analysis and the exploration of causal mediating processes: A primer of examples, methods, and issues. Psychological Methods, 1(1), 47–65. Shoss, M. K., Eisenberger, R., Restubog, S. L. D., & Zagenczyk, T. J. (2013). Blaming the organization for abusive supervision: the roles of perceived organizational support and supervisor’s organizational embodiment. Journal of Applied Psychology, 98(1), 158–68.



Stone, D. L., & Rosopa, P. J. (2017). The advantages and limitations of using meta-analysis in human resource management research. Human Resource Management Review, 27(1), 1–7. Strahan, R., & Gerbasi, K. C. (1972). Short, homogeneous versions of the Marlowe-Crowne social desirability scale. Journal of Clinical Psychology, 28(2), 191–3. Thompson, E. R., & Phua, F. T. (2005). Reliability among senior managers of the Marlowe–Crowne short-form social desirability scale. Journal of Business and Psychology, 19(4), 541–54. Vésteinsdóttir, V., Reips, U. D., Joinson, A., & Thorsdottir, F. (2017). An item level evaluation of the Marlowe-Crowne Social Desirability Scale using item response theory on Icelandic Internet panel data and cognitive interviews. Personality and Individual Differences, 107, 164–73.

Viswesvaran, C., & Ones, D. S. (1995). Theory testing: Combining psychometric meta‐analysis and structural equations modeling. Personnel Psychology, 48(4), 865–85. Wayne, J. H., Casper, W. J., Matthews, R. A., & Allen, T. D. (2013). Family-supportive organization perceptions and organizational commitment: The mediating role of work–family conflict and enrichment and partner attitudes. Journal of Applied Psychology, 98(4), 606–22. Williams, L. J., & Anderson, S. E. (1991). Job satisfaction and organizational commitment as predictors of organizational citizenship and in-role behaviors. Journal of Management, 17(3), 601–17. Wood, J. A., Boles, J. S., Johnston, W., & Bellenger, D. (2008). Buyers’ trust of the salesperson: An item-level meta-analysis. Journal of Personal Selling & Sales Management, 28, 263–83.

17 Continuum Specification and Validity in Scale Development L o u i s Ta y a n d A n d r e w J e b b

Many psychological constructs of interest are constituted by a theoretical continuum that represents the degree, or amount, of that construct. A continuum underlies psychological constructs such as attitudes, personality, emotions, and interests, and allows us to make statements about whether individuals are “high,” “moderate,” or “low” on such constructs. Nevertheless, researchers have not typically considered the construct continuum explicitly when defining constructs and creating scales, reflected in the lack of discussion in classic scale development resources (e.g., DeVellis, 1991; Clark & Watson, 1995; Hinkin, 1998). It is possible to create a valid scale when ignoring the continuum, but doing so leaves one open to some scientific problems and sources of potential invalidity downstream (Tay & Jebb, 2018). For instance, how can we fully distinguish between similar constructs when their continua are not explicitly defined? This is incredibly important for the distinction between normal and clinically-dysfunctional personality traits (Samuel & Tay, 2018), given the recognition that normal and abnormal personality lie on a common continuum (Melson-Silimon et  al., 2019). It is also pertinent in organizations, which seek to select individuals based on various traits. Another problem concerns reverse-worded items, which are very common in self-report scales (DeVellis,

1991). The issue is that such items are not appropriate for all types of continua and can actually produce construct contamination under the wrong circumstances, undermining validity. Indeed, the question of whether reverse-worded items are part of a construct or represent some type of contamination has been previously discussed in organizational research, such as whether negative wordings of organizational citizenship behavior (OCB) are actually part of counterproductive work behavior (CWB), or vice versa (Spector, Bauer, & Fox, 2010). Understanding a construct’s continuum and how best to operationalize it in scale creation is the solution to these problems – and others. Consequently, this chapter presents the concept of continuum specification, which consists of defining and then operationalizing construct continua within scale development (Tay & Jebb, 2018). After a brief introduction, this chapter is split into two major sections. The first outlines the process of defining the continuum, and the second describes how to operationalize the continuum. This format is similar to Tay and Jebb (2018), but going beyond past work, we focus here more on how the different components of continuum specification relate to the different forms of validity evidence. We provide a checklist of issues to consider that relates to the two, and in so doing,



we hope that researchers can better integrate continuum specification into their scale development process to produce greater validity.

OVERVIEW OF CONTINUUM SPECIFICATION According to Nunnally and Bernstein (1994), measurement is defined as “assigning symbols to objects so as to (1) represent quantities of attributes numerically (scaling) or (2) define whether the objects fall in the same or different categories with respect to a given attribute (classification)” (p. 3). In most psychological measurements, we perform the “scaling” of individuals with respect to a construct of interest. This presupposes that the construct has a theoretical continuum that is scalable. Individuals are scaled with respect to this continuum, which differentiates them by their degree, or amount, of the construct. While our goal here is not to engage in a philosophical discussion of whether psychological constructs are by nature ontologically scalable, the measurement procedures of scaling we use typically assume that they are. In other words, we do not know if psychological constructs themselves can inherently be scaled. However, we treat them as though they can. In this vein, the concept of continuum specification explicitly describes the process through which researchers define and operationalize the construct continuum as part of the scale creation process. We summarize this information in Table 17.1 and discuss each of these components in the next section. Further, because continuum specification is foundational to establishing the validity of a scale, we seek to link each of these continuum specification components to the different forms of validity evidence. This is shown in Table 17.2. In scale development, researchers are concerned about the issue of validity, which is defined by the Standards for Educational and Psychological Testing as “the degree to which evidence and theory support the interpretation of test scores for proposed uses of tests” (p. 11, American Psychological Association et al., 2014). While the concept of validity is unitary, there are different forms of validity evidence that either support or counter our evaluation of validity. In this chapter, we discuss the four primary forms of validity evidence: evidence with regard to test content, internal structure, response process, and relations to other measures (American Psychological Association et  al., 2014). We now describe each component of continuum specification.

DEFINING THE CONSTRUCT AND THE CONSTRUCT CONTINUUM The first step in scale development is conceptually defining the target construct (Podsakoff et  al., 2016; Jebb et al., 2021). This sets the boundaries on what the construct is and, implicitly, where its continuum begins and ends. For example, researchers A and B might generally define conscientiousness as the trait of being “thorough, neat, well-organized, diligent, and achievementoriented” (McCrae & John, 1992, p. 197). However, researcher A might further define it as any amount of conscientious-related thoughts, feelings, and behaviors. Alternatively, researcher B might further define it as only amounts of thoughts, feelings, and behaviors exhibited by non-clinical populations and that extreme, unhealthy amounts should be considered different clinical traits. In either case, the definition of the construct itself must be provided outright, especially if it pertains to certain degrees of relevant thoughts, feelings, and behaviors. Doing so more clearly articulates the nature of the continuum. Defining the end poles. After the construct itself is defined, the scale developer should go on to define the continuum explicitly. The first step is for researchers to consider what sets the limits of the construct, the continuum’s poles. This can be done by answering questions such as: “What is the meaning of each endpoint on the continuum?”, “At the high end, what types of behaviors does it encompass?”, “At the low end, does it include the absence of the targeted behavior or an opposite behavior?” In short, one is thinking through whether the construct continuum is bipolar or unipolar. A bipolar construct is one where the construct end poles comprise opposite content. In contrast, a unipolar continuum is one where the lower pole represents the absence of the content of interest but not an opposing concept. Relating to conscientiousness, the continuum is often (but not necessarily) conceived as bipolar. This is because a low degree of being “thorough, neat, well-organized, diligent, and achievement-oriented” (McCrae & John, 1992, p. 197) is often interpreted as being undetailed, messy, disorganized, careless, and achievement-indifferent. By contrast, affect can be conceived as either bipolar or unipolar (Russell & Carroll, 1999). For example, it is possible to define the continuum of emotion from happiness to sadness (i.e., a bipolar continuum), but it is also possible to define the lower end, not as sadness, but as simply the absence of a strongly experienced emotion. In this case, positive affect would have a unipolar continuum, and there would also be another unipolar continuum for sadness (with sadness on



Table 17.1  Checklist: Aspects and steps of continuum specification Aspect 1: Define the Construct Continuum This specifies the theoretical nature of the continuum. This theoretical nature consists of two parts: (a) what the continuum’s poles mean and (b) what its degrees mean. 1.  Define the polarity of the continuum This step defines what the endpoints of the continuum mean. Without defining these poles, important things can remain ambiguous, like what lower scale scores mean, whether the target construct is separate or overlapping with others, and what relations we should expect with other variables. a. Determine whether the continuum poles are combinatorial. Think carefully about the construct under study, and what the upper and lower poles mean. Although rare, consider if the construct’s continuum is combinatorial – where each pole is a combination of different constructs. This will usually not be the case, but it must first be ruled out before moving on. b. Determine whether the continuum is unipolar or bipolar based on theory. If the continuum is not combinatorial, then it must be either unipolar or bipolar. Which one it is depends on whether one of these options is not logically valid. Does a unipolar continuum make sense for this construct? Does a bipolar continuum? Polarity really depends on the meaning of the lower pole. We virtually always know what the upper pole means – it is the highest degree of the quality named by the construct. The lower pole, therefore, will determine the continuum’s polarity, and its meaning can be identified via theory. c. If the continuum can be either unipolar or bipolar, then choose a conceptualization appropriate for the research context. Sometimes a construct can be rightly conceived as either unipolar or bipolar. In this case, the researcher must make a decision about how to conceive the construct. The choice should be whatever conception is clearer, more intuitive, and makes the most theoretical sense. This choice should be stated explicitly in scale development because it affects what item content will be appropriate. 2.  Specify the nature of the continuum gradations This step defines what the continuum gradations, or degrees, represent. This tells us what differentiates individuals along the same continuum. Without doing this, some of the same problems described in Step 1 can arise: the meanings of lower scores can be ambiguous, it can be difficult to tell whether the target construct is distinct or overlapping with others, and its expected relations with other variables can be ambiguous. a. Theoretically consider what separates high vs. low scores. Like Step 1, one must carefully consider the construct under study but this time focus on what differentiates higher vs. lower scores. Is it the degree of experienced intensity, like how we usually conceive degrees of momentary emotion? Or is it frequency, such as the frequency of experiencing an emotion or committing a particular behavior? The quality of the degrees can be other types as well, such as belief strength (e.g., stronger beliefs in one’s abilities) or other, newer qualities. Like Step 1, the quality that characterizes the continuum’s degrees should be stated explicitly to aid scale development. Aspect 2: Operationalize the Construct Continuum in Scale Development In Aspect 1, the continuum was defined. In Aspect 2, one makes sure that the instrument and its validation procedures accurately reflect this continuum. 1.  Generate items appropriate to the continuum When creating a scale, the items must reflect the nature of the continuum. First, all item content should be located within the bounds of the two poles, and not outside of them. Second, the content should also reflect how the gradations have been defined. These steps are fairly easy to perform. However, it is important to make them explicit because validity can be impaired when they are ignored. a. Create items that fall between the continuum’s poles. Having defined the polarity of the continuum, when creating items, one should only include content that is located between these two poles. Failure to do so produces construct contamination. A main concern here is the presence of reverse-worded items. Reverse-worded items are only appropriate for a bipolar continuum because these types of items, by definition, include opposing content. They should be omitted in measures of unipolar constructs because they lie beyond the construct’s poles. b. Create items whose content matches the continuum’s gradations. In addition to the poles, the item content should also reflect the kind of gradations the construct has been defined as having. It should not reflect different kinds of gradations that lie outside the definition. For example, when measuring positive affect whose continuum gradations have been defined as experienced intensity, only items asking about intensities are appropriate; items asking about frequencies or other gradation types will be inappropriate (e.g., “I feel happy ___ times a day”). (Continued)



Table 17.1  Checklist: Aspects and steps of continuum specification (Continued) 2.  Choose response options that match the continuum Just like the items, the scale’s response options should also reflect how the continuum has been defined. Because response options are always tied to items, it is important to consider how the items and response options work together. They together must reflect the continuum’s polarity and gradations. a. Select response options that reflect the continuum’s polarity. Response options will contain either numeric labels, verbal labels, or both. However, different labels can imply different polarities. Some response options may imply a unipolar continuum (“Not at all” to “Always”) or bipolar continuum (e.g., “Strongly disagree” to “Strongly agree”). Scale developers must ensure that these labels match how the polarity has been defined. b. Select response options that reflect the continuum’s gradations. In addition to polarity, different response options can also imply different type of gradations. For example, the responses “Never” to “Always” imply that the continuum’s gradations are frequencies. For an item describing a behavior, the response options “Strongly disagree” to “Strongly agree” imply behavioral extremity. 3.  Assess dimensionality with the appropriate statistical method Examining the dimensionality of a scale is an important part of validation. However, there are multiple ways to do this, and the method that is most appropriate depends on the polarity of the construct continuum. The advice in this step is fairly straightforward: for unipolar or bipolar continua, factor analysis can be used, for combinatorial constructs, multidimensional scaling is best. a. If the continuum is unipolar or bipolar, use factor analysis. The large majority of constructs will have either unipolar or bipolar continua. In these cases, factor analysis is the preferred method for examining dimensionality. b. If the continuum is combinatorial, use multidimensional scaling. For those minority of constructs that are combinatorial, multidimensional scaling is the preferred method of assessing dimensionality because it is based on relative distances rather than the absolute value of the correlations. Failure to do this can result in an inaccurate number of dimensions. 4.  Assess the item response process Assess whether individuals use an ideal point response process (by comparing an ideal point response model to a dominance model). This can be done by comparing the model-data fit of ideal point response models to dominance response models. If the fit of the ideal point response model is substantially better, use these models to assess dimensionality and score individuals for subsequent analyses (e.g., regression).

the high end and a lack of strong negative emotion at the lower pole). Polarity is, therefore, a description of the meaning of the end poles, and it is different than dimensionality, which describes how many dimensions are present within a construct. To determine whether a construct has a bipolar or unipolar continuum, the researcher should think carefully about the construct and define the poles in the way that makes the most theoretical sense. Does a unipolar continuum make sense for this construct? Does a bipolar continuum? Sometimes, one of these options is not logically valid. One piece of advice is to realize that polarity virtually always depends on the meaning of the lower pole rather than the upper pole. We pretty much always know what the upper pole means – it is the highest degree of the quality named by the construct. For example, for locus of control, the upper pole is the strongest that a person can believe that they control the events in their life. The lower pole, therefore, will determine this continuum’s polarity. Its polarity can be identified via theory, and in this case, locus of control must be bipolar. This is because the less I believe that I control events in my life,

the more I must believe that outside forces control them (given that there is no “third option” other than myself and outside forces). Sometimes a construct can be validly conceived as either unipolar or bipolar. In this case, the researcher must make an informed decision about how best to define its poles. The choice should be whatever polarity is more intuitive and makes the most theoretical sense for that particular construct. This choice will be context-dependent. While the definitions of most psychological continua fall into bipolar or unipolar categories, it has been observed that some construct continua combine two different types of constructs and focus on their relative strength. These can be referred to as combinatorial constructs, as they combine different constructs at the end poles (Tay & Jebb, 2018). There are multiple examples of combinatorial constructs in the social sciences. Within vocational and occupational psychology, Holland’s RIASEC model (1959) of vocational interests is the most popular, and it describes six different vocational interest types: Realistic, Investigative, Artistic, Social, Enterprising, and



Table 17.2  Components of continuum specification and their links to validity evidence Continuum Specification Parts Defining the construct continuum 1.  Define the polarity of the continuum

2.  Specify the nature of the continuum gradation

Operationalizing the construct continuum 1.  Generate items along the continuum

Issues Needing to be Addressed

Most Relevant Forms of Validity Evidence

Define what the endpoints of the continuum mean. Is the continuum bipolar, unipolar, or combinatorial (i.e., two different types of end poles)?

Evidence with regard to test content and relations to other measures The meaning of the poles is part of the construct definition and determines the type of item content that is necessary. It also is necessary for evaluating how constructs should relate to one another; how can we have expected relationships if a construct’s bounds have not been fully defined? Evidence with regard to test content and relations to other measures The meaning of the continuum gradation determines what content is required (and also inappropriate). It also helps clarify what relations to expect; a certain type of gradation (e.g., frequency) may have different expected relationships than another (e.g., experienced intensity).

Define what the continuum gradations, or degrees, represent. That is, what precisely distinguishes a high score from a low score?

Create item content that falls within the end poles of the continuum and matches the nature of the continuum gradation.

2.  Choose response options that match continuum definition

Select response options that reflect the continuum’s polarity and the nature of its gradation.

3.  Assess dimensionality with the appropriate statistical method

Use factor analysis for unipolar or bipolar constructs. Use multidimensional scaling for combinatorial constructs.

4.  Assess item response process

Assess whether individuals use an ideal point response process (by comparing an ideal point response model to a dominance model). If so, consider applying ideal point models for assessing dimensionality.

Evidence with regard to test content When the item content reflects both the polarity and gradation of the full continuum, this ensures that the content matches the construct definition and helps avoid construct deficiency and contamination. Evidence with regard to test content When the response options match the construct continuum’s polarity and gradation, its content matches how it has been defined. Evidence with regard to the internal structure Applying factor analysis to combinatorial constructs can provide a misleading picture of their dimensional structure. Evidence with regard to the response process and relations to other measures When applying dimensionality techniques that assume a dominance response process (e.g., factor analysis), there may be spurious factors. Moreover, the correct response process model can improve estimation with other variables.



Conventional. These vocational interest types can be plotted in a circle through multidimensional scaling (Rounds & Zevon, 1983). Prediger (1982) observed that the circle of vocational interests – or the entire vocational interest space – can be defined with two orthogonal dimensions: People-Things and Data-Ideas. The PeopleThings construct dimension has Realistic Interest on one end pole and Social Interest on the other end pole; the Data-Ideas construct dimension has Enterprising + Conventional on one end pole and Investigative + Artistic on the other end pole (this is because the Data-Ideas dimension falls in the middle of Investigative and Artistic types; it also falls in-between Enterprising and Conventional types). Clearly, Prediger’s dimensions have different types of constructs (in the same domain) at the end poles. While it is often believed that these vocational interest dimensions are bipolar (e.g., individuals with Social interests do not have Realistic interests and vice versa), more recent research shows that these orthogonal dimensions should not be viewed as bipolar. For example, it has been shown that a substantial number of individuals endorse both Realistic and Social interests, and the meta-analytic correlation between them is actually positive (Tay, Su, et  al., 2011). Therefore, this continuum does not behave like one that is bipolar. Instead, this continuum focuses on the relative strength of vocational interest between the two end poles. In other words, people are scaled with respect to the relative strength of the end poles. They are not scaled in terms of each pole on its own. By considering whether the construct end poles are bipolar, unipolar, or combinatorial, one can better specify what the construct includes and excludes, and what falls inside and outside its bounds. Defining these poles is, therefore, best viewed as another part of defining the construct itself. It is, therefore, as foundational for validity as developing a good construct definition. When the construct is under-defined, validity is undermined. With regard to scale content, validity is evidenced when the content of the scale is shown to match the construct’s definition (American Psychological Association et al., 2014). However, how can we evaluate if the content is sufficient and if the construct, including its poles, are not fully defined first? Similar concerns affect the validity with regard to the internal structure, defined as when the internal structure of the scale (i.e., relationships among items and dimensions) matches what we expect based on theory. How can we be confident we have an accurate assessment of its structure if the construct’s poles have not yet been defined? Finally, defining the continuum’s poles affects validity evidence with regard to other

variables, defined as when the observed relationships match what is expected based on theory (American Psychological Association et  al., 2014). Without defining each construct’s poles, how can we determine if two constructs are overlapping or separate (i.e., discriminant validation)? How can we propose expected relationships (e.g., criterion-related evidence) if the construct bounds are undefined? Thus, defining the continuum’s poles is fundamental for having a clear conception of the construct and gathering and interpreting validity evidence. Specify the nature of the continuum gradations. In addition to its poles, construct continua have one other characteristic: their gradations, or degrees. Thus, another aspect of defining the continuum (and thus the construct itself) is specifying just what these gradations, or degrees, really mean. This can be achieved by seeking to answer the questions: “What are the degrees of the continuum on which individuals are scaled on?” or “What is the quality that separates high from low scores?” It has been observed that there are common categories for the meaning of continuum degrees, including experienced intensity, behavioral extremity, belief strength, and the frequency or timing of behaviors (Tay & Jebb, 2018). The continuum of emotion-type constructs often has a gradation that is focused on experienced intensity. For example, in the measurement of emotions, the extent to which someone experiences a target emotion with greater intensity is typically regarded as higher on the construct continuum than someone experiencing lesser intensity. Another type of gradation can be behavioral extremity. This typically applies to personalitytype constructs where the construct continuum reflects the typical or average patterns of thoughts, feelings, cognitions, and behaviors. For example, higher levels on the conscientiousness continuum will typically involve greater levels of behaviors such as self-discipline, organization, and dependability (Roberts et al., 2009). In contrast, lower levels on the conscientiousness continuum may have some of these behaviors but with less extremity. The gradations of the construct continuum can also take on the quality of belief strength. This is typically seen in attitudinal measurement, as attitudes are defined as “evaluative judgments that … vary in strength” (Crano & Prislin, 2006, p. 347). For example, political attitudes can vary by strength of belief on the political spectrum. Finally, the measurement of psychological constructs can also take the form of frequency or timing of behaviors. For example, the measurement of health-related behaviors often involves estimating the frequency and timing of behaviors, such as the frequency of exercise.


Relating this to validity, since defining the continuum’s degrees is another aspect of construct definition, it generally has the same importance for validity as defining the poles. In clearly specifying the nature of the continuum gradation, researchers can more readily evaluate test content validity evidence – namely, do the items’ content match the type of continuum specified? For example, if the continuum degrees are defined as behavioral frequencies, one can evaluate whether the items indeed capture frequencies. If positive affect is the target construct, and the gradations are how frequently it is experienced, then items relating to the prevalence, but not emotional intensity, would be appropriate. Specifying the nature of these degrees also helps interpret the kind of validity evidence concerned with relations to other constructs. This is because, for the same construct, different types of gradations can yield differential relations with other constructs. For example, it has been found that conceptualizing and measuring emotions as varying in emotional intensity shows high negative correlations between positive and negative emotions as compared to emotional frequency, and that emotional intensity and emotional frequency are fairly independent (Diener et  al., 1985). Importantly, the frequency of affect, rather than the intensity of affect, predicts non-selfreported measures of happiness (Diener et  al., 2009). Establishing the quality of the continuum’s degrees helps ensure that researchers correctly interpret the core ingredient that underlies a construct label.

OPERATIONALIZING THE CONTINUUM As can be seen, defining the construct’s continuum is the first and most important part of continuum specification. This is because it is really part of defining the construct, which is fundamental (Podsakoff et  al., 2016). However, once the continuum has been fully defined, one must then make sure it is operationalized properly in scale development. There are several parts to this, including generating items within the bounds of the continuum, choosing appropriate response options, assessing dimensionality with a suitable statistical method, and identifying the response process used by individuals along the continuum. Each of these is required for valid measurement and should be done alongside conventional validation procedures (e.g., generating items that fit the continuum can be done when the items are checked for content relevance). They are discussed in turn.


Generate items that fit the continuum’s poles and gradations. Continuum specification proposes that assessing the construct requires that researchers create and select items that span the proposed construct continuum. To ensure that the items fall within the continuum, one should first consider how the end poles of the construct were defined. Is the construct unipolar or bipolar? What is the meaning of the lower pole? Simply put, developers should not include items that lie outside the continuum and should only include items that lie within its bounds. For example, say that counterproductive work behaviors (CWBs) have been defined as a unipolar construct, where the upper pole is the presence of these behaviors and the lower pole is the absence of them. Therefore, the scale items should contain examples of CWBs and reject reverse-worded examples that reflect positive work behaviors. Failing to do this would result in measuring a separate, unintended construct. As a general rule, reverse-worded items are inappropriate for unipolar continua. By contrast, bipolar continua can (and usually should) be measured using reverse-worded items to cover their entire range. For example, suppose emotional valence has been defined as having a bipolar continuum with positive and negative affect at the poles. In that case, one should write and select both positive and negative emotion items to ensure that the construct’s full range has been sufficiently captured. The advice to only include items that lie within the continuum bounds is common-sense, but it can be overlooked when the continuum is not taken into account in measurement. Another important aspect is to ensure that the items also match the nature of the continuum gradations, whether they be experienced intensity, behavioral extremity, belief strength, or some other type. For example, if emotion has had its gradations defined as experienced intensities, then one should write items that reflect this (e.g., “I have strong feelings of happiness”). If the gradations have been defined as frequencies, then one should only write items that reflect frequencies (“I frequently feel happy”). In summary, one needs to ensure that the scale item content always matches how the continuum has been defined, both in terms of its polarity and gradations. Traditionally, researchers have only focused on whether the item content reflects the defined construct content when evaluating scale items for content validity evidence. However, continuum specification shows that there are additional issues to consider. Validity evidence with regard to content should also consist of whether the items match the polarity of the continuum and the nature of its gradations since these are aspects of how the construct has been defined. This evidence can be



provided when the scale developer (a) states how the continuum has been defined and (b) shows that each item fits this definition. The primary problem that can happen when this is not done is measurement contamination – when an unintended construct is being measured because the items reflect different kinds of endpoles and/or gradations. Choose response options that match the continuum. In addition to the scale items, another aspect of continuum specification is the consideration of response options. Just like item content, researchers need to ensure that the response options match how the construct’s polarity has been defined. In doing so, they promote validity evidence with regard to scale content because what is being measured will actually match the theoretical construct. It is vital to know here that different verbal response options imply different polarities. For instance, some labels imply a bipolar continuum by directly naming an opposing content (e.g., “extremely unhappy” to “extremely happy”). These should not be used if the continuum has been defined as unipolar (e.g., when “less happiness” does not mean “unhappiness”) (Russell & Carroll, 1999). There are also labels that may imply a more unipolar continuum (e.g., “not at all” to “extremely”). Moreover, some response options may be ambiguous with regard to polarity. This frequently happens when the item wording can be construed as either bipolar or unipolar. For instance, disagreeing with a conscientiousness item, “I am hardworking,” could imply either that the individual views themselves as lazy (bipolar) or merely having a non-exceptional work ethic (unipolar). Item/response option combinations like this may be acceptable for measurement as long as there are other items that identify the construct as either unipolar or bipolar. Nevertheless, it is important to recognize that certain types of response options may be inherently ambiguous to participants. If the issue of polarity is something that needs to be carefully tested (e.g., the continuum of normal and abnormal personality; or the continuum between CWB and OCB), then it is critical to align the response options to ensure an accurate assessment. Apart from polarity, the response options should also reflect the nature of the continuum’s gradations. For example, if the gradations have been defined as frequencies, and the items have been written this way, then the response options should also reflect frequencies (i.e., “not at all frequent” to “very frequent”). If the gradations and items are of a different type, such as experienced intensity, then the response options should reflect this, as well. Response options reflecting a different kind of gradation would cause ambiguity and

confusion for the respondents, introducing measurement error, undermining validity. Assess dimensionality with the appropriate statistical method. In addition to the scale’s content (e.g., items and response options), another essential part of validation is collecting evidence with regard to the scale’s internal structure (American Psychological Association et  al., 2014). Specifically, this involves investigating features like the number of latent variables in the scale (dimensionality) and how its items are related to these variables. Continuum specification entails being aware that there are different ways of assessing the dimensionality, and some are more appropriate than others depending on how the continuum has been defined. This part of continuum specification is actually fairly simple; if the construct is either unipolar or bipolar, then the typical approach of using factor analysis (FA) can be used (Brown, 2006). This will cover the large majority of cases, as most psychological constructs are either unipolar or bipolar. However, for the combinatorial constructs described earlier (i.e., those whose poles are defined as combinations of constructs), multidimensional scaling (MDS) provides a better method of providing internal structure validity evidence. Why is this the case? Combinatorial constructs have end poles that comprise different types of constructs within the same domain. For example, the RIASEC model in vocational interests has six different vocational interest types, but they are all different types within a common domain of vocational interests. As Prediger (1982) proposes, the People-Things and Data-Ideas dimensions are combinatorial constructs. MDS enables researchers to sort the different types, typically within a two-dimensional space, and identify dimensions that combine different constructs within the same domain. The key difference between MDS and factor analysis is that MDS focuses on the relative distances of the different types with respect to one another, whereas factor analysis focuses on the absolute level of the correlations. Therefore, in MDS, it is possible to have two end poles of a continuum that are positively correlated to one another (e.g., Realistic interests and Social interests), but this positive correlation is relatively smaller than the correlations with other types (e.g., Realistic interests and Investigative interests). On the other hand, it is less likely that factor analysis will have a dimensional model that has two end poles of a continuum positively correlated to one another. As a practical example, conducting MDS on RIASEC data leads to combinatorial constructs (Rounds & Tracey, 1993), but conducting FA on RIASEC data leads to six unipolar dimensional constructs (Su et al., 2019). More generally, MDS


is applied to other combinatorial construct continua beyond vocational interests and can be seen in the interpersonal circumplex (Wiggins, 1982), in personality research, and basic human values (Schwartz, 1992) in values research. The type of statistical analysis that one uses to assess the dimensionality of a construct depends on its continuum. Therefore, when evaluating validity evidence of the internal structure, one should not assume straightforward equivalence between MDS and FA (or other variants). Accurately inferring the internal structure requires researchers to determine the construct polarity and determine which statistical technique is most appropriate. Assess the item response process. Early in psychological measurement research, psychometricians were aware that there were two fundamentally different ways of responding to scale items along the construct continuum (Coombs, 1964): ideal point responding and dominance responding (see Tay & Ng, 2018 for review). Ideal point responding is commonly seen in personality, interest, and attitude scales. Here, individuals will try and find a response option that best matches their own level (the “ideal point”), and individuals who are significantly higher or lower than the item tend not to endorse it. For example, individuals who are “extremely happy” or “extremely unhappy” are both less likely to endorse being “moderately happy.” As a result, a higher item score does not invariably mean a higher standing on the attribute – because respondents can disagree with a middle-level item for multiple reasons. By contrast, in dominance responding, as the term “dominance” suggests, individuals whose level is higher than the item location will always have a higher (or at least equal) probability of endorsing the item. This is the case in ability testing. For example, individuals with higher cognitive ability will always have a greater probability of correctly responding to a test item with a lower difficulty. When we assume this kind of responding, a higher item score always means that an individual has a higher location on the continuum. More recent research has shown that individuals responding to typical self-reported behaviors, such as attitudes, personality, interests, and emotions, tend to use ideal point responding (Tay & Ng, 2018). The fascinating thing is that commonly applied measurement statistical techniques such as reliability (e.g., alpha, omega) and dimensionality analyses (e.g., FA) assume that individuals use a dominance item response model. However, this may not be appropriate in many cases, especially when one is interested in assessing the entire construct continuum. When individuals use an ideal response process, using standard factor analyses .


techniques to analyze the dimensionality of scale items can result in spurious dimensions emerging, which has been mathematically shown (Davison, 1977). A practical example can be seen when assessing the intensity of emotional valence on a bipolar continuum (i.e., happy to sad). Individuals using an ideal point response process tend to coendorse low levels of happiness and sadness. At extreme levels of happiness or sadness, there are hardly any co-endorsements of happiness or sadness (Tay & Kuykendall, 2017). Because of the co-endorsements of low levels of happiness and sadness, conducting factor analysis on this bipolar continuum would yield two orthogonal dimensions that are often interpreted as positive emotion being orthogonal to negative emotion (Tay & Drasgow, 2012). It is important that the assumptions of the dimensionality analysis match the underlying item response process. This furnishes validity evidence with regard to response processes (American Psychological Association, 2014), which is typically understood in terms of whether the assessment indeed taps into the claimed underlying psychological or cognitive process. To do so, researchers must determine whether an ideal point or a dominance item response process better describes the collected scale data, especially when the construct continua are about self-reported typical behaviors such as attitudes, interests, personality, and values. This can be achieved through a comparison of fitting ideal point item response models compared to dominance item response models (Tay, Ali, et al., 2011). When it is found that ideal point responding occurs, it is helpful to use ideal point response models (Tay & Ng, 2018). This will include the use of ideal point models such as the Generalized Graded Unfolding Model (Roberts et al., 2004). Practically, when dominance item response models are misapplied to ideal point response processes, it can lead to an incorrect inference of the number of dimensions. For example, it has been shown in organizational research that the bivariate model of emotions (positive emotions and negative emotions as two unipolar continua) may be a statistical artifact from applying dominance response models like factor analysis to a single underlying bipolar continuum when respondents use an ideal point response process (Tay & Drasgow, 2012). The other issue that can occur in misapplying dominance response models to ideal point response processes is that curvilinear effects are harder to detect. Indeed, simulations have shown that when responses are produced by an ideal point model, ideal point models have higher power and greater accuracy in estimating curvilinear effects as compared to dominance models (Cao et al., 2018). Given the substantial interest in



psychology on curvilinear relationships, whether individual differences such as personality and skill have decreasing utility past a certain point on outcomes (i.e., “too much of a good thing”) (e.g., Le et al., 2011; Rapp et al., 2013), it is critical to use appropriate modeling techniques. Critically, this provides stronger validity evidence with regard to relations to other measures (American Psychological Association et al., 2014).

CONCLUSION Scale creation and validation are a core part of survey research. In scale creation, researchers have often neglected the construct continuum in their definition, operationalization, and analysis of constructs. This is problematic as the continuum is what produces variation in scores. If a researcher comes across a scale that did not have its continuum defined, then they can attempt to define it themselves post-hoc. It is likely that the continuum, by virtue of its items and response options, will be operating with an implicit polarity and type of gradations. However, the work of continuum specification is to make these attributes explicit, and this can be done after the fact, if necessary. This will not necessarily lead to the revision of the scale, but it may. If the researcher then sees that there are some items that are inappropriate, such as reverse-worded items that measure a unipolar construct, then these items will have to be removed. It is also possible that a different set of response options might appear more appropriate after the continuum has been defined. If any revisions are made, then some amount of validation will likely have to be done, as scale revisions do change our inferences about the validity (Smith et  al., 2000). However, the extent to which past validity evidence bears on the revised measure depends on how large the changes that were made. It is also possible that these revisions could help clarify topics and debates in particular areas. We see the potential of this for areas like the polarity of affect, the collection of discriminant evidence (which requires knowing the boundaries between constructs), and the growing distinction between normal and “dark” personality traits (e.g., Samuel & Tay, 2018). This chapter provides a foundational introduction to the process continuum specification based on the pioneering work on the topic (Tay & Jebb, 2018). However, these ideas were developed further here to add more clarity to how continuum specification is realized in scale creation and relates to validation issues. Psychological

measurement and scale creation can be advanced through better awareness of these issues. We hope that the integration of this concept into scale development and evaluation practices can improve the rigor of our scientific practice.

REFERENCES American Psychological Association, American Educational Research Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. Brown, T. A. (2006). Confirmatory factor analysis for applied research. Guilford Press. Cao, M., Song, Q. C., & Tay, L. (2018). Detecting curvilinear relationships: A comparison of scoring approaches based on different item response models. International Journal of Testing, 18, 178– 205. 1345913 Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7, 309–19. Coombs, H. C. (1964). A theory of data. John Wiley. Crano, W. D., & Prislin, R. (2006). Attitudes and persuasion. Annual Review of Psychology, 57, 345–74. Davison, M. L. (1977). On a metric, unidimensional unfolding model for attitudinal and developmental data. Psychometrika, 42, 523–48. DeVellis, R. F. (1991). Scale development: Theory and applications. Sage. Diener, E., Larsen, R. J., Levine, S., & Emmons, R. A. (1985). Intensity and frequency: Dimensions underlying positive and negative affect. Journal of Personality and Social Psychology, 48, 1253–65. Diener, E., Sandvik, E., & Pavot, W. (2009). Happiness is the frequency, not the intensity, of positive versus negative affect. In F. Strack., M. Argyle., & N. Schwarz. (Eds.), Subjective well-being: An interdisciplinary perspective. (pp. 119–40). Pergamon Press. Hinkin, T. R. (1998). A brief tutorial on the development of measures for use in survey questionnaires. Organizational Research Methods 1, 104–21. doi: 10.1177/109442819800100106 Holland, J. L. (1959). A theory of vocational choice. Journal of Counseling Psychology, 6, 35–45. Jebb, A. T., Ng, V., & Tay, L. (2021). A review of key likert scale development advances: 1995–2019. Frontiers in Psychology, 12, 637547. https://doi. org/10.3389/fpsyg.2021.637547 Le, H., Oh, I.-S., Robbins, S. B., Ilies, R., Holland, E., & Westrick, P. (2011). Too much of a good thing: Curvilinear relationships between personality traits


and job performance. Journal of Applied Psychology, 96(1), 113–33. McCrae, R. R., & John, O. P. (1992). An introduction to the five-factor model and its applications. Journal of Personality, 60, 175–215. Melson-Silimon, A., Harris, A. M., Shoenfelt, E. L., Miller, J. D., & Carter, N. T. (2019). Personality testing and the Americans With Disabilities Act: Cause for concern as normal and abnormal personality models are integrated. Industrial and Organizational Psychology, 12(2), 119–32. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory. McGraw-Hill. Podsakoff, P., MacKenzie, S., & Podsakoff, N. (2016). Recommendations for creating better concept definitions in the organizational, behavioral, and social sciences. Organizational Research Methods, 19, 159–203. Prediger, D. J. (1982). Dimensions underlying Holland’s hexagon: Missing link between interests and occupations. Journal of Vocational Behavior, 21, 259–87. (82)90036-7 Rapp, A. A., Bachrach, D. G., & Rapp, T. L. (2013). The influence of time management skill on the curvilinear relationship between organizational citizenship behavior and task performance. Journal of Applied Psychology, 98(4), 668–77. https:// Roberts, B. W., Jackson, J. J., Fayard, J. V., Edmonds, G., & Meints, J. (2009). Conscientiousness. In M. R. Leary & R. H. Hoyle (Eds.), Handbook of individual differences in social behavior (pp. 369–81). Guilford Press. Roberts, J. S., Fang, H., Cui, W., & Wang, Y. (2004). GGUM2004: A windows based program to estimate parameters in the generalized graded unfolding model. Applied Psychological Measurement, 30, 64–5. Rounds, J., & Tracey, T. J. (1993). Prediger’s dimensional representation of Holland’s RIASEC circumplex. Journal of Applied Psychology, 78, 875–90. Rounds, J., & Zevon, M. A. (1983). Multidimensional scaling research in vocational psychology. Applied Psychological Measurement, 7, 491–510. https:// Russell, J. A., & Carroll, J. M. (1999). On the bipolarity of positive and negative affect. Psychological Bulletin, 125, 3–30. Samuel, D. B., & Tay, L. (2018). Aristotle’s golden mean and the importance of bipolarity for personality models: A commentary on “Personality traits and maladaptivity: Unipolarity versus bipolarity”.


Journal of Personality. jopy.12383 Schwartz, S. H. (1992). Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. Advances in Experimental Social Psychology, 25, 1–65. https://doi. org/10.1016/S0065-2601(08)60281-6 Smith, G. T., McCarthy, D. M., & Anderson, K. G. (2000). On the sins of short-form development. Psychological Assessment, 12, 102. Spector, P. E., Bauer, J. A., & Fox, S. (2010). Measurement artifacts in the assessment of counterproductive work behavior and organizational citizenship behavior: Do we know what we think we know? Journal of Applied Psychology, 95(4), 781–90. Su, R., Tay, L., Liao, H. Y., Zhang, Q., & Rounds, J. (2019). Toward a dimensional model of vocational interests. Journal of Applied Psychology, 104(5), 690–714. Tay, L., Ali, U. S., Drasgow, F., & Williams, B. A. (2011). Fitting IRT models to dichotomous and polytomous data: Assessing the relative model– data fit of ideal point and dominance models. Applied Psychological Measurement, 35, 280–95. Tay, L., & Drasgow, F. (2012). Theoretical, statistical, and substantive issues in the assessment of construct dimensionality: Accounting for the item response process. Organizational Research Methods, 15(3), 363–84. 1094428112439709 Tay, L., & Jebb, A. T. (2018). Establishing construct continua in construct validation: The Process of continuum specification. Advances in Methods and Practices in Psychological Science, 1(3), 375–88. Tay, L., & Kuykendall, L. (2017). Why self-reports of happiness and sadness may not necessarily contradict bipolarity: A psychometric review and proposal. Emotion Review, 9, 146–54. https://doi. org/10.1177/1754073916637656 Tay, L., & Ng, V. (2018). Ideal point modeling of noncognitive constructs: Review and recommendations for research. Frontiers in Psychology, 9, 2423. Tay, L., Su, R., & Rounds, J. (2011). People things and data-ideas: Bipolar dimensions? Journal of Counseling Psychology, 58, 424–40. https://doi. org/10.1037/a0023488 Wiggins, J. S. (1982). Circumplex models of interpersonal behavior in clinical psychology. In P. C. Kendall & J. N. Butcher (Eds.), Handbook of research methods in clinical psychology (pp. 183–221). Wiley.

18 Exploratory/Confirmatory Factor Analysis and Scale Development Larry J. Williams and Andrew A. Hanna

A considerable portion of management research uses employee surveys in which respondents are asked to describe their experiences at their workplace via questionnaires. These surveys can be used to test theories, assess workplace climate, or evaluate interventions or changes in organizational policies. While theory describes the relations among constructs examined, statistical analysis of the employee-provided data is ultimately used to reach conclusions about construct relations. A key to the success of the measurement process is that the items used to represent the constructs are, in fact, good representations of the constructs, which is to say they have construct validity. Within this context, exploratory and confirmatory factor analyses play an important role, as both techniques yield statistical information that can be used to judge the quality of the resulting items and their value as measures of their presumed underlying constructs. Success at improving construct validity in organizational research has been facilitated by methodologists who focus on what happens before the items of the questionnaire are presented to the sample whose responses will be included in the factor analysis used to develop and refine the scales to be used. For example, Podsakoff, MacKenzie, and Podsakoff (2016) have emphasized the importance of having clear conceptual

definitions of constructs. Next in the research process is the development of specific items, where the judgment needs to be made that the content of the items matches the definition of the construct and, if so, the researcher concludes they have content validity (see Colquitt, Sabey, Rodell, & Hill, 2019). It has long been recognized that there is value in combining items to form scales, even when concepts are clearly defined and items are demonstrated to have content validity. When multiple constructs are included in the research design, evidence is considered as to whether items that are intended to measure a common construct are highly correlated with each other. It is also desired that these same items not correlate highly with items of different constructs. Exploratory and confirmatory factor analyses (EFA, CFA) are data analysis tools for evaluating a matrix of correlations based on items measuring multiple constructs. Relevant evidence from such a factor analysis can inform decisions about (a) items that are intended to measure the same thing, (b) if what they are measuring is something different from other things being measured, and (c) ultimately, how many different things are being measured. Increased use of EFA and CFA in organizational research increases the importance of key decision points in their implementation. Statistical


developments with both methods have created more choices for researchers using one or both techniques, and researchers have new software programs available to use. This leads to a challenge for those wanting to follow best practices as they evaluate items and develop scales for use in their theory-testing or workplace environment assessment. The purpose of this chapter is to provide an overview of current EFA/CFA methods as used for scale development, including an organizationalbased example that demonstrates these practices with readily available software. Our presentation of this example will highlight key differences and similarities between the models underlying EFA and CFA while advocating that both can, should, and typically are used in a confirmatory way.

EFA/CFA EXAMPLE: TRUSTWORTHINESS Preliminary Analysis In the following example, we use data collected from Hanna (2021) that measured the construct of trustworthiness. Extant research conceptualizes trustworthiness as a multidimensional construct consisting of three facets – ability (Abil), benevolence (Bene), and integrity (Inte) (Frazier, Johnson, Gavin, Gooty, & Snow, 2010). The trustworthiness measure we used includes nine items – three items reflecting each of the three dimensions – as utilized by Jones and Shah (2015). Our labels for the items will be Ab1-Ab3, Be1-Be3, and In1In3. Our analyses for this example used the “lavaan,” “psych,” “GPArotation,” and “corrplot” R packages. These packages must be installed and activated to run the included syntax. We used R-based packages for our example because they are freely available, are increasingly popular, and can be used to conduct preliminary data management and evaluation as well as both exploratory and confirmatory factor analyses within the same general platform. We also checked the necessary properties of the data (for a comprehensive list, see Watkins, 2018). We next conducted a subjective examination of the correlations among items to ensure that conducting an EFA/CFA is appropriate. As noted in our introduction, since with our analysis we hope to find support for three dimensions, we expect that the three items of ability will be highly correlated with each other, as will the correlations among the three benevolence items and the three integrity items. We also anticipate that the correlations between items across dimensions will be lower. This pattern of three distinct clusters of correlations is needed for the three factors to


emerge and be supported in both the EFA and CFA. An examination of these correlations prior to the EFA/CFA can provide preliminary support for our proposed underlying dimensionality. More importantly, if the desired pattern of correlations is not obtained, these unexpected values can help us understand problems that emerge in the subsequent multivariate analysis. If beginning with a dataset that contains items not used in the EFA, the first step is to create a data frame with only the items to be included. To produce this correlation matrix, begin with the following syntax: [1] names(Data_Full) [2] TWData