Research Methods in Human Resource Management: Toward Valid Research-Based Inferences 1648020887, 9781648020889

Empirical research in HRM has focused on such issues as recruiting, testing, selection, training, motivation, compensati

608 135 4MB

English Pages 242 Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Research Methods in Human Resource Management: Toward Valid Research-Based Inferences
 1648020887, 9781648020889

Table of contents :
Contents
1. Perspectives on the Validity of Inferences from Research in Human Resource Management • Eugene F. Stone-Romero and Patrick J. Rosopa
2. Advances in Research Methods: What Have We Neglected? • Neal Schmitt
3. Research Design and Causal Inferences in Human Resource Management Research • Eugene F. Stone-Romero
4. Heteroscedasticity in Organizational Research • Amber N. Schroeder, Patrick J. Rosopa, Julia H. Whitaker, Ian N. Fairbanks, and Phoebe Xoxakos
5. Kappa and Alpha and Pi, Oh My: Beyond Traditional Interrater Reliability Using Gwet’s AC1 Statistic • Julie I. Hancock, James M. Vardaman, and David G. Allen
6. Evaluating Job Performance Measures: Criteria for Criteria • Angelo S. DeNisi and Kevin R. Murphy
7. Research Methods in Organizational Politics: Issues, Challenges, and Opportunities • Liam P. Maher, Zachary A. Russell, Samantha L. Jordan, Gerald R. Ferris, and Wayne A. Hochwarter
8. Range Restriction in Employment Interviews: An Influence Too Big to Ignore • Allen I. Huffcutt
9. We’ve Got (Safety) Issues: Current Methods and Potential Future Directions in Safety Climate Research • Lois E. Tetrick, Robert R. Sinclair, Gargi Sawhney, and Tiancheng (Allen) Chen
Biographies

Citation preview

Research Methods in Human Research Management: Toward Valid Research-Based Inferences

IAP—INFORMATION AGE PUBLISHING P.O. BOX 79049 CHARLOTTE, NC 28271-7047 WWW.INFOAGEPUB.COM

Stone-Romero Rosopa

In this volume of Research in Human Resource Management we consider the overall validity of inferences stemming from empirical research in human resource management (HRM), industrial and organizational psychology, organizational behavior and allied disciplines. The chapters in this volume address the overall validity of inferences as a function of four facets of validity, i.e., internal, external, construct, and statistical conclusion. The contributions address validity issues for specific foci of study (e.g., interviews, safety, and organizational politics) as well as those that span multiple foci (e.g., neglected research methods, causal inferences in research, and heteroscedasticity in measured variables). The general objective of the chapters is to provide basic and applied researchers with “tools” that will help them to design and conduct empirical studies that have high levels of validity, improving both the science and practice of HRM.

RESEARCH METHODS IN HUMAN RESOURCE MANAGEMENT

Toward Valid Research-Based Inferences Edited by

Eugene F. Stone-Romero Patrick J. Rosopa

Construct validity

Internal validity

Validity of research results External validity

Statistical conclusion validity

A VOLUME IN: RESEARCH IN HUMAN RESOURCE MANAGEMENT

Research Methods in Human Resource Management: Toward Valid Research-Based Inferences

A Volume in: Research in Human Resource Management Series Editors Dianna L. Stone James H. Dulebohn

Research in Human Resource Management Series Editors Dianna L. Stone Universities of New Mexico, Albany, and Virginia Tech James H. Dulebohn Michigan State University Diversity and Inclusion in Organizations (2020) Dianna L. Stone, James H. Dulebohn, & Kimberly M. Lukaszewski The Only Constant in HRM Today is Change (2019) Dianna L. Stone & James H. Dulebohn The Brave New World of eHRM 2.0 (2018) James H. Dulebohn & Dianna L. Stone Human Resource Management Theory and Research on New Employment Relationships (2016) Dianna L. Stone & James H. Dulebohn Human Resource Strategies for the High Growth Entrepreneurial Firm (2006) Robert L. Heneman & Judith Tansky IT Workers Human Capital Issues in a Knowledge Based Environment (2006) Tom Ferratt & Fred Niederman Human Resource Management in Virtual Organizations (2002) Robert L. Heneman & David B. Greenberger Innovative Theory and Empirical Research on Employee Turnover (2002) Rodger Griffeth & Peter Hom COMING SOON Managing Team Centricity in Modern Organizations James H. Dulebohn, Brian Murray, & Dianna L. Stone Forgotten Minorities Dianna L. Stone, Kimberly M. Lukaszewski, & James H. Dulebohn

Research Methods in Human Resource Management: Toward Valid Research-Based Inferences

Edited by

Eugene F. Stone-Romero Patrick J. Rosopa

INFORMATION AGE PUBLISHING, INC. Charlotte, NC • www.infoagepub.com

Library of Congress Cataloging-In-Publication Data The CIP data for this book can be found on the Library of Congress website (loc.gov). Paperback: 978-1-64802-088-9 Hardcover: 978-1-64802-089-6 E-Book: 978-1-64802-090-2

Copyright © 2020 Information Age Publishing Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the publisher. Printed in the United States of America

CONTENTS

1. Perspectives on the Validity of Inferences from Research in Human Resource Management................................................................. 1 Eugene F. Stone-Romero and Patrick J. Rosopa 2. Advances in Research Methods: What Have We Neglected?................ 5 Neal Schmitt 3. Research Design and Causal Inferences in Human Resource Management Research............................................................................. 39 Eugene F. Stone-Romero 4. Heteroscedasticity in Organizational Research.................................... 67 Amber N. Schroeder, Patrick J. Rosopa, Julia H. Whitaker, Ian N. Fairbanks, and Phoebe Xoxakos 5. Kappa and Alpha and Pi, Oh My: Beyond Traditional Interrater Reliability Using Gwet’s AC1 Statistic.......................................... 87 Julie I. Hancock, James M. Vardaman, and David G. Allen 6. Evaluating Job Performance Measures: Criteria for Criteria............................................................................... 107 Angelo S. DeNisi and Kevin R. Murphy 7. Research Methods in Organizational Politics: Issues, Challenges, and Opportunities............................................................. 135 Liam P. Maher, Zachary A. Russell, Samantha L. Jordan, Gerald R. Ferris, and Wayne A. Hochwarter v

vi • CONTENTS

8. Range Restriction in Employment Interviews: An Influence Too Big to Ignore............................................................. 173 Allen I. Huffcutt 9. We’ve Got (Safety) Issues: Current Methods and Potential Future Directions in Safety Climate Research................................... 197 Lois E. Tetrick, Robert R. Sinclair, Gargi Sawhney, and Tiancheng (Allen) Chen Biographies............................................................................................. 227

CHAPTER 1

PERSPECTIVES ON THE VALIDITY OF INFERENCES FROM RESEARCH IN HUMAN RESOURCE MANAGEMENT Eugene F. Stone-Romero and Patrick J. Rosopa

Empirical research in Human Resource Management (HRM) and the related fields of industrial and organizational psychology, and organizational behavior has focused on such issues as recruiting, testing, selection, training, motivation, compensation, and employee well-being. A review of the literature on these and other topics suggests that less than optimal methods have often been used in HRM studies. Among the methods-related problems are using (a) measures or manipulations that have little or no construct validity, (b) samples of units (e.g., participants, organizations) that bear little or no correspondence to target populations, (c) research designs that have little or no potential for supporting valid causal inferences, (d) samples that are too small to provide for adequate statistical power, and (e) data analytic strategies that are inappropriate for the issues addressed by a study. As a result, our understanding of various HRM phenomena has suffered and improved methods may serve to enhance both the science and practice of HRM and allied disciplines. Research Methods in Human Resource Management: Toward Valid Research-Based Inferences, pages 1–4. Copyright © 2020 by Information Age Publishing All rights of reproduction in any form reserved.

1

2 • EUGENE F. STONE-ROMERO & PATRICK J. ROSOPA

In order for the results of empirical studies to have a high level of validity, it is critical that they be based upon empirical studies that have construct validity, internal validity, external validity, and statistical conclusion validity (Campbell & Stanley, 1963; Cook & Campbell, 1979; Shadish, Cook, & Campbell; 2002). Construct validity has to with the degree to which the measures and manipulations used in an empirical study are faithful representations of underlying constructs. Internal validity reflects the degree to which the design of a study allows for valid inferences about causal connections between the variables considered by a study. External validity represents the extent to which the findings of a study generalize to different sampling particulars of units, treatments, research settings, and outcomes. Finally, statistical conclusion validity is the degree to which inferences stemming from the use of statistical methods are correct. Valid research results are vital for both science and practice in HRM and allied fields. With respect to science, the confirmation of a theory hinges on the validity of empirical studies that are used to support it. For example, research aimed at testing a theory that X causes Y is of little or no value unless it is based on studies that use randomized experimental designs. In addition, the results of valid research are essential for the development and implementation of HRM policies and practices. For example, attempts to reduce employee turnover will not meet with success unless an organization measures this criterion in a construct valid manner. PURPOSE OF THE SPECIAL ISSUE In view of the above, the purpose of this Special Issue (SI) of Research in Human Resource Management is to provide researchers with resources that will enable them to improve the internal validity, external validity, construct validity, and statistical conclusion validity (Campbell & Stanley, 1963; Cook & Campbell, 1976, 1979; Shadish, Cook & Campbell, 2002) of research in HRM. Sound research in these fields should serve to improve both the science and practice of HRM. In the interest of promoting such research the authors of chapters in this SI specify research methods-related problems in HRM and offer recommendations for dealing with them. The chapters in this volume are arranged in terms of the breadth of issues dealt with by them. More specifically, the chapters that have the broadest scope are presented first, followed by those that have a narrower focus. Brief summaries of the chapters, in order of their appearance, are as follows: Neal Schmitt (Michigan State University) provides a comprehensive contribution that considers such issues as the development and use of quantitative methods, estimates of reliability, IRT methods, Big Data, structural equation modeling, meta-analysis, hierarchical linear modeling, computational modeling, regression analysis, confirmatory factor analysis, analysis of data from research using longitudinal designs, the timing of data collection, and effect size estimation. The topics covered by Schmitt deal with a number of facets of validity (e.g., internal, construct, and statistical conclusion, and external).

Perspectives on the Validity of Inferences from Research  •  3

Eugene F. Stone-Romero (University of New Mexico) explains the important connection between experimental design options (randomized-experimental, quasi-experimental, and non-experimental) and the validity of inferences about causal connections between variables. In the process, he shows why (a) randomized-experimental designs provide the firmest basis for causal inferences and (b) a number of so called “causal modeling” techniques (e.g., causal-correlation, hierarchical regression, path analysis, and structural equation modeling) have virtually no ability to justify such inferences. In addition, he considers the importance of randomized-experimental designs for research aimed at (a) the testing of theories and (b) the development of HRM-related policies and practices. Stone-Romero’s contribution focuses on the internal validity of research. Amber N. Schroeder (University of Texas—Arlington), Patrick J. Rosopa (Clemson University), Julia H. Whitaker (University of Texas—Arlington), Ian N. Fairbanks (Clemson University), and Phoebe Xoxakos (Clemson University) describe how heteroscedasticity may manifest in organizational research. In particular, they discuss how heteroscedasticity may be substantively meaningful. They provide examples from research on stress interventions, aging and individual differences, skill acquisition and training, groups and teams, and organizational climate. In addition, they describe procedures that can be used to detect various forms of heteroscedasticity in commonly used statistical analyses in HRM. Julie I. Hancock (University of North Texas), James M. Vardaman (Mississippi State University), and David G. Allen (Texas Christian University) note the importance of inter-rater reliability in HRM studies that involve two or more independent coders (e.g., a meta-analysis). These authors review various measures of inter-rater reliability including percentage agreement, Cohen’s kappa, Scott’s pi, Krippendorf’s alpha, and Gwet’s AC1. Based on their comparative analysis of 440 articles that were coded for various characteristics, they provide evidence to suggest that Gwet’s AC1 may be a useful index of inter-rater reliability beyond traditional indices (e.g., percentage agreement, Cohen’s kappa). They also provide practical guidelines for HRM researchers when selecting an index for inter-rater reliability. Angelo S. DeNisi (Tulane University) and Kevin R. Murphy (University of Limerick) discuss the difficulties associated with comparing appraisal systems when job performance criteria vary across studies. After reviewing common approaches for evaluating criteria, the authors describe a construct validation framework that can be used to establish criteria for criteria. The framework involves construct explication, multiple evidence sources, and synthesis of evidence. Liam P. Maher (Boise State University), Zachary A. Russell (Xavier University), Samantha L. Jordan (Florida State University), Gerald R. Ferris (Florida State University), and Wayne A. Hochwarter (Florida State University) discuss methodological issues in organizational politics. They discuss five constructs in the organizational politics literature—perceptions of organizational politics, political behavior, political skill, political will, and reputation. In addition to con-

4 • EUGENE F. STONE-ROMERO & PATRICK J. ROSOPA

ceptual definitions and measurement issues, the authors provide critiques of each construct as well as directions for future research. The authors conclude with a discussion of the conceptual, research design, and data collection challenges that researchers in organizational politics face. Allen I. Huffcutt (University of Wisconsin Green Bay) discusses the problem of range restriction in HRM, especially in employment interviews. He demonstrates how serious this problem can be by simulating data that is unrestricted and free of measurement error. Then, he shows how validities change after systematically introducing measurement error, direct range restriction, and indirect range restriction. In addition, he provides a step-by-step demonstration of the calculations to obtain corrected correlation coefficients. Lois E. Tetrick (George Mason University), Robert R. Sinclair (Clemson University), Gargi Sawhney (Auburn University), and Tiancheng (Allen) Chen (George Mason University) discuss methodological issues in the safety climate literature based on a review of 261 articles. There review reveals a lack of consensus and an inadequate explication of the safety climate construct and its dimensionality. In addition, the authors discuss some common research design issues including the low percentage of studies that involve interventions. The authors highlight the (a) importance of incorporating time in research studies involving multiple measurements and (b) increased use of various levels in safety climate research. REFERENCES Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research. Chicago, IL: Rand McNally. Cook, T. D., & Campbell, D. T. (1976). The design and conduct of quasi-experiments and true experiments in field settings. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 223–326). Chicago, IL: Rand McNally. Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Boston, MA: Houghton Mifflin. Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.

CHAPTER 2

ADVANCES IN RESEARCH METHODS What Have We Neglected? Neal Schmitt

My purpose in this paper is twofold. First I trace and describe the phenomenal development of quantitative analyses methods over the past couple of decades. Then I make the case that our progress in measurement, research design, and estimating the practical significance of our research has not kept pace with the development of analytic techniques and that more attention should be directed to these critical aspects of our research endeavors. RESEARCH METHODS A HALF CENTURY AGO In the 1960s, quantitative methods courses included correlation and regression analyses, analysis of variance and an advanced course on factor analysis (this was exploratory factor analysis; confirmatory factor analyses did not arrive on the scene till the early 1980s). At this time, too, item response theory (IRT) had been described theoretically but software packages that would allow for evaluating items, particularly polytomous models were really not available till the 1970s and 80s. Research Methods in Human Resource Management: Toward Valid Research-Based Inferences, pages 5–38. Copyright © 2020 by Information Age Publishing All rights of reproduction in any form reserved.

5

6 • NEAL SCHMITT

At this time the notion that all validities were specific to a situation was the accepted wisdom in the personnel selection area. Frank Schmidt and Jack Hunter introduced meta-analyses and validity generalization in the mid to late 1970s (Schmidt & Hunter, 1977). Hypothesis testing was standard practice too and little attention was paid to the practical significance of statistically significant results. So, a person at that time was considered well trained if he/she were conversant with correlation and regression, analyses of variance, exploratory factor analysis and perhaps nonparametric indices. This has changed radically in the intervening years. DEVELOPMENT OF MODERN QUANTITATIVE ANALYSIS METHODS The 1980s were distinguished by the rapid adoption of structural equation modeling (SEM) using LISREL (later AMOS, MPLUS and other software tools) and the use of meta-analysis to summarize bodies of research on a wide variety of relationships between HR and OB constructs. Among SEM enthusiasts, there was even a misperception that confirmation of a proposed model of a set of relationships indicated that these variables were causally related rather than the fact that data were simply consistent with a hypothesized set of relationships. Even after this error of interpretation was recognized there was an enthusiastic adoption of SEM by researchers. Both meta-analysis and SEM brought a focus on the underlying latent constructs being measured and related as opposed to the measured variables themselves. Developments in both SEM and meta-analyses became increasingly sophisticated. Meta-analysts were concerned about file-drawer problems, random versus fixed effects analyses, estimates of variance accounted for by various errors, moderator analyses, and the use of regression analyses of meta-analytically derived estimates of relationships. Specific applications of SEM such as multi-group analyses and tests for measurement invariance (Vandenberg & Lance, 2000) were soon widely applied as were SEM analyses of longitudinal data (e.g., latent growth modeling, Willett & Sayer, 1994). Certainly among the most frequently used analytic innovations have been those associated with levels research (Klein & Kozlowski, 2000). Multilevel modeling is used in a very large proportion of the articles now published in our journals. In one recent issue of Journal of Applied Psychology (February, 2017), Sonnentag, Pundt, and Venz used multilevel SEM to assess survey data on snacking behavior, Walker, Jaarsveld, and Skarlecki used multilevel SEM to study the impact of customer aggression on employee incivility and Zhou, Wang, Song, and Wu examined perceptions of innovation and creativity using hierarchical linear modeling. Hierarchical linear modeling (HLM) has been used to study change, goal congruence, climate and many other phenomena. It is almost as though I suddenly discovered the nested nature of most of the data I collect.

Advances in Research Methods  •  7

We have also seen the development of new methods of analyzing longitudinal data (including the use of HLM for this purpose). As mentioned above, latent growth modeling via SEM has been used in many studies. Early methods of analyzing change usually involved predictors that did not change over time and were not directed to analyzing relationships in change across variables over time. With time-varying predictors and dynamic relationships, more complex methods of analysis are required. Good treatments of the differences in growth models are provided in book chapters by Ployhart and Kim (2013) and DeShon (2013). Examples of the analysis of dynamic change models are becoming more frequent (e.g., Chen, Ployhart, Cooper-Thomas, Anderson, & Bliese, 2011; Pitiaru & Ployhart, 2010). Computational modeling was described by Ilgen and Hulin (2000) nearly two decades ago, but is now becoming rather commonly used as an alternative research method. See Grand (2017) for a computational model of the influence of stereotype threat on training/learning practices and the performance potential of employees and their organizations. Vancouver and Purl (2017) provide a computational model used to understand better the negative, positive, and null effects of self-efficacy on performance. Missing data plagues many of our applied studies, particularly when data are collected on multiple occasions. Traditionally, I think most of us have used listwise deletion on cases on which there are missing data or replaced missing values with the mean of the variable. Listwise deletion has a disastrous effect on the available sample size (e.g., N=500, K=10, 10% randomly missing data, N=175). Mean replacement as a solution to the missing data problem has an obvious impact on the variability of a variable and the calculation of biased parameter estimates. Full information and maximum likelihood EM imputation are much more powerful and can handle huge proportions of missing data in ways that do not produce biased estimates of parameters. Mention of these methods of treating missing data is only beginning to appear in our literature. Adaptations to the typical regression analyses have also been introduced and are increasingly common. Moderated and mediated regressions have been part of our repertoire for some time, but they continue to present challenges. Analyses of models that involve mediation are susceptible to inference problems (StoneRomero & Rosopa, 2011) such as those mentioned above in connection with SEM. I will summarize these problems in connection with research design problems later in this paper. We also employ polynomial regression, splines, and the analysis of censored data sets. Qualitative analyses have typically been disparaged by quantitative researchers, but the early stages of a job analysis, probably the oldest aspect of a selection study, are certainly qualitative in nature. Modern forms of qualitative research such as various versions of text analysis can be very quantitative and even when such techniques are not used qualitative analyses have become an increasingly valuable tool of organizational researchers.

8 • NEAL SCHMITT

Big Data produces opportunities and challenges associated with the analysis and interpretation of huge multidisciplinary data sets. Angrave, Charlwood, Kirkpatrick Lawrence, and Stuart (2016), Cascio and Boudreau (2011) and others have detailed a number of challenges in the use and interpretation of the wide variety of big data available and the potential that analyses of these data will result in improved HR practices. The quality and accuracy of many Big Data files is often unknown; for example, it is rare that one would be able to assess the construct validity of Big Data indices as organizational researchers usually do. Big Data also introduces a whole new vocabulary (see Harlow & Oswald, 2016). Words like lasso (screening out noncontributing predictor variables), latent dirichlet allocation (modeling words in a text that are attributable to a smaller set of topics), k-fold cross-validation (developing models on multiple subsets of “test” data that are then cross-validated on “training” data), crud factor (a general factor or nuisance factor) and many more. In many ways, analyses of Big Data seem like the “dust-bowl empiricism” that was decried by organizational psychologists a half century ago. Note though that most Big Data analysts do attend to theory and much more effort has been devoted to consideration of crossvalidation of findings than was true in early validation efforts. Many other analysis techniques have challenged us such as relative importance analysis, use of interactive and power terms in regression to analyze difference scores, power analysis, spatial analyses, social network analyses, dyadic data analyses and more. The last two to three decades have been an exciting time for those working in quantitative analyses. The variety of new techniques available to help understand data and the increased availability of data publicly available in social media outlets and elsewhere as well as the availability of free software packages such as R is nothing short of a revolution. We are continually faced with the challenge of educating ourselves and others on the appropriate and productive use of these procedures. While we have much to celebrate and occupy ourselves, I would like to voice concerns about some issues that seem to have gone unnoticed by many researchers. PROBLEMS THAT REQUIRE ATTENTION Three concerns warrant greater attention than seems to be the case in our current literature. First, we have not been as concerned about the data themselves as the techniques we use to summarize and analyze them. Certainly many of us are familiar with the phrase “garbage in, garbage out.” Very little attention has been given to development of new measures or even the reliability or construct validity of those measures we do employ. A similar concern was voiced by Cortina, Aguinis, and DeShon (2017) after reviewing 100 years of research methods papers published in Journal of Applied Psychology. They said: “…we hope that the fascination with abstruse data analysis techniques gets replaced by fascination with appropriate research design, including top-notch measurement “(p. 283). Second,

Advances in Research Methods  •  9

we have paid too little attention to research design. This is especially evident when we collect (or try to collect) longitudinal data. Third, in the interest of discovering statistically significant findings or the results of the latest novel analytic technique, we have lost sight of the practical significance of our results— in terms of reporting effect sizes that are meaningful to practitioners, explaining the nature of our results (witness the lack of impact of selection utility analyses so popular a couple of decades ago) and in terms of addressing issues that concern OB/HR practitioners or our organizational clients. In the remainder of this chapter, I will describe the “state of the art” in these three areas and why I think they should receive more attention by research methodologists than is currently the case. MEASUREMENT CONCERNS Aside from IRT developments there has been very little direction as to how to evaluate the items or scales we use. Even IRT is not very applicable with short scales. CFA has been used for the same purpose, but we have little guidance as to what is good fit to a particular measurement model. Nye and Drasgow (2011) have tried to provide such guidance and Meade, Johnson, and Braddy (2008) recommend the use of the confirmatory fit index (with a cutoff value of .002) as a means of comparing the fit of alternative models or invariance. Too often we wave a set of alpha values at the scales we use sometimes apologizing for those whose alphas are below .70 as evidence that our measures are acceptable. Sometimes journals even publish one item per scale so the reader can get some sense of the nature of the construct measured, but even the publication of one item has been found objectionable on proprietary grounds. Clark (2006) decries the sentence often used to justify the use of a measure: “According to the literature, measure X’s reliability is good and it has been shown to have validity.” (p. 448). This statement is often made without a reference, but even with a reference, it often appears doubtful that the author read the paper or papers they cite. Of course, there is often no mention as to how or against what the measure was validated. Or, as investigators we write a set of items for a study and label them as some existing construct with no supporting investigation of its psychometric characteristics. Subsequent researchers or meta-analysts take the label for granted. This situation is even worse now that we have become enamored of Big Data because we have little or no control over the nature of the data collected and many times the data comes from disciplines that have little appreciation for the quality of their measures (Angrave et al, 2016). Let me give some examples. Forty or fifty years ago, there were several publications which provided guidelines on item writing though most of those addressed multiple choice ability items. Some guidelines addressed the use of double-barreled items, use of double negatives, or jargon (Edwards, 1957) in Likert-type items. I even remember one paper that experimentally manipulated some of these guidelines in constructing a final exam in an introductory psychology course (Dudycha & Carpenter, 1973). We now take item writing (whether

10 • NEAL SCHMITT

multiple choice or Likert items) for granted with the possible exception of large test publishers whose main concern is the perceived fairness of test items to different groups of examinees. Even attempts to improve perceived fairness to different underrepresented groups have rarely been examined (for an exception see Golubovich, Grand, Ryan and Schmitt, 2014). We do have some other developments in measurement – both methods of measurement and the means of analyzing the measurement properties of our indices. Cognitive diagnosis models, multidimensional IRT models and simulations/gaming are some examples. However, these techniques have not caught on to any great degree—perhaps because they are too challenging for many of us or because psychometricians or quantitative data analysts do not speak the language of most psychologists and there may be some level of arrogance among psychometricians about the relative incompetence of the rest of us. In any event, few of us read Psychometrika anymore and I suspect the same may be true of Psychological Methods and educational journals like the Journal of Educational Measurement. Organizational Research Methods is still accessible to most organizational researchers and that may account for its relatively high impact factor. Whatever the reason there seems to be a segregation of quantitative analysts and measurement types from other researchers, particularly those who develop or use psychological measures. In addition to a lack of attention in writing items, there is an overdependence on alpha as an index of reliability or unidimensionality. Cortina (1993) and (Schmitt, 1996) have demonstrated that alpha can be a poor index of reliability even when we have lots of items. Schmitt (Cortina [1993] provided a similar analysis) demonstrated that a six-item test with the item intercorrelations in Table 2.1 yielded an alpha of .86. Most of us would be happy with this alpha and proceed to do further analyses using this measure. If we bother to look further (examine item intercorrelations), it would be obvious that the six items address two dimensions. Further, examination of item content would almost certainly provide at least a tentative explanation of these two sets of items, but that almost never occurs. This example was constructed to make a point, but with more items and a more ambiguous set of correlations, this problem would likely go unrecognized. A more modern and frequently used approach to assess the dimensionality of our measures is to employ confirmatory factor analyses. Assessment of a unidimensional model of these intercorrelations would have yielded the following fit indices (Chi square=401.62, df=9, RMSEA=.47, NNFI=.29, CFI = .57). Most researchers would conclude that the alpha of .86 was not a good index of the unidimensionality of this measure and that a composite index of this set of six items is meaningless. However, we can also fool ourselves about the dimensionality of a set of items when using CFA—probably not as easily. We are dependent on a set of “rules of thumb” as to whether a model fits our data and indices of practical fit (Nye & Drasgow, 2011) are not helpful in this instance. Consider the item intercorrelations in Table 2.2 for which a four-factor model produces a perfect fit to the data.

Advances in Research Methods  •  11 TABLE 2.1.  Hypothetical Intercorrelations of a Six-Item Composite Variable

1

1

1.0

2

3

4

5

2

.8

1.0

3

.8

.8

1.0

4

.3

.3

.3

1.0

5

.3

.3

.3

.8

1.0

6

.3

.3

.3

.8

.8

6

1.0

A one factor model does pretty well too (Chi-square=73.66, df=54, RMSEA=.04, NNFI=.98, CFI=.98). Most of us as authors and most of us as reviewers would be happy with this demonstration of unidimensionality. I agree that the difference in the correlations of within and between items belonging to each of these four factors is small, but alpha for each of these four factors is .82 and the correlations between any two of the four sets of items is .55. Are these distinct and practically meaningful “factors?” Incidentally, the alpha for the 12 item composite here is .93. Clearly, both alpha and CFA tell us that one factor explains these data, but four distinct factors are responsible for the item intercorrelations. The point I am making is that the more sophisticated analysis of dimensionality does not do justice to the question any more so than does alpha. A third way of looking at these data is to examine the item content, item-total correlations, and the item intercorrelations or perform an exploratory factor analysis—something that few “sophisticated” data analysts ever do! TABLE 2.2.  Hypothetical Intercorrelations of Data Representing Four Factors Variable

1

1

1

2

3

2

.6

1

3

.6

.6

1

4

.5

.5

.5

4

5

6

7

8

9

10

11

12

1

5

.5

.5

.5

.6

1

6

.5

.5

.5

.6

.6

1

7

.5

.5

.5

.5

.5

.5

1

8

.5

.5

.5

.5

.5

.5

.6

1

9

.5

.5

.5

.5

.5

.5

.6

.6

1

10

.5

.5

.5

.5

.5

.5

.5

.5

.5

1

11

.5

.5

.5

.5

.5

.5

.5

.5

.5

.6

1

12

.5

.5

.5

.5

.5

.5

.5

.5

.5

.6

.6

1

12 • NEAL SCHMITT

Yet another reason why quality measurement is important is highlighted in a paper by Murphy and Russell (2017). They point to a long history of frustration among organizational scientists in formulating, documenting, and testing moderator hypotheses and wonder if it is time to discontinue our search for moderator variables. One reason they have been illusive is that the typical moderator variable has very low reliability. If a moderator is formed by the product of two measures each with reliability of .70 (the usual level considered marginally acceptable), their product has a reliability of .49. This low reliability, of course, has an impact on the power with which any test of moderation is conducted and the magnitude of the effect associated with the moderation. A third reason why we may be likely to ignore the measurement quality of our data is the advent of Big Data. Big Data analyses can be valuable in answering many of the questions we have about human behavior in ways we could only wish a decade or so ago. However, it also means that we often accept data from many different sources and people whose appreciation for measurement issues simply do not match those of members of our discipline and there is usually no way we can check the psychometric characteristics of these measures. I do not know to what degree this may bias the results of Big Data analyses, but I do think it deserves attention. INFORMATION ON RELIABILITY AS PRESENTED IN CURRENT RESEARCH ARTICLES Overall, then, I do not believe we have paid much attention to the quality of our measures. We seem to think item writing is easy and if we ask respondents to use a five-point Likert type scale we will have a quality measure. Or more often, researchers adapt a measure from previous work, occasionally taking a few items from some longer measure. Then we use alpha and CFA to justify our behavior. To ascertain that our statements about these practices are relatively standard, I examined the last three issues of the 2017 volumes of Journal of Applied Psychology, Personnel Psychology and the Academy of Management Journal and tabulated the results in Table 2.3. In this table, I have described the construct the authors purported to measure, the evidence for reliability and, in some cases, the discriminant validity of the measure (convergent validity was rarely mentioned or assessed), employment of rules of thumb to justify reliability and the justification for use of the measure. The table does document several positive features of the research reviewed. First, most indices of reliability are quite high and clearly exceed the usual minimum for alpha (i.e., .70) cited in the literature. Second authors do routinely provide justification for their use of measures. That justification, though, is almost always limited to the fact that someone else used the same scale or the current measure was modified from a measure used in an earlier study. Very rarely did authors present any evidence of the relationship between the original version of

Advances in Research Methods  •  13 TABLE 2.3.  Measure Adequacy as Described in Recent Issues of Three Journals Journal JAP

Construct(s)

Type of Measure

Rule of Previous Use Thumb or Justification. Justification Mentioned Mentioned

Rudeness Goal Progress Task performance Psy withdrawal

Self-report Self-report Self-report Self-report

Alpha = .92 Alpha = .86 Alpha = .94 Alpha = .77

Interpersonal avoidance Morning affect Core self-evaluation

Self-report Alpha = .94 No Self-report Alpha = .92 No Self-report &.93 No Alpha = .85 Differences CFA test of in fit distinctiveness indices

Yes Yes Yes

JAP

Role conflict Empowering help orientation. Emotional exhaustion

Self-report Alpha = .78 Self-report & .81 Self-report Alpha = .65 Alpha = .89 & .02

No Yes No

Yes Yes Yes

JAP

Interviewer evaluations of candidates Self-verification

Interviewers Alpha = .88 Self-report Test-retest=.59

No No

No Yes

JAP

Entity theory Social Support Self-efficacy Feedback Seeking

Self-report Self-report Self-report Self-report

Alpha = .92 Alpha = .71 Alpha = .67 Alpha = .78

No No No No

Yes Yes Yes Yes

JAP

Anger Empathy Perceptions of Treatment Intentions

Self-report Alpha = .96 Self-report None Self-report None

No No No

Yes Yes Yes

JAP

Behavioral integrity Ethnic Dissimilarity Ethnic representation

Self-report Alpha = .78 Self-report &.94 Self-report Alpha = .69 Alpha = .96

No Yes No

Yes Yes No

JAP

Perceived Effort Perceived Liking Procuticle Adherence Org. Justice Adherence Conscientiousness Agreeableness

Supervisor Supervisor Self-report Self-report Self-report Self-report

No No No No No No

Yes Yes Yes Yes Yes Yes

Alpha = .96 & .94 Alpha = .92 & .91 Alpha = .82 Alphas=.83–.93 Alpha = .77 Alpha = .77 Similar measures used in three studies

No No No No

Yes Yes Yes Yes

(Continues)

14 • NEAL SCHMITT TABLE 2.3.  Continued Journal

Construct(s)

Type of Measure

Rule of Previous Use Thumb or Justification. Justification Mentioned Mentioned

JAP

Ability Benevolence Integrity

Evaluations Alpha = .90 of target &.90 Alpha = .95 &.92 Alpha = .95 &.93 CFA indicated three factors

No No No

Yes Yes Yes

JAP

Moral disengagement Intent to ostracize MD language Moral identity Other concern

Self-report Self-report Self-report Self-report Self-report

Alpha = .76 & .82 Alpha = .88 &.94 Alpha = .92 &.87 Alpha = .85 &.85 Alpha = .83 &.90

No No No No No

Yes Yes Yes Yes Yes

JAP

Team goal setting Team agreeableness Team emotional stability Team extraversion Team conscientiousness Task cohesion

Self-report Self-report Self-report Self-report Self-report Self-report

Alpha = .90 Alpha = .73 Alpha = .67 Alpha = .86 Alpha = .81 Alpha = .86

No No No No No No

Yes Yes Yes Yes Yes Yes

JAP

Org. politics Political behavior Spy. Empowerment Emotional Exhaustion Task performance

Self-report Self-report Self-report Self-report Supervisor

Alpha = .74 Alpha = .83 Alpha = .88 Alpha = .92 Alpha = .96

No No No No No

Yes Yes Yes Yes Yes

Psych

Ineffective interpersonal Behavior Performance Effective Interpersonal Behavior Derailment potential Promotability Performance

Other rating Supervisors Supervisor Supervisor Supervisor Supervisor

Alpha = .94 Alpha = .91 Alpha = .92, factor analysis and CFA Confirmed 3 factors.

No No No No No No

Yes Yes Yes Yes No No

No No No

Yes Yes Yes

PPsych Individual OCB Organizational OCB Group Cohesiveness Job self-efficacy

Group ldr. Alpha = .89 Group ldr. Alpha = .74 Group mbrs. Alpha-.83 Group mbrs Alpha = .73 &.74

Advances in Research Methods  •  15 TABLE 2.3.  Continued Journal

Construct(s)

PPsych Commuting strain Task significance Family interference Commuting means effic. Self-regulation at work

Type of Measure Self-report Self-report Self-report Self-report Self-report

Rule of Previous Use Thumb or Justification. Justification Mentioned Mentioned Alpha = .8 to .93 Alpha = .78 to .93 Alpha = .7 to .93 Alpha = .86 Alpha = .94

No No No No No

Yes Yes Yes Yes Yes

PPsych Five dimensions of role Identity Self-report Alphas=.70 Group cohesiveness Self-report to .81 Alpha = .61

No No

Yes ;CFA for discriminant validity, test of significance and fit indices Yes

PPsych Part. In development Development challenges Develop. Supervision Leader Self Efficacy Mentor network Leader efficacy Promotability

Self-report Self-report Mgrs. Self-report Self-report Supervisor Supervisor

No No No No No No No

Yes Yes-constr validity Yes. Yes No Yes Yes

PPsych Leader member exchange Alumni goodwill

Self-report Alpha = .89 Self-report Alumni = .73

No No

Yes-compared with full length lmx New items

PPsych Affective org.comm. Superv.trans.ldrship.

Self-report Alpha = .93 Self-report Alpha = .98

no

\yes

PPsych Surface acting Deep acting Ego depletion Self-efficacy-emotion regulation Intentional harming of coworker

Self-report Alpha = .84 Self-report Self-report Self-report Supervisor

No No No No No

Yes Yes Yes Yes Yes

PPsych Core self-evaluation Task mastery Political Knowledge Social integration Org. Identification

Self-report Self-report Self-report Self-report Self-report

No No No No No

Yes Yes Yes Yes Yes CFA for discr. Val.

Alpha = .74 Alpha = .83 Alpha = .96 Alpha = .94 Alpha = .93 Alpha = .85 Alpha = .87

Alpha = .81 Alpha = .83 Alpha = .73 Alpha = .84 Alpha = .86

(Continues)

16 • NEAL SCHMITT TABLE 2.3.  Continued Construct(s)

Type of Measure

PPsych Mastery orientation Mastery ornt. Var. Post trng. Self-efficacy Motivation to transfer Declarative knowledge Transfer Opportunity to perform

Self-report Self-report Self-report Self-report Test/quiz Self-report Self-report

PPsych Transfomation ldrsh Family role identific.

Journal

Rule of Previous Use Thumb or Justification. Justification Mentioned Mentioned Alpha = .93 Alpha = .94 Alpha = .91 .89 None Alpha = .8 to .9 Alpha = .69 to .86

No No No No NA No No

Yes Yes Yes Yes NA Yes Yes

Supervisor Alpha = .96 Self-report Alpha = .87

No No

Yes Yes

PPsych Unethical behavior Ostracism Performance Performance Relationship conflict

Supervisor Self-report Supervisor Self-report Supervisor

No No No No No

Yes Yes Yes yes Yes

AMJ

Feedback seeking Curiosity Change to artistic drafts

Raters Alpha = .75 Self-report Alpha = .81 Raters Kappa = .88

No No No

No Yes No

AMJ

Report vagueness

Raters

Krippen alpha=.82

Yes

No

AMJ

Brand identity conflict Brand identity enhance Intrinsic motivation Perspective taking

Self-report Self-report Self-report Self-report

Alpha = .72 Alpha = .70 Alpha = .92 Alpha = .89

No No No No

Yes Yes Yes Yes

AMJ

Company resources Employee belief in cause Corporate volunteer climate Corp. Vol intentions Personal Vol. Intent

Self-report Self-report Self-report Self-report Self-report

Alpha = .73 Alpha = .82 Alpha = .97 Alpha = .96 Alpha = .97

No No No No No

Yes Conv. & Disc. Val. Conv. & Disc. Val. Yes Yes YES

AMJ

Target influence behavior Self-reliance Leadership evaluations Communality Competence

Evaluator Evaluator Self-report Evaluators Evaluator Evaluator

Alpha = .93 Rwg=.77 Alpha = .68 Alpha = .88 & .82 No data Alpha = .88

No No No No No

Yes Yes Yes No Yes

Alpha = .91,.95,.96 Alpha = .96,.91,.97 Alpha = .94 Alpha = .70 Alpha = .92,96

Advances in Research Methods  •  17 TABLE 2.3.  Continued Journal

Construct(s)

Type of Measure

Rule of Previous Use Thumb or Justification. Justification Mentioned Mentioned

AMJ

Ethical relativism Ethical idealism Ethical leadership

Self-report Alpha = .84 Self-report Alpha = .81 Peer or Alpha = .92 subordinates

No No No

Yes Yes Yes

AMJ

Econ. Downturn perc. Negative mood Positive mood Construal of success

Self-report Self-report Self-report Self-report

Alpha = .90 Alpha = .77 Alpha = .72 Alpha = .79

No No No No

Yes Yes Yes Yes

AMJ

Surface acting Deep acting Work engagement Emotional exhaustion Giving help Receiving help Positive affect Negative affect

Self-report Self-report Self-report Self-report Self-report Cortina, et al. (under review) Self-report Self-report Self-report

Alpha = .85 Alpha = .84 Alpha = .92 Alpha = .90 Alpha = .90

No No No No No

Yes Yes Yes Yes Yes

Alpha = .88 Alpha = .91 Alpha = .80

No No No

Yes Yes Yes

the measure and the modified measure. The frequent modification of scales is documented in Cortina et al. (under review). Beyond these positive features of the research it is clear that organizational researchers have measured a wide variety of different constructs, most of which are not the typical individual difference measure that was the target of research in the selection arena. Human resource researchers, broadly defined, have clearly expanded the nature of the issues and constructs with which they are interested. This proliferation of measures may, however, make it more difficult to assess the commonality of research findings across study and time; calling attention to this issue was not the purpose of our paper. Almost all studies summarized in Table 2.3 use self-report instruments to assess the constructs of interest and in many of these cases this is the only alternative. However, researchers frequently use supervisory responses or objective or archival data as the source of information about constructs of interest. There are fewer references to articles published in AMJ as many of the articles published in that journal employ archival data for which coding accuracy or agreement are applicable and for which data are readily available for verification purposes. Third, there is an almost universal reliance on alpha as an index of measurement reliability or adequacy. In some cases, this is complemented by a CFA of

18 • NEAL SCHMITT

items assigned to multiple constructs to ascertain their discriminant validity. In only a few cases was a CFA employed to determine the unidimensionality of a measure. Most alpha values were quite high (in the .80s and .90s) and very few were below the .70 level that has been routinely suggested as the minimal acceptable level. In no case, was this level cited as justification for the use of a measure. While not presented in the table, it was the case that almost all authors presented one or two items for each of their measures, but never the entire measure. Given the availability of publication of information in supplementary sources in most journals, it seems that publication of the entire measure should become standard practice. Fourth, coded in the last column of the table was the justification cited by the author for use of a measure. In almost all cases, the justification was the use of a measure by some other author to index the same or similar constructs. However, these citations rarely included the data from the original study that supported the measure and in most cases as mentioned above, there was a modification (usually decreasing the number of items) of the original measure. In a few cases, data were reported that included a correlation between the original measure and the modified measure. In the case of these modifications, it seems particularly important that the careful reader have access to both the original and modified instrument underscoring the value of the use of supplementary publication outlets if not the main article for this purpose. In a few cases, the justification included a CFA of the measured variables with a description of that analysis related to questions about discriminant or convergent validity. While one or two representative items were often presented in these papers, there was no presentation of item intercorrelations or item-total correlations and content that might have further informed the reader about the nature of the construct measured and the degree to which individual items may or may not have represented that construct. Because of the problems with alpha demonstrated in Table 2.1, it may be helpful to consider the inclusion of other indices of unidimensionality though there has been minimal agreement as to what such an index might be (Hattie, 1985). When a bifactor model (general factor plus uncorrelated specific factors) fit the data, it might be useful to present the omega indices described by Reise (2012). These indices represent the degree to which a set of items are reflective of a general factor as well as the variance associated with individual specific factors and variance due to a specific item. Such analyses along with the examination of item content might illuminate the nature of measures such as situational judgment measures which typically display very low alphas, but little evidence that more than a single general factor explains the data. Omega values as “reliability” measures would have been more appropriate given the multidimensionality of the data presented in Tables 2.1 and 2.2. There is literature decrying the sole reliance on alpha (Sijtsma, 2009) and support for the use of omega (Dunn, Baguley, & Brunsden, 2014); and other indices of internal consistency (Zinbarg, Revelle, Yovel, & Li, 2005). Sijtsma argued for

Advances in Research Methods  •  19

the use of an index he labeled the greatest lower bound (GLB) estimate as the preferred estimate of reliability. However, Zinbarg et al. (2005) showed that the GLB was almost always lower than the hierarchical form of omega. Omega that includes item loadings on a general factor as well as item loadings on group factors as true variance appears to the best lower bounds estimate of reliability and the most appropriate index to use in correcting observed correlations between two variables for attenuation due to unreliability. Dunn et al. document the almost universal use of alpha as a measure of internal consistency in spite of the critical psychometric literature including a paper by Cronbach himself (Cronbach & Shavelson, 2004). They also support the routine use of omega along with the confidence interval for its estimation and provide direction and an example of its calculation using the open source statistical package, R. McNeish (2018) provides a review of the use of alpha like that provided here in Table 2.1 for three different psychological journals. The results of that review are very similar in that almost all authors used alpha as a report of reliability. McNeish went on to compare the magnitude of alpha and five other reliability indices for measures included in two publicly available data sets. He found alpha was consistently lower by about .02 to .10 depending most often on the variability of item loadings on a general factor. Aside from underrepresenting the reliability of a measure, these differences may be practically meaningful in applied instances when relationships are corrected for attenuation due to unreliability as they routinely are in studies of the criterionrelated validity of personnel selection measures (Schmidt & Hunter, 1998). In the March 2018 issue of the American Psychological Society’s Observer, Fried and Flake make four observations about measurement that are consistent with the data in Table 2.3 and this discussion. First, they encourage researchers to clearly communicate the construct targeted, how it is measured, and its source. Second, there should be a rationale for scale use and modifications. Third, if the only evidence you have of measure “validity” is its alpha, consider conducting a validity study to ascertain the scales’ correlates. Finally, stop using alpha as the only evidence of a scale’s adequacy. I would add that we should replace alpha with omega for composite measures. INATTENTION TO RESEARCH DESIGN I mentioned above that we have made significant advances in our analyses of longitudinal data. However, we have paid little attention to the research designs that produce our longitudinal data. When we study socialization, training, leadership, or the impact of job satisfaction on life satisfaction or the reverse and a host of other time-related variables, it is important that our data collection efforts reflect the time periods in which the process we are studying is likely to occur. For example, if we are looking at the impact of training on some outcome variable, it makes little sense to evaluate such training before the effects of training are likely to have had their full impact. Likewise it makes little sense to assess the impact of various socialization efforts many months or years after the employment of a

20 • NEAL SCHMITT

group of employees. Similarly, investigating the impact of life satisfaction on job satisfaction among a group of long tenured employees doesn’t make much sense. Perhaps this is well known or common sense. However, examples of a lack of consideration of the timing of data collection are not difficult to find in recently published articles. The following examples come from an unsystematic search of the last several issues of top-tier journals in our discipline. Kaltianen, Lipponen, and Holz (2017) studied the longitudinal relationships between perceptions of process justice during a merger and subsequent cognitive trust. They did do an excellent job of describing when data were collected and what was transpiring in the organization at the time. However, they provided little concrete justification for the one year separation between data collections. Did trust change more or less slowly than these one-year intervals? The authors recognize this limitation in their discussion in that they state that they would like to assess these changes in shorter time periods. There is little theoretical or empirical justification (maybe nothing) that indicates when such changes might occur. Most data on these relationships are cross sectional, but the authors do cite a meta-analysis of this relationship. Those meta-analytic data might be analyzed to determine if the time interval employed in studies moderates the reported effect size. If there is moderation, then an appropriate time interval might be determined for use in subsequent longitudinal studies, but to my knowledge, this is rarely if ever done when deciding on the timing of longitudinal data collections. Barnes, Miller, and Bostock (2017) report an interesting study on the effect of web-based cognitive behavior therapy on insomnia and a variety of workplace outcomes (organizational citizenship behavior, interpersonal deviance, job satisfaction, negative affect). The researchers hypothesized that the therapy would have effects on workplace outcomes mediated by insomnia and evaluated these hypotheses with pre-post surveys separated by ten weeks. There was no mention of the appropriateness of this ten week interval. There was support for some of their hypotheses suggesting a rather short time frame within which this hypothesized mediation occurred. Grand (2017) provides a computational model of the effects of stereotype threat during training and turnover on employee performance potential over time. This is a very interesting and thorough analysis of what happens over time in the presence of stereotype threat based on realistic parameters. The analyses show the usual asymptote of employee learning with the negative impact of stereotype threat remaining over time as a function of turnover in trained employees. However, there was no mention of the time interval during which these processes unfold though I assume this could be shorter or longer based on the time it takes employees to reach their full potential. Perhaps most illustrative of a lack of consideration of the timing of data collection is a study by Deng, Walter, Lam, and Zhao (2017) that just appeared in Personnel Psychology. These authors studied the effect of emotional labor on ego depletion and the treatment of customers. Data were collected in two surveys two months apart. There was no mention of the appropriateness of this time interval

Advances in Research Methods  •  21

and equally problematic, the possibility that job tenure might play a role in this process was not considered. These are all excellent studies, but in each case, the time periods studied are not discussed (the exception was the Kaltianen et al. study in which they cited the lack of more frequent measurement as a limitation). Time must be considered if we are to discover and adequately estimate the underlying processes we seek to explain. To underscore this issue, I examined the articles published in the last year in two major journals (Journal of Applied Psychology, and Personnel Psychology) and the last three issues in the 2017 volume of Academy of Management Journal). The shorter time frame for Academy of Management Journal was used because more papers were published in AMJ and more involved longitudinal designs in which time of data collection was a potential concern. Table 2.4 contains a brief description of the 46 studies reviewed in these three journals including the major hypotheses evaluated, the time interval between data collections, support for the hypothesized effects, and any discussion of time. In about half of these studies (N = 22), there was no discussion of the role that time might have played in the study results or whether the timing of multiple data collections was appropriate. In some of these studies, the variables studied might not have been sensitive to the precise time of data collection or the time interval represented a reasonable space within which to expect an effect to occur (e.g., effect of socialization during probationary period). However, in most of these 22 cases, it would not be hard to assert that the time of measurement was a critical factor in finding or estimating the effect of some process (e.g., leader personality affecting leader performance) yet it was not mentioned in the description of the research. In those studies in which time was mentioned, it was almost always mentioned as a limitation of the study sometimes with the suggestion that future research consider the importance of data collection timing. In one study in Personnel Psychology, there was an extensive discussion of the socialization process investigated and why the timing of data collections was appropriate. A very large proportion of the papers published in Academy of Management Journal (AMJ) were longitudinal and many involved the use of archival data that occasionally spanned one or more decades. In some of the archival studies, data were collected for many time periods really assuring that any time-related effects would be observed. Like the other two journals, however, 7 of the 16 AMJ papers did not discuss the importance of time when it seemed to me that it should have been a relevant issue. The relatively greater concern with time in papers published in Academy of Management Journal may be a function of what seems to be a greater emphasis on theory in that journal. This theoretical emphasis should produce a concern that measurement time periods coincide with the theoretical processes being studied. In none of the papers mentioned in Table 2.4 was the timing of the first data collection described. When studying a work-related issue, it seems that the first data collection should occur at employment or immediately before or after an

Assessment center feedback>selfFirst stage was 2.4 years after efficacy>feedback seeking>career outcomes feedback; second stage was 15 years later

Team charter and team conscientiousness 10 weeks lead to task cohesion and team performance

Political behavior>task performance mediated by emotional exhaustion and psychological empowerment including moderator effects of political behavior on exhaustion

Study 1:Intrinsic motivation>organizational Six months identification Study 2: Need fulfillment>intrinsic Three stage with 4 weeks motivation>organizational identification intervening

Job control & task-related stressor and social stressors>health and well-being

Unethical behavior, supervisor bottom line orientation, and shame>exemplification behavior

JAP

JAP

JAP

JAP

JAP

JAP

Supported; no discussion of time interval Support was found for the first link in the hypothesized sequence and partial support for the mediation hypothesis. No discussion of time interval or extent of previous experience

Al l four hypotheses were supported. No discussion of time interval.

Hypothesis supported; no discussion of time.

Hypotheses were supported. No discussion of the timing of data collection. Times are averages across participants

Time interval = probation period. Role conflict>exhaustion moderated by type of help provided to newcomers

Rudeness affected all four outcomes. Hypo. Restricted to morning rudeness, but possibility of buildup or crossover effects are recognized

Hypo. Support & Discussion or rationale for Time Interval

Six months and two weeks

Unethical behavior. Shame; shame>exemplification; supervisor BLM moderated the latter relationship. Time issue was discussed

Five times over 10 years, but In a general sense, hypotheses were confirmed. Data collection times mid-point varied and last data were discussed and early periods were defended on the notion that collection was six years after the was when most job stress would occur. fourth period

Two months separating each of three surveys

Six months

Role conflict>emotional exhaustion moderated by helping (socialization)

JAP

Nine hours

Time Interval

Morning rudeness>task perf. & goal progress & interaction avoidance & Psych. Withdrawal

Hypothesized Effects

JAP

Journal

TABLE 2.4.  Longitudinal Research Designs in Articles Published Recently in Major Journals

22 • NEAL SCHMITT

Study 1: Intercultural dating>creativity

Distance and velocity disturbances>enthusiasm and frustration>goal commitment, effort and perf.

Intraindividual increases in org. One year pre- and post-merger valence>org. identification>job sat & intent to stay and personal valence constr.>org identification>job satisfaction and intent to stay

Leader extraversion, agreeableness, & conscientiousness>team potency belief and identification w. ldr>Performance moderated by power distance

Work engagement>work-family Interpersonal capitalization>family satisfaction and work-family balance

Participation in job crafting intervention>crafting toward interests and strength>person-job fit

JAP

JAP

JAP

JAP

JAP

JAP

Job demand>unhealthy eating in the evening and the interaction of job demands and sleeping was significant. Negative customer interaction>negative mood>unhealthy eating. Various points in a day were sampled; no discussion of multi-day effects.

Partial support for hypotheses. No discussion of the time interval separating data collection

Mixed support for hypotheses. Authors emphasized the need to collect data at multiple time points, but did not discuss the time interval between data collections

Disturbances both affected frustration and enthusiasm, but velocity had longer term effects—authors mentioned the limiting effect of time on the result

Hypothesis supported. No mention of timing.

8 weeks

(Continues)

Major mediation hypothesis unsupported. No discussion of timing.

Work Engagement collected at Mediated effects were supported. Authors did discuss the problem of work but mediator and outcomes simultaneous collection of mediator and outcome data. collected at the same time

Three months

45 minute experiment

10 months

6–8 weeks after teams started and Promotive perf.>productivity and prohibitive perf.>safety. Promotive three months later perf.>innovation>perf. gains Prohibitive perf.>monitoring>safety gains. Timing of meas. was recognized as limitation

Team voice>team innovation & team monitoring>productivity and safety

JAP

Morning noon and evening of fifteen days. In a second study, four daily surveys were administered for four weeks.

Work demands>unhealthy eating buffered by sleep and mediated by self-regulations

JAP

Advances in Research Methods  •  23

Trust in direct ldrs.>direct ldr procedural justice>trust in top ldrs. & performance. Relationships moderated by vertical collectivism.

Recruitment source timing and diagnosticity>human capital

High performance leads to supportive or Eight weeks undermining behavior by peers mediated by peers’ perceived benefit or threat

Interaction of Job demands and control > death

Ambient discrimination > mentoring 4 weeks > organizational commitment, strain, insomnia, absenteeism. Mentoring activities moderated the discrimination—outcome relationship

JAP

JAP

JAP

PPsych

PPsych

Seven years

Time between receipt of information on jobs and recruitment varied

Three months

Data collected over two years and tied to specific company changes

Process justice & cognitive trust are reciprocally related through three stages of a merger

JAP

Hypo. Support & Discussion or rationale for Time Interval

Not seen as a longitudinal study; time difference was used to control for common method variance

Hypothesis was supported and there was a lengthy discussion of the implications of end-of-career data collection

Hypotheses were supported, but there was no mention of the time interval

Time was the major variable studied and it was related to human capital. Attribution is that students developed skills relevant to specific jobs.

Trickle model supported—direct ldr. trust leads to top ldr. trust mediated by direct ldr procedural justice. No discussion of length of time interval between data collections

Hypotheses confirmed. Data collections tied to specific changes hypothesized to result from merger. Discussed need to estimate relationships in a shorter time frame.

1 week between each of four data Most hypotheses were supported. Discusses lack of true longitudinal collections design.

Time Interval

Newcomers’ task and social info. Seeking>Mgrs. Perceptions of newcomer commitment to task master and social adjustment >mgrs. Provision of help>outcomes

Hypothesized Effects

JAP

Journal

TABLE 2.4.  Continued

24 • NEAL SCHMITT

LMX > higher salaries & responsibility in 18 months subsequent jobs as well as alumni goodwill.

Emotional labor (surface and deep acting) > Two months ego depletion > coworker harming

Vertical access, horizontal tie strength and core self-evaluation > newcomer learning and organizational identification

Customer mistreatment > Negative mood > Daily before and after the closing Hypothesized indirect effect supported. Daily data collection employees’ helping behavior of restaurants where participants consistent with hypotheses. worked.

Group cohesiveness will moderate OCBI and OCBO and self-efficacy change and mediation against job performance

Employee identification > Use of voice regarding work > managers’ valuation of voice

Company policies and employee passion for volunteering > corporate volunteering climate > Volunteering intentions and behavior

PPsych

PPsych

PPsych

PPsych

AMJ

AMJ

Extensive discussion of socialization and timing of surveys. Vertical access and core self-evaluations were related to outcomes; horizontal tie strength was not. Three-way interaction related to 3 of 4 outcomes.

4 weeks

Two months

(Continues)

Support for hypotheses but no discussion of timing of measurement

Support was found for the hypothesized mediation, but limitation of data collection timing was discussed

Wave 1 followed by Wave 2 three All hypotheses were confirmed. No discussion of the timing of data months later and a third wave collection. after another 3 months

Time 1 (2 months before org. entry, Time 2 (6 months later) and Time 3 (two months after Survey 2

Mention that the two month interval may have been too long thereby reducing magnitude of expected relationships.

Hypotheses were supported. No mention of time interval but it seems appropriate.

Mixed support and recognition of the lack of truly longitudinal design

PPsych

Two months

Job challenge and developmental experiences > leader self-efficacy and mngrs. network > promotability and leader effectiveness

PPsych

Time 2 data collected six months Data were collected before, during and after a program so the timing after program entry and a third of data collection spanned the totality of the participants’ experience. wave 3 months later Hypotheses were supported.

Culture beliefs > intercultural sensitivity rejection > cross-cultural adjustment

PPsych

Advances in Research Methods  •  25

CEO Power > board-chair separation and lead independent director appts.

Team based empowerment > team effectiveness moderated by team leader status

Follower’s dependence on leader > Abusive Three waves of data collection supervision time 2 > abusive supervision separated by 4 weeks and reconciliation time 3 moderated by follower’s behavior

Social networks > information and communication technology use > entrepreneurial activity and profit

Top executive humility > Middle manager job satisfaction > middle manager turnover moderated by top mngmt. faultlines

Donors contributions > peer recognition of Russian theatres moderated by depth of involvement of external stakeholders

Economic downturns > Zero-sum construal 17 years of success > workplace helping

AMJ

AMJ

AMJ

AMJ

AMJ

AMJ

AMJ

Hypotheses supported; no discussion of timing of data collection.

Intrinsic motivation mediator supported; perspective taking unsupported. Timing of data collection mentioned as a study limitation

Supported – no discussion of time period, but likely not needed

Hypo. Support & Discussion or rationale for Time Interval

7 years

1 year

Yearly over 7 years

First step of causal sequence was confirmed by longitudinal data; second step by experiment

Hypotheses supported. No discussion of the time period over which data were collected

Hypotheses were supported. No mention of time interval

Specifically hypothesized that effects would increase with time. When ICT use and family and community centrality were high entrepreneurial activity increased with time.

Timing of data collection matched followers’ performance reviews. Hypotheses supported in two studies

7 months before intervention and Hypotheses supported; time was sufficient for intervention to effect 37 months after outcomes

Ten years

Identity conflict and identity enhancement > 4 months intrinsic motivation and perspective taking > performance

AMJ

Monthly performance for four years

Time Interval

Pay for performance > individual performance

Hypothesized Effects

AMJ

Journal

TABLE 2.4.  Continued

26 • NEAL SCHMITT

Supervisor liberalism > performance-based pay gap between gender groups

Daily surface acting at work > emotional exhaustion > next day work engagement moderated by giving and receiving help

Subordinate deviance > supervisor self -regulation / social exchange > abusive supervision

Risk aversion > guanxi activities

Team commitment and organizational commitment > Dominating, Integrating, Obliging, Avoiding conflict strategies

AMJ

AMJ

AMJ

AMJ

AMJ

Experiment and survey with no time interval

Cross-sectional survey

Two weeks in Study 1; two to four weeks in Study 2

Daily surveys for five days

25 years

Mixed support in the survey replication of an experiment. No mention of time.

No discussion of timing of data collection, but hypothesis supported

Indirect effect for self-regulation was supported, but not the indirect effect for social exchange. Emphasized their use of a cross-lagged research design, but did not discuss timing of data collection.

Hypotheses supported with giving help being a significant moderator. No mention of time

Hypothesis supported even after control variable are considered. No discussion of time period.

Advances in Research Methods  •  27

28 • NEAL SCHMITT

important intervention that is the study focus. This was the case in some of the papers, but very often the timing of initial or subsequent data collection appeared to be a matter of convenience (e.g. every two months or every four weeks). On a positive note, it seems that a very large proportion of the papers, particularly in AMJ, were longitudinal. This was clearly not the case a couple of decades ago. It should also be noted that the data provided in Table 2.4 are partly a result of one reader’s interpretation of the studies. In some of these studies, the authors may argue that time was considered, and/or it was irrelevant. It is also the case that most studies employing longitudinal designs are instances of quasi-experimentation, hence the causal inferences derived from these studies are often problematic (Shadish, Cook, & Campbell, 2002). These studies are almost always designed to test mediation hypotheses using hierarchical regression or SEM to test hypotheses. These models often reflect a poor basis for making causal inferences even though authors frequently imply directly or indirectly that they provide support for causal hypotheses. These inference problems and potential solutions have been described in a series of papers by Stone-Romero and Rosopa (2004, 2008, 2011). They make the case that causal inferences when data are not generated as a function of an experimental design that tests the effects of both independent and mediator variables are not justified. Like earlier authors (e.g., James, Mulaik, & Brett, 2006), they point out that SEM findings (and analyses using hierarchical linear regression) may support a model but that other models that include a different causal direction or unmeasured common causes may also be consistent with the data. A longitudinal design that includes theoretically and/or empirically supported differences in the timing of data collection would seem to obviate at least the problem of misspecified causal direction. Given the importance of time-ordering of the independent, mediator and outcome variables, as argued above, it is surprising that Wood, Goodman, and Cook (2008) found only 11% of the studies in their review of mediation research incorporated time ordering. Their results are consistent with the data in Table 2.4. The past decade since the Wood et al review has produced very little change in longitudinal research; even when data are collected at multiple points in time, there is little, or no justification of the time points selected. Those conducting longitudinal research are missing an opportunity to provide stronger justification of causal inference when they fail to design their research with careful consideration of time (Mitchell & James, 2001). ESTIMATES OF EFFECT SIZE Our sophisticated data analyses often do not provide an index of what difference the results of a study might make in important everyday outcomes or decision making. We have gotten good at reporting d statistics and we use the Cohen (1977) guides for small, medium and large effect sizes. The adequacy of the use of Cohen’s d to communicate effect size as well as other similar statistical indices was identified as an urban legend in a recent book (Cortina & Landis, 2009).

Advances in Research Methods  •  29

As did Cohen, these authors point to the context of the research as an important factor in presenting and interpreting effect sizes. An effect size of .1 is awfully important if the outcome predicted is one’s life. It might not be that impressive if it is one’s level of organizational commitment (my apology to those who study organizational commitment). They also point to the strength (or lack thereof) of the research design that produces an effect. If the effect is easily produced, then it should be less likely dismissed as unimportant. If one needs to use a sledge hammer manipulation to get an effect, it is probably not all that practically important. Perhaps combining both these ideas, Cortina and Landis describe the finding that taking aspirins accounts for 1/10 of one percent of the variance in heart attack occurrence, but such a small intervention with an important outcome makes it a significant effect (in my opinion and it seems the medical profession as well). JAP does require a section of the discussion be devoted to the theoretical and practical significance of the study results and most articles in other journals do as well. However, this often appears to be a pro forma satisfaction of a publication requirement. Moreover, as mentioned above, many of our sophisticated data analyses do not translate into an effect size. Even when they do, unless these d statistics or similar effect sizes are in a metric related to organizationally or societally important outcomes, they are not likely to have much influence on decision makers. It is also interesting that the literature on utility (Cascio, 2000) which was oriented to estimating the effect of various behavioral interventions in dollar terms has pretty much faded away. We also suspect that it would be hard for even a doctoral level staff person in an organization to translate the results of a structural equation analysis or a multilevel analysis or even stepwise regressions into organizationally relevant metrics. A good example, and probably an exception, of a combination of stepwise regression analyses of the impact of various recruitment practices and sources of occupational information is a paper by Campion, Ployhart, and Campion (2017). The usual regression-based statistics were used to evaluate hypotheses and then translated into percent passing an assessment of critical skills and abilities under different recruitment scenarios. This information communicated quite directly with information users. This would be very important data, for example, for the military in evaluating the impact of lowering entrance qualifications of military recruits on subsequent failure rates in training or dropout rates. Incidentally, Campion et al. also reported the number of applicants who heard about jobs from various sources and the quality (in terms of assessed ability) of the applicants. As for the previous research issues raised in this paper, I reviewed papers published in the same three journals (Journal of Applied Psychology, Personnel Psychology, and Academy of Management Journal) to ascertain the degree to which authors addressed the practical implications of their research in some quantifiable manner or in terms of some manner that readers could understand what might or should be changed in an organizational practice to benefit from the study findings. Since all papers have the potential for practical application, I reviewed only the

30 • NEAL SCHMITT

last 12 papers published in 2017 in these three journals. In most articles published in these three journals, there was a section titled “practical implications.” I reviewed these sections as well as the authors’ reports regarding their data in producing Table 2.5. The table includes a column in which the primary interest of the author(s) is listed. I then consider whether there was any presentation of a quantitative estimate of the impact of the variables studied on some outcome (organizational or individual). Most papers presented their results in terms of correlations or multiple regressions, but many also presented the results of structural equation modeling or hierarchical linear modeling. There were only a few papers in which any other quantitative index other than the results of these statistical analyses of the impact of a set of “independent” variables was presented. These indices were d (standardized mean difference) or odds ratios. These indices may also be deficient in that the metric to which they refer may or may not be organizationally relevant. For example, I might observe a .5 standard deviation increase in turnover intent, but unless I know how turnover intent is related to actual turnover in a given circumstance and how that turnover is related to production, profit, or expense of recruiting and training new personnel, it is not easy to make the results relevant to an organizational decision maker. Of course, It is also the case that correlations can be translated to d and that means and standard deviations can be used to compute d and with appropriate available metrics to some organizationally relevant metric. However, this was never done in the 36 studies reviewed. Nearly all authors did make some general statements as to what they believed their study implied for the organization or phenomena they studied. Abbreviated forms of these statements are included in the last column of Table 2.5. As mentioned above, Journal of Applied Psychology includes a “practical implications” section in all articles. As is obvious in these statements, authors have given some thought to the practical implications of their work and their statements relate to a wide variety of organizationally and individually relevant outcomes. What is not apparent in Table 2.5 is that these sections in virtually all papers rarely exceed one to three paragraphs of material and usually did not discuss how their statements would need to be modified for use or implementation in a local context. The utility analyses developed by Schmidt, Hunter, McKenzie, and Muldrow (1979) and popularized by Cascio (2000) were directed to an expression of study results in dollar terms. This approach to utility received a great deal of attention a couple of decades ago, but interest in this approach has waned. Several issues may have been critical. First, expressing some variables in dollar terms may have seemed artificial (e.g., volunteering, team-based empowerment, OCBs, rudeness). Second, calculations underlying utility estimates devolved into some fairly arcane economic formulations (e.g., Boudreau, 1991) which in turn required assumptions that may have made organizational decision makers uncomfortable. Third, the utility estimates were based on judgments that some decision makers may have suspected were inaccurate (Macan & Highhouse, 1994) even though

Job insecurity

Flexible working arrangements

Environmental and climate change

Insomnia

Stereotype threat, training, and performance potential

Snacking at work

Customer behavior and service incivility

Perceptions of novelty and creativity

Authoritarian leadership

Gender transition and job attitudes and experiences

Gender and crying

JAP

JAP

JAP

JAP

JAP

JAP

JAP

JAP

JAP

JAP

JAP

PPsych Work demands, job control, and mortality

Workplace gossip

Nature of Phenomenon Studied

JAP

Journal

Odds ratios

NO

No

No

No

No

No

Yes—d

No

Odds ratios

No

No

No

Effect Size Estimates Practical Implications Suggested

(Continues)

Job demands and job control interacted to produce higher mortality. Organizations should seek to increase employee control over job tasks.

Crying was associated with lower performance and leader evaluations for males. Men should be cautious in emotional expression.

Gender transition related to job satisfaction, person-organization fit and lower perceived discrimination. Organizations should promote awareness and inclusivity

Negative effects of authoritarian leadership on performance, OCB and intent to stay moderated by power distance and role breadth self-efficacy

Organizations should encourage creativity and innovation and use employees with promotion focus to identify good ideas

Verbal aggression directed to an employee and interruptions lead to employee incivility

Individual, organizational, and situational factors affect what employees eat. Organizations should promote healthy organizational eating climate.

Stereotype effect learning which has implications for human potential over time

Treatment for insomnia had positive effects on OCB and interpersonal deviance

Self-concordance of goals and climate change were related to petition signing behavior and intentions to engage in sustainable climate change behavior

Improve employees’ wellbeing and effectiveness. Flextime should be accompanied by some time structuring and goal setting

Risk that job performance and OCB will suffer and intent to leave will increase

Discussed gossip relationships with workplace deviance and promoting norms for acceptable behavior

TABLE 2.5.  Reports of Practical Impact of Research and Effect Sizes

Advances in Research Methods  •  31

Nature of Phenomenon Studied

No No

PPsych Role-based identity at work

PPsych Leader development

No No No No No

PPsych Emotional labor in customer contacts

PPsych Newcomer adjustment

PPsych Training transfer

PPsych Family role identification and leadership

AMJ

Curiosity and creativity

Organizations should consider training employees on the biases faced by women in leadership roles.

No

PPsych Status incongruence and the impact of transformational leadership

Study offers suggestions as to how to provide feedback and that curiosity be considered when selected people into “creative” jobs. Creative workers must have time to consider revisions.

Organizations and individuals should promote family involvement as these activities enhance transformational leadership behavior

Expectations regarding transfer of training should take account of different learning trajectories and opportunities to perform.

Organizations should tailor their approach to newcomer socialization to individual needs.

Organizations should promote deep acting rather than surface acting in service employees to prevent harming behavior to clients and coworkers.

LMX quality relationships are related to career progress in new organizations and alumni good will. Orgs. should promote internal job opportunities

PPsych LMX leadership effects

Combinations of developmental exercises: formal training, challenging responsibilities, and developmental supervision best in developing leaders.

Provides role-based identity scales and suggests that employees who assume too many roles may experience burnout.

Provision of experiences that foster social adjustment increase benefits derived from international experiences

No

PPsych Cultural intelligence

Practices that promote balance satisfaction and effectiveness may enhance job attitudes and performance.

Practical Implications Suggested

High quality formal and informal mentoring relationships that offer social support reduce the negative impact of racism and lead to a number of positive job outcomes.

No

Effect Size Estimates

PPsych Mentoring as a buffer against discrimination No

PPsych Work family balance

Journal

TABLE 2.5.  Reports of Practical Impact of Research and Effect Sizes

32 • NEAL SCHMITT

Ambiguity in corporate communication in response to competition

Value of voice

Pay for performance

Identity conflict and sales performance

Board director appt and firm performance

Team-based empowerment

Abusive supervision

Entrepreneurs’ motivation shapes the characteristics and strategies of firms

Innovation and domain experts

Volunteering climate

Women entrepreneurs in India

AMJ

AMJ

AMJ

AMJ

AMJ

AMJ

AMJ

AMJ

AMJ

AMJ

AMJ

CEO and boards can be balanced in terms of power and this likely leads to positive firm level outcomes.

Managers can influence performance by reducing role conflict and increasing identity enhancement.

Employees indebted to a pay for performance plan will react positively to debt forgiveness but only in the short term.

Exercise of voice should be on issues that are feasible in terms of available resources. Speaking up on issues that are impossible to address will have negative impact on the manager and employee

Odds ratios and profit in rupees

No

No

No

No

Community and social networks lead to entrepreneurial activity and profit moderated by information and technology use

Fostering collective pride about volunteering leads to affective commitment and to volunteering intentions.

Experts are useful in generating potential problem solutions, but may interfere in selecting the best solution

Describes the process of organizing new firms and whether founders remain till the firm becomes operational or leave

Provides strategies for abused followers to reconcile with an abusive supervisor. Organizations should encourage leaders and followers to foster mutual dependence.

Percentage of High status leaders struggle with team-based empowerment and specific leader behaviors same day appt. facilitate or hinder delegation requests

No

d of selling intention

No

No

Likelihood of Use vague language in annual reports to reduce competitive entry in your market competitive actions

Advances in Research Methods  •  33

34 • NEAL SCHMITT

the consistency across judges was usually quite acceptable (Hunter, Schmidt, & Coggin, 1988). Finally, some estimates were so large (e.g., Schmidt, Hunter, & Pearlman, 1982) and the vagaries of organizational life so unpredictable (Tenopyr, 1987) that utility estimates were rarely realized. It appears that HR personnel are facing a similar set of “so what” questions as they attempt to make sense of the Big Data analyses that are now possible and increasingly common. Angrave et al. (2016) report that HR practitioners who are faced with these data are enthused but feel no better informed about how to put them into practice than before they were informed about the data. This seems to be the situation that those working on utility analyses confronted in the 80s and 90s. Although many organizations have begun to engage with HR data and analytics, most seem not to have moved beyond operational reporting. Angrave et al. assert that four items are important if HR is to make use of Big Data analytics. First, there must be a theory of how people contribute to the success of the organization. Do they create, capture, and/or leverage something of value to the organization and what is it? Second, the analyst needs to understand the data and the context in which it is collected to be able to gain insight into how best to use the metrics that are reported. Third these metrics must help identify the groups of talented people who are most instrumental in furthering organizational performance. Finally, simple reports of relationships are not sufficient, there must be attention given to the use of experiments and quasi-experiments that show that a policy or intervention improves performance.

FIGURE 2.1.  Example of an Expectancy Chart Reflecting the Relationship between College GPA and Situational Judgment Scores

Advances in Research Methods  •  35

Perhaps one takeaway or recommendation from this discussion is that authors use sophisticated statistics to answer theoretical questions and then use descriptive statistics including percentages or mean differences in meaningful organizationally relevant metrics to communicate with the consumers of our research. Or, engage organizational decision makers in making the translations of these simple statistics to a judgment about practical utility. In this context, perhaps we should “reinvent” expectancy tables suggested for this use in early Industrial Psychology textbooks (e.g., Tiffin & McCormick, 1965). See Figure 2.1 for an example. SUMMARY AND CONCLUSIONS In conclusion, there is much about which to congratulate ourselves regarding contributions to our science that have produced an explosion of quantitative analysis techniques that help us understand the data we collect as well as help to educate our colleagues on their use and appropriate interpretation. While continuing our work in these areas, I also think we should pay close attention to measurement issues, research design concerns particularly in the context of longitudinal efforts, and our ability to communicate our results in convincing ways to those who consume our research. These points have all been made by others, but they remain issues with which we must grapple. It may be time that editors and reviewers require that researchers present more information on the content of measures and the validation of those measures; that authors who investigate some process in longitudinal research explain why data were/were not collected at certain time points, and that authors provide indices of what impact the results of their research might have on organizational outcomes. REFERENCES Angrave, D., Charlwood, A., Kirkpatrick, I., Lawrence, M., & Stuart, M. (2016). HR and analytics: Why HR is set to fail the big data challenge. Human Resource Management Journal, 26, 1–12. Barnes, C. M., Miller, J. A., & Bostock, S. (2017). Helping employees sleep well: Effects of cognitive behavior therapy for insomnia on work outcomes. Journal of Applied Psychology, 102, 104–113. Boudreau, J. W. (1991). Utility analysis for decisions in human resource management. In M. D. Dinette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology: Vol 2. (pp. 621–746). Palo Alto, CA: Consulting Psychologists Press. Campion, M. C., Ployhart, R. E., & Campion, M. A. (2017). Using recruitment source timing and diagnosticity to enhance applicants’ occupation-specific human capital. Journal of Applied Psychology, 102, 764–781 Cascio, W., & Boudreau, J. (2011). Investing in people: The financial impact of human resource initiatives. (2d. ed.). Upper Saddle, NJ: Pearson. Cascio, W. F. (2000). Costing human resources: The financial impact of behavior in organizations. Cincinnati, OH: Southwestern. Chen, G., Ployhart, R. E., Cooper-Thomas, H. D., Anderson, N., & Bliese, P. D. (2011). The power of momentum: A new model of dynamic relationships between job sat-

36 • NEAL SCHMITT isfaction changes and turnover intentions. Academy of Management Journal, 54, 159–181. Clark, L. A. (2006). When a psychometric advance falls in the forest. Psychometrika, 71, 447–450. Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York, NY: Academic Press. Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98–104. Cortina, J., Sheng, A., List, S. K., Keeler, K. R. Katell, L. A., Schmitt, N., Tonidandel, S. Summerville, K., Heggestad, E., & Banks, G. (under review). Why is coefficient alpha?: A look at the past, present, and (possible) future of reliability assessment. Journal of Applied Psychology. Cortina, J. M., Aguinis, H., & DeShon, R. P. (2017). Twilight of dawn or of evening? A century of research methods in the Journal of Applied Psychology. Journal of Applied Psychology, 102, 274–290. Cortina, J. M., & Landis, R. S. (2009). When small effect sizes tell a big story, and when large effect sizes don’t. In C. E. Lance & R. J. Landis (Eds.), Statistical and methodological myths and urban legends: Doctrine, verity, and fables in the organizational and social sciences. New York, NY: Routledge. Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391–418. Deng, H. Walter, F., Lam, C. K., & Zhao, H. H. (2017). Spillover effects of emotional labor in customer service encounters toward coworker harming: A resource depletion perspective. Personnel Psychology, 70, 469–502. DeShon, R. P. (2013). Inferential meta-themes in organizational science research: Causal research, system dynamics, and computational models. In N. Schmitt & S. Highhouse (Eds.), Handbook of psychology Vol. 12: Industrial and organizational psychology (pp. 14–42.). New York, NY: Wiley. Dudycha, A. L., & Carpenter, J. B. (1973). Effect of item format on item discrimination and difficulty. Journal of Applied Psychology, 58, 116–121. Dunn, T. J., Baguley, T., & Brunsden, V. (2014). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology, 105, 399–412. Edwards, A. L. (1957) Techniques of attitude scale construction. New York, NY: AppletonCentury-Crofts. Fried, E. I., & Flake, J. K. (2018). Measurement matters. Observer, 31, 29–31. Golubovich, J., Grand, J. A., Ryan, A. M., & Schmitt, N. (2014). An examination of common sensitivity review practices in test development. International Journal of Selection and Assessment. 22, 1–11. Grand, J. A. (2017). Brain drain? An examination of stereotype threat effects during training on knowledge acquisition and organizational effectiveness. Journal of Applied Psychology, 102, 115–150. Harlow, L. L., & Oswald, F. L. (2016). Big data in psychology: Introduction to the special issue. Psychological Methods, 21, 447–457. Hattie, J. (1985). Methodology review” Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9, 139–164.

Advances in Research Methods  •  37 Hunter, J. E., Schmidt, F. L., & Coggin, T. D. (1988). Problems and pitfalls in using capital budgeting and financial accounting techniques in assessing the utility of personnel programs. Journal of Applied Psychology, 73, 522–528. Ilgen, D. R., & Hulin, C. L. (Eds.). (2000). Computational modeling of behavioral processes in organizational research. Washington, DC: American Psychological Association Press. James, L. R., Mulaik, S. A., & Brett, J. M. (2006). A tale of two methods. Organizational Research Methods, 9, 233–244. Kaltianen, J., Lipponen, J., & Holtz, B. C. (2017). Dynamic interplay between merger process justice and cognitive trust in top management: A longitudinal study. Journal of Applied Psychology, 102, 636–647. Klein, K. J., & Kozlowski, S. W. J. (Eds.) (2000). Multilevel theory, research and methods in organizations. San Francisco, CA: Jossey-Bass. Macan, T. H., & Highhouse, S. (1994). Communicating the utility of human resource activities: A survey of I/O and HR professionals. Journal of Business and Psychology, 8, 425–436. Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power and sensitivity of alternate fit indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568–592. McNeish, D. (2018). Thanks coefficient alpha: We’ll take it from here. Psychological Bulletin, 23, 412–433. Mitchell, T. R., & James, L. R. (2001). Building better theory: Time and the specification of when things happen. Academy of Management Review, 26, 530–547. Murphy, K. R., & Russell, C. J. (2017). Mend it or end it: Redirecting the search for interactions in the organizational sciences. Organizational Research Methods, 20, 549–573. Nye, C. D., & Drasgow, F. (2011). Effect size indices for analyses of measurement equivalence: Understanding the practical importance of differences between groups. Journal of Applied Psychology, 96, 966–980. Pitiaru, A. H., & Ployhart, R. E. (2010). Explaining change: Theorizing and testing dynamic mediated longitudinal relationships. Journal of Management, 36, 405–429. Ployhart, R. E., & Kim, Y. (2013). Dynamic growth modeling. In J. M. Cortina and R. S. Landis (Eds.), Modern research methods (pp. 63–98). New York, NY: Routledge. Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47, 667–696. Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529–540. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Schmidt, F. L., Hunter, J. E., McKenzie, R., & Muldrow, T. (1979). Impact of valid selection procedures on workforce productivity. Journal of Applied Psychology, 64, 609–626. Schmidt, F. L., Hunter, J. E., & Pearlman, K. (1982). Assessing the economic impact of personnel programs on workforce productivity. Personnel Psychology, 35, 333–347. Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350–353.

38 • NEAL SCHMITT Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin. Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107–120. Sonnentag, S., Pundt, A., & Venz, L. (2107). Distal and proximal predictors of snacking at work: A daily-survey study. Journal of Applied Psychology, 102, 151–162. Stone-Romero, E. F., & Rosopa, P. J. (2004). Inference problems with hierarchical multiple regression-based tests of mediating effects. Research in Personnel and Human Resources Management, 23, 249–290. Stone-Romero, E. F., & Rosopa, P. J. (2008). The relative validity of inferences about mediation as a function of research design characteristics. Organizational Research Methods, 11, 326–352. Stone-Romero, E. F., & Rosopa, P. (2011). Experimental tests of mediation models: Prospects, problems, and some solutions. Organizational Research Methods, 14, 631– 646. Tenopyr, M. L. (1987). Policies and strategies underlying a personnel research program. Paper presented at the Second Annual Conference of the Society for Industrial and Organizational Psychology, Atlanta, Georgia. Tiffin, J., & McCormick, E. J. (1965). Industrial psychology. Englewood Cliffs, NJ: Prentice-Hall. Vancouver, J. B., & Purl, J. D. (2017). A computational model of self-efficacy’s various effects on performance: Moving the debate forward. Journal of Applied Psychology, 102, 599–616. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70. Walker, D. D., Jaarsveld, D. D., & Skarlecki, D. P. (2017). Sticks and stones can break my bones but words can also hurt me: The relationship between customer verbal aggression and employee incivility. Journal of Applied Psychology, 102, 163–179. Willett, J. B., & Sayer, A. G. (1994). Using covariance structure analysis to detect correlates and predictors of change. Psychological Bulletin, 116, 363–381. Wood, R. E., Goodman, J. S., & Cook, N. D. (2008). Mediation testing in management research. Organizational Research Methods, 11, 270–295. Zhou, J., Wang, X. M., Song, L. J., & Wu, J. (2017). Is it new? Personal and contextual influences on perceptions of novelty and creativity. Journal of Applied Psychology, 102, 180–202. Zinbarg, R. E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s a, Revelle’s b, and McDonald’s vH: Their relations with each other and two alternate conceptualizations of reliability. Psychometrika, 70, 1–11.

CHAPTER 3

RESEARCH DESIGN AND CAUSAL INFERENCES IN HUMAN RESOURCE MANAGEMENT RESEARCH Eugene F. Stone-Romero

The validity of inferences derived from empirical research in Human Resource Management (HRM), Industrial and Organizational Psychology (I&O), Organizational Behavior (OB) and virtually all other disciplines is a function of such facets of research design as experimental design, measurement methods, sampling strategies, and statistical analyses (Campbell & Stanley, 1963; Cook & Campbell, 1976, 1979; Shadish, Cook, & Campbell, 2002; Stone-Romero, 2009, 2010). Research design is “an overall plan for conducting a study” (Stone-Romero, 2010) that takes into account the factors of internal validity, external validity, construct validity, and statistical conclusion validity (Shadish et al., 2002). Unfortunately, the HRM literature is replete with empirical studies that have highly questionable levels of internal validity, i.e., the degree to which the results of a research allow for valid inferences about causal connections between variables. As is detailed below, the validity of such inferences is a function of the experimental designs used in research. In view of this, the major concern of this article is experimental design. It determines not only the validity of causal inferResearch Methods in Human Resource Management: Toward Valid Research-Based Inferences, pages 39–65. Copyright © 2020 by Information Age Publishing All rights of reproduction in any form reserved.

39

40 • EUGENE F. STONE-ROMERO

ences in research aimed at testing model-based predictions, but also the effectiveness of HRM policies and practices. In the interest of explicating the way in which experimental design affects the validity of causal inferences in research, this article considers the following issues: (a) invalid causal inferences in HRM research, (b) the importance of valid causal inferences in basic and applied research, facets of validity in research, (c) formal reasoning procedures as applied to the results of empirical research, (d) the importance of experimental design for valid causal inferences, (e) the settings in which research is conducted, (f) experimental design options (randomized experimental, quasi-experimental and non-experimental) for research, (g) other research design issues, (h) overcoming objections that have been raised about randomized experiments in HRM and related disciplines, and (i) conclusions and recommendations for basic and applied research and editorial policies. INVALID CAUSAL INFERENCES IN THE HRM LITERATURE An inspection of the literature in HRM and allied disciplines shows a pervasive pattern of unwarranted inferences about causal connections between and among variables. Typically, such inferences are found in reports of the findings of nonexperimental studies. In such studies assumed independent and dependent variables are measured, correlations between and/or among variables are determined, and causal inferences are generated on the basis of the observed correlations. Illustrative of the unwarranted inferences in the HRM literature are the very large number of articles in the HRM literature that (a) have such titles as “The Impact of X on Y,” “The Effects of X on Y,” and “The Influence of X on Y,” and (b) contain unwarranted inferences about causal relations between variables. Among the many hundred examples of this are the following. First, on the basis of a non-experimental study of the relation between job satisfaction (satisfaction hereinafter) and job performance (performance hereinafter) and a so called “causal correlational analysis.” Wanous (1974) concluded that the results of his study indicated that performance causes intrinsic satisfaction and extrinsic satisfaction causes performance. As is explained below, the results of a causal correlation analyses do not allow for valid inferences about causality. More generally, the findings of any non-experimental study provide a very poor basis for justifying causal inferences. Second, using the findings of a meta-analysis of 16 non-experimental studies of the relation between job attitudes and performance and a meta-analytic based regression analysis, Riketta (2008) argued that job attitudes are more likely to influence performance than the reverse. Both attitudes and performance were measured. As is detailed below, regression analyses do not afford a valid basis for causal inferences unless the analyses are based upon research that uses randomized experimental designs. Regrettably, Riketta’s study did not use such a design. Instead, the design was non-experimental, making suspect any causal inferences.

Research Design and Causal Inferences  •  41

Third, relying on a structural equation modeling (SEM)-based analysis of the results of three non-experimental studies that examined relations between (a) core self-evaluations and (b) job and life satisfaction, Judge, Locke, Durham, and Kluger (1998) concluded that “The most important finding of this study is that core evaluations of the self have consistent effects on job satisfaction, independent of the attributes of the job itself. That is, the way in which people see themselves affects how they experience their jobs and even their lives” (p. 30). In view of the fact that the researchers applied SEM to the findings of non-experimental studies, causal inferences are not justified. As is explained in detail below in the section titled “Research Design Options,” unless studies use randomized experimental designs, causal inferences are seldom, if ever justified. Thus, such inferences were unwarranted in the justdescribed studies and thousands of other non-experimental studies in the HRM literature. The findings of a study by Stone-Romero and Gallaher (2006) illustrate the severity of causal inference problems in empirical studies in HRM and allied disciplines. They performed a content analysis of 161 articles that were randomly sampled from articles published in the 1988, 1993, 1998, and 2003 volumes of journals that publish HRM-related articles (i.e., Personnel Psychology, Organizational Behavior and Human Decision Processes, the Academy of Management Journal, and the Journal of Applied Psychology). The studies reported in these articles used various types of experimental designs (i.e., non-experimental, quasi-experimental, and randomized-experimental). The articles were searched for instances of the inappropriate use of causal language in their title, abstract, and results and/or discussion sections. The search revealed one or more instances of unwarranted causal inferences in 79% of the 73 articles that were based on nonexperimental designs, and 78% of the 18 articles that used quasi-experimental designs. Overall, the analysis of the 161 articles showed that causal inferences were unwarranted in a very large percentage of the research-based articles. IMPORTANCE OF VALID INFERENCES IN BASIC AND APPLIED RESEARCH Valid inferences about causal relations between variables (i.e., internal validity) are a function of experimental design. The major design options are randomizedexperimental, quasi-experimental, and non-experimental. Convincing inferences about the degree to which a study’s results generalize to and across populations (i.e., external validity) are contingent on the way sampling units are selected (e.g., random, non-random). Persuasive inferences about the nature of the constructs dealt with by a study (i.e., construct validity) are conditional on the way that the study’s variables are manipulated or measured. Finally, valid inferences about relations between and among variables (i.e., statistical conclusion validity) are dependent on the appropriateness of the statistical analyses used in a study. Of the four facets of validity, internal validity is the “sine qua non” in research prob-

42 • EUGENE F. STONE-ROMERO

ing causal connections between or among variables (Campbell & Stanley, 1963; Cook & Campbell, 1976, 1979; Shadish et al., 2002). Unless the results of an empirical study show that an independent variable (i.e., an assumed cause) is causally related to a dependent variable (i.e., an assumed effect) it is of little consequence that the study has high levels of external, construct, or statistical conclusion validity. Internal Validity in Basic Research Internal validity is a crucial issue in both basic and applied research. In basic (theory testing) research it is vital to know if the causal links posited in conceptual or theoretical models are supported by research results. For example, if a theory posits that satisfaction produces changes in performance, research used to test the theory should show that experimentally induced changes in satisfaction lead to predicted changes in performance. None of the many hundred non-experimental studies or meta-analyses of this relation have produced any credible evidence of this causal connection (e.g., Judge, Thoresen, Bono, & Patton, 2001). On the other hand, there is abundant experimental evidence that demonstrates a causal connection between training and performance (e.g., Noe, 2017). Internal Validity in Applied Research In applied research it is vital to show that an organizational intervention (e.g., job enrichment) leads to hypothesized changes in one or more dependent variable (e.g., satisfaction, employee retention, performance). For example, unless research can adduce support for a causal relation between the use of a selection test and employee performance, it would make little or no sense for organizations to use the test. FORMAL REASONING AND MODEL TESTING IN RESEARCH It is instructive to consider model testing research in the context of formal reasoning techniques (Kalish & Montague, 1964). They allow one to use symbolic sentences along with formal reasoning principles to assess that validity of research-based conclusions. Let’s consider this in the context of a simple researchrelated example. In it (a) MC stands for “a model being tested is correct,” (b) RC represents “research results are consistent with the model,” and (c) ~ MC denotes “not consistent with the model.” The general strategy employed in testing models (or theories) is to (a) formulate a model, (b) conduct an empirical study designed to test it, and (c) use the study’s results to argue that the model is correct or incorrect. In terms of symbolic sentences, the reasoning that is almost universally used by researchers is as follows: The researcher (a) assumes that if the model is correct then the results of research will be consistent with the model, that is MC → RC, (b) shows RC through empirical research, and (c) uses the RC finding to infer MC. Unfortunately, however, this reasoning strategy leads to an

Research Design and Causal Inferences  •  43

invalid inference about MC because it is predicated on the logical fallacy of affirming the consequent. The inference is incorrect because it also may be true that ~MC → RC. For example, a researcher (a) assumes a model in which satisfaction causes performance, (b) conducts an empirical study that shows a .30 correlation between these variables, and (c) concludes that the model is correct. This is an invalid inference because the same correlation would have resulted from a model that posited satisfaction to be the cause of performance. In addition, it could have resulted from the operation of one or more confounding variables (e.g., worker pay that is positively contingent on performance). For example, a very creatively designed randomized experimental study by Cherrington, Reitz, and Scott (1971) studied the relation between performance and satisfaction. Subjects were randomly assigned to one of two conditions. In the first part of the study they performed a task for one hour. Then, subjects in one condition received rewards that were positively contingent on performance, whereas in the other rewards were negatively contingent on performance. Subsequently, they performed the task for another hour. The researchers found that across reward contingency conditions, (a) there was no relation between satisfaction and second-hour performance, (b) satisfaction was positively related to second-hour performance for subjects who received rewards that were positively contingent on performance, and (c) satisfaction was negatively related to performance for those who received negatively contingent rewards. These results are both interesting and important. They show that the satisfaction-performance relation is a function of reward contingency. More specifically, reward contingency was responsible for the correlation between satisfaction and productivity. As Cherrington et al. concluded: “Our theory implies no causeeffect relationship between performance and satisfaction; instead, it stresses the performance-reinforcing as well as the satisfaction-increasing potential of contingent reinforcers” (p. 535). It merits stressing that research results that are consistent with an assumed model (RC) have no necessary implications for its correctness. However, via the inference rule of Modus Tollens (Kalish & Montague, 1964), a study that showed ~RC would allow for the logical inference of ~MC; that is if the study failed to provide support for the model then the researcher could logically conclude that the model was incorrect. Of course, the latter inference would be contingent upon the study having high levels of both construct validity and statistical conclusion validity. CAUSAL INFERENCES AND EXPERIMENTAL DESIGN In an empirical study of the relation between X (an assumed cause) and Y (an assumed effect), there are three conditions that are vital to valid causal inferences: (a) X precedes Y in time, (b) X and Y are related to one another, and (c) there are no rival explanations of the relation between X and Y (for example, Z is a cause of both X and Y and there is no causal relation between X and Y). These requirements are most satisfied in randomized experimental research and are not satisfied as adequately in either non-experimental or quasi-experimental research, including

44 • EUGENE F. STONE-ROMERO

longitudinal research (Campbell & Stanley, 1963; Cook & Campbell, 1976, 1979; Shadish et al., 2002; Stone-Romero, 2009, 2010). Research that demonstrates that X and Y are related satisfies only one such condition. Thus, it does not serve as a sufficient basis for inferring that X causes Y. As the well-known adage states, “correlation does not imply causation.” Assume that a study provides evidence of a correlation between two measured variables, O1 and O2. As Figure 3.1 indicates, his finding might result from (a) O1

FIGURE 3.1.  Possible causal relations among several observed variables.

Research Design and Causal Inferences  •  45

being a cause of O2 (Figure 3.1a); (b) O2 being a cause of O1, (Figure 3.1b); or (c) the relation between O1 and O2 being a non-causal function of a third unmeasured variable, O3 (Figure 3.1c). So, evidence that O1 and O2 are correlated is insufficient to infer that there is a causal connection between these variables. Nevertheless, as was noted above, it is quite common for researchers in HRM and allied disciplines to base causal inferences on evidence of relations between variables (e.g., an observed correlation between two variables) as opposed to research that uses a sound experimental design. One vivid example of this is the body of research on the relation between satisfaction and organizational commitment (commitment hereinafter). On the basis of observed correlations between measures of these two variables and different types of statistical analyses: (a) some researchers (e.g., Williams & Hazer, 1986) have concluded that satisfaction causes commitment, (b) other researchers (e.g., Bateman & Strasser, 1984; Koslowsky, 1991; Weiner & Vardi, 1980) have inferred that commitment causes satisfaction, (c) still others (e.g., Lance, 1991) have reasoned that satisfaction and commitment are reciprocally related to one another, and (d) yet others have argued that the relation between satisfaction and commitment is unclear or spurious (Farkas & Tetrick, 1989). Another instance of invalid causal inferences relates to the correlation between job attitudes (attitudes hereinafter) and performance. As noted above, on the basis of a meta-analysis of the results of 16 non-experimental studies Riketta (2008) inappropriately concluded that attitudes are more likely to influence performance than vice versa. The fact that the study was based on meta-analysis does nothing whatsoever to bolster causal inferences. RESEARCH SETTINGS Empirical research can be conducted in what have typically referred to as “laboratory” and “field” settings (e.g., Bouchard, 1976; Cook & Campbell, 1976, 1979; Evan, 1971; Fromkin & Streufert, 1976; Locke, 1986). However, as John Campbell (1986) and others (e.g., Stone-Romero, 2009, 2010) have argued, the laboratory versus field distinction is not very meaningful. One important reason for this is that research “laboratories” can be set up in what are commonly referred to as “field” settings. For example, an organization can be created for the specific purpose of conducting a randomized-experimental study (Shadish et al., 2002, p. 274). Clearly, such a setting blurs the distinction between so called laboratory and field research. To better characterize the settings in which research takes place Stone-Romero (2009, 2010) recommended that they be categorized in terms of their purpose. More specifically, (a) special purpose (SP) settings are those that were created for the specific purpose of conducting empirical research and (b) non-special purpose (NSP) settings are those that were created for a non-research purpose (e.g., manufacturing, consulting, retailing). In the interest of clarity about research settings the SP versus NSP distinction is used in the remainder of this article.

46 • EUGENE F. STONE-ROMERO

RESEARCH DESIGN OPTIONS IN EMPIRICAL STUDIES In designing an empirical study, a researcher is faced with a number of options, including (a) the type of experimental design, (b) the number and types of participants, (c) the measures or manipulations of variables, (d) its setting, and (e) the planned statistical analyses. With respect to experimental design, there are three general options, i.e., non-experimental, quasi-experimental, and randomized-experimental (Campbell & Stanley, 1963; Cook & Campbell, 1976, 1979; Shadish et al., 2002; Stone-Romero, 2010). Table 3.1 provides a summary of the properties of the design options. Note that experimental design is the major determinant of the degree to which a study’s results allow for valid causal inferences. Before describing various designs, a word is in order about the notation used in the examples that are described below. In these examples (a) the symbol of => is used to denote implies or signifies, and (b) the research design symbols used are as follows: (a) X => either an assumed independent variable or the manipulation (treatment) of a variable, (b) ~X => the absence of a treatment, (c) Y => an assumed or actual dependent variable, (d) R => random assignment to treatment conditions, (e) ~R => non-random assignment to such conditions, and (f) Oi => the operational definition of an assumed independent, mediator, or dependent variable. Note, moreover that no distinction is made here between the operational definition of a construct and the construct itself. Randomized-Experimental Designs The simplest method for conducting research that allows for valid causal inferences about the relation between two variables (e.g., X and Y) is a randomizedexperimental study in which (a) X is manipulated at two or more levels, (b) sampling units (e.g., individuals, groups, organizations) are assigned to experimental conditions on a random basis, and (c) the dependent variable is measured. If there are a sufficiently large number of sampling units, randomization serves to equate the average level of observed variables (Oi) in the experimental conditions on all measured or unmeasured variables prior to the manipulation of the independent variable or variables. As such, randomization rules out such threats to internal vaTABLE 3.1.  Attributes of Three General Types of Experimental Designs Design Type Quasi-Experimental

RandomizedExperimental

Measured, assumed

Manipulated

Manipulated

Measured, assumed

Measured, assumed

Measured

Control of confounds

Typically very low

Low to moderate

Very high

Validity of causal inferences

Typically very low

Moderately high

Very high

Attribute

Non-Experimental

Independent variable Dependent variable

Research Design and Causal Inferences  •  47

lidity as selection, history, maturation, regression, attrition, instrumentation, and the interactive effects of these threats (Campbell & Stanley, 1963; Cook & Campbell, 1976, 1979; Shadish et al., 2002). In their Table 1, Campbell and Stanley detail the threats to internal validity that are controlled by randomized-experimental designs. Designs such as the Solomon Four-Group allow the researcher to rule out virtually all threats to internal validity. As a result, such designs should be used if causal inferences are important in empirical research. A study using a randomized-experimental design can test for not only the main effects of independent variables (e.g., X1, X2, and X3), but also their interactive effects (X1 × X2, X1 × X3, X1 ×X3, and X1 × X2 × X3). Moreover, it may consider their effects on multiple dependent variables (Y1, Y2, . . . Yj). For example, a study may assess the effects of a job design manipulation (e.g., autonomy) on such dependent variables as satisfaction, absenteeism, motivation, and turnover. There are two general categories of randomized-experimental designs. They are single independent variable designs and multiple independent variable designs. Single Independent Variable Designs. One of the simplest and most useful of the randomized-experimental designs (diagramed below) is the Solomon Four-Group design. In it (a) research units are randomly assigned to one of four conditions, (b) units in condition A and C receive the treatment while those in B and D serve as no-treatment controls, (c) a single independent variable (X) is manipulated, and (d) its effects on the dependent variable (O) are determined via statistical methods. Note that (a) the dependent variable is measured before (O1A and O1B) and after (O2A, O 2B, and O2C) the treatment period. Diagrammatically: R R R R

O1A O1B

X ~X X ~X

O2A O 2B O2C O2D

The results of a study using this design provide a convincing basis for concluding that the independent variable caused the dependent variable. That is, they allow for ruling out all threats to internal validity. Note, however, the same results could not be used to support the conclusion that X is the only cause of changes in the dependent variable. Other randomized-experimental research may show that X is also causally related to a host of other manipulations of the independent variables. Multiple Independent Variable Designs. Randomized-experimental designs also can be used in studies that examine causal relations between multiple independent variables and one or more dependent variables. A study of this type can consider both the main and interactive effects of two or more independent variables (e.g., X1, X2, and X1×X2). Thus, for example, a 2 × 2 factorial study could test

48 • EUGENE F. STONE-ROMERO

for the main and interactive effects of room temperature and relative humidity on workers’ self-reports of their comfort level. Quasi-Experimental Designs Quasi-experimental designs have three attributes: First, one or more independent variables (e.g., X1, X2, and X3) are manipulated. Second, assumed dependent and control variables are measured (O1, O2, . . . Ok) before and after the manipulations. Third, the studied units are not randomly assigned to experimental conditions. The latter attribute results in a very important deficiency, i.e., an inability to argue that the studied units were equivalent to one another before the manipulation of the independent variable(s). Stated differently, at the outset of the study the units may have differed from one another on a host of measured and/or unmeasured confounding variables (Campbell & Stanley, 1963; Cook & Campbell, 1976, 1979: Shadish et al., 2002). Thus, any observed differences in measures of the assumed dependent variable(s) may have been a function of one or more confounding variables. There are five basic types of quasi-experimental designs. Brief descriptions of them are provided below. Single Group Designs With Or Without Control Group. In this type of design an independent variable is manipulated and the assumed dependent variable is measured at various points in time before and/or after the manipulation. A very simple case of this type of design is the one-group pretest-posttest design: O1A

X

O2A

An example of this design is a study in which performance is measured before (O1A) and after (O2A) a job-related training program (X). A major weakness of this and similar designs is that pretest versus posttest changes in measured variable may have been a function of a host of confounds, including history, maturation, and regression (Campbell & Stanley, 1963; Cook & Campbell, 1976, 1979; Shadish et al., 2002). Thus, the design does not allow for valid inferences about the causal connection between training and performance. Multiple Group Designs Without Pretest Measures. In this type of design units are not randomly assigned to conditions, two or more groups are assigned to treatment and no treatment conditions, and the assumed dependent variable is measured after the manipulation of the independent variable: ~R ~R

X ~X

O2A O2B

Although this design is slightly better than the just-described single group design, it is still highly deficient with respect to the criterion of internal validity.

Research Design and Causal Inferences  •  49

The principal reason for this is that posttest differences in the assumed dependent variable may have resulted from such confounds as pre-treatment differences on the same variable or a host of other confounds (e.g., local history). Multiple Group Designs with Control Groups and Pretest Measures. In this type of design (a) units are assigned to one or more treatment and control conditions on a non-random basis, (b) the independent variable is manipulated in one or more such conditions, and (c) the assumed dependent variable is measured before and after the treatment period. One example of this type of design is: ~R ~R

O1A O1B

X ~X

O2A O 2B

For instance, in a study of the effects of participation in decision making on commitment: (a) the treatment is implemented in one division of a company (i.e., Group A) while the other division (Group B) serves as a no treatment control condition, and (b) commitment is measured before and after the treatment period in both groups. The hoped for outcome is that O1A = O1B and O2A > O2B. Unfortunately, this pattern of results would not justify the inference that the treatment produced the difference in commitment. Although the design is an improvement over one in which there are no pretest measures, it is still quite deficient in terms of the internal validity criterion. Even if the pretest measures revealed that the groups did not differ from one another at the pretest period, a large number of confounds may have been responsible for any observed posttest differences. For example, the group that was treated also experienced a pay increase. Time Series Designs. The time series design involves the measurement of the assumed dependent variable on a number of occasions before and after the group experiences the treatment. A simple illustration of this type of design is: O1A

O2A

O3A

···

O25A

X

O26A

O27A

O28A

···

O50A

For example, performance may be measured at multiple points in time before and after the implementation of a job training intervention. Although this design allows for the ruling out of some confounds (e.g., maturation), it does not permit the ruling out of others (e.g., history). As a result, the design is relatively weak in terms of the internal validity criterion. Regression Discontinuity Designs. This design entails (a) the measurement of the variable of interest at a pretest period, (b) the use of pretest scores to assign units to treatment versus control conditions, (c) the separate regression of posttest scores (O2) on pretest scores (O1) for units in the conditions, and (d) the comparison of slope and/or intercept differences for the two groups. An example is of this type of design is:

50 • EUGENE F. STONE-ROMERO

~R ~R

O1A O1B

X ~X

O2A O 2B

Unfortunately, this design does not allow for confident inferences about the effect of the treatment on the assumed dependent variable. There are numerous reasons for this including differential levels of attrition from members of the two groups (Cook & Campbell, 1976, 1979; Shadish et al., 2002). Summary. As noted above, quasi-experimental designs may allow a researcher to rule out some threats to internal validity. However, as is detailed by Campbell and Stanley (1963, Table 2) other threats can’t be ruled out by these designs. As a result, internal validity is often questionable in research using quasi-experimental designs. Stated differently, these designs are inferior to randomized-experimental designs in terms of supporting causal inferences. Non-Experimental Designs In a non-experimental study the researcher measures (i.e., observes) assumed independent, mediator, moderator, and dependent variables (O1, O2, O3, . . . Ok). One example of this type of study is research by Hackman and Oldham (1976). Its purpose was to test the job characteristics theory of job motivation. In it the researchers measured the assumed (a) independent variables of task variety, autonomy, and feedback, (b) mediator variables of experienced meaningfulness of work, and knowledge of results of work activities, (c) moderator variable of higher order need strength, and (d) dependent variables of work motivation, performance, and satisfaction. They then used statistical analyses (e.g., zero-order correlation, multiple regression) to test for relations between the observed variables. Results of the study showed strong support for hypothesized relations between the measured variables. Nevertheless, because the study was non-experimental, any causal inferences stemming from it would rest on a very shaky foundation. Note, moreover, that it is of no consequence whatsoever that the analyses were predicated on a theory! Thus, for example, the study’s results were incapable of serving as a valid basis for (a) inferring that job characteristics were the causes of satisfaction or (b) ruling out the operation of a number of potential (observed and unobserved) confounding variables. More generally and contrary to the arguments of many researchers, causal inferences are not strengthened by invoking a theory prior to the time a study is conducted. As noted above, for example, (a) some theorists argue that satisfaction causes performance, (b) others contend that performance causes satisfaction, and (c) still others assert that the relation is spurious. Non-experimental research is incapable of determining which, if any, of these assumed causal models is correct.

Research Design and Causal Inferences  •  51

OTHER DESIGN ISSUES This section deals with two strategies that are frequently used for making causal inferences: (a) meta-analysis based summaries of randomized-experimental studies and (b) tests of mediation models Cumulating the Results of Several Studies There are often instances in which multiple studies using randomized- experimental designs have examined causal relations between and among variables of interest. In such cases their findings can be cumulated using meta-analytic methods. This may be done in cases where research has considered either (a) simple causal models (e.g., Figures 3.1a or 3.1b) or (b) models involving mediation (e.g., Figure 3.1d, 1e, or 3.1g). Tests of Simple Causal Models With the Results of a Meta-Analysis The results of multiple randomized-experimental studies may be cumulated using meta-analytic methods. For example, Hosoda, Stone-Romero, and Coats (2003) meta-analyzed the results of 37 randomized-experimental studies of relations between (a) physical attractiveness and (b) various job-related outcomes (e.g., hiring, predicted job success, promotion, and job suitability). Overall, results showed a .37 mean weighted effect size (d) for 62 studied relations. These results allow for confident causal inferences about the impact of the attractiveness on the dependent variables. Using correlations derived from a meta-analysis leads to one very important consequence. More specifically, it provides evidence that causal relations between or among variables generalize across such dimensions as (a) types of sampling units, (b) research contexts, or (c) time periods. Mediation Models Research using randomized-experimental designs also may be used in tests of models involving mediation (e.g., Pirlott & MacKinnon, 2016; Rosopa & StoneRomero, 2008; Stone-Romero & Rosopa, 2008, 2010, 2011). For example, a researcher may posit that O1 → O2 → O3 . Here, as is illustrated in Figure 3.1g, the effect of O1 on O3 is transmitted through the mediator, O2. The simplest way of testing such a mediation model is to conduct two randomized experiments, one that tests for the effects of O1 on O2 and the other that tests for the effects of O2 on O3 (Rosopa & Stone-Romero, 2008; Stone-Romero & Rosopa, 2008, 2010, 2011). If the results show support for both such predictions, the mediating effect of O2 can be deduced through the use of symbolic logic (Kalish & Montague, 1964, Theorem 26). Research by Eden, Stone-Romero, and Rothstein (2015) is an example of a meta-analytic based mediation study. It used the results of meta-analyses of two

52 • EUGENE F. STONE-ROMERO

relations: The first involved the causal relation between leader expectations (LE) and subordinate self-efficacy (SE). For it, the average correlation was .58. The second considered the causal relation between subordinate self-efficacy (SE) and subordinate performance (SP), the average correlation was .35. When combined, these correlations along with formal reasoning deductions provided support for the assumed mediation model. The reasoning is ((LE → SE) ˄ (SE → SP)) → (LE → SE) (see Theorem 26 of Kalish & Montague, 1964). Whereas the results of meta-analyses of experimental studies may be used to support causal inferences for either simple (e.g., O1 → O2) or complex relations (e.g., O1 → O2 → O3), they do not justify such inferences in cases where the metaanalyses are based upon relations derived from non-experimental studies (e.g., Judge et al., 2001; Riketta, 2008). Stated somewhat differently, meta-analytic methods cannot serve as basis for valid causal inferences when they involve the accumulation of the findings of two or more non-experimental studies. INVALID CAUSAL INFERENCES BASED ON STATISTICAL STRATEGIES The literature in HRM and related fields is replete with studies in which invalid inferences about causal relations are based on the results of statistical analyses as opposed to the use of randomized-experimental designs. These analyses are of several types, including: (a) finding a zero-order correlation between two or more measured variables (e.g., O1, O2, and O3), (b) using hierarchical regression to show that a study’s results are consistent with an assumed causal model, and (c) analyzing data with so called “causal modeling” methods (e.g., cross-lagged correlation, causal-correlation analysis, path analysis, and SEM). Regrettably, the statistical methods used in a non-experimental or quasi-experimental study don’t provide a valid basis for valid causal inferences (Bollen, 1989; Rogosa, 1987; Rosopa & Stone-Romero, 2008; Shadish et al., 2002; Stone-Romero & Rosopa, 2004, 2008). Stated somewhat differently, statistical methods are not an appropriate substitute for sound experimental design! A number of researchers (e.g., Baron & Kenny, 1986) have advocated the use of hierarchical multiple regression (HMR) as a basis for inferring mediating effects (e.g., O1 →O2, → O3) using data from non-experimental studies. They contend that such analyses provide a basis for causal inferences about the direct (O1 → O3 ) and indirect (O1 →O2, → O3) effects of the assumed independent variable (O1) on the assumed dependent variable (O3). Figures 3.1d, 3.1e and 3.1g show three of several possible models involving mediation for the measured variables of O1, O2, and O3. As is explained below, the results of a “causal analysis” with HMR or any other statistical technique cannot provide valid evidence on the correctness of any of these models. Stone-Romero and Rosopa (2004) conducted an experimental statistical simulation study to determine the effects of four concurrent manipulations on the ability of the Baron and Kenny HMR technique to provide evidence of mediation. In

Research Design and Causal Inferences  •  53

the assumed causal model (a) the effect of Z1 on Z3 was mediated by Z2, and (b) there also was a direct effect of Z1 on Z3. The variables in the simulation were r12, r13, r23, and N, where (a) r12 = correlation between the assumed independent variable, Z1 and the assumed mediator variable, Z2; (b) r13 = correlation between the assumed independent variable, Z1 and the assumed dependent variable, Z3; (c) r23 = correlation between the assumed mediator variable, Z2 and the dependent variable Z3, and sample size. The manipulations of r12, r13, r23, (values of .1 to .9 for each) and N (values of 68, 136, 204, 272, 340, 408, 1,000, 1,500, 2,000, 2,500, and 3,000) resulted in 651 data sets. They were analyzed using the HMR technique. Results of 8,463 HMR analyses showed that: (a) if the model tested was not the true model, there would be a large number of cases in which there would be support for partial or complete mediation and the researcher would make highly erroneous inferences about mediation; (b) if the model tested was the true model, there would only be slight support for inferences about complete mediation and modest support for inferences about partial mediation. Overall, the HMR procedure did very poorly in terms of providing evidence of mediation (see Stone-Romero & Rosopa, 2004 for detailed information on the findings). Thus, the HMR technique is unlikely to provide consistent evidence to support inferences about mediation. As noted by Stone-Romero and Rosopa (2004) there are many problems with the HMR strategy for making inferences about causal connections between variables. First, it relies on the interpretation of the magnitudes of regression coefficients as a basis for determining effect size estimates. However, as Darlington (1968) demonstrated more than 50 years ago, when multiple correlated predictors are used in a regression analysis it is impossible to determine the proportion of variance that is explained uniquely by each of them. Second, when applied to data from non-experimental research there is almost always ambiguity about causal direction. Third, there is the issue of model specification. Although a researcher may test an assumed causal model, he or she cannot be certain that is the correct model. Moreover, the invocation of a theory may be of little or no consequence because there may be many theories about the causal connection between variables (e.g., the relation between satisfaction and performance). Fourth, the results of non-experimental research do not provide a basis for making causal inferences. In contrast, the findings of randomized-experimental studies do. Fifth, and finally, in non-experimental research there is always the problem of unmeasured confounds. These are seldom considered in HMR analyses. Even if they are, if the measures of confounds lack construct validity their effects cannot be controlled fully by an HMR analysis. Stone-Romero and Rosopa (2004) are not alone in questioning the ability of HMR to provide credible evidence about causal connections between (or among) measured variables. A number of other researchers have comparable views. For example, Mathieu and Taylor (2006) wrote that “Research design factors are paramount for reasonable mediational inferences to be drawn. If the causal order of variables is compromised, then it matters little how well the measures perform or

54 • EUGENE F. STONE-ROMERO

the covariances are partitioned. Because no [data] analytic technique can discern the true causal order of variables, establishing the internal validity of a study is critical. . . [and] randomized field experiments afford the greatest control over such concerns” (p. 1050). They went on to state that randomized experiments “remain the ‘gold standard’ [in empirical research] and should be pursued whenever possible” (p. 1050). These and similar views stand in sharp contrast to the generally invalid arguments of several authors (e.g., Baron & Kenny, 1986; Blalock, 1964, 1971; James, 2008; James, Mulaik, & Brett, 2006; Kenny, 1979, 2008; Preacher & Hayes, 2004, 2008). Unfortunately, unwarranted inferences about causality on the basis of so called “causal modeling” methods are all too common in publications in HRM and allied fields. For example, on the basis of a meta-analysis of the satisfaction- performance relation, Judge, Thoresen, Bono, and Patton (2001) argued that causal modeling methods can shed light on causal relations between these variables, especially in cases where mediation is hypothesized. They wrote that “Though some research has indirectly supported mediating influences [on the satisfaction-performance relation], direct tests are lacking. Such causal studies are particularly appropriate in light of advances in causal modeling techniques in the past 20 years” (p. 390). Contrary to the views of Judge et al., causal modeling techniques cannot provide a valid basis for causal inferences. Another example of invalid causal inferences comes from Riketta’s (2008) meta-analytic study of the relations between job attitudes (attitudes hereinafter) and performance. As noted above, he cumulated the findings of 16 nonexperimental studies to compute average correlations between attitudes and performance. They were used in what he described as a meta-analytic regression analysis. On the basis of it he wrote that “ because the present analysis is based on correlational rather than experimental data, it allows for only tentative causal conclusions and cannot rule out some alternative causal explanations (e.g., that third variables inflated the cross-lagged paths; see, e.g., Cherrington, Reitz, & Scott, 1971; Brown & Peterson, 1993). Although the present analysis accomplished a more rigorous test for causality than did previous meta-analyses in this domain, it still suffers from the usual weakness of correlational designs. Experiments are required to provide compelling evidence of causal relations” (p. 478). Whereas Riketta was correct in concluding that experiments are needed to test causal relations, he was incorrect in asserting that his study provided a more rigorous test of causality than previous meta-analytic studies. Brown and Peterson (1993) conducted an SEM-based test of an assumed causal model on the antecedent and consequences of salesperson job satisfaction. On the basis of its results they concluded that “Another important finding of the causal analysis is evidence that job satisfaction primarily exerts a direct causal effect on organizational commitment rather than vice versa” (p. 73). Unfortunately, this and other causal inferences were unwarranted because the study’s data came from non-experimental studies.

Research Design and Causal Inferences  •  55

It is interesting to consider the views of James et al. (2006) with respect to testing assumed mediation models. They argue that “if theoretical mediation models are thought of as causal models, then strategies designed specifically to test the fit of causal models to data, namely, confirmatory techniques such as structural equation modeling (SEM), should be employed to test mediation models” (p. 234). Moreover, they contend that in addition to testing a mediation model of primary interest they strongly recommend testing alternative causal models. As they note, “The objective is to contrast alternative models and identify those that appear to offer useful explanations versus those that do not” (p. 243). However, they go on to write that the results of SEM analyses “for both complete and partial mediation models do not imply that a given model is true even though the pattern of parameter estimates is consistent with the predictions of the model. There are always other equivalent models implying different causal directions or unmeasured common causes that would also be consistent with the data” (p. 238). Unfortunately, for the reasons noted above, testing primary or alternative models with SEM or any other so called “causal modeling” methods does not allow researchers to make valid causal inferences because when applied to data from non-experimental studies these methods cannot serve as a valid basis for inferences about cause. Some researchers seem to believe that the invocation of a theory combined with the findings of a “causal modeling” analysis (e.g., SEM) is the deus ex machina of nonexperimental research. Nothing could be further from the truth. One reason for this is that the same set of observed correlations between or among a set of measured variables can be used to support a number of assumed causal models (e.g., Figures 3.1a to 3.1g). In the absence of research using randomized experimental designs it is impossible to determine which, if any, of the models is correct. Clearly, so called “causal modeling” methods (e.g., path analysis, hierarchical regression, cross-lagged panel correlation, and SEM) are incapable of providing valid evidence on causal connections between and among measured variables (Cliff, 1983; Freedman, 1987; Games, 1990; Rogosa, 1987; Rosopa & Stone-Romero, 2008; Spencer, Zanna, & Fong, 2005; Stone-Romero, 2002, 2008, 2009, 2010; Stone-Romero & Gallaher, 2006; Stone-Romero & Rosopa, 2004, 2008, 2010, 2011). Therefore, researchers interested in making causal inferences should conduct studies using either randomized-experimental or quasi-experimental designs. In recent years, a number of researchers have championed the use of data analytic strategies for supposedly improving causal inferences in research using nonexperimental designs. Two examples of this are propensity score modeling (e.g., Rosenthal & Rosnow, 2008) and regression-based techniques for approximating counterfactuals (e.g., Morgan & Winship, 2014). On their face, these techniques may appear elegant and sophisticated. However, the results of these regressionbased strategies do not provide a valid basis for causal inferences because the data used by them come from non-experimental research. Another very serious limitation of the propensity score strategy and similar strategies is that they provide statistical controls for only a limited set of control variables. This leaves a host of

56 • EUGENE F. STONE-ROMERO

unmeasured variables uncontrolled. In short, statistical control strategies such as propensity score analyses are a very poor substitute for research using randomized experimental designs. OBJECTIONS TO RANDOMIZED-EXPERIMENTS Some researchers (e.g., James, 2008; Kenny, 2008) have argued that research based on randomized-experimental designs is not feasible for various reasons, including (a) some independent variables can’t be manipulated, (b) the manipulation of others may not be ethical, and (c) organizations will not permit randomized experiments. Non-Manipulable Variables Clearly, some variables are incapable of being manipulated by researchers, including (a) the actual ages, sexes, genetic makeup, physical attributes, and cognitive abilities of research participants, (b) the laws of cities, counties, states, and countries, and (c) the environments in which research units operate. Nevertheless, through creative research design it may be possible to manipulate a number of such variables. For example, in a randomized-experimental study of helping behavior by Danzis and Stone-Romero (2009) the attractiveness of a confederate (who requested help from research subjects) was manipulated through a number of strategies (e.g., the clothing, jewelry, makeup, and hairstyle of confederates). Results of the study showed that attractiveness had an impact on helping behavior. Attractiveness also can be manipulated in a number of other ways. For example, in a number of simulated hiring studies using randomized-experimental designs the physical attractiveness of job applicants was manipulated via photos of the applicants (see Stone, Stone, & Dipboye, 1992, for details). In addition, a randomized-experimental study by Kreuger, Stone, and Stone-Romero (2014) examined the effects of several factors, including applicant weight on hiring decisions. In it, the weight of applicants was manipulated through the editing of photos of them using Photoshop software. Overall, what the above demonstrates quite clearly is that randomized-experiments are possible that involve independent variables that some researchers believe to be difficult or impossible to manipulate. In an article that critiqued the use of research using randomized-experimental designs, James (2008) wrote that “If we limited causal inference to randomized experiments where participants have to be randomly sampled [sic] into values of a causal variable, then we would no longer be able to draw causal inferences about smoking and lung cancer (to mention one of several maladies)” (p. 361). Clearly, this argument is of little or no consequence because many variables can be manipulated. For example, a large number of randomized-experimental studies have shown the causal connection between smoking and lung cancer using human or non-human research subjects (El-Bayoumi, Iatropolous, Amin, Hoffman, & Wynder, 1999 Salaspuro & Salaspuro, 2004). And, at the cellular level, thou-

Research Design and Causal Inferences  •  57

sands of randomized-experimental studied have linked a wide variety of chemical compounds to cancer and other diseases. Moreover, a quasi- experimental study by Salaspuro and Salaspuro (2004) used smoking and non-smoking subjects. The smokers smoked one cigarette every 20 minutes while the non-smokers served as controls. The researchers compared their salivary acetaldehyde (a known carcinogen) levels every 20 minutes for a 160 minute period. Results showed that smokers had seven times higher acetaldehyde levels than non-smokers. The fact that only smokers smoked the cigarettes there was virtually no ethical issue with the study. Moreover, even though the study used a quasi-experimental design causal inferences were possibly high because virtually all threats to internal validity were controlled through its design. It merits adding that I have strong ethical objections to research that uses nonhuman subjects in research aimed at inducing cancer or other diseases. Fortunately, there are various ethical alternatives to such studies. For example, research can use “tissue-on-a-chip models or microphysiological systems [which are] are a fusion of engineering and advanced biology. Silicon chips are lined with human cells that mimic the structure and function of human organs and organ systems. They are used for disease modeling, personalized medicine, and drug testing” (e.g., Fabre, Livingston, & Tagle, 2013; Physicians Committee for Responsible Medicine, 2018). The availability of these methods casts further doubt on the validity of James’ (2008) arguments because it clearly demonstrates how randomized-experiments can be used in ethical research in medicine. These methods also may have applicability in HRM research. For example, they may be used in studies of the effects of environmental toxins or substances (e.g., chemicals, asbestos, and particulates) in work settings. Second, the fact that randomized experiments are not always possible should not lead researchers to operate on the assumption that they should not be used in cases where they are possible (see Evan, 1971, for many examples). That would be tantamount to arguing that since antibiotics cannot cure cancer, arthritis, and a host of other diseases, they should not be used in the treatment of diseases such as bronchitis, syphilis, sinus infections, strep throat, urinary tract infections, pneumonia, eye infections, and ear infections. The View That Some Manipulations Are Unethical There are many instances in which it would be unethical to subject research participants to treatments that produce psychological or physical harm. In terms of the former, for example, a researcher may be interested in assessing the effects of feedback valence (positive versus negative) on task-based esteem or mood state. However, it would be unethical to provide research participants with false negative feedback about themselves or their work. In contrast, it would not be unethical to ask them to role play a hypothetical worker and ask how they would react to different types of feedback. For example, Stone and Stone (1985) used a 2 × 2 randomized-experimental design to study the effects of feedback favorability

58 • EUGENE F. STONE-ROMERO

and feedback consistency on self-perceived task competence and perceived feedback accuracy. Participants in the study were asked to role play a hypothetical worker’s reactions to the feedback. Results showed main and interactive effects of the manipulated variables. The fact that role playing methods were used averted the ethical problems that would have arisen if the participants had been provided with false negative feedback. The upshot of the foregoing is that randomized experiments are indeed possible for variables that would be difficult or unethical to manipulate. Of course, role playing and other simulation studies are not devoid of problems. For example, construct validity may be threatened by research that uses the role playing strategy. More specifically, it seems quite likely that the strength of a variable manipulated through a role play (e.g., performance feedback in an SP setting would be lower than that of the same type of variable in a NSP context (e.g., performance feedback provided by a supervisor in a work organization). Nevertheless, if relations between simulated feedback and measured outcomes are found in an SP setting they are likely to be underestimates of the relations that would be found in a NSP context. Moreover, if the purpose of a study is to determine causal connections between variables the results of a randomized experiment in an SP setting would certainly be more convincing than the results of a nonexperimental study in an NSP setting. The View That Experimental Research Is Not Possible in Organizational Settings Another objection that has been raised by some (e.g., James, 2008; Kenny, 2008) is that organizations will not allow researchers to conduct randomized experiments. This argument is of questionable validity: Although randomized experiments may not be allowed by some organizations, they are indeed possible. Eden and his colleagues (Davidson & Eden, 2000; Dvir, Eden Avolio, & Shamir, 2002; Eden, 1985, 2003, 2017; Eden & Aviram, 1993; Eden & Zuk, 1995), for example, have conducted a large number of randomized experiments in organizations. In one such study, Eden and Zuk (1995) used a randomized-experimental design to assess the effects of self-efficacy training on seasickness of naval cadets in the Israeli Defense Forces. Results of the study showed support for the study’s hypotheses. Other researchers would do well to benefit from the high degree of creativity shown by Eden and his colleagues in conducting randomized-experiments in NSP settings. Further evidence of the feasibility of experimentation in organizational settings is afforded by the chapters in a book titled Organizational Experiments: Laboratory and field research (Evan, 1971). Moreover, Shadish et al. (2002) describe the many situations that are conducive to the conduct of randomized experiments in NSP settings (e.g., work organizations). Among these are circumstances when (a) the demand for a treatment exceeds its supply, (b) treatments cannot be delivered to all units simultaneously, (c) units can be isolated from one another, (d) when there is little no communication between or among units, (e) assignment to treat-

Research Design and Causal Inferences  •  59

ments can granted on the basis of breaking ties with regard to a selection variable, (f) units are indifferent to the type of treatment they receive, (g) units are separated from one another, and (h) the researcher can create an organization within which the research will be conducted. Finally, even if randomized experiments are not possible in NSP settings (e.g., work organizations) they may be possible in SP settings (Stone-Romero, 2010), including, organizations created for the specific purpose of experimental research (Evan, 1971; Shadish et al., 2002). Thus, contrary to the arguments of several analysts (e.g., James, 2008; Kenny, 2008), researchers should consider randomized-experimental research when their interest is testing assumed causal models. Of course, if the sole purpose of a study is to determine if observed variables are related to one another non-experimental studies are appropriate. CONCLUSIONS In view of the above, several conclusions are offered: First, causal inferences require sound experimental designs. Of the available options, randomized-experimental designs provide the strongest foundation for such inferences, quasi-experimental designs afford a weaker basis, and non-experimental designs offer the weakest. Thus, whenever possible researchers interested in making causal inferences should use randomized-experimental designs. Second, data analytic strategies are never an appropriate substitute for sound experimental design. It is inappropriate to advance causal inferences on the basis of such “causal modeling” strategies as HMR, path analysis, cross-lagged panel correlation, and SEM. Thus, researchers should refrain from doing so. There is nothing wrong with arguing that a study’s results are consistent with an assumed causal model, but consistency is not a valid basis for implying the correctness of that model. The reason is that the results may be consistent with many other models and there is seldom a valid basis for choosing one model over others. Third, researchers should acknowledge the fact causal inferences are inappropriate when a study’s data come from research using non-experimental or quasiexperimental designs. Thus, they should not advance such inferences (see also, Wood, Goodman, Beckmann, & Cook, 2008). Rather, they should be circumspect in discussing the implications of the findings of their research. Fourth, randomized experiments are possible in both SP and NSP settings, and they are the “gold standard” for conducting research aimed at testing assumed causal models. Thus, they should be the first choice for research aimed at testing such models. Moreover, there are numerous strategies for conducting such experiments in NSP settings (Evan, 1971; Shadish et al., 2002). The many studies by Eden and his colleagues are evidence of this. Fifth, researchers should not assume that statistical controls for confounds (e.g., in regression models) are effective in ruling out confounds. There are two reasons for this. One is that the measures of known confounds may lack construct validity. The other is that the researcher may not be aware of all confounds

60 • EUGENE F. STONE-ROMERO

that may influence observed relations between assumed causes and effects. Thus, it typically proves impossible to control for confounds in non-experimental research. Sixth, the editors of journals in HRM and related disciplines should insure that authors of research-based articles refrain from advancing causal inferences when their studies are based on experimental designs that do not justify them. As noted above, there is nothing wrong with arguing that a study is based upon an assumed causal model. For example, an author may argue legitimately that a study’s purpose is to test a model that posits a causal connection between achievement motivation and job performance. In the study, both variables are measured. If a relation is found between these variables it would be inappropriate to conclude that the results of the study provided a valid basis for inferring that achievement motivation was the cause of performance. As noted above, research using nonexperimental or quasi-experimental designs cannot provide evidence of the correctness of an assumed causal model. Seventh, sound research methods are vital to both (a) the development and testing of theoretical models and (b) the formulation of recommendations for practice. Thus, progress in both such pursuits is most likely to be made through research that uses randomized-experimental designs (Stone-Romero, 2008, 2009, 2010). With respect to theory testing, randomized-experiments are the best research strategy for providing convincing evidence on causal connections between variables. With little exception, the research literature on various topics (e.g., the satisfaction-performance relation) shows quite clearly that non-experimental research has done virtually nothing to provide credible answers to the validity of extant theories. On the other hand, well-conceived experimental studies (e.g., Cherrington et al, 1971) provide clear evidence on causal linkages between variables. With regard to recommendations for practice it is important to recognize that a large percentage of studies in HRM and allied disciplines have used non-experimental designs. Because of this it seems likely that many HRM-related policies and practices are based upon research that lacks internal validity. Thus, research using randomized-experimental designs has the potential to greatly improve HRM-related policies and practices. Eighth, the language associated with some statistical methods may serve as a basis for invalid inferences about causal connections between variables. One example of this is analysis of variance. In a study involving two manipulated variables (e.g., A and B) an ANOVA analysis would allow for valid inferences about the main and interactive effects of these variables on a measured dependent variable. However, if an ANOVA was used to analyze data from a study in which the same variables were measured (e.g., age, ethnicity, sex) it would be inappropriate to argue that these so called “independent variables” affected the assumed dependent variable. It deserves adding that the same arguments can be made about the language associated with other statistical methods (e.g., multiple regression, and SEM).

Research Design and Causal Inferences  •  61

Ninth and finally, although this paper’s focus was on HRM research, the just noted conclusions have far broader implications. More specifically, they apply to virtually all disciplines in which the results of empirical research are used to advance causal inferences about the correctness of assumed causal models. REFERENCES Blalock, H. M. (1964). Causal inferences in nonexperimental research. New York, NY: W. W. Norton. Bateman, T. S., & Strasser, S. (1984). A longitudinal analysis of the antecedents of organizational commitment. Academy of Management Journal, 27, 95–112. Blalock, H. M. (1971). Causal models in the social sciences. Chicago. IL: Aldine. Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173–1182. Bollen, K. A. (1989). Structural equations with latent variables. New York, NY: Wiley. Bouchard, T. (1976). Field research methods: Interviewing, questionnaires, participant observation, systematic observation, and unobtrusive measures. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 363–413). Chicago, IL: Rand McNally. Brown, S. P., & Peterson, R. A. (1993). Antecedents and consequences of salesperson job satisfaction: Meta-analysis and assessment of causal effects. Journal of Marketing Research, 30, 63–77. Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research. Chicago, IL: Rand McNally. Campbell, J. P. (1986). Labs, fields, and straw issues. In E. A. Locke (Ed.), Generalizing from laboratory to field settings: Research findings from industrial-organizational psychology, organizational behavior, and human resource management (pp. 269– 279). Lexington, MA; Lexington Books. Cherrington, D. J., Reitz, H. J., & Scott, W. E. (1971). Effects of contingent and noncontingent reward on the relationship between satisfaction and task performance. Journal of Applied Psychology, 55, 531–536. Cook, T. D., & Campbell, D. T. (1976). The design and conduct of quasi-experiments and true experiments in field settings. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 223–326). Chicago, IL: Rand McNally. Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Boston, MA: Houghton Mifflin. Cliff, N. (1983). Some cautions concerning the application of causal modeling methods. Multivariate Behavioral Research, 18, 115−126. Danzis, D., & Stone-Romero, E. F. (2009). Effects of helper sex, recipient attractiveness, and recipient femininity on helping behavior in organizations. Journal of Managerial Psychology, 24, 722–737. Darlington, R. B. (1968). Multiple regression in psychological research and practice. Psychological Bulletin, 69, 161–182. Davidson, O. B., & Eden, D. (2000). Remedial self-fulfilling prophecy: Two field experiments to prevent Golem effects among disadvantaged women. Journal of Applied Psychology, 85, 386–398.

62 • EUGENE F. STONE-ROMERO Dvir, T., Eden, D., Avolio, B. J., & Shamir, B. (2002). Impact of leadership development on follower development and performance: A field experiment. Academy of Management Journal, 45, 735–744. Eden, D. (1985). Team development: A true field experiment at three levels of rigor. Journal of Applied Psychology, 70, 94–100. Eden, D. (2003). Self-fulfilling prophecies in organizations. In J. Greenberg (Ed.), Organizational behavior (2nd ed., pp. 91–122). Mahwah, NJ: Erlbaum. Eden, D. (2017). Field experimentation in organizations. Annual Review of Organizational Psychology and Organizational Behavior, 4, 91–122. Eden, D., & Aviram, A. (1993). Self-efficacy training to speed reemployment: Helping people to help themselves. Journal of Applied Psychology, 78, 352–360. Eden, D., Stone-Romero, E. F., & Rothstein, H. R. (2015) Synthesizing results of multiple randomized experiments to establish causality in mediation testing. Human Resource Management Review, 25, 342–351. Eden, D., & Zuk, Y. (1995). Seasickness as a self-fulfilling prophecy: Raising self-efficacy to boost performance at sea. Journal of Applied Psychology, 80, 628–635. El-Bayoumy, K., Iatropolous, M., Amin, S., Hoffman, D., & Wynder, E. L. (1999). Increased expression of cyclooygnase-2 in rat lung tumors induced by tobacco-specific nitrosamine-4-(3 pyridl)-1-butanone: The impact of a high fat diet. Cancer Research, 59, 1400–1403. Fabre, K. M., Livingston, C., & Tagle, D. A. (2014). Organs-on-chips (microphysiological systems): tools to expedite efficacy and toxicity testing in human tissue. Experimental Biology and Medicine, 239, 1073–1077. Evan, W. M. (1971). Organizational experiments: Laboratory and field research. New York, NY: Harper & Row. Farkas, A. J., & Tetrick, L. E. (1989). A three-wave longitudinal analysis of the causal ordering of satisfaction and commitment on turnover decisions. Journal of Applied Psychology, 74, 855–868. Freedman, D. A. (1987). As others see us: A case study in path analysis. Journal of Educational Statistics, 12, 101−128. Fromkin, H. L., & Streufert, S. (1976). Laboratory experimentation. In M. D. Dunnette (Ed.). Handbook of industrial and organizational psychology (pp. 415–465). Chicago, IL: Rand McNally. Games, P. A. (1990). Correlation and causation: A logical snafu. Journal of Experimental Education, 58, 239–246. Hackman, J. R., & Oldham, G. R. (1976). Motivation through the design of work: Test of a theory. Organizational Behavior and Human Performance, 16, 250–279. Hosoda, M., Stone-Romero, E. F., & Coats, G. (2003). The effects of physical attractiveness on job-related outcomes: A meta-analysis of experimental studies. Personnel Psychology, 56, 431–462. James, L. R. (2008). On the path to mediation. Organizational Research Methods, 11, 359–363. James, L. R., Mulaik, S. A., & Brett, J. M. (2006). A tale of two methods. Organizational Research Methods, 9, 233–244. Judge, T. A., Locke, E. A., Durham, C. C., & Kluger, A. N. (1998). Dispositional effects on job and life satisfaction: The role of core evaluations. Journal of Applied Psychology, 83, 17–34.

Research Design and Causal Inferences  •  63 Judge, T. A., Thoresen, C. J., Bono, J. E., & Patton, G. K. (2001). The job satisfaction-job performance relationship: A qualitative and quantitative review. Psychological Bulletin, 127, 376–407. Kalish, D., & Montague, R. (1964). Logic: Techniques of formal reasoning. New York, NY: Harcourt, Brace, & World. Kenny, D. A. (1979). Correlation and causality. New York, NY: Wiley. Kenny, D. A. (2008). Reflections on mediation. Organizational Research Methods, 11, 353–358. Koslowsky, M. (1991). A longitudinal analysis of job satisfaction, commitment, and intention to leave. Applied Psychology: An International Review, 40, 405–415. Krueger, D. C., Stone, D. L., & Stone-Romero, E. F. (2014). Applicant, rater, and job factors related to weight-based bias. Journal of Managerial Psychology, 29, 164–186. Lance, C. E. (1991). Evaluation of a structural model relating job satisfaction, organizational commitment, and precursors to voluntary turnover. Multivariate Behavioral Research, 26, 137–162. Locke, E. A. (1986). Generalizing from laboratory to field settings: Research findings from industrial-organizational psychology, organizational behavior, and human resource management. Lexington, MA: Lexington Books. Mathieu, J. E., & Taylor, S. R. (2006). Clarifying conditions and decision points for mediational type inferences in organizational behavior. Journal of Organizational Behavior, 27, 1031–1056. Morgan, S. L., & Winship, C. (2014). Counterfactuals and causal inference. New York, NY: Cambridge University Press. Noe, R. A. (2017). Employee training and development (7th ed.). Burr Ridge, IL: McGrawHill/Irwin. Physicians Committee for Responsible Medicine. (2018). Retrieved 14 December 2018 from: https://www.pcrm.org/ethical-science/animal-testing-and-alternatives/humanrelevant-alternatives-to-animal-tests Pirlott, A. G., & MacKinnon, D. P. (2016). Design approaches to experimental mediation. Journal of Experimental Social Psychology, 66, 29–38. Preacher, K. J., & Hayes, A. F. (2004). SPSS and SAS procedures for estimating indirect effects in simple mediation models. Behavior Research Methods, Instruments, & Computers, 36, 717–731. Preacher, K. J., & Hayes, A. F. (2008). Contemporary approaches to assessing mediation in communication research. In A. F. Hayes, M. D. Slater, & L. B. Snyder (Eds.), The SAGE sourcebook of advanced data analysis methods for communication research (pp. 13–54). Thousand Oaks, CA: Sage. Riketta, M. (2008). The causal relation between job attitudes and performance: A metaanalysis of panel studies. Journal of Applied Psychology, 93, 472–481. Rogosa, D. (1987). Causal models do not support scientific conclusions: A comment in support of Freedman. Journal of Educational Statistics, 12, 185–195. Rosenthal, R., & Rosnow, R. L. (2008). Essentials of behavioral research: Methods and data analysis (3rd ed.). New York, NY: Mc Graw Hill. Rosopa, P. J., & Stone-Romero, E. F. (2008). Problems with detecting assumed mediation using the hierarchical multiple regression strategy. Human Resource Management Review, 18, 294–310.

64 • EUGENE F. STONE-ROMERO Salaspuro, V., & Salaspuro, M. (2004). Synergistic effect of alcohol and drinking on in vivo acetaldehyde concentration in saliva. International Journal of Cancer, 111, 480–483. Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin. Spencer, S. J., Zanna, M. P., & Fong, G. T. (2005). Establishing a causal chain: Why experiments are often more effective than mediation analyses in examining psychological processes. Journal of Personality and Social Psychology, 89, 845–851. Stone, D. L., & Stone, E. F. (1985). The effects of feedback consistency and feedback favorability on self-perceived task competence and perceived feedback accuracy. Organizational Behavior and Human Decision Processes, 36, 167–185. Stone, E. F., Stone, D. L., & Dipboye, R. L. (1992). Stigmas in organizations: Race, handicaps, and physical attractiveness. In K. Kelley (Ed.), Issues, theory, and research in industrial/organizational psychology (pp. 385–457). Amsterdam, Netherlands: Elsevier Science Publishers Stone-Romero, E. F. (2002). The relative validity and usefulness of various empirical research designs. In S. G. Rogelberg (Ed.), Handbook of research methods in industrial and organizational psychology (pp. 77–98). Malden, MA: Blackwell. Stone-Romero, E. F. (2008). Strategies for improving the validity and utility of research in human resource management and allied disciplines. Human Resource Management Review, 18, 205–209. Stone-Romero, E. F. (2009). Implications of research design options for the validity of inferences derived from organizational research. In D. Buchanan & A. Bryman (Eds.), Handbook of organizational research methods (pp. 302–327). London, UK: Sage. Stone-Romero, E. F. (2010). Research strategies in industrial and organizational psychology: Nonexperimental, quasi-experimental, and randomized experimental research in special purpose and nonspecial purpose settings. In S. Zedeck (Ed.), Handbook of industrial and organizational psychology (pp. 35–70). Washington, DC: American Psychological Association Press. Stone-Romero, E. F., & Gallaher, L. (2006, May). Inappropriate use of causal language in reports of non-experimental research. Paper presented at the meeting of the Society for Industrial and Organizational Psychology. Dallas, TX. Stone-Romero, E. F., & Rosopa, P. J. (2004). Inference problems with hierarchical multiple regression-based tests of mediating effects. Research in Personnel and Human Resources Management, 23, 249–290. Stone-Romero, E. F., & Rosopa, P. J. (2008). The relative validity of inferences about mediation as a function of research design characteristics. Organizational Research Methods, 11, 326–352. Stone-Romero, E. F., & Rosopa, P. (2010). Research design options for testing mediation models and their implications for facets of validity. Journal of Managerial Psychology, 25, 697–712. Stone-Romero, E. F., & Rosopa, P. (2011). Experimental tests of mediation models: Prospects, problems, and some solutions. Organizational Research Methods, 14, 631– 646. Wanous, J. P. (1974). A causal-correlational analysis of the job satisfaction and performance relationship. Journal of Applied Psychology, 59, 139–144.

Research Design and Causal Inferences  •  65 Wiener, Y., & Vardi, Y. (1980). Relationships between job, organization, and career commitments and work outcomes: An integrative approach. Organizational Behavior and Human Performance, 26, 81–96. Williams, L. J., & Hazer, J. T. (1986). Antecedents and consequences of satisfaction and commitment in turnover models: A reanalysis using latent variable structural equation methods. Journal of Applied Psychology, 71, 219–231. Wood, R. E., Goodman, J. S., Beckmann, N., & Cook, A. (2008). Mediation testing in management research: A review and proposals. Organizational Research Methods, 11, 270–295.

CHAPTER 4

HETEROSCEDASTICITY IN ORGANIZATIONAL RESEARCH Amber N. Schroeder, Patrick J. Rosopa, Julia H. Whitaker, Ian N. Fairbanks, and Phoebe Xoxakos

Variance plays an important role in theory and research in human resource management and related fields. Variance refers to the dispersion of scores or residuals around a mean or, more generally, a predicted value (Salkind, 2007, 2010). In the general linear model, the mean square error provides an estimate of the dispersion of population error variance (Fox, 2016). As mean square error decreases, in general, the variability in the population error also decreases. In general linear models, it is assumed that the population error variance is constant across cases (i.e., observations in a sample). This assumption is known as homoscedasticity, or homogeneity of variance (Fox, 2016; King, Rosopa, & Minium, 2018; Rencher, 2000). When the homoscedasticity assumption is violated, it is referred to as heteroscedasticity, or heterogeneity of variance (Fox, 2016; Rosopa, Schaffer, & Schroeder, 2013). When heteroscedasticity is present in the general linear model, this results in incorrect standard errors, which can lead to biased Type I error rates and reduced statistical power (Box, 1954; DeShon & Alexander, 1996; White, 1980; Wilcox, 1997). This can threaten the statistical conclusion validity of a study (Shadish, Cook, & Campbell, 2002). Notably, heteroscedasticity has been found in a variety of organizational and psychological research contexts (Aguinis & Pierce, 1998; Antonakis & Dietz, 2011; Ostroff & Fulmer, 2014), thereby Research Methods in Human Resource Management: Toward Valid Research-Based Inferences, pages 67–86. Copyright © 2020 by Information Age Publishing All rights of reproduction in any form reserved.

67

68 •  SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

prompting research regarding best practices for detecting changes in residual variance and mitigating its negative effects (Rosopa et al., 2013). In the present paper, we discuss how change in residual variance (i.e., heteroscedasticity) can be more than a violated statistical assumption. In some instances, heteroscedasticity can be of substantive theoretical importance. For instance, Bryk and Raudenbush (1988) proposed that heteroscedasticity may be an indicator of unmeasured individual difference moderators in studies where treatment effects are measured. Thus, the focus of this paper is twofold: First, we highlight five areas germane to human research management and related fields in which changes in variance provide a theoretical and/or empirical contribution to research and practice. Namely, we describe how the examination of heteroscedasticity can contribute to the understanding of organizational phenomena across five research topics: (a) stress interventions, (b) aging and individual differences, (c) skill acquisition and training, (d) groups and teams, and (e) organizational climate. Second, we describe several data analytic approaches that can be used to detect heteroscedasticity. These approaches, however, are discussed in the context of various statistical analyses that are commonly used in human resource management and related fields. We consider (a) testing for the equality of two independent means, (b) analysis of variance, (c) analysis of covariance, and (d) multiple linear regression. SUBSTANTIVE HETEROSCEDASTICITY IN ORGANIZATIONAL RESEARCH Even though error variance equality is an assumption of the general linear model, in some instances, heteroscedasticity may be more than a violated assumption; rather, it could be theoretically important. In the following sections, we provide examples of substantively meaningful heteroscedasticity in organizational research. Stress Intervention Stress management is a topic of interest in several psychological specialties, including organizational and occupational health psychology. For organizations, stress can result in decreased job performance (Gilboa, Shirom, Fried, & Cooper, 2008), increased absenteeism (Darr & Johns, 2008), turnover (Podsakoff, LePine, & LePine, 2007), and adverse physical and mental health outcomes (Schaufeli & Enzmann, 1998; Zhang, Zhang, Ng, & Lam, 2019). Thus, stress management interventions are often implemented by organizations with the objective of reducing stressors in the workplace (Jackson, 1983), teaching employees to better manage stressors, or reducing the negative outcomes associated with stressors (Ivancevich, Matteson, Freedman, & Phillips, 1990). Although several different stress interventions exist (e.g., cognitive-behavioral approaches, relaxation approaches, multimodal approaches; Richardson & Rothstein, 2008; van der Klink,

Heteroscedasticity in Organizational Research  •  69

Blonk, Schene, & van Dijk, 2001), stress interventions have one common goal: to reduce stress and its negative consequences. Stress intervention research often examines the reduction in strain or negative health outcomes of those in a treatment group compared to those in a control group (Richardson & Rothstein, 2008; van der Klink et al., 2001). However, successful stress interventions may also result in less variability in stress-related outcomes for those in the treatment group compared to those in the control group, as has been demonstrated (although not explicitly predicted) in several studies (e.g., Bond & Bunce, 2001; Galantino, Baime, Maguire, Szapary, & Farrar, 2005; Jackson, 1983; Yung, Fung, Chan, & Lau, 2004). Thus, individual-level stress interventions (DeFrank & Cooper, 1987; Giga, Noblet, Faragher, & Cooper, 2003) may result in a reduction in the variability of reported strain (e.g., by reducing individual differences in perceiving stressors, coping with stress, or recovering from strain; LaMontagne, Keegel, Louie, Ostry, & Landsbergis, 2007), thereby contributing to heterogeneity of variance when comparing those who underwent the intervention to those who did not. This is consistent with the finding that treatments can interact with individual difference variables to contribute to differences in variability in outcomes (see e.g., Bryk & Raudenbush, 1988). Thus, heteroscedasticity could be the natural byproduct of an effective stress intervention, which provides an illustration of a circumstance in which heteroscedasticity may be expected when testing for the equality of two independent means. Figure 4.1 provides an example of two independent groups where the means differ be-

FIGURE 4.1.  Plot of means for two independent groups (n = 100 in each group) with 95% confidence intervals, suggesting that the variability in the Intervention group is much smaller than the variability in the Control group.

70 •  SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

tween a control group and an experimental group that underwent an intervention designed to reduce strain. Notably, the variance is smaller for those who received the intervention compared to those in the control group. As highlighted by transactional stress theory (Lazarus & Folkman, 1984), individual perceptions play an important role in the stress response process. For example, Webster, Beehr, and Love (2011) found that the same work demand (e.g., workload, hours of work, job requirements) can be perceived as challenging to one employee and encumbering to another. Therefore, individual differences have the potential to produce heteroscedasticity in stress intervention effectiveness. More specifically, individual factors such as one’s self-regulatory orientation (i.e., promotion- or prevention-focused; Byron, Peterson, Zhang, & LePine, 2018), self-efficacy (Panatik, O’Driscoll, & Anderson, 2011), and perceptions of others’ expectations for perfectionism (Childs & Stoeber, 2012) have been linked to stress appraisal and subsequent stress outcomes. As such, stress interventions not specifically addressing relevant individual differences (e.g., variability in appraisals of efficacy-related stress) may be differentially effective across individuals, thereby resulting in heteroscedasticity in stress outcomes. In other words, within a treatment condition, post-intervention stress outcome variability may be impacted by individual difference variables (i.e., the intervention may be more effective for specific individuals, thereby resulting in a greater reduction in stress outcome variability for this subset of employees). As such, differences in variability may be an important index for measuring the effectiveness of a stress intervention. Aging and Individual Differences A second area in which heteroscedasticity may make a substantive contribution is in aging research. Research on aging often utilizes simple linear regression to examine relations between age and various outcomes to determine if abilities such as memory or reaction time decline as individuals age (i.e., a negative regression slope when predicting memory or positive regression slope when predicting reaction time). For example, as age increases, visual acuity (Spirduso, Francis, & MacRae, 2005), fluid intelligence (Morse, 1993), decision making ability (Boyle et al., 2012), and episodic memory (Backman, Small, & Wahlin, 2001) tend to decline. Heteroscedasticity appears to exist in this area (Baltes & Baltes, 1990), as research shows that for many tasks, older adults tend to have larger variations in performance than do younger adults (Spirduso, Francis, & MacRae, 2005). However, these declines may be more pronounced for certain individuals due to a variety of individual differences. Namely, Christensen et al. (1999) found that physical strength, depression, illness, gender, and education level explained variation in cognitive functioning among older adults. A number of other explanations for age-related changes in variability have been proposed, including increased opportunities for gene expression (Plomin & Thompson, 1986), environmental changes and life experiences, discrepancies in the rate of change in various biological systems (Spirduso et al., 2005), or the

Heteroscedasticity in Organizational Research  •  71

greater prevalence of health-related problems in older adults (Backman et al., 2001; Baltes & Baltes, 1990). Each of these factors can lead to increased individual differences in older adults, which has important implications for organizational (Moen, Kojola, & Schaefers, 2017; Ng & Feldman, 2008; Taylor & Bisson, 2019) and aging (Colcombe, Kramer, Erikson, & Scalf, 2005; Froehlich, Beausaert, & Segers, 2016; Kotter-Grühn, Kornadt, & Stephan, 2016) research. Namely, when researchers examine how outcomes change as a function of age using a simple linear regression model, it may be appropriate to test for heteroscedasticity and model it accordingly (Rosopa et al., 2013). Figure 4.2 depicts a scatterplot of memory and age. The figure also includes the fitted line from the ordinary least squares regression of memory on age. In addition to a statistically significant negative slope, it should be evident that the residual variance is changing as a function of age. Specifically, residual variance increases as age increases. For example, when age is equal to 30, the residual variance is smaller compared to when age is equal to 50 and when age is equal to 70. The socio-emotional selectivity theory (SST; Carstensen, 1995) suggests an age-related dependency in social relationship motivation as a result of coping efforts related to declines in physical and cognitive abilities. Specifically, SST proposes that because older adults view their remaining time in life to be shorter than younger adults, this motivates older adults to gain emotional satisfaction from social relationships, as opposed to focusing on acquiring resources in their social interactions (which is more common in younger adults). However, previous work has demonstrated a discrepancy between chronological age and felt age, particu-

FIGURE 4.2.  Simple linear regression predicting memory with age, suggesting that residual variance increases as age increases.

72 •  SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

larly among older adults, such that older adults often report feeling younger than their chronological age (Barak, 2009). Thus, it is possible that adults with the same chronological age may have varying perceptions of felt age, which could impact their strategies for seeking and maintaining social relationships. As such, for chronologically older adults, those with a lower felt age may react similar to younger adults (i.e., by engaging in social relationships for instrumental purposes), whereas those with a higher felt age may respond more in line with SST (i.e., by focusing on emotional connectivity in interpersonal relationships). As such, there would be greater heteroscedasticity in motives for social interactions for older adults compared to younger adults due to greater variability in perceptions of time remaining in life (Carstensen, Isaacowitz, & Charles, 1999). Therefore, an examination of variance dispersion as a function of both chronological and felt age may provide an important theoretical contribution. Skill Acquisition and Training Research on skill acquisition and training is yet another area in which theoryconsistent heteroscedasticity may be found. In general, training leads to an increase in mean performance across individuals such that both low and high performing individuals tend to demonstrate increased performance as a result of training (Ackerman, 2007). However, training may also impact variability in performance. A seminal theory in the training literature, Ackerman’s (1987) theory of automatic and controlled processing, provides a framework by which individuals process information and develop skills. Through practice, tasks that require recurring skills and procedures may become automatic (i.e., they can be completed quickly and effortlessly with little to no thought). These tasks should have a performance asymptote (Campbell, McCloy, Oppler, & Sager, 1993) such that additional training beyond the performance plateau does not increase performance. Examples include driving, typing, and reading. Conversely, Ackerman (1987) defined controlled information processing as a much slower and more effortful form of information processing. Tasks that are more inconsistent (i.e., they require the use of multiple skills and/or problem-solving abilities) may require controlled information processing and conscious thought to complete. Even with training, these tasks require significant attention and effort (i.e., continued controlled information processing). Ackerman and colleagues (Ackerman, 1987; Ackerman & Cianciolo, 2000; Kanfer & Ackerman, 1989) conducted a series of studies demonstrating that for tasks requiring automatic processing, increased training leads to a decrease in performance variability among individuals, whereas for controlled tasks, increased training leads to an increase in individual performance variability. Thus, changes in the variability of performance are a function of the characteristics of the skill being trained. Specifically, because task automaticity decreases the impact of individual differences (e.g., intelligence, attention span, or working memory) on performance, the variability in performance may decrease among individuals completing tasks inducing automatic processing. On the other hand, for controlled

Heteroscedasticity in Organizational Research  •  73

processing tasks, variability in performance across individuals may remain constant regardless of time spent in training (Ackerman, 1987), or increase in situations in which a lack of problem-solving skills causes some individuals to fall further behind in performance, in comparison to other trainees (Ackerman, 2007). Consistent with this supposition, across two air traffic controller tasks, Ackerman and Cianciolo (2000) demonstrated decreased performance variability for a task requiring automatic processing and increased performance variability on a controlled processing task. In sum, Ackerman (1987) theorized that task characteristics determine the role of individual differences in task performance. If a task is consistent and becomes automatic through training, individual differences will be less predictive of performance. For this reason, variability in performance decreases with increased training. If a task is inconsistent and requires controlled processing even after extensive training, individual differences will remain predictive of performance. For this reason, variability may remain fairly constant or even increase with additional training. It is important to note that we focus on changes in the variability of an outcome (i.e., performance) across groups while another predictor changes (i.e., amount of training). Although there may be positive slopes (i.e., a mean increase in performance with more training), the residuals around the predicted regression surface may not remain constant, but instead decrease (for automatic processing tasks) or increase (for controlled processing tasks) as a function of the amount of training. Thus, in this dual processing theory, heteroscedasticity may be implicit. Groups and Teams Another substantive domain where variability may have important implications is research on groups and teams. A variety of measurement techniques have been used to describe team composition variables, including examining means, minimum or maximum values, and team member variability (e.g., standard deviations) for variables of interest (Barrick, Stewart, Neubert, & Mount, 1998; Bell, 2007). Related to heteroscedasticity, studies examining team member diversity (i.e., score variability across individual team members) have positively linked variability in team member extraversion and emotional stability to team job performance (Neuman, Wagner, & Christiansen, 1999) and dispersion in team member work values with individual team member performance (Chou, Wang, Wang, Huang, & Cheng, 2008). In addition, Barrick et al. (1998) found that variability in team member conscientiousness was inversely related to team performance, and De Jong, Dirks, and Gillespie (2016) demonstrated that team differentiation in terms of specialized skills and decision-making power moderated the positive relation between intrateam trust and team performance, such that stronger relations emerged for teams with greater team member differentiation. Thus, in each of these cases, a consideration of heteroscedasticity related to various team composition factors explained additional variance in relations of interest.

74 •  SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

Taking this a step further, Horwitz and Horwitz (2007) conducted a metaanalysis to examine how various types of team diversity impact team outcomes. Their findings indicated that task-related diversity (i.e., variability in attributes relevant to task completion, such as expertise) was positively related to team performance quality and quantity, whereas demographic diversity (i.e., dispersion in observable individual category memberships, such as in age and race/ethnicity subgroups) was unrelated to team performance. Notably, however, later work suggested that demographic diversity may in some cases be negatively related to group performance when subjective (but not objective) performance metrics are employed (van Dijk, van Engen, & van Knippenberg, 2012). Further, temporal examinations of team diversity suggested that demographic diversity within teams may become advantageous over time due to team members’ shifting focus from surface-level attributes (i.e., demographics) to more task-relevant individual characteristics (Harrison, Price, Gavin, & Florey, 2002). Additionally, in an examination of the impact of group cohesion on decisionmaking quality as a function of groupthink (i.e., “a mode of thinking that people engage in when they are deeply involved in a cohesive ingroup, when the members’ striving for unanimity override their motivation to realistically appraise alternative courses of action”; Janis, 1972, p. 9), Mullen, Anthony, Salas, and Driskell (1994) demonstrated that team decision-making quality was positively related to group homogeneity in task commitment and inversely related to interpersonal attraction-related cohesion. Taken together, research on organizational groups and teams has benefited from an examination of the impact of heteroscedasticity in team composition. Thus, we encourage future work to continue to explore how heterogeneity of variance contributes to our understanding of phenomena related to organizational groups and teams, including the consideration of new perspectives, such as the real-time impact of diversity changes on team functioning (see e.g., dynamic team diversity theory; Li, Meyer, Shemla, & Wegge, 2018). Organizational Climate Heteroscedasticity is also a factor of interest in organizational climate research. Organizational climate has been defined as experience-based perceptions of organizational environments based on attributes such as policies, procedures, and observed behaviors Ostroff, Kinicki, & Muhammad, 2013; Schneider, 2000). Although early climate research approached organizational climate broadly (i.e., a molar approach), later work examined climate through a more focused lens (see Schneider, Ehrhart, & Macey, 2013), emphasizing that different climate types can exist within an organization (e.g., customer service, safety, and innovation climates). Organizational climate has been a topic of considerable interest to organizational researchers, as various climate types have been linked to a number of work outcomes. For example, innovative organizational climate has been positively linked to creative performance (Hsu & Fan, 2010), perceived innovation (Lin & Liu, 2012), and organizational performance (Shanker, Bhanugopan, van der Heijden, & Farrell, 2017). Likewise, organizations with a more positive cus-

Heteroscedasticity in Organizational Research  •  75

tomer service climate tend to have higher customer satisfaction and greater profits (Schneider, Macey, Lee, & Young, 2009), and meta-analytic data demonstrated a positive relation between safety climate and safety compliance (Christian, Bradley, Wallace, & Burke, 2009). Within organizational climate research, there has been a focus on understanding how variability in perceptions of climate both across individuals and units within organizations can influence associated organizational outcomes (Zohar, 2010). One way in which consensus in climate perceptions within an organization has been examined is by assessing climate strength, which Schneider, Salvaggio, and Subirats (2002) summarize quite succinctly as “within-group variability in climate perceptions [such that] the less within-group variability, the stronger the climate” (p. 220). Climate strength is an example of a dispersion model (see Chan, 1998), in which the model measures the extent to which perceptions of a construct vary, and within-group variability is treated as a focal construct (Dawson, González-Romá, Davis, & West, 2008). Climate strength has been described as a moderator of relations between organizational climate and organizational outcomes, such that the effect of a particular climate (e.g., safety climate) is stronger when climate strength is high (Schneider et al., 2002, 2009; Shin, 2012). Yet other work suggested that climate strength may be curvilinearly related to organizational performance in some contexts, such that performance peaks at moderate levels of climate strength (Dawson et al., 2008). In sum, organizational climate research has benefited from the consideration of heteroscedasticity as a meaningful attribute. Thus, we encourage researchers to move beyond the presumption that systematic differences in variance in organizational data simply be viewed as a violated statistical assumption that should be corrected, but rather, consider whether heteroscedasticity may provide a meaningful contribution to underlying theory and empirical models. Summary The above sections reviewed various substantive research areas where the change in variance may be of theoretical or practical importance. For example, although a stress intervention may result in lower strain for those in a treatment group compared to those in a control group, a smaller variance for those in the treatment group compared to those in the control group could also be meaningful (see Figure 4.1). Because researchers may not typically test for changes in variance, we review extant data analytic procedures in the following section. DATA ANALYTIC PROCEDURES A variety of statistical approaches exist for conducting tests on variances or changes in residual variance. Because these tests would likely be used in tandem with commonly used statistical procedures (e.g., test on two independent means, simple linear regression), we organize this section based on such approaches. We discuss

76 •  SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

procedures that can be used in (a) tests of the equality of two independent means, (b) analysis of variance, (c) analysis of covariance, and (d) multiple linear regression. It deserves noting that the sections below are all special cases of the general linear model. That is, for each of n observations, a quantitative dependent variable (y) can be modeled using a set of p predictor variables (x1, x2,…, xp ) plus some unknown population error term. That is, in matrix form, tests on two independent means, analysis of variance, analysis of covariance, linear regression, moderated multiple regression, and polynomial regression are subsumed by the general linear model:

y = Xb + e (1)

where y is an n × 1 vector of observations for the dependent variable, X is an n × (p + 1) model matrix including a leading column of 1s, b is a (p + 1) × 1 vector of population regression coefficients, and e is an n × 1 vector of population errors. b is typically estimated using ordinary least squares (OLS) estimation, where the OLS-based estimates are b = (X´X)-1X´y, and OLS-based estimates of the population errors are known as residuals (Rencher, 2000). Here, the have a mean of 0 and a constant variance (σ2). When the constant variance assumption is violated, this is known as heteroscedasticity (Fox, 2016; Rosopa et al., 2013). It deserves noting that the normality assumption is not required for Equation 1 to be valid. However, when the population errors follow a normal distribution, this allows for statistical inferences on the model and its coefficients including hypothesis tests and confidence intervals (Rencher, 2000). For each of the four sections below, the model matrix (X) changes and we describe various procedures that can be used to test whether the variance changes as a function of one of the columns in X or some other variable (e.g., fitted values). Testing the Equality of Two Independent Means In research involving two independent groups, the independent samples t statistic is often used to test whether two population means differ from one another. For example, a researcher might assess whether the mean on a dependent variable differs between the treatment and control group. Independent of this test on two means, a researcher may want to assess whether the variances between two independent groups differ. If the population variances differ, this is commonly known as heterogeneity of variance or heteroscedasticity (Rosopa et al., 2013). In this situation, p = 1 for a dummy-variable representing group membership in one of two groups, and X is n × 2. One approach for testing whether two population variances differ from one another is due to Hartley (1950). However, because Box (1954) demonstrated that this test does not adequately control Type I error, it is not recommended. Instead, two other tests are recommended here. Bartlett (1937) proposed a test that transforms independent variances. The test statistic is approximately distributed as χ2 with degrees of freedom equal to the

Heteroscedasticity in Organizational Research  •  77

number of groups minus 1. However, Box (1954) noted that this test can be sensitive to departures from normality. In instances where the normality assumption is violated, Brown and Forsythe’s (1974) procedure is recommended. This approach is a modified version of Levene’s (1960) test. Specifically, a two-sample t-test can be conducted on the absolute value of the residuals. However, instead of calculating the absolute value of the residuals from the mean, the absolute value of the residuals is calculated using the median for each group. For a review, Bartlett (1937) and Brown and Forsythe’s (1974) procedures are discussed in Rosopa et al. (2013) and Rosopa, Schroeder, and Doll (2016). Thus, although a researcher may be interested in testing whether the mean for one group differs significantly from the mean of another group, if the researcher also suspects that the variances differ as a function of group membership (see e.g., Figure 4.1), two statistical approaches are recommended. Bartlett’s (1937) test or Brown and Forsythe’s (1974) test can be used. If the test is statistically significant at some fixed Type I error rate (a), the researcher can conclude that the population variances differ from one another. It deserves noting that if a researcher finds evidence that the variances are not the same between the two groups (i.e., heteroscedasticity exists), the conventional Student’s t statistic should not be used to test for mean differences. Instead, Welch’s t statistic should be used; this procedure allows for variances to be estimated separately for each group, and, with Satterthwaite’s corrected degrees of freedom, provides a more robust test for mean differences between two independent groups regardless of whether the homoscedasticity assumption is violated (Zimmerman, 2004). Analysis of Variance In a one-way analysis of variance, the population means on the dependent variable are believed to be different (in some way) across two or more independent groups. Assuming that the population error term in Equation 1 is normally distributed, the test statistic is distributed as an F random variable (Rencher, 2000). However, in addition to tests on two or more means, a researcher may be interested in testing whether variance changes systematically across two or more groups. For example, with three groups, the variance may be large for the control group, but small for treatment A and treatment B. With three independent groups, because there are two dummy-variables for group membership, p = 2 and X is n × 3. In the case of a one-way analysis of variance, Bartlett’s (1937) test and Brown and Forsythe’s (1974) test are also suggested. However, Brown and Forsythe’s (1974) test becomes, more generally, an analysis of variance on the absolute value of the residuals around the respective medians. Thus, if the χ2 test or the F test, respectively, are statistically significant at a, this suggests that the variances are different among the groups. Note that with three independent groups there are three pairwise comparisons that can be conducted. However, there are only two linearly

78 •  SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

independent comparisons (i.e., two degrees of freedom). If a researcher were to conduct additional tests to isolate which of the three groups had significantly different variances, a Bonferroni correction procedure is recommended. For a factorial analysis of variance, a researcher may be interested in main effects for each categorical predictor (i.e., marginal mean differences) as well as possible interactions between categorical predictors. However, independent of the mean differences, a researcher may also be interested in testing the main and interactive effects of the residual variances. O’Brien (1979, 1981) developed an approach that could be used to test the main and interactive effects of the variances in the cells of both one-way and factorial designs. See also an extension by Rosopa et al. (2016). Analysis of Covariance In analysis of covariance, a researcher is typically interested in examining whether population differences on a dependent variable exist across multiple groups. However, a researcher may have one or more continuous predictors that they want to control statistically. Often, these continuous predictors (i.e., covariates) are demographic variables (e.g., employee’s age), individual differences (e.g., spatial ability), or a pretest variable. Assuming the simplest analysis of covariance where a researcher has two independent groups and one covariate, because there is one dummy-variable representing group membership and one covariate (typically, centered), p = 2 and the model matrix (X) is n × 3. Here, the continuous predictor is centered because in analysis of covariance researchers often are interested in the adjusted means on the dependent variable where the adjustment is at the grand mean of the continuous predictor (i.e., covariate) (Fox, 2016). In analysis of covariance, residual variance can change as a function of not only the categorical predictor, but also the continuous predictor (i.e., covariate). For instances where a researcher suspects that the residual variance is changing as a function of a categorical predictor, the procedures discussed above can be used. Specifically, the OLS-based residuals from the analysis of covariance can be saved. Then, either Bartlett’s (1937) test or Brown and Forsyths’s (1974) test can be used to determine whether the residual variance changes as a function of the categorical predictor. As noted above, with three or more groups, if additional tests are conducted to isolate which of the groups had significantly different variances, a Bonferroni correction procedure is recommended. In analysis of covariance, the residual variance could change as a function of the continuous predictor. Here, a general approach is suggested, known as a score test (Breusch & Pagan, 1979; Cook & Weisberg, 1983). This is discussed in the next section on multiple linear regression. Multiple Linear Regression In multiple linear regression, a researcher typically has many predictors to predict a continuous dependent variable. A researcher may have categorical predictors only (e.g., analysis of variance), continuous and categorical predictors (e.g.,

Heteroscedasticity in Organizational Research  •  79

analysis of covariance), or categorical and continuous predictors along with functions of the predictors (e.g., quadratic terms, product terms). Although such a model can be increasingly complex, the overall model is still that shown in Equation 1. The model matrix (X) is n × (p + 1) where p denotes the number of predictors/regressors in the overall model. If the homoscedasticity assumption is violated, as noted above, it could be due to a categorical predictor. In such instances, similar to the analyses discussed above, Bartlett’s (1937) test or Brown and Forsythe’s (1974) test can be used. When conducting multiple comparisons on the residual variances, Bonferroni corrections are recommended. For instances where the residual variance changes as a function of a continuous predictor (e.g., a covariate in analysis of covariance), a general statistical approach is available known as the score test. This test was independently developed in the econometrics (Breusch & Pagan, 1979) and statistics (Cook & Weisberg, 1983) literatures. It can detect various forms of heteroscedasticity (i.e., change in residual variance). The test requires fitting two regression models. In the first, the sum of squares error (SSE) from the full regression model of interest is obtained. Then, in the second, the squared OLS residuals from the first analysis are regressed on the variables purported to be the cause of the heteroscedasticity (e.g., a continuous predictor), and the sum of squares regression (SSR) is obtained. The test statistic, (SSR/2) ÷ (SSE/n)2, is asymptotically distributed as χ2 with degrees of freedom equal to the number of variables used to predict the squared OLS residuals. The score test is considered a general test for heteroscedasticity because it can detect whether the residual variance changes as a function of categorical predictors, continuous predictors, or predicted values (Kutner, Nachtsheim, Neter, & Li, 2005). Thus, it is a very flexible statistical approach and can be used not only for multiple regression models, but also for two independent groups, analysis of variance, analysis of covariance, and models with interaction terms and higher-order terms (e.g., quadratic or cubic terms) (Rosopa et al., 2013). Summary In this section, we reviewed statistical procedures commonly used in human resource management, organizational psychology, and related disciplines. In addition, we discussed some data-analytic procedures that can be used to detect changes in residual variance. It deserves noting that if a researcher finds evidence to support their theory that variance changes as expected, this suggests that the homoscedasticity assumption in general linear models is violated. Thus, although a researcher may have found evidence that residual variance changes as a continuous predictor increases (see e.g., Figure 4.2), the use of OLS estimation in linear models is no longer optimal; regression coefficients will be incorrect (i.e., inefficient; Rencher, 2000). Thus, although parameter estimates remain unbiased in the presence of heteroscedasticity, statistical inferences (e.g., hypothesis tests,

80 •  SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

confidence intervals) involving means, regression slopes, and linear combinations of regression slopes will be incorrect (Fox, 2016; Kutner et al., 2005). However, more general solutions are available including weighted least squares regression (Kutner et al., 2005) and heteroscedasticity-consistent covariance matrices (Fox, 2016; Ng & Wilcox, 2009, 2011). For brief reviews, see also Rosopa et al. (2013) and Rosopa, Brawley, Atkinson, and Robertson (2018). CONCLUSION A major objective of this paper is to describe how heteroscedasticity can be more than just a statistical violation. Rather, differences in residual variance could be a necessary and implicit aspect of a theory or empirical study. We included examples from five organizational research domains in which heteroscedasticity may provide a substantive contribution, thus highlighting that although changes in residual variance are often viewed to be statistically problematic, heteroscedasticity can also contribute meaningfully to our understanding of various organizational phenomena. Nevertheless, there are likely other topical areas germane to organizational contexts in which heteroscedasticity may occur (see e.g., Aguinis & Pierce, 1998; Bell & Fusco, 1989; Dalal et al., 2015; Grissom, 2000). Thus, we hope that this paper stimulates research that considers the impact of heteroscedasticity, as heterogeneity of variance can serve as an important explanatory mechanism that can provide insight into a variety of organizational phenomena. We encourage researchers to consider whether there is a theoretical basis for a priori expectations of heteroscedasticity in their data, as well as to consider whether unanticipated heterogeneity of variance may have substantive meaning. Stated differently, although homogeneity of variance is a statistical assumption of the general linear model, we suggest that researchers carefully consider whether changes in residual variance can be attributed to other constructs in a nomological network (Cronbach & Meehl, 1955). Overall, this can enrich both theory and practice in human resource management and allied fields. REFERENCES Ackerman, P. L. (1987). Individual differences in skill learning: An integration of psychometric and information processing perspectives. Psychological Bulletin, 102, 3–27. doi:10.1037//0033-2909.102.1.3 Ackerman, P. L. (2007). New developments in understanding skilled performance. Current Directions in Psychological Science, 16, 235–239. doi:10.1111/j.14678721.2007.00511.x Ackerman, P. L., & Cianciolo, A. T. (2000). Cognitive, perceptual-speed, and psychomotor determinants of individual differences during skill acquisition. Journal of Experimental Psychology: Applied, 6, 259–290. doi:10.1037//1076-898X.6.4.259 Aguinis, H., & Pierce, C. A. (1998). Heterogeneity of error variance and the assessment of moderating effects of categorical variables: A conceptual review. Organizational Research Methods, 1, 296–314. doi:10.1177/109442819813002

Heteroscedasticity in Organizational Research  •  81 Antonakis, J., & Dietz, J. (2011). Looking for validity or testing it? The perils of stepwise regression, extreme-scores analysis, heteroscedasticity, and measurement error. Personality and Individual Differences, 50, 409–415. doi:10.1016/j.paid.2010.09.014 Backman, L., Small, B. J., & Wahlin, A. (2001). Aging and memory: Cognitive and biological perspectives. In Birren, J. E., & Schaie, W. K. (Eds.), Handbook of the psychology of aging (pp. 349–366). San Diego, CA: Academic Press. Baltes, P. B., & Baltes, M. M. (1990). Psychological perspectives on successful aging: The model of selective optimization with compensation. In P. B. Baltes, & M. M. Baltes (Eds.), Successful aging: Perspectives from the behavioral sciences (pp. 1–34). New York, NY: Cambridge University Press. Barak, B. (2009). Age identity: A cross-cultural global approach. International Journal of Behavioral Development, 33, 2–11. doi:10.1177/0165025408099485 Barrick, M. B., Stewart, G. L. Neubert, M. J., & Mount, M. K. (1998). Relating member ability and personality to work-team processes and team effectiveness. Journal of Applied Psychology, 83, 377–191. doi:10.1037/0021-9010.83.3.377 Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal Society, A160, 268–282. doi:10.1098/rspa.1937.0109 Bell, S. (2007). Deep-level composition variables as predictors of team performance: A meta-analysis. Journal of Applied Psychology, 92, 595–615. doi:10.1037/00219010.92.3.595 Bell, P. A., & Fusco, M. E. (1989). Heat and violence in the Dallas field data: Linearity, curvilinearity, and heteroscedasticity. Journal of Applied Social Psychology, 19, 1479–1482. doi:10.1111/j.1559-1816.1989.tb01459.x Bond, F. W., & Bunce, D. (2001). Job control mediates change in a work reorganization intervention for stress reduction. Journal of Occupational Health Psychology, 6, 290–302. doi:10.1037//1076-8998.6.4.290 Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems, I. Effect of inequality of variance in the one-way classification. Annals of Mathematical Statistics, 25, 290–302. doi:10.1214/aoms/1177728786 Boyle, P. A., Yu, L., Wilson, R. S., Gamble, K., Buchman, A. S., & Bennett, D. A. (2012). Poor decision making is a consequence of cognitive decline among older persons without Alzheimer’s disease or mild cognitive impairment. PLOS One, 7, 1–5. doi:10.1371/journal.pone.0043647 Breusch, T. S., & Pagan, A. R. (1979). A simple test for heteroscedasticity and random coefficient variation. Econometrica, 47, 1287–1294. doi:10.2307/1911963 Brown, M. B., & Forsythe, A. B. (1974). Robust test for the equality of variances. Journal of the American Statistical Association, 69, 364–367. doi:10.2307/2285659 Bryk, A. S., & Raudenbush, S. W. (1988). Heterogeneity of variance in experimental studies: A challenge to conventional interpretations. Psychological Bulletin, 104, 396– 404. doi:10.1037//0033-2909.104.3.396 Byron, K., Peterson, S. J., Zhang, Z., & LePine, J. A. (2018). Realizing challenges and guarding against threats: Interactive effects of regulatory focus and stress on performance. Journal of Management, 44, 3011–3037. doi:10.1177/0149206316658349 Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 35–70). San Francisco, CA: Jossey-Bass.

82 •  SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS Carstensen, L. L. (1995). Evidence for a life-span theory of socioemotional selectivity. Current Directions in Psychological Science, 4, 151–156. doi:10.1111/1467-8721. ep11512261 Carstensen, L. L., Isaacowitz, D. M., & Charles, S. T. (1999). Taking time seriously: A theory of socioemotional selectivity. American Psychologist, 54, 165–181. doi:10.1037/0003-066X.54.3.165 Chan, D. (1998). Functional relationships among constructs in the same content domain at different levels of analysis: A typology of composition models. Journal of Applied Psychology, 83, 234–246. doi:10.1037/0021-9010.83.2.234 Childs, J. H., & Stoeber, J. (2012). Do you want me to be perfect? Two longitudinal studies on socially prescribed perfectionism, stress and burnout in the workplace. Work & Stress, 26, 347–364. doi:10.1080/02678373.2012.737547 Chou, L., Wang, A., Wang, T., Huang, M., & Cheng, B. (2008). Shared work values and team member effectiveness: The mediation of trustfulness and trustworthiness. Human Relations, 61, 1713–1742. doi:10.1177/0018726708098083 Christensen, H., Mackinnon, A. J., Korten, A. E., Jorm, A. F., Henderson, A. S., Jacomb, P., & Rodgers, B. (1999). An analysis of diversity in the cognitive performance of elderly community dwellers: Individual differences in change scores as a function of age. Psychology and Aging, 14, 365–379. Christian, M. S., Bradley, J. C., Wallace, J. C., & Burke, M. J. (2009). Workplace safety: A meta-analysis of the roles of person and situation factors. Journal of Applied Psychology, 94, 1103–1127. doi:10.1037/a0016172 Colcombe, S. J., Kramer, A. F., Erickson, K. I., & Scalf, P. (2005). The implications of cortical recruitment and brain morphology for individual differences in inhibitory function in aging humans. Psychology and Aging, 20, 363–375. doi:10.1037/08827974.20.3.363 Cook, R. D., & Weisberg, S. (1983). Diagnostics for heteroscedasticity in regression. Biometrika, 70, 1–10. doi:10.2307/2335938 Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Dalal, R. S., Meyer, R. D., Bradshaw, R. P., Green, J. P., Kelly, E. D., & Zhu, M. (2015). Personality strength and situational influences on behavior: A conceptual review and research agenda. Journal of Management, 41, 261–287. doi:10.1177/0149206314557524 Darr, W., & Johns, G. (2008). Work strain, health and absenteeism: A meta-analysis. Journal of Occupational Health Psychology, 13, 293–318. doi:10.1037/a0012639 Dawson, J. F., González-Romá, V., Davis, A., & West, M. A. (2008). Organizational climate and climate strength in UK hospitals. European Journal of Work and Organizational Psychology, 17, 89–111. doi:10.1080/13594320601046664 DeFrank, R. S., & Cooper, C. L. (1987). Worksite stress management interventions: Their effectiveness and conceptualisation. Journal of Managerial Psychology, 2, 4–10. doi:10.1108/eb043385 De Jong, B. A., Dirks, K. T., & Gillespie, N. (2016). Trust and team performance: A metaanalysis of main-effects, moderators, and covariates. Journal of Applied Psychology, 101, 1124–1150. doi:10.1037/apl0000110

Heteroscedasticity in Organizational Research  •  83 DeShon, R. P., & Alexander, R. A. (1996). Alternative procedures for testing regression slope homogeneity when group error variances are unequal. Psychological Methods, 1, 261–277. doi:10.1037/1082-989X.1.3.261 Fox, J. (2016). Applied regression analysis and generalized linear models (3rd ed.). Thousand Oaks, CA: Sage. Froehlich, D. E., Beausaert, S., & Segers, M. (2016). Aging and the motivation to stay employable. Journal of Managerial Psychology, 31, 756–770. doi:10.1108/JMP08-2014-0224 Galantino, M. L., Baime, M., Maguire, M., Szapary, P. O., & Farrar, J. T. (2005). Association of psychological and physiological measures of stress in health-care professionals during an 8-week mindfulness meditation program: Mindfulness in practice. Stress and Health, 21, 255–261. doi:10.1002/smi.1062 Giga, S. I., Noblet, A. J., Faragher, B., & Cooper, C. L. (2003). The UK perspective: A review of research on organisational stress management interventions. Australian Psychologist, 38, 158–164. doi:10.1080/00050060310001707167 Gilboa, S., Shirom, A., Fried, Y., & Cooper, C. (2008). A meta-analysis of work-demand stressors and job performance: Examining main and moderating effects. Personnel Psychology, 61, 227–271. doi:10.1111/j.1744-6570.2008.00113.x Grissom, R. J. (2000). Heterogeneity of variance in clinical data. Journal of Consulting and Clinical Psychology, 68, 155–165. doi: 10.1037/0022-006X.68.1.155 Harrison, D. A., Price, K. H., Gavin, J. H., & Florey, A. T. (2002). Time, teams, and task performance: Changing effects of surface- and deep-level diversity on group functioning. Academy of Management Journal, 45, 1029–1045. doi:10.2307/3069328 Hartley, H. O. (1950). The maximum F-ratio as a short-cut test for heterogeneity of variance. Biometrika, 37(3/4), 308–312. Horwitz, S. K. & Horwitz, I. B. (2007). The effects of team diversity on team outcomes: A meta-analytic review of team demography. Journal of Management, 33, 987–1015. doi:10.2307/3069328 Hsu, M. L. A., & Fan, H. (2010). Organizational innovation climate and creative outcomes: Exploring the moderating effect of time pressure. Creativity Research Journal, 22, 378–386. doi:10.1080/10400419.2010.523400 Ivancevich, J. M., Matteson, M. T., Freedman, S. M., & Phillips, J. S. (1990). Worksite stress management interventions. American Psychologist, 45, 252–261. doi:10.1037//0003-066X.45.2.252 Jackson, S. E. (1983). Participation in decision making as a strategy for reducing job-related strain. Journal of Applied Psychology, 68, 3–19. doi:10.1037//0021-9010.68.1.3 Janis, I. L. (1972). Victims of groupthink. Boston, MA: Houghton-Mifflin. Kanfer, R., & Ackerman, P. L. (1989). Motivation and cognitive abilities: An integrative/ aptitude-treatment interaction approach to skill acquisition. Journal of Applied Psychology, 74, 657–690. doi:10.1037//0021-9010.74.4.657 King, B. M., Rosopa, P. J., & Minium, E. W. (2018). Statistical reasoning in the behavioral sciences (7th ed.). Hoboken, NJ: Wiley. Kotter-Grühn, D., Kornadt, A. E., & Stephan, Y. (2016). Looking beyond chronological age: Current knowledge and future directions in the study of subjective age. Gerontology, 62, 86–93. doi:10.1159/000438671 Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models (5th ed.). New York, NY: McGraw-Hill.

84 •  SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS LaMontagne, A. D., Keegel, T., Louie, A. M., Ostry, A., & Landsbergis, P. A. (2007). A systematic review of the job-stress intervention evaluation literature. International Journal of Occupational and Environmental Health, 13, 268–280. doi:10.1179/ oeh.2007.13.3.268 Lazarus, R. S., & Folkman, S. (1984). Stress, appraisal, and coping. New York, NY: Springer. Levene, H. (1960). Robust tests for equality of variances. In I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow, & H. B. Mann (Eds.), Contributions to probability and statistics (pp. 278–292). Stanford, CA: Stanford University Press. Li, J., Meyer, B., Shemla, M., & Wegge, J. (2018). From being diverse to becoming diverse: A dynamic team diversity theory. Journal of Organizational Behavior, 39, 956–970. doi:10.1002/job.2272 Lin, Y. Y., & Liu, F. (2012). A cross‐level analysis of organizational creativity climate and perceived innovation: The mediating effect of work motivation. European Journal of Innovation Management, 15, 55–76. doi:10.1108/14601061211192834 Moen, P., Kojola, E., & Schaefers, K. (2017). Organizational change around an older workforce. The Gerontologist, 57, 847–856. doi:10.1093/geront/gnw048 Morse, C. K. (1993). Does variability increase with age? An archival study of cognitive measures. Psychology and Aging, 8, 156–164. doi:10.1037/0882-7974.8.2.156 Mullen, B., Anthony, T., Salas, E., & Driskell, J. E. (1994). Group cohesiveness and quality of decision making: An integration of tests of the groupthink hypothesis. Small Group Research, 25, 189–204. doi:10.1177/1046496494252003 Neuman, G. A., Wagner, S. H., & Christiansen, N. D. (1999). The relationship between work-team personality composition and the job performance of teams. Group & Organization Management, 24, 28–45. doi:10.1177/1059601199241003 Ng, T. W., & Feldman, D. C. (2008). The relationship of age to ten dimensions of job performance. Journal of Applied Psychology, 93, 392–423. doi:10.1037/00219010.93.2.392 Ng, M., & Wilcox, R. R. (2009). Level robust methods based on the least squares regression estimator. Journal of Modern Applied Statistical Methods, 8, 384–395. Ng, M., & Wilcox, R. R. (2011). A comparison of two-stage procedures for testing leastsquares coefficients under heteroscedasticity. British Journal of Mathematical and Statistical Psychology, 64, 244–258. doi:10.1348/000711010X508683 O’Brien, R. G. (1979). A general ANOVA method for robust tests of additive models for variances. Journal of the American Statistical Association, 74, 877–880. doi:10.2307/2286416 O’Brien, R. G. (1981). A simple test for variance effects in experimental designs. Psychological Bulletin, 89, 570–574. doi:10.1037//0033-2909.89.3.570 Ostroff, C., & Fulmer, C. A. (2014). Variance as a construct: Understanding variability beyond the mean. In J. K. Ford, J. R. Hollenbeck, & A. M. Ryan (Eds.), The nature of work: Advances in psychological theory, methods, and practice (pp. 185–210). Washington, DC: APA. doi:10.1037/14259-010 Ostroff, C., Kinicki, A. J., & Muhammad, R. S. (2013). Organizational culture and climate. In N. W. Schmitt, S. Highhouse, & I. B. Weiner (Eds.), Handbook of psychology: Industrial and organizational psychology (pp. 643–676). Hoboken, NJ: Wiley. Panatik, S. A., O’Driscoll, M. P., & Anderson, M. H. (2011). Job demands and work-related psychological responses among Malaysian technical workers: The moderating

Heteroscedasticity in Organizational Research  •  85 effects of self-efficacy. Work & Stress, 25, 355–370. doi:10.1080/02678373.2011.6 34282 Plomin, R. & Thompson, L. (1988). Life-span developmental behavioral genetics. In Baltes, P. B., Featherman, D. L., & Lerner, R. M. (Eds.), Life-span development and behavior (pp. 1–31). Hillsdale, NJ: Lawrence Erlbaum. Podsakoff, N. P., LePine, J. A., & LePine, M. A. (2007). Differential challenge stressorhindrance stressor relationships with job attitudes, turnover intentions, turnover, and withdrawal behavior: A meta-analysis. Journal of Applied Psychology, 92, 438–454. doi:10.1037/0021-9010.92.2.438 Rencher, A. C. (2000). Linear models in statistics. New York, NY: Wiley. Richardson, K. M., & Rothstein, H. R. (2008). Effects of occupational stress management intervention programs: A meta-analysis. Journal of Occupational Health Psychology, 13, 69–93. doi:10.1037/1076-8998.13.1.69 Rosopa, P. J., Brawley, A. M., Atkinson, T. P., & Robertson, S. A. (2018). On the conditional and unconditional Type I error rates and power of tests in linear models with heteroscedastic errors. Journal of Modern Applied Statistical Methods, 17(2), eP2647. doi:10.22237/jmasm/1551966828 Rosopa, P. J., Schaffer, M. M., & Schroeder, A. N. (2013). Managing heteroscedasticity in general linear models. Psychological Methods, 18, 335–351. doi:10.1037/a0032553 Rosopa, P. J., Schroeder, A. N., & Doll, J. L. (2016). Detecting between-groups heteroscedasticity in moderated multiple regression with a continuous predictor and a categorical moderator: A Monte Carlo study. SAGE Open, 6(1), 1–14. doi:10.1177/2158244015621115 Salkind, N. J. (2007). Encyclopedia of measurement and statistics. Thousand Oaks, CA: Sage. doi:10.4135/9781412952644 Salkind, N. J. (2010). Encyclopedia of research design. Thousand Oaks, CA: Sage. doi:10.4135/9781412961288 Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton-Mifflin. Schaufeli, W. B., & Enzmann, D. (1998). The burnout companion to study and practice. Philadelphia, PA: Taylor & Francis. Schneider, B. (2000). The psychological life of organizations. In N. M. Ashkanasy, C. P. M. Wilderom, & M. F. Peterson (Eds.), Handbook of organizational culture & climate (pp. xvii–xxi). Thousand Oaks, CA: Sage. Schneider, B., Ehrhart, M. G., & Macey, W. H. (2013). Organizational climate and culture. Annual Review of Psychology, 64, 361–388. doi:10.1146/annurevpsych-113011-143809 Schneider, B., Macey, W. H., Lee, W. C., & Young, S. A. (2009). Organizational service climate drivers of the American Customer Satisfaction Index (ACSI) and financial and market performance. Journal of Service Research, 12, 3–14. doi:10.1177/1094670509336743 Schneider, B., Salvaggio, A. N., & Subirats, M. (2002). Climate strength: A new direction for climate research. Journal of Applied Psychology, 87, 220–229. doi:10.1037/00219010.87.2.220 Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton-Mifflin.

86 •  SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS Shanker, R., Bhanugopan, R., van der Heijden, Beatrice I. J. M., & Farrell, M. (2017). Organizational climate for innovation and organizational performance: The mediating effect of innovative work behavior. Journal of Vocational Behavior, 100, 67–77. doi:10.1016/j.jvb.2017.02.004 Shin, Y. (2012). CEO ethical leadership, ethical climate, climate strength, and collective organizational citizenship behavior. Journal of Business Ethics, 108(3), 299–312. doi:10.1007/s10551-011-1091-7 Spirduso, W. W., Francis, K. L., & MacRae, P. G. (2005). Physical dimensions of aging (2nd ed.). Champaign, IL: Human Kinetics. Taylor, M. A., & Bisson, J. B. (2019). Changes in cognitive functioning: Practical and theoretical considerations for training the aging workforce. Human Resource Management Review. Advance online publication. doi:10.1016/j.hrmr.2019.02.001 van der Klink, J. J. L., Blonk, R. W. B., Schene, A. H., & van Dijk, F. J. H. (2001). The benefits of interventions for work-related stress. American Journal of Public Health, 91, 270–276. doi:10.2105/AJPH.91.2.270 van Dijk, H., van Engen, M. L., & van Knippenberg, D. (2012). Defying conventional wisdom: A meta-analytical examination of the differences between demographic and job-related diversity relationships with performance. Organizational Behavior and Human Decision Processes, 119, 38–53. doi:10.1016/j.obhdp.2012.06.003 Webster, J. R., Beehr, T. A., & Love, K. (2011). Extending the challenge-hindrance model of occupational stress: The role of appraisal. Journal of Vocational Behavior, 79, 505–516. doi:10.1016/j.jvb.2011.02.001 White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48, 817– 838. doi:10.2307/1912934 Wilcox, R. R. (1997). Comparing the slopes of two independent regression lines when there is complete heteroscedasticity. British Journal of Mathematical and Statistical Psychology, 50, 309–317. doi:10.1111/j.2044- 8317.1997.tb01147.x Yung, P. M. B., Fung, M. Y., Chan, T. M. F., & Lau, B. W. K. (2004). Relaxation training methods for nurse managers in Hong Kong: A controlled study. International Journal of Mental Health Nursing, 13, 255–261. doi:10.1111/j.1445-8330.2004.00342.x Zhang, Y., Zhang, Y., Ng, T. W. H., & Lam, S. S. K. (2019). Promotion- and preventionfocused coping: A meta-analytic examination of regulatory strategies in the work stress process. Journal of Applied Psychology, 104(10), 1296–1323. doi:10.1037/ apl0000404 Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. British Journal of Mathematical and Statistical Psychology, 57(1), 173–181. doi:10.1348/000711004849222 Zohar, D. (2010). Thirty years of safety climate research: Reflections and future directions. Accident Analysis and Prevention, 42, 1517–1522. doi:10.1016/j.aap.2009.12.019

CHAPTER 5

KAPPA AND ALPHA AND PI, OH MY Beyond Traditional Inter-rater Reliability Using Gwet’s AC1 Statistic Julie I. Hancock, James M. Vardaman, and David G. Allen

The aggregation of research results is an important and useful mechanism for better understanding a variety of phenomena, including those relevant to human resource management (HRM). For decades, aggregate studies in the form of meta-analyses have provided an overarching examination and synthesis of results. Such studies provide conclusions based on the amalgamation of previous works, enabling scholars to determine the consistency of findings and the magnitude of an effect, as well as affording more precise recommendations than may be offered based on the results of a solitary study (Aguinis, Dalton, Bosco, Pierce, & Dalton, 2011; Borenstein, Hedges, Higgins, & Rothstein, 2011). Similarly, content analyses are another mechanism by which to identify trends through examining the text, themes, or concepts present across existing studies. These aggregate studies can have real and significant implications for better understanding HRM issues (e.g., Allen, Hancock, Vardaman, & McKee, 2014; Barrick & Mount, 1991; Eby, Casper, Lockwood, Bordeaux, & Brinley, 2005; Hancock, Allen, Bosco, McDaniel, & Pierce, 2013; Pindek, Kessler, & Spector, 2017). However, in order to deResearch Methods in Human Resource Management: Toward Valid Research-Based Inferences, pages 87–106. Copyright © 2020 by Information Age Publishing All rights of reproduction in any form reserved.

87

88 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

duce meaningful conclusions, the data must be reliable and demonstrate construct validity. These methods of study aggregation typically require the employment of multiple coders to systematically gather and categorize data into an appropriate coding scheme. The agreement among coders is a significant issue in these studies, as disagreement could constitute a threat to the validity of the results of aggregation studies. Consequently, inter-rater reliability (IRR) is calculated to determine the degree to which coders consistently agree upon the categorization of variables of interest (Bliese, 2000; LeBreton, Burgess, Kaiser, Atchley, & James, 2003). The most basic approach is the simple calculation of the percentage of agreements that coders have established, whereby the number of total actual agreements is divided by the total possible number of agreements. The simplicity of calculating percentage agreements makes it a commonly used index of IRR in the management literature. Although this method provides an easily calculable general indication of the degree to which coders agree, it can be misleading, failing to take into consideration the impact that chance may have on the reliability of agreement. Deviations from 100% agreement become less meaningful and may result in an inflated IRR, jeopardizing the construct validity of the measure and indicating that percentage agreement is only useful and meaningful in very specific conditions. Despite these potential shortcomings, IRR has been traditionally reported as the simple percentage of agreement among coders in the management literature (e.g., Barrick & Mount, 1991; Eby et al., 2005; Hancock et al., 2013; Hoch, Bommer, Dulebohn, & Wu, 2018; Judge & Ilies, 2002; Mackey, Frieder, Brees, & Martinko, 2017). Reliability statistics such as Scott’s pi (p) (1955), Cohen’s Kappa (κ) (1960) (e.g., Heugens & Lander, 2009; Koenig, Eagly, Mitchell, & Ristikari, 2011) and Krippendorff’s alpha (α) (1980) (e.g., Tuggle, Schnatterly, & Johnson, 2010; Tuggle, Sirmon, Reutzel, & Bierman, 2010) have been increasingly identified as superior indices of IRR in comparison to simple percentage agreement and are beginning to appear in aggregate studies. However, these indices are also not without limitations. Each of the more sophisticated indicators has been derived in order to combat the shortcomings of their predecessors. Even so, neither p, κ, nor a will be appropriate in all circumstances. In particular, each has limitations regarding a variety of scenarios, including those where: (a) there are multiple coders but different combinations of coders for different cases, (b) there exist any number of categories, scale values, or measures, (c) there is missing data, (d) known prevalence (dichotomous coding) exists, (e) there is skewed data, and (f) for any sample size. A search for another option for calculating inter-rater agreement across several disciplines elicited attention to the AC1 statistic for IRR established by Gwet (2001). This offers a test, which “is a more robust chance-corrected statistic that consistently yields reliable results” (Gwet, 2002b, p. 5) as compared to κ, providing scholars with a more accurate measurement in each of those situations.

Kappa and Alpha and Pi, Oh My   •  89

Thus, in this paper, we contribute to the management literature in several ways. First, we demonstrate the utility of the AC1 by offering a comparison of key characteristics of five IRR indices, such as the number of coders/observers, level of measurement, and sample size, as well as the number of categories, scale values, or measures each IRR index can accommodate. Further, we compare the degree to which each IRR index accommodates missing data, known prevalence (data coded 0 or 1), and skewed data, highlighting the contextual characteristics in which each index may be used appropriately. Next, we examine over 440 studies to provide a side-by-side data-driven comparison of each of the five IRR indices discussed, showing the variation that exists based on synthesis characteristics and how the inferences and conclusions made by researchers as a result of IRR index selection may vary. Finally, we provide recommendations for the best indices to use for calculating IRR and address additional areas of practicality for AC1, specifically, the value it holds for HRM practices. In so doing, we highlight the value of AC1 over other IRR indices in two specific situations: when examining dichotomous variables and when more than two coders are engaged in coding. LITERATURE REVIEW The degree to which data analysis and synthesis can lead to prescriptions for researchers and practitioners is dependent upon the level of accuracy and reliability with which coders of the data agree. IRR indices seek to provide some degree of trust and assurance of data that are coded and categorized by human observers, thus increasing the degree of confidence researchers have in data driven by human judgments (Hayes & Krippendorff, 2007) by improving construct validity. In their review of several IRR indices, Hayes and Krippendorff (2007) identify five properties that exemplify the nature of a good reliability index. First, agreement amongst two or more coders/observers working independently to ascribe categorizations to observations ought to be assessed without influence of the number of independent coders present or by variation in the coders involved. Thus, the individual coders participating in the codification of data should not influence coding agreement. Second, the number of categories to be coded should not bias the reliabilities. Thus, reliability indices should not be influenced in one direction or the other by the number of categories prescribed by the developer of the coding schemata. Third, the reliability metric should be represented on a “numerical scale between at least two points with sensible reliability interpretations (Hayes & Krippendorff, 2007, p. 79).” Thus, scales whereby a 0 indicates that zero agreement exists suggests a violation of the assumption of independence of coders and are thus ambiguous in their assessment of reliability. Fourth, Hayes and Krippendorff (2007) suggest that a good reliability index should “be appropriate to the level of measurement of the data (p. 79).” Thus, it must be suitable for comparisons across various types of data, not limited to one particular type of data. Finally, the “sampling behavior should be known or at least computable” (p. 79).

90 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

Extant IRR Measures Each of the most prevalent IRR indices has pros and cons when compared using Hayes and Krippendorff’s (2007) criteria. For example, although percentage agreement is easy to calculate, it skews agreement in an overly positive direction. Although is complex to compute, it accommodates more complex coding schemes. The following sections review the utility and shortcomings of each approach, providing a better understanding of the circumstances under which a particular IRR index may be most appropriately utilized. Percentage Agreement. A common IRR index in the management literature is simple percentage agreement. Percentage agreement assesses IRR by simply dividing the number of agreements two coders have by the number of potential matches that exist. Percent Agreement % Agreement = ∑ c

Occ × 100 n

where Occ represents each agreement coincidence and n represents the total number of coding decisions. Percentages are typically calculated for each variable in a coding scheme then averaged such that the overall agreement is known, as is the agreement for each specific variable. In addition to being a straightforward calculation, percent agreement can provide researchers with insights into problematic variables within the data (McHugh, 2012). For example, if percentage agreement for a particular variable, agreement is only 40%, this suggests that the variable should be revisited to determine the underlying reason for low agreement. However, although this measure is easily calculable, it fails to fully satisfy a majority of the five reliability criteria set forth by Hayes and Krippendorff (2007) and can be somewhat misleading. The simplicity of calculating percentage agreements makes it a commonly used index of IRR in the management literature. However, the degree to which it is meaningful is situationally specific, i.e., when there are two well-trained coders, in the presence of nominal data, with fewer rather than a greater number categories, and a low chance that guessing will take place (Scott, 1955). Thus, it is not a sufficient and reliable measure itself. Percentage agreement does not consider the role that chance might play in ratings, incorrectly assuming that all raters make deliberate, rational decisions in assigning their rating. Perhaps more alarmingly, the lack of chance accounted for in this metric makes it possible for agreement to seem acceptable even if both coders guessed at their categorization. For example, if two coders employ two differing strategies for categorizing items, one coder categorizes every item as “A” and the other coder often, but not always, categorizes an item as “A,” simple percentage agreement would suggest that they

Kappa and Alpha and Pi, Oh My   •  91

are in agreement when they are, in fact, utilizing different strategies for the their categorizations or, more disturbingly, simply guessing. Additionally, this calculation is predisposed towards coding schemes with fewer categories whereby a higher percentage agreement will be achieved by chance when there are fewer categories to code. Further, percentage agreement is interpreted as from 0–100% with 100% indicating complete agreement and 0% complete disagreement, which is not likely unless coders are violating the condition of independence. Consequently, deviations from 100% agreement (complete agreement in all categories) become less meaningful as the scale is not meaningfully interpretable. Failure of simple percentage agreement calculations to adequately assess reliability substantially limits the construct validity of the assessments scholars are using to synthesize data and draw conclusions and has been deemed unacceptable in determining IRR for decades (e.g., Krippendroff, 1980; Scott 1955). Thus, it is advisable that management scholars explore other, more reliable indices for assessing IRR; several other metrics attempt to do so. Scott’s Pi. In an attempt to overcome the limitations of percent agreement, p (1955) was developed as a means by which IRR might be calculated above and beyond simple percentages. Although percentage agreement is based on the number of matches that coders obtain out of a particular number of potential matches, takes into consideration the role played by chance agreement. The probability of chance is based on the cumulative classification of probabilities, not the probabilities of individual rater classification (Gwet, 2002a) and provides a chancecorrected agreement index for assessing IRR. This metric considers the degree to which coders agree when they do not engage in guessing. Further, Scott (1955) proposed that the previous categorizations of items by coders be examined by calculating the observed number of items each coder has placed into a particular category. For example, the total number of items placed into the same category by two coders would be compared to the total number of items to categorize. The assumption is that if each of the coders were simply categorizing items by chance, each coder would have the same distribution (Artstein & Poesio, 2008; Scott, 1955). Scott’s p

π=

Po − Pe 1 − Pe

where

Po = ∑ c

and

Occ n

92 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

Pe = ∑ pi2 c

where pi represents the proportion “of the sample coded as belonging to the ith category” (Scott, 1955). Occ = each agreement coincidence (diagonal cells in coincidence matrix) n = total number of coding decisions c = each coincidence marginal 2

The interpretability of p, though not universally agreed upon, ranges from 0.0 to 1.0 where (a) zero suggests that coder agreement would not be any worse than if the coding process were random, (b) one indicates that there is perfect agreement among coders, and (c) a negative outcome indicates that coder agreement was worse than it would have been simply by chance. Thus, p satisfies Hayes and Krippendoff’s (2007) second and third requirements for reliability. Like percentage agreement, however, p is traditionally limited to only nominal data and a maximum of two coders. Though it does overcome some of the issues faced by percent agreement, it does not satisfy all five requirements set forth by Hayes and Krippendorff (2007) making it useful in limited conditions and, like percentage agreement, limits construct validity. Cohen’s Kappa. Cohen’s κ (1960) was developed to improve upon the shortcomings of percentage agreement and . Though a generalized version of measuring pairwise agreement was proposed by Fleiss (1971), allowing for the use of multiple coders, like both percentage agreements and p, κ is limited to two coders. Further, it is limited to nominal data and it does not allow for coders to be substitutable, thus it is unreliable in situations when the exchanging of coders is necessary. κ may be used when guessing is likely to be prevalent among coders or if the coders lack the training necessary to provide adequate comparisons. Although the basic formula for calculation remains the same as p, it differs in its assumption of chance agreement. The assumption here is that a coder’s prior distributions will influence their assignment of items into a particular category, such that the probability of each coder assigning that item into the same category must be calculated and summed. Consequently, each coder has their own distribution. Cohen’s k

κ=

Po − Pe 1 − Pe

where Po = ∑ c

and

Occ n

Kappa and Alpha and Pi, Oh My   •  93

Pe =

1 n2

∑ pm

i

where n represents the number of cases and S pmi represents the sum of the marginal products (Neuendorf, 2002). Like p, κ ranges from 0.0 to 1.0, however, because zero is defined as it would be for a correlation, “Kappa, by accepting the two observers’ proclivity to use available categories idiosyncratically as baseline, fails to keep κ tied to the data whose reliability is in question. This has the effect of punishing observers for agreeing on the frequency distribution of categories used to describe the given phenomena (Brennan & Prediger, 1981, Zwick, 1988) and allowing systematic disagreements, which are evidence of unreliability, to inflate the value of κ (Krippendorff, 2004a,b).” (Hayes & Krippendorff, 2007, p. 81). Thus, like the measures discussed above, κ also fails to satisfy the five requirements outlined by Hayes and Krippendorff (2007). Krippendorff’s Alpha. Krippendorff’s (1970) was developed in an attempt to fulfill the remaining voids in reliability calculations left by percentage agreements, p, and κ. This IRR index overcomes the data limitations of the previous three by allowing for more than two observers and for the computation of agreements among ordinal, interval, and ratio data, as well as nominal data (Hayes & Krippendorff, 2007). Although earlier measures correct for percent agreement, instead calculates disagreements. Consequently, it is gaining popularity as a standard IRR index that addresses the limitations of earlier IRR indices, providing researchers with a metric that is able to overcome a variety of concerns. Krippendorff’s a

α = 1− where Do =

1 ∑ n c

Do De

∑O

ckmetric ∂

k

2ck

and De =

1 ∑ n(n − 1) c

∑ n xn k

c

k

2 metric∂ ck

Where Do is the observed disagreement among values assigned to units of analysis and De is the disagreement one would expect when the coding of units is attributable to chance rather than to the properties of these units. Ock, nc, nk, and n refer to the frequencies of values in coincidence matrices. (see Krippendorff, 2011, p. 1 for further description).

94 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

Alpha is useful for multiple coders and is appropriate for various data types (Hayes & Krippendorff, 2007). However, it is not an efficient measure in certain contexts. For example, it is not appropriate for a paired double coding scheme (Knut De Swert, 2012), nor an appropriate measure for particular datasets. Because a is based on the chance of agreement, it is difficult to utilize this measure of reliability with skewed data. Due to the binary nature of intensive content or meta-analytic data (where the choices of “0” not present and “1” present exist), many variables may be categorized as 0s and 1s, with several variables resulting in a low representation of 1s. Thus, the degree of skewness can be problematic in calculating a, as well as κ because, “The κ statistic is effected by skewed distributions of categories (the prevalence problem) and by the degree to which the coders disagree (the bias problem)” (Eugenio & Glass, 2004). Feinstein and Cicchetti (1990, p. 543) further articulate this problem: In a fourfold table showing binary agreement of two observers, the observed proportion of agreement, P0 can be paradoxically altered by the chance-corrected ratio that creates κ as an index of concordance. In one paradox, a high value of P0 can be drastically lowered by a substantial imbalance in the table’s marginal totals either vertically or horizontally. In the second paradox, (sic) κ will be higher with an asymmetrical rather than symmetrical imbalance in marginal totals, and with imperfect rather than perfect symmetry in the imbalance. An adjustment that substitutes Kmax for κ does not repair either problem, and seems to make the second one worse.

Despite these difficulties in assessing accurate IRR in varying contexts, and κ continue to be the most legitimate IRR indices in premier management research (e.g., Desa, 2012; Heugens & Lander, 2009; Kostova & Roth, 2002; Tuggle et al., 2010). The calculations for these values elicit a paradoxical outcome, dependent upon the degree to which they exhibit trait prevalence, or the presence of a particular trait within a population, and the conditional probabilities of the coder properly classifying that trait as either “present” or “not present” (typically 1 or 0). This issue, the prevalence problem, as well as the bias problem (Eugenio & Glass, 2004) create difficulties in the accuracy of reliability statistics when few categories exist or when there is a substantial difference in the marginal distribution amongst coders. Consequently, meta-analyses and content analyses that utilize a binary approach in their collection and synthesis of data or have any substantially over-represented category, such that the data are skewed, suffer from low, and thus “unreliable,” levels of agreement due to the calculations emphasis on the outcome being a product of chance (Gwet, 2008). FILLING THE VOID: THE AC1 STATISTIC The search for another option for calculating inter-rater agreement across several disciplines elicited attention to the AC1 statistic for IRR established by Gwet (2001; 2002a). The AC1 inter reliability statistic “is a more robust chance-corrected statistic that consistently yields reliable results” as compared to κ (Gwet,

Kappa and Alpha and Pi, Oh My   •  95

2002b, p. 5). Furthermore, Gwet (2008) investigated the influence of the conditional probabilities of the coders on the prevalence of a specific trait using p and κ as a metric for inter-rater reliability. AC1 γˆ1 =

pa − pe 1 − pe

Where q

pa =

1 1 − Pm

pe =

1 q ∑ πk (1 − πk ) q − 1 k =1

∑p k =1

kk

pm = “the relative number of subjects rated by a single rater (i.e. 1 rating is missing)2”

πk =

( pk + + p+ k ) 2

pk+= relative number of subjects assigned to category k by rater A p+k = relative number of subjects assigned to category k by rater B pkk = relative number of subjects classified into category k by both raters pk = the probability of a randomly –selected rater to classify a randomly selected subject into category k q = nominal measurement scale? Q is the number of in the nominal rating scale AC1 may be used with any number of coders, any number of categories, scale values, or measures. It can accommodate missing data, any sample size, and account for trait prevalence. Although AC1 may only be used to calculate IRR with nominal data, a similar statistic, AC2, may be used to calculate IRR with ordinal, interval, or ratio scale data. In our own review, we utilized a coding team of more than two coders, but with two coders for each article, multiple categories, and nominal data which demonstrated trait prevalence, that is a large amount of data that were coded “1” by both coders, a condition deemed problematic with the other forms of IRR calculation. Due to the lack of ordinal, interval, and ratio data in our coding schemata, AC2 is beyond the scope of this paper and is suggested as a more comprehensive measure for datasets comprised of data that are not nominal in nature. Calculations suggest that both p and κ produce realistic estimates of IRR when the prevalence of a trait is approximately .50. The farther the trait prevalence above or below a value of .50, the less reliable and accurate the indices. The

c

π=

Po − Pe 1 − Pe

Po − Pe 1 − Pe

κ=

No

Not Small

No

No

Nominal

Limited

2*

Cohen’s k

No

Not Small

No

No

Nominal

Any

2*

Scott’s p

*If more than 2 coders exist, an extension called Fleiss’ kappa can be used to assess IRR.

Occ × 100 n

Skewed Data



No

Sample Size

Formula

No

Any

Known Prevalence (0 or 1)

No

Nominal

Level of Measurement (nominal, ordinal, interval, ratio, etc.)

Missing Data

Limited

2

Percent Agreement

Number of Categories, Scale Values, or Measures

# Coders/Observers

Data Characteristics

TABLE 5.1.  Guidelines for Best Selecting an IRR Index

α = 1−

No

Any

No

Yes

Do De

Nominal, Ordinal, Interval, and Ratio

Any

Any

Krippendorff’s a

γˆ1 =

pa − pe 1 − pe

Yes

Any

Yes

Yes

Nominal

Any

Any

Gwet’s AC1

γˆ1 =

pa − pe 1 − pe

Yes

Any

Yes

Yes

Ordinal, interval, and ratio ratings

Any

Any

Gwet’s AC2

96 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

No No No

Sample Size Restrictions

Accommodates Skewed Data

Nominal

Level of Measurement (nominal, ordinal, interval, ratio, etc.)

Accommodates Known Prevalence (0 or 1)

Yes

Bias Due to Number of Categories, Scale Values, or Measures

No

No

Accommodates Multiple Coders/Observers

Accommodates Missing Data

Percent Agreement

Data Characteristics

TABLE 5.2.  Guidelines for Best Selecting an IRR Index

No

Yes

No

No

Nominal

No

No

Scott’s p

No

Yes

No

No

Nominal

Yes

No

Cohen’s k

No

No

No

Yes

Nominal, Ordinal, Interval, and Ratio

No

Yes

Krippendorff’s a

Yes

No

Yes

Yes

Nominal

No

Yes

Gwet’s AC1

Yes

No

Yes

Yes

Ordinal, interval, and ratio ratings

No

Yes

Gwet’s AC2

Kappa and Alpha and Pi, Oh My   •  97

98 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

closer the trait prevalence to 0 or 1, the more difficult it is to have confidence in the results of these two IRR indices. Thus, Gwet (2008) suggests that the biases established by such calculations negatively influence the overall statistics, leading to the possibility of underestimating the actual inter-rater reliability by up to 100% (Gwet, 2008, p. 40). Consequently, the inclusion of chance-agreement probabilities as a core component in calculating these metrics is inappropriate when utilizing data that rely upon a coding scheme that has few categories, such as those using a binary categorical classification system. The AC1 statistic adequately addresses the unique nature of data beyond the scope of a, κ, and p. Thus, in an effort to compare these IRR indices, we calculated each using a dataset of skewed data. INTER RATER RELIABILITY MEASUREMENT: A COMPARATIVE EXAMPLE To explore the differences of IRR indices (and to demonstrate the utility of the AC1 statistic) within an actual coding context, we conducted a review of over 440 employee turnover articles in eleven major journals in the fields of management and psychology from 1958–2010 (Academy of Management Journal, Administrative Science Quarterly, Human Relations, Human Resource Management, Journal of Applied Psychology, Journal of Management, Journal of Management Studies, Journal of Organizational Behavior, Journal of Vocational Behavior, Organizational Behavior and Human Decision Processes, and Personnel Psychology). A more thorough description of the data can be found in Allen et al. (2014), the original study for which these data were coded. For each article, we coded 130 different variables, a majority of which were confounded by trait prevalence and, subsequently, demanded a look beyond the traditional IRR indices. The appropriateness of the IRR index to be used should be assessed based on the degree to which the metric can accommodate the data. In this example, the data were nominal, thus percent agreement, p, κ, a, or AC1 could be appropriate, based solely on the type of data. However, a paired double coding scheme was utilized, whereby three coders alternated coding such that two coders independently coded each article. Discrepancies were resolved amongst the coders with a fourth coder resolving any discrepancies that were unable to be resolved by them. Consequently, this excludes κ as an appropriate metric, given that this metric fails to support the substitution of coders. Percent agreement and p are also excluded, given that they do not provide a means by which to adequately assess agreement among more than two independent coders. Beyond those requirements, we are left to consider and AC1, both of which accommodate any number of independent coders, neither is restricted by the number of categories, scale values or measures present, and both can be interpreted on a numerical scale between two points, thus making either of these a suitable choice. However, a common issue within coding schemes is that of known prevalence (or trait prevalence), whereby coders are identifying the presence (coded 1) versus

Kappa and Alpha and Pi, Oh My   •  99

absence (coded 0) of a particular trait, phenomenon, etc. The data in this example are representative of this problem, which cannot be sufficiently accommodated by a. The problems with a in this situation are laid bare in our study. Take for example our coding of the retrospective study variable in Table 5.3. Despite their being 96% agreement between coders, a is calculated at .28. This meager value is the result of the dichotomous nature of the variable and the use of a rotated coding scheme, whereby (a) Coders 1 and 2 code set of articles, (b) Coders 2 and 3 code a set of articles, and (c) Coders 1 and 3 code a set of articles. The a index cannot account for this coder scheme, and underestimates the degree of IRR. By contrast, Table 4.3 demonstrates that AC1 more accurately measures IRR than when a rotating coder design is employed. Specifically, Table 5.3 provides calculations of the different IRR indices for 25 of the 130 variables that were coded for within each of the 440 studies in our sample. Given the laborious nature of content and metaanalysis, these types of designs are increasingly, common, highlighting the utility of AC1 as an IRR measure. We calculated each of the five IRR indices for our data in order to compare across several theoretical and methodological variables that were assigned as either present or not present in a particularly article. Table 5.3 shows a comparison of all five IRR indices for these coded variables. For these comparisons, it is clear that there is a substantial range of IRR coefficients. A similar pattern can be seen for all three comparisons: p, κ, and a, are all relatively close in value (to the thousandth place), whereas percent agreement and the AC1 tend to be substantially higher, with percent agreement consistently remaining the highest coefficient, followed by AC1. Although there is a lack of commonality of agreement regarding acceptable levels of IRR for each of these variables (e.g., Krippendorff, 1980; Perreault & Leigh, 1989; Popping, 1988), the general body of literature suggests that IRR coefficient values of greater than .90 are acceptable in virtually all situations and values of .80 or greater are acceptable in most situations. Values below 0.80 are subject to disagreement of acceptability among scholars (Neuendorf, 2002), however, some scholars suggest that values between 0.6 and 0.8 are moderately strong and sometimes acceptable (e.g. Landis & Koch, 1977). Other scholars suggest 0.70 as the cutoff for reliability (e.g., Cronbach, 1980; Frey, Botan, & Kreps, 2000). However, due to the relatively conservative nature of p and κ, lower thresholds are at times deemed acceptable. Using these acceptance guidelines, it is clear that the interpretation of acceptance varies based upon which IRR index is being used. For the coding of studies grounded in the theories of Porter and Steers (1973), Lee and Mitchell (1991) and Rusbult and Farrell (1983), the IRR is acceptable regardless of which metric is used, though for p, κ, and a acceptance is borderline for Porter and Steers, whereas percentage and AC1 are clearly acceptable. However, for the remaining theoretical variables that were coded, the p, κ, and a are not deemed acceptable, though AC1 and percentage agreement offer evidence of acceptable IRR among

Measures

Setting Study Design Study Design Study Design

Nominal

Nominal

Nominal

Nominal

Nominal

Simulation

Cross-sectional

Ex post archival

Longitudinal

Repeated measures

Setting

Study Design

Setting

Nominal

Nominal

Field

Measures

Lab

Nominal

Nominal

Multi item measures

Single item measures

Measures

Nominal

Idiosyncratic

Measures Measures

Nominal

Nominal

Existing measures adapted

Category

Type of Data

Existing measures w/out adapting

Coded Variable

TABLE 5.3.  Coded Variable IRR Comparison

0.9451

0.8654

0.8154

0.8434

0.9945

0.9918

0.9918

0.7225

0.9148

0.8297

0.7885

0.7857

%Agreement

0.5395

0.5594

0.0672

0.4547

0.4979

0.3969

0.6629

0.4493

0.7653

0.5890

0.5795

0.5399

Scott’s p

0.5396

0.5596

0.0680

0.4599

0.4979

0.3970

0.6633

0.4497

0.7656

0.5892

0.5795

0.5402

Cohen’s k

0.5402

0.5600

0.0685

0.4554

0.4986

0.3977

0.6634

0.4500

0.7656

0.5596

0.5801

0.5405

Krippendorff’s a

0.9430

0.8540

0.8060

0.8270

0.9940

0.9920

0.9920

0.6290

0.8960

0.7850

0.7460

0.7210

Gwet’s AC1

100 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

Theories Theories Theories Theories

Nominal

Nominal

Nominal

Lee & Mitchell

March & Simon

Theories Theories Theories Theories

Nominal

Nominal

Nominal

Nominal

Nominal

Muchinsky & Morrow

Price

Steers & Mowday

Maertz

Porter & Steers

Theories

Theories

Nominal

Nominal

Mobley

Mobley et al

Theories

Hulin et al

Study Design

Nominal

Nominal

Static cohort

Study Design

Nominal

Rusbult & Farrell

Retrospective

0.9286

0.9973

0.9148

0.9011

0.9835

0.8874

0.9093

0.9091

0.9890

0.9643

0.9890

0.7720

0.9615

0.7083

0.6653

0.6032

0.4831

0.6915

0.6701

0.6772

0.6727

0.8889

0.5874

0.8125

0.5513

0.2849

0.7084

0.6654

0.6043

0.4834

0.6919

0.6701

0.6774

0.6728

0.8889

0.5888

0.8125

0.5528

0.2849

0.7087

0.6657

0.6037

0.4838

0.6919

0.6705

0.6776

0.6734

0.8891

0.5879

0.8128

0.5519

0.2859

0.9050

0.9980

0.8920

0.8780

0.9830

0.8390

0.8740

0.8740

0.9880

0.9610

0.9890

0.7250

0.9610

Kappa and Alpha and Pi, Oh My   •  101

102 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

coders. This can likely be attributed to the binary coding scheme that was used and offers evidence for the importance of choosing the right metric. In all but one of these instances, the percentage agreement is above 90% and is arguably inflated based on the lack of consideration of chance. This inflation is further demonstrated upon examination of the variables which coded for measures. Although percentage agreement remains inflated, the remaining four IRR indices fail to demonstrate IRR across the board (single item measures) or show low reliability as calculated by p, κ, and a, and a barely acceptable AC1 (i.e., existing measures adapted, existing measures without adaptation). Thus, the IRR index used has a substantial influence on the degree to which IRR is considered acceptable or not. RESEARCH AND PRACTICE Having examined five IRR indices, it is clear that there is no one specific IRR index that is best suited for every situation and that the selection of an appropriate metric is dependent upon the data and coding processes themselves. In conjunction with endorsements in other disciplines, we recommend the use of a in a majority of content and meta-analytic contexts. However, although it builds and improves upon the metrics commonly used to date, the importance of chanceprobability in the calculation of a makes it a poor choice in contexts where a prevalence or bias problem may exist. Consequently, we recommend the AC1 Statistic as an alternative to for management scholars engaging in analytical synthesizing research involving large numbers of categories coded as a function of 1 “present” or 0 “not present.” Perhaps just as importantly, we also strongly recommend the AC1 index in situations where more than two coders work together in rotating fashion. Alpha is not suited for this type of scheme and underestimates IRR in this situation. Further, although AC1 is correlated with percent agreement, it offers construct validity assurances that percent agreement does not. Although percent agreement may be a methodologically-sound index of IRR when used in very simple coding scenarios (e.g., two coders, nominal data, limited number of categories, no missing or skewed data), Tables 5.1 and 5.2 demonstrate that it has several limitations in more complex situations that the AC1 index does not. Specifically, AC1 is most valuable under circumstances such as when more than two coders are present, when there are multiple categories to code, when data are not nominal, when missing data must be accounted for, and when data are skewed. These situations demonstrate the value of the AC1 index over and above simple percent agreement. Given that the coding of dichotomous variables and the use of multiple coders working in rotating fashion are becoming increasingly common in meta-analytic and content studies, the AC1 index should become more prevalent. The AC1 index is an IRR measure that is appropriate for more complex situations, as it overcomes the role of chance and improves construct validity above and beyond the capability of simple percent agreement in many situations.

Kappa and Alpha and Pi, Oh My   •  103

Implications for Practice IRR is not only imperative for establishing construct validity in management research, it also has uses within HRM practice. Interviews and performance evaluations which tend to have multiple “coders” rely on agreement in order to make accurate assessments leading to job offer, promotion, and termination decisions. Implementing appropriate metrics for assessing IRR can aid organizations in ensuring that these decisions are legally permissible, demonstrating validity and reliability. For instance, ad hoc hiring committees are often made up of multiple members who rotate in and out of interviews. When assessing agreement, using would inaccurately report lower agreement about candidates, and open the organization up to legal questions about the validity of its hiring process. Employing AC1 would address this issue and put the firm’s hiring process on more sound procedural footing. Further, many hiring decisions are also dichotomous (e.g., “acceptable candidate vs. unacceptable candidate” or “qualified candidate versus unqualified candidate”). The use of the AC1 statistic in calculating agreement also has value here, as a and other forms of IRR underestimate agreement when the variable of interest is dichotomous. In this sense, the choice of IRR metric has real financial and legal consequences for HRM practitioners. Understanding the circumstances under which a is not appropriate and AC1 should be used instead could have significant practical implications for HR managers. Yes versus no decisions, and committee decisions are common in organizational life, making an understanding of the AC1 statistic increasingly important in organizations. This paper provides a review of the most commonly used IRR indices in the management literature, building upon the flaws that have been previously identified regarding the usefulness of percentage agreement, p, κ, and a as indicators of IRR. Although each of these reliability metrics provides important information about the agreement present amongst coders and is certainly prevalent in the management literature, they are only appropriate to use in certain contexts. Further, we highlight the utility of the AC1 statistic which provides reliability information in a different context, above and beyond that provided by a. Specifically, although a is the best choice in a wide variety of circumstances, AC1 demonstrates utility over a when there is a rotation of multiple coders, as well as accommodating dichotomous data, common phenomena in both the HRM literature and in practice. Consequently, we are not suggesting that there is a best or worst metric to use, but instead that the choice of IRR index should be a function of the coding scheme used, the number of coders used in observing and classifying data, and the organization of the data. The validity of the conclusions we make as scholars relies upon the degree to which we are able to rely upon our measures, without which, we cannot provide meaningful suggestions for research or practice.

104 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

REFERENCES Aguinis, H., Dalton, D. R., Bosco, F. A., Pierce, C. A., & Dalton, C. M. (2011). Metaanalytic choices and judgment calls: Implications for theory building and testing, obtained effect sizes, and scholarly impact. Journal of Management, 37, 5–38. Allen, D. G., Hancock, J. I., Vardaman, J. M., & McKee, D. L. N. (2014). Analytical mindsets in turnover research. Journal of Organizational Behavior, 35, S61–S86. Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34, 555–596. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Bliese, P. D. (2000). Within-group agreement, non-independence, and reliability: Implications for data aggregation and analysis. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 349–381). San Francisco, CA: Jossey-Bass. Borenstein, M., Hedges, L. V., Higgins, P. T., & Rothstein, H. R. (2011). Introduction to meta-analysis. West Sussex, UK: John Wiley & Sons. Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687–699. Cohen, J. A. (1960). Coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. Cronbach, L. J. (1980). Validity on parole: How can we go straight. In W. B. Schrader (Ed.), New directions for testing and measurement: Measuring achievement over a decade. (pp. 99–108). San Francisco, CA: Jossey-Bass. Desa, G. (2012). Resource mobilization in international social entrepreneurship: Bricolage as a mechanism of institutional transformation. Entrepreneurship Theory and Practice, 36, 727–751. De Swert, K. (2012). Calculating inter-coder reliability in media content analysis using Krippendorff’s Alpha. Center for Politics and Communication, 1–15. Retrieved from: https://www.polcomm.org/wp-content/uploads/ICR01022012.pdf Eby, L. T., Casper, W. J., Lockwood, A., Bordeaux, C., & Brinley, A. (2005). Work and family research in IO/OB: Content analysis and review of the literature (1980– 2002). Journal of Vocational Behavior, 66, 124–197. Eugenio, B. D., & Glass, M. (2004). The kappa statistic: A second look. Computational Linguistics, 30, 95–101. Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543–549. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378. Frey, L., Botan, C. H., & Kreps, G. (2000). Investigating communication. New York, NY: Allyn & Bacon. Gwet, K. (2001). Handbook of inter-rater reliability: How to estimate the level of agreement between two or multiple raters. Gaithersburg, MD: STATAXIS Publishing Company Gwet, K. (2002a). Inter-rater reliability: Dependency on trait prevalence and marginal homogeneity. Statistical Methods for Inter-rater Reliability Assessment Series, 2, 1–9.

Kappa and Alpha and Pi, Oh My   •  105 Gwet, K. (2002b). Kappa statistic is not satisfactory for assessing extent of agreement between Raters. Statistical Methods for Inter-rater Reliability Assessment Series, 1, 1–5. Gwet, K. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29–48. Hancock, J. I., Allen, D. A., Bosco, F. A., McDaniel, K. R., & Pierce, C. A. (2013). Metaanalytic review of employee turnover as a predictor of firm performance. Journal of Management, 39, 573–603. Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1, 77–89. Heugens, P. P., & Lander, M. W. (2009). Structure! Agency! (And other quarrels): A metaanalysis of institutional theories of organization. Academy of Management Journal, 52, 61–85. Hoch, J. E., Bommer, W. H., Dulebohn, J. H., & Wu, D. (2018). Do ethical, authentic, and servant leadership explain variance above and beyond transformational leadership? A meta-analysis. Journal of Management, 44, 501–529. Judge, T. A., & Ilies, R. (2002). Relationship of personality to performance motivation: A Meta-analytic review. Journal of Applied Psychology, 87, 797–807. Koenig, A. M., Eagly, A. H., Mitchell, A. A., & Ristikari, T. (2011). Are leader stereotypes masculine? A met-analysis of three research paradigms. Psychological Bulletin, 137, 616–642. Kostova, T., & Roth, K. (2002). Adoption of an organizational practice by subsidiaries of multinational corporations: Institutional and relational effects. Academy of Management Journal, 45, 215–233. Krippendorff, K. (1970). Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement, 30, 61–70. Krippendorff, K. (1980). Reliability. In K. Krippendorff, Content analysis; An introduction to its methodology (pp. 129–154). Beverly Hills, CA: Sage Publications. Krippendorff, K. (2004a). Content analysis: An introduction to its methodology (2nd ed.). Thousand Oaks, CA: Sage. Krippendorff, K. (2004b). Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research, 30, 411–433. Krippendorf, K. (2011). Computing Krippendor’s Alpha-Reliability. Retrieved from: hpttp. repository. upenn. edu/asc_papers/43 Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33,159–174. Lebreton, J. M., Burgess, J. R., Kaiser, R. B., Atchley, E. K., & James, L. R. (2003). The restriction of variance hypothesis and interrater reliability and agreement: Are ratings from multiple sources really dissimilar? Organizational Research Methods, 6, 80–128. Lee, T. W., & Mitchell, T. R. (1991). The unfolding effects of organizational commitment and anticipated job satisfaction on voluntary employee turnover. Motivation and Emotion, 15, 99–121. Mackey, J. D., Frieder, R. E., Brees, J. R., & Martinko, M. J. (2017). Abusive supervision: A meta-analysis and empirical review. Journal of Management, 43, 1940–1965. McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica: Biochemia Medica, 22, 276–282.

106 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN Neuendorf, K. A. (2002). The content analysis guidebook. Thousand Oaks, CA: Sage. Perreault Jr, W. D., & Leigh, L. E. (1989). Reliability of nominal data based on qualitative judgments. Journal of Marketing Research, 26, 135–148. Pindek, S., Kessler, S. R., & Spector, P. E. (2017). A quantitative and qualitative review of what meta-analyses have contributed to our understanding of human resource management. Human Resource Management Review, 27, 26–38. Popping, R. (1988). On agreement indices for nominal data. In Sociometric Research (pp. 90–105). London, UK: Palgrave Macmillan. Porter, L. W., & Steers, R. M. (1973). Organizational, work, and personal factors in employee turnover and absenteeism. Psychological Bulletin, 80, 151. Rusbult, C. E., & Farrell, D. (1983). A longitudinal test of the investment model: The impact on job satisfaction, job commitment, and turnover of variations in rewards, costs, alternatives, and investments. Journal of Applied Psychology, 68, 429. Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19, 321–325. Tuggle, C. S., Schnatterly, K., & Johnson, R. A. (2010). Attention patterns in the boardroom: How board composition and processes affect discussion of entrepreneurial issues. Academy of Management Journal, 53, 550–571. Tuggle, C. S., Sirmon, D. G., Reutzel, C. R., & Bierman, L. (2010). Commanding board of director attention: investigating how organizational performance and CEO duality affect board members’ attention to monitoring. Strategic Management Journal, 31, 946–968. Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374– 378.

CHAPTER 6

EVALUATING JOB PERFORMANCE MEASURES Criteria for Criteria Angelo S. DeNisi and Kevin R. Murphy

Research aimed at improving performance appraisals dates back almost 100 years, and there have been a number of reviews of this literature published over the years (e.g., Austin & Villanova, 1992; Bretz, Milkovich, & Read, 1992; DeNisi & Murphy, 2017; DeNisi & Sonesh, 2011; Landy & Farr, 1980; Smith, 1976). Each of these papers have chronicled the research conducted to help us better understand the processes involved in performance appraisals, and how this understanding could help to improve the overall process. Although these reviews were done at different points in time, the goal in each case was to draw conclusions concerning how to make appraisal systems more effective. However, while each of these reviews included studies comparing and contrasting different appraisal systems, they all simply accepted whatever criterion was used for those comparisons, and, based on those comparisons, made recommendations on how to conduct better appraisals. This is an issue since many of the studies reviewed used different criterion measures for their comparisons. But this issue is even more serious when we realize that the reason why these studies and reviews have used different criterion measures is that there is no consensus on what is the “best” criterion measure to use when comparing appraisal Research Methods in Human Resource Management: Toward Valid Research-Based Inferences, pages 107–133. Copyright © 2020 by Information Age Publishing All rights of reproduction in any form reserved.

107

108 • ANGELO S. DENISI & KEVIN R. MURPHY

systems. Stated simply, if we want to make statement about how “system A” was better than “system B,” we need some criterion or criteria against which to compare the systems. Although this would seem to be a basic issue from a research methods point of view, the truth is that there have been many criterion measures used over time, but almost all of them are subject to serious criticism. Therefore, despite 100 years of research on performance appraisal, there is actually very little we can be certain about in terms of identifying the “best” approaches. The present paper differs from those earlier review articles, because the present paper focuses specifically on the problem of criterion identification. Therefore, our review is not organized according to which rating formats or systems were compared, but rather, we organized our review around which criterion measures were used to make those comparisons. Our goal, then, is not to determine which system is best, but to identify problems with the criterion measures used in the past, and to propose a somewhat different approach to try to identify a more useful and credible criterion measure. Therefore, we begin with a discussion of the various criterion measures that have typically been used in comparing and evaluating appraisal studies. In each case, we note the problems that have been identified with their use, and why they may not really be useful as criterion measures.. We then move on to lay out a comprehensive framework to evaluating the construct validity of job performance measures that we believe can serve as the basis for more useful measures to be used in this research. HISTORICAL REVIEW Most measures of job performance, performance ratings (and sometimes rankings of employees in terms of performance) rely on the judgments of supervisors or other evaluators. The question of how to assess the validity of these judgments has been a recurring concern in performance appraisal research and has often been discussed under the heading of “the criterion problem” (cf., Austin & Villanova, 1992). Throughout the first 50 years of serious research in this area of performance measures of this sort were almost always evaluated relative to two types of criteria: (1) measures of agreement, or (2) measures based on the distributions of ratings—i.e., rater error measures. There are issues associated with each type of measure, and we shall discuss these in turn. Agreement Measures The reliance upon agreement measures as criteria for evaluating appraisal systems has a long history. Some type of inter-rater agreement measure has been used to evaluate appraisal systems from as early as the 1930s (e.g., Remmers, 1934) continuing through the 50s (e.g., Bendig, 1953), the 60s (e.g., Smith & Kendall, 1963), and the 70s (e.g., Blanz & Ghiselli, 1972). The underlying assumption was that agreement indicated reliable ratings, and, since reliable ratings

Evaluating Job Performance Measures  •  109

are a prerequisite for valid ratings, agreement could be used as a proxy for validity and accuracy. But, in fact, the situation was much more complex. Viswesvaran, Ones and Schmidt (1996) reviewed several methods of estimating the reliability (or the freedom from random measurement error) of job performance ratings and argued that inter-rater correlations provided the best estimate of the reliability of performance ratings (See also Ones, Viswesvaran & Schmidt, 2008; Schmidt, Viswesvaran & Ones, 2000). The correlations between ratings given to the same employees by two separate raters are typically low, however, and others (e.g., LeBreton, Scherer, & James, L. R. 2014; Murphy & DeShon, 2000) have argued that treating inter-rater correlations as measures of reliability makes sense only if you believe that agreements between raters are due solely to true scores and disagreements are due solely to random measurement error, a proposition that strikes us as unlikely. A number of studies have examined the roles of systematic and random error in performance ratings, as well as methods of estimating systematic and random error (Fleenor, Fleenor, & Grossnickle, 1996; Greguras & Robie, 1998; Hoffman, Lance, Bynum, & Gentry, 2010; Hoffman & Woehr, 2009; Kasten & Nevo, 2008; Lance, 1994; Lance, Baranik, Lau, & Scharlau, 2009; Lance, Teachout, & Donnelly, 1992; Mount, Judge, Scullen, Sytsma, & Hezlett, 1998; Murphy, 2008; O’Neill, McLarnon, & Carswell, 2015; Putka, Le, McCloy, & Diaz, 2008; Saal, Downey, & Lahey, 1980; Scullen, Mount, & Goff, 2000; Woehr, Sheehan, & Bennett, 2005). In general, these studies suggest that there is considerably less random measurement error in performance ratings than studies of inter-rater correlation would suggest. For example, Scullen, Mount, and Goff (2000) and Greguras and Robie (1998) examined sources of variability in ratings obtained from multiple raters and found that the largest source of variance in ratings is due to raters, some of which is likely due to biases or general rater tendencies (e.g., leniency). There have been a number of advances in research on inter-rater agreement, some involving multi-level analyses (e.g., Conway, 1998), or the application of generalizability theory (e.g., Greguras, Robie, Schleicher,, & Goff, 2003). Others have examined sources of variability in peer ratings (e.g., Deirdorff & Surface, 2007), and multi-rater systems such as 360 Degree Appraisals (e.g., Hoffman, Lance, Bynum,, & Gentry, 2010; Woehr, Sheehan,, & Bennett, 2005). In all these cases, results indicated that substantial portions of the variability in ratings was due to systematic rather than random sources of variability in ratings, undercutting the claim (e.g., Schmidt et al. 2000) that performance ratings exhibit a substantial amount of random measurement error. Studies of inter-rater agreement moved from the question of whether raters agree, to considering why and under what circumstances they agree or disagree. For example, there is a robust literature dealing with differences in ratings collected from different sources (e.g., supervisors, peers). In general, self-ratings were found to be typically higher than ratings from others (Valle & Bozeman, 2002), and agreement between subordinates, peers and supervisors was typically

110 • ANGELO S. DENISI & KEVIN R. MURPHY

modest, with uncorrected correlations in the .20s and .30s (Conway & Huffcutt, 1997; Valle & Bozeman, 2002). However, given the potentially low levels of reliability for each source, it is likely that the level of agreement among sources is actually somewhat higher. Harris and Schaubroeck (1988) reported corrected correlations between sources in the mid .30s to low .60s. Viswesvaran, Schmidt, and Ones (2002) applied a more aggressive set of corrections and suggested that in ratings of overall performance and some specific performance dimensions, peers and supervisors show quite high levels of agreement. This conclusion, however, depends heavily on the assumption that almost half of the variance in performance ratings represents random measurement error, a conclusion that has been shown to be incorrect in studies of the generalizability of performance ratings. Second, it is possible that raters agree about some things and disagree about others. For example, it is commonly assumed that raters are more likely to agree on specific, observable aspects of behavior than on more abstract dimensions (Borman, 1979). Roch, Paquin and Littlejohn (2009) conducted two studies to test this proposition, and their results suggested that the opposite is true. Interrater agreement was actually higher for dimensions that are less observable or that are judged to be more difficult to rate. Roch et al. (2009) speculated that this seemingly paradoxical finding may reflect the fact that when there is less concrete behavioral information available, raters fall back on their general impressions of ratees when rating specific performance dimensions. Other studies (e.g., Sanchez & De La Torre, 1996) have reported that accuracy in observing behavior was positively correlated with accuracy in evaluating performance. That is, raters who had an accurate recall of what they have observed, also appeared to be more accurate in evaluating ratees. Unfortunately, however, accuracy in behavioral observation did not appear to be related in any simple way to the degree to which the behavior in question is observable or easy rate. Rater Error Measures The most commonly used criterion measures in appraisal research, referred to as rater error measures, are related to the distributions of ratings. It has often been assumed that the absence of these errors is evidence for the validity and accuracy of performance ratings, although as we note later in this paper, this assumption does not seem fully tenable. The reliance upon rater error measures as criteria dates back to the earliest research on performance appraisals (Bingham, 1939; Kingsbury, 1922, 1933; Thorndike, 1920), and the three most common rater errors: leniency, range restriction, and halo error, have continued to influence the ways in which ratings data are analyzed (e.g., Saal, Downey & Lahey, 1980; Sulsky & Balzer, 1988). Research dealing with these criterion measures have in fact, accounted for a great deal of the research on performance appraisals though much of the 20th century. The sheer volume of this research makes it difficult to discuss this literature in its

Evaluating Job Performance Measures  •  111

entirety, so it useful to deal separately with two major categories of “rater errors”: (a) distributional errors across ratees, and (b) correlational errors within ratees. Distributional Errors. Measures of distributional errors rely on the assumption that if distributions of performance ratings deviate from some ideal, this indicates that raters are making particular types of errors in their evaluations. Although, in theory, any distribution might be viewed as ideal, in practice, the ideal distribution for the purpose of determining whether or not rater errors have occurred has been the normal distribution. Thus, given a group of ratees, any deviation in their ratings from a normal distribution was seen as evidence of a rating error. This deviation could take the form of “too many” ratees being rated as excellent (“leniency error”), “too many” ratees being rated as poor (“severity error”), or “too many” ratees being rated as average (“central tendency error”). Obviously, the logic of this approach depends upon the assumption that the ideal (normal) distribution was correct, so that any other distribution that was obtained was due to some type of error, but, this underlying assumption has been questioned on several grounds. First, the true distribution of the performance of the group of employees who report to a single supervisor is almost always unknown. If it were known, we would not need subjective evaluations and could simply rely upon the “true” ratings. Therefore, it is impossible to assess whether or not there are “too many” ratees who are evaluated at any point on the scale. Second, if there were an ideal distribution of ratings, there is no justification for the assumption that it is normal and centered around the scale midpoint (Bernardin & Beatty, 1984). Rather, organizations exert considerable effort to assure that the distribution of performance is not normal. Saal, Downey, and Lahey (1980) point out that a variety of activities, ranging from personnel selection to training are designed to produce a skewed distribution of performance, so that most (if not all) employees should be—at least—above the midpoint on many evaluation scales. Finally, the use of distributional data as an indicator of errors assumes there are no true differences in performance across work groups (cf., Murphy & Balzer, 1989). In fact, a rater who gives subordinates higher than “normal” ratings may not be more lenient but may simply have a better group of subordinates who are actually doing a better job and so deserve higher ratings. Furthermore, recent research has challenged the notion that job performance is normally distributed in almost any situation (Aguinis & O’Boyle, 2014; Aguinis, O’Boyle, Gonzalez-Mulé, & Joo. 2016; Joo, Aguinis, & Bradley, 2017). These authors argue that in many settings, a small number of high performers (often referred to as “stars”) contribute disproportionally to the productivity of a group, creating a distribution that is far from normal. Beck, Beatty, and Sackett (2014) suggest that the distribution of performance might depend substantially on the type of performance that is measured, and it is reasonable in many cases to assume nearly normal distributions. The argument over the appropriate distributional assumptions is a complex one, but the very fact that this argument exists is

112 • ANGELO S. DENISI & KEVIN R. MURPHY

a strong indication that we cannot say with confidence that ratings are too high, or that there is too little variance in or too much intercorrelation among ratings of different performance dimensions absent reliable knowledge of how ratings should be distributed. In the eyes of some critics (e.g., Murphy & Balzer, 1989), the lack of reliable knowledge about the true distribution of performance in the particular workgroup evaluated by any particular rater makes distributional error measures highly suspect. Correlational Errors. Measures of correlational error are built around similar assumptions, that there is some ideal level of correlation that the ratings from each supervisor should assign. Specifically, it is often assumed that different aspects or dimensions of performance should be independent and or at least should show low levels of intercorrelation. Therefore, when raters give ratings of performance that turn out to be correlated, this is thought to indicate a rating error. This inflation of the intercorrelations among dimensions, is referred to as halo error. Cooper (1981b) suggests that halo is likely to be present in virtually every type of rating instrument. There is an extensive body of research examining halo errors in rating, and a number of different measures, definitions, and models of halo error have been proposed (Balzer & Sulsky, 1992; Cooper, 1981a,b; Lance, LaPointe, & Stewart, 1994; Murphy & Anhalt, 1992; Murphy, Jako, & Anhalt, 1993; Nathan & Tippins, 1989; Solomonson & Lance, 1997). Although there was disagreement on a number of points across these proposals, there was substantial agreement on several important points. First, the observed correlation between ratings of separate performance dimensions reflects both actual consistencies in performance (referred to as “true halo,” or the actual degree of correlation between two conceptually distinct performance dimensions) and errors in processing information about ratees or in translating that information into performance ratings (referred to as “illusory halo”). Clearly, the degree of true halo does not indicate any type of rating error but instead reflects the true covariance across different parts of a job; it is only the illusory halo that reflects a potential rater error (Bingham actually made the same point in 1939). Second, this illusory halo is driven in large part by raters’ tendency to rely on general impressions and global evaluations when rating specific aspects of performance (e.g., Balzer & Sulsky, 1992; Jennings, Palmer, & Thomas, 2004; Lance, LaPointe, & Stewart, 1994; Murphy & Anhalt, 1992). Third, all agree that it is very difficult, if not impossible, to separate true halo from illusory halo. Even in cases where the expected correlation between two rating dimensions is known for the population in general (for example, in the population as a whole several of the Big Five personality dimensions are believed to be essentially uncorrelated), that does not mean that the performance of a small group of ratees on these dimensions will show the same pattern of true independence. There is an emerging consensus that measures that are based on the distributions and the intercorrelations among the ratings given by an individual rater have proved essentially useless for evaluating performance ratings (DeNisi & Murphy,

Evaluating Job Performance Measures  •  113

2017; Murphy & Balzer, 1989). First, we cannot say with any confidence that a particular supervisor’s ratings are too high or too highly intercorrelated unless we know a good deal about the true level of performance, and if we knew this, we would not need supervisory performance ratings. Second, the label “rater error” is misleading. It is far from clear that supervisors who give their subordinates high ratings are making a mistake. There might be several good reasons to give subordinates high ratings (e.g., to give them opportunities to obtain valued rewards, to maintain good relationships with subordinates), and raters who know that high ratings are not truly deserved might nevertheless conclude that it is better to give high ratings than to give low ones (Murphy & Cleveland, 1995; Murphy, Cleveland, & Hanscom, 2018). Finally, as we shall see, there is no evidence to support the assumption that rating errors have much to do with rating accuracy, an assumption that has long served as the basis for the use of rating errors measures as criteria for evaluating appraisal systems. Rating Accuracy Measures Although not always formerly acknowledged, researches used agreement measures and rating error measures to evaluate appraisals because these were seen proxies for rating accuracy. It was long assumed that we could not assess rating accuracy directly, and so we needed to use measures that could serve as reasonable proxies for accuracy. But, starting in the late 1970s, laboratory studies of performance ratings made it possible to assess rating accuracy directly. This stream of research focused largely on rater cognitive processes, following suggestions from Landy and Farr (1980), although the earliest research using accuracy measures actually predated the publication of this article. In it, Landy and Farr concluded that research on rating scale formats had not been very useful and suggested, instead, that research focus more on raters themselves and how they made decisions about which ratings to give. In order to study the cognitive processes involved in evaluating performance researchers were forced to rely more broadly on laboratory research, where it might be possible to collect data that isolated particular processes. This move to the lab also made it possible to develop and use direct measures of rating accuracy as criteria, by developing “true scores” which could be used as criteria for evaluating ratings, by comparing actual ratings to these true scores. Much of this research began with Borman (1978) and continued through the work of Murphy and colleagues (e.g., Murphy & Balzer, 1986; Murphy, Martin, & Garcia, 1982) and DeNisi and colleagues (e.g., DeNisi, Robbins, & Cafferty, 1989; Williams, DeNisi, Meglino,, & Cafferty, 1986). The ability to compute the accuracy of a set of ratings allowed Murphy and colleagues (Murphy & Balzer, 1989) to assess the relation between rating accuracy and rating error measures, but it also allowed for different criteria for comparing rating scales as well as rater training techniques, and it also led to more complex ways of assessing rating accuracy.

114 • ANGELO S. DENISI & KEVIN R. MURPHY

Borman and his associates launched a sustained wave of research on rating accuracy, using videotapes of ratees performing various tasks, which could then be used as stimulus material for rating studies. Borman’s (1977, 1978, 1979) research was based on the assumption that well-trained raters, observing these tapes under optimal conditions, could provide a set of ratings which could then be pooled and averaged (to remove potential individual biases and processing errors) to generate “true scores” which would be used as the standard against which all other ratings could be compared. That is, these pooled ratings, collected under optimal conditions, could be considered to be an accurate assessment of performance, which could then be used as criterion measures in subsequent research. Rating accuracy measures, similar to those developed by Borman were widely used in appraisal studies focusing on rater cognitive processes (Becker & Cardy, 1986; Cardy & Dobbins, 1986; DeNisi, Robbins, & Cafferty, 1989; McIntyre, Smith, & Hassett, 1984; Murphy & Balzer, 1986; Murphy, Balzer, Kellam, & Armstrong, 1984; Murphy, Garcia, Kerkar, Martin, & Balzer, 1982: Pulakos, 1986; Williams et al., 1986 ), but were also used in studies comparing different methods of rater training (e.g., Pulakos, 1986), and even studies comparing differ types of rating scales (e.g., DeNisi, Robbins, & Summers, 1997). A review of research on rating accuracy measures can be found in Sulsky & Balzer (1988). Different Types of Accuracy. Attempts to increase the accuracy of performance ratings are complicated by the fact that there are many different types of accuracy. At a basic level, Murphy (1991) argued for making a distinction between behavioral accuracy and classification accuracy. Behavioral accuracy referred to the ability to discriminate between good and poor incidents of performance, while classification accuracy referred to the ability to discriminate between the best performer, and the second-best performer and so on. Murphy (1991) also argued that the purpose for which the ratings were to be used should dictate which type of accuracy was more important, but it seems clear that these measures answer differ questions about rating accuracy and that both are likely to be important. At a more complex level, Cronbach (1955) had noted that there were several ways we could define the agreement between a set of rating provided by a rater and a set of true scores. Specifically, he defined four separate components of accuracy: (1) Elevation—the accuracy of the average rating, over all ratees and dimensions, (2) Differential Elevation—the accuracy in discriminating among ratees, (3) Stereotype Accuracy—accuracy in discriminating among performance dimensions across all ratees, and (4) Differential Accuracy—accuracy in detecting ratee differences in patterns of performance, such as diagnosing individual strengths and weakness. Research suggests that the different accuracy measures are not highly correlated (Sulsky & Balzer, 1988), so that the conclusions one draws about the accuracy of a set of ratings may depend more upon the choice of accuracy measures than on a rater’s ability to evaluate his or her subordinates (Becker & Cardy, 1986).

Evaluating Job Performance Measures  •  115

Several scholars had questioned the assumption that rater error measures were useful proxies for assessments of the accuracy (e.g., Becker & Cardy, 1986; Cooper, 1981b; Murphy & Balzer, 1986). Murphy and Balzer (1989) who, using data from over 800 raters, provided the first direct empirical examination of this assumption. They reported that the relationship between any of the common rating errors and rating accuracy was either zero, or it was in the wrong direction (i.e., more rater errors were associated with higher accuracy). In particular, they reported that the strongest error-accuracy relationship was between halo error and accuracy, but that higher levels of halo were associated with higher levels of accuracy—not lower levels, as should have been the case. Accuracy measures have proved problematic as criteria for evaluating ratings. First, different accuracy measures often lead to quite different conclusions about rating systems; several authors have suggested that the purpose for the appraisals should probably dictate which type of accuracy measure should be used to evaluate ratings (e.g., Murphy, 1991; Murphy & Cleveland, 1995). Furthermore, direct measures of accuracy are only possible in highly controlled settings, such as laboratory studies, making these measures less useful for field research. Finally, Ilgen, Barnes-Farrell and McKellin (1993) raised wide-ranging questions about whether or not accuracy was the right goal in performance appraisal—and therefore whether it was the best criterion measure for appraisal research. This point was also raised elsewhere by DeNisi and Gonzalez (2004), and Ilgen (1993). Alternative Criteria for Evaluating Ratings Neither agreement indices, rater error measures nor rating accuracy measures have proved satisfactory as criteria for evaluating ratings. A number of authors have proposed alternative criterion measures to be used in research. For example, DeNisi and Peters (1996) conducted one of the few field studies examining cognitive processes usually studied in the lab. These authors examined rater reactions to the appraisals they gave, as well rating elevation and rating discrimination (between and within ratees) as criterion measures to evaluate the effectiveness of two different interventions intended to improve rater recall of performance information. Varma, DeNisi, and Peters (1996) relied upon the information used in rater diaries to generate proxies for rating accuracy (i.e., the extent to which ratings reflected what raters recorded in their diaries). But neither alternative seemed likely to replace other criterion measures, although they did point to some other possibilities. Ratee (not rater) reactions to the appraisal process have a substantial history of use as criteria measures, dating back at least to the late 1970s (Landy, Barnes, & Murphy, 1978; Landy, Barnes-Farrell, & Cleveland, 1980). Specifically, this research focused on ratee’s perceptions of the fairness of the appraisal process as a criterion measure, and others have adopted this approach as well (e.g., Taylor, Tracy, Renard, Harrison, & Carroll, 1995). Consistent with the larger body of literature on organizational justice, scholars have suggested that ratee’s perceptions

116 • ANGELO S. DENISI & KEVIN R. MURPHY

about the fairness of the ratings they received as well as the rating process itself, are important criteria for evaluating the effectiveness of any appraisal system (cf., Folger, Konovsky, & Cropanzano, 1992; Greenberg, 1986, 1987). This focus was consistent with the recommendations of Ilgen et al. (1993) and DeNisi and Gonzalez (2004), and assumes that employees are most likely to accept performance feedback, to be motivated by performance-contingent rewards, and to view their organization favorably if they view the performance appraisal system as fair and honest. In our view, perceptions of fairness should be thought of as a mediating variable rather than as a criterion. The rationale for treating reactions as a mediating variable is that performance ratings are often used in organizations as means of improving performance, and reactions to ratings probably have a substantial impact of the effectiveness of rating systems. It is likely that performance feedback will lead to meaningful and useful behavior changes only if the ratee perceives the feedback (i.e., the ratings) received as fair, and accepts this feedback. Ratee performance may not actually improve, perhaps because of a lack of ability or some situational constraint, but increasing an incumbent’s desire to improve and a willingness to try harder is assumed to be a key goal performance appraisal and performance management systems. Unfortunately, feedback, even when accepted, is not always as effective as we had hoped it might be (cf., Kluger & DeNisi, 1996). Conclusions Our review of past attempts at identifying criterion measures for evaluating performance appraisals suggests that one of the reasons for the recurring failure in the century-long search for “criteria for criteria” is the tendency to limit this search to a single class of measures, such as inter-rater agreement measures, rater error scores, indices of rating accuracy and the like. Although some type of ratee reaction measure may be more reasonable, this criterion is also narrow and deals with only one aspect of appraisals. Early in the history of research on criteria for criteria Thorndike (1949) reminded us of the importance of keeping the “ultimate criterion” criterion in mind. He defined this ultimate criterion as the “complete and final goal” of the assessment or intervention being evaluated (p. 121). In the field of performance appraisal, this “ultimate criterion” is an abstraction, in part because performance appraisal has many goals and purposes in most organizations (Murphy et al., 2018). Nevertheless, this abstraction is a useful one, in part because it reminds us that no single measure or class of measure is likely to constitute an adequate criterion for evaluating performance appraisal systems. Each individual criterion measure is likely to have a certain degree of criterion overlap with the ultimate criterion (i.e., each taps some part of the ultimate criterion), but each is also likely suffer from a degree of criterion contamination (i.e., each measure is affected by things outside of the ultimate criterion). The search for a single operational criterion for criteria strikes us as pointless.

Evaluating Job Performance Measures  •  117

We propose to evaluate measures of job performance in the same way we evaluate other measures of important constructs—i.e., through the lens of construct validation. In particular, we propose a framework for evaluating performance ratings that draws upon methods widely used to assess the construct validity of tests and assessments (American Educational Research Association, 2014). CONSTRUCT VALIDATION AS A FRAMEWORK FOR ESTABLISHING CRITERIA FOR CRITERIA Construct validation is most commonly associated with testing (American Educational Research Association, 2014). Specifically, there are many tests designed to measure constructs, such as Intelligence or Agreeableness. A construct is a label we use to describe a set of related behaviors or phenomena, and they do not exist in the literal sense. We cannot see Intelligence although we can infer it, and constructs such as Agreeableness or Intelligence are extremely useful in helping us to understand behavior. The process of construct validation, then, involves assessing whether the measure (of Intelligence, for example) actually measures the construct (i.e., actually measures Intelligence), and in some sense, most assessments of validity (regardless of the specific approach used) can really be thought of assessments of construct validity (Cronbach, 1990; Murphy & Davidshofer, 2005). Evaluating the construct validity of any measure does not involve the simple and straightforward application of one method. Instead, it is a process of collecting information in support of construct validity and establishing what is referred to as a “nomological network.” Establishing this network really involves testing a series of hypotheses concerning the construct. These hypotheses take the form of “if this instrument measures intelligence, then sores on this instrument should be positively (or negatively) related to other outcomes.” As evidence in support of these hypotheses is collected, the case for construct validity builds. In the final analysis, we can never “prove” the construct validity of any measure, but we can amass enough data in support of construct validity that it becomes generally accepted as measuring what it is intended to measure. This same approach can be used to assess the construct validity of performance appraisals (similar approaches have been proposed by Borman (1991) and Milkovich & Wigdor (1991) but never implemented to our knowledge; see also Stone-Romero, Alvarez, & Thompson (2009)). Basically, this approach involves collecting data to test hypotheses in support for stating that performance ratings measure actual job performance. What kinds of data should we collect? There are several classes of data that would be useful. This framework we propose for evaluating job performance measures has three components: (1) construct explication, (2) multiple evidence sources, and (3) the accumulation and synthesis of relevant evidence to draw conclusions about the

118 • ANGELO S. DENISI & KEVIN R. MURPHY

extent to which job performance measures reflect the desired constructs and fulfil their desired purposes. That is, in order to evaluate performance ratings and performance appraisal systems, we have to first know what they are intended to measure and to accomplish, then collect the widest array of relevant evidence, then put that information together to draw conclusions about how well our performance measures reflect the constructs they are designed to reflect and achieve the goals they are designed to accomplish. Construct Explication Construct explication is the process of defining the meaning and the correlates of the construct one wishes to measure (Cook & Campbell, 1979; Shadish, Cook, & Campbell, 2001). Applying this notion to performance appraisal systems involves answering three questions, two of which focus on performance itself and the last of which focusses on the purpose of performance appraisal systems in organizations: (1) what is performance? (2) what are its components? and (3) what are we trying to accomplish with a PA system? We can begin by drawing upon existing, well-researched models of the domain of job performance (Campbell, 1990; Campbell, McCoy, Oppler, & Sager, 1993) to answer the first two questions, although we also propose a general definition of job performance as the total value of the contribution of a person to the value of the organization, over a defined period of time. This broad definition, however, requires further explication. Campbell (1990) suggested that there were eight basic dimensions of job performance that applied to most jobs, so that job performance could be defined as how well an employee performed each. These were: job-specific task proficiency (tasks that make up the core technical requirements of a job); non-job-specific task proficiency (tasks not specific to the job but required by all jobs in the organization); written and oral communications; demonstrating effort (how committed a person is to job tasks and how persistently and intensely they work at those tasks); maintaining personal discipline(avoiding negative behavior at work); facilitating team and peer performance (support help and development); supervision (influencing subordinates); and management and administration (non-supervisory functions of management including goal setting). Subsequent discussions (e.g., Motowidlo & Kell, 2013), expanded the criterion space to include contextual performance (behavior that contributes to organizational effectiveness through its effects on the psychological, social, and organizational context of work, but is not necessarily part of any person’s formal job description), counterproductive performance (behaviors that are carried out to hurt and hinder effectiveness and have negative expected organizational value), and adaptive performance (which includes the ability to transfer training/learning from one task to another, coping and emotional adjustment, and showing cultural adaptability). Assessing the degree to which an appraisal instrument captures the critical aspects of job performance is largely an issue of content validity. Although content

Evaluating Job Performance Measures  •  119

validity has traditionally been used in connection with validating tests, it clearly applies to evaluating appraisal instruments as well. In the case of appraisal instruments this would mean the extent to which the content of the appraisal instrument overlaps with defined performance on the job in question. Thus, the issue would be assessing whether or not the appraisal instruments captures all the aspects of job performance discussed above. This type of assessment is likely to rely on expert judgment, but there are many tools that can be applied to bring rigor to these judgments. Lawshe (1975) first proposed a quantitative approach for assessing the degree of agreement among those experts, resulting in the Content Validity Index (CVI). Subsequent research (e.g., Polit, Beck, & Owen, 2007) supports the usefulness of this index as a means of assessing content validity and cold be sued with regard to appraisal instruments as well. Addressing the third question requires knowledge of the context in which work is performed and the goals of the organization in creating and implementing the appraisal system (Murphy, Cleveland, & Hanscom, 2018). This involves consideration of the reasons why organizations conduct appraisals and the ways in which they use appraisals information. The model suggested by Cleveland, Murphy and Williams (1989) is particularly useful in this regard. Those authors distinguish among: between-person distinctions (e.g., who gets a raise or is promoted); within-person distinctions (e.g., identification of training needs); systems maintenance (e.g., evaluating HR systems); and documentation (e.g., justification for personnel actions). Of course, in most organizations, appraisal information will be used for several (if not all) of these purposes, but it is important to assess the effectiveness of appraisal systems for each purpose for which information is used. Evidence There are many types of evidence that are relevant for evaluating performance measures, and we discuss a number of these, but it is surely the case that there are other types of evidence that could be collected as well. But, perhaps the most basic type of evidence could be derived by simply examining the actual content of the rating scales used to assess performance. This content should be based upon careful job analysis that provides clear and unambiguous definitions of performance dimensions that are related to the job in question. The basic dimensions suggested by Campbell (1990), and discussed above would provide a good starting point, although adding aspects of performance such as contextual performance, counterproductive performance and adaptive performance would help ensure a more complete view of a person’s contribution to the organization These dimensions might be expressed in terms of behaviors, goals, or outcomes, but arguing that personality traits or attitudes are related to these performance dimensions requires an extra step and an extra set of assumptions. Evidence could also be collected by assessing the convergent validity of various measures of performance. The assessment of convergent validity is commonly a part of any base of evidence for construct validity and is concerned with the

120 • ANGELO S. DENISI & KEVIN R. MURPHY

extent to which different measures, claiming to assess the same construct, are related to each other. In the case of performance appraisals, these “other” measures might include objective measures of performance, in situations where such measures are possible. In fact, there is evidence to suggest that performance ratings and objective performance measures are related (corrected correlations in the .30s and .40s), but not substitutable (e.g., Bommer, Johnson, Rich, Podsakoff, & MacKenzie, 1995; Conway, Lombardo, & Sanders, 2001; Heneman, 1986; Mabe & West, 1982). We could also approach convergent validation by comparing ratings of the same person, using the same scale, but provided by different raters. This could be viewed as the interrater agreement criterion discussed earlier, but those measures typically involved multiple raters at the same level. The notion of 360 degree ratings (or multi-source ratings) assumes that raters who have different relationships with a ratee might evaluate that ratee differently (otherwise there would be no reason to ask for ratings from different sources) and the level of agreement across sources is seen as an important component of the effectiveness of these systems (e.g., Atwater & Yammarino, 1992). In general, data suggest that ratings from different sources are related, but not highly correlated so that the rating source has an important effect on ratings (e.g., Harris & Schaubroeck, 1988; Mount, Judge, Scullen, Sytsma,, & Hezlett, 1998). Woehr, Sheehan, and Bennett (2005) also reported strong effect for rating source, although they did find that the effects of performance dimensions were the same across sources. In both cases, there is surely some question about whether these different measures actually purport to measure the same things. Objective performance measures typically assess output only. It is possible that an employee’s performance is more than just the number of units sold or produced. Nevertheless, evidence that objective and subjective assessments of performance and effectiveness converge can represent an important aspect of the validation of an appraisal system. It is also possible to assess construct validity by examining evidence of criterion-related validity. Performance ratings are among the most commonly used criteria for validating selection tests. There is a large body of data demonstrating that tests designed to measure job-relevant abilities and skills are consistently correlated with ratings of job performance (cf., Schmidt & Hunter, 1998; Woehr & Roch, 2016). We typically think of these data as evidence for the validity of the selection tests rather than for the performance ratings, but they can be used for both. That is, if there is a substantial body of evidence demonstrating that predictors of performance that should be related to job performance measures actually are related to performance ratings (and there is such a body of evidence) then performance ratings are likely to be capturing at least some part of the construct of job performance. Another way of gathering evidence about the construct validity of performance ratings is to determine whether ratings have consistent meanings across contexts or cultures. Performance appraisals are used in numerous countries and

Evaluating Job Performance Measures  •  121

cultures; multinational corporations might use similar appraisal systems in many nations. The question of whether performance appraisal provide measures that can reasonably be compared across borders is therefore an important one. Ployhart, Wiechmann, Schmitt, Sacco, and Rogg (2003) examined ratings of technical proficiency, customer service and teamwork given to fast food workers in Canada, South Korea and Spain and concluded that ratings show evidence of invariance. In particular, raters appeared to interpret the three dimensions in similar ways and to apply comparable performance standards when evaluating their subordinates. However, there was also evidence of some subtle differences in perceptions that could make direct comparisons across countries complex. In particular, raters in Canada perceived smaller relations between Customer Service and Teamwork than did raters in South Korea and Spain. On the whole, however, Ployhart et al. (2003) concluded that ratings from these countries reflected similar ideas about the dimensions and about the performance levels expected and could therefore be used to make cross-cultural comparisons. Similarly, there is evidence of measurement equivalence when performance ratings of more experienced and less experienced raters are compared (Greguras, 2005). Even though experience as a supervisor is likely to influence the strategies different supervisors apply to maximize the success of their subordinates, it appears that supervisors using a well-developed performance appraisal system are likely to agree regarding the meaning of performance dimensions and performance levels. Also, we might use evidence regarding bias in performance ratings to evaluate the construct validity of these ratings. The rationale here is that if performance ratings can be shown to be strongly influenced by factors other than job performance, this would tend to argue against the proposition that performance ratings provide valid measures of job performance (Colella, DeNisi, & Varma, 1998). There is a substantial literature dealing with the question of whether or not performance ratings are biased by factors that are presumably unrelated to actual job performance, such as the demographic characteristics of ratees or the characteristics of work groups. A full review of that literature is beyond the scope of the present paper, but it is worth noting that there is evidence of some bias based on employee gender (i.e., bias against women; see review by Roberson, Galvin, & Charles, 2007); employee age (i.e., older workers are rated somewhat higher); race and ethnicity (i.e., minority group members tend to receive somewhat lower ratings), and disability status (i.e., Colella, DeNisi, & Varma, 1998; Czajka & DeNisi, 1988), as well as bias based on attributes viewed negatively by most of the population (e.g., obesity, low levels of physical attractiveness; see for example, Bento, White, & Zacur, 2012). Although there is evidence that performance ratings show some biases, it is important to note that age, gender, race and disability tend to have very small effects on performance ratings (Landy, 2010), and these factors may not be as important

122 • ANGELO S. DENISI & KEVIN R. MURPHY

as some have suggested. In fact, several review authors have concluded that bias is not a significant issue in most appraisals (e.g., Arvey & Murphy, 1998; Bass & Turner, 1973; Baxter, 2012; Bowen, Swim, & Jacobs, 2000; DeNisi & Murphy, 2017; Kraiger & Ford, 1985; Landy, Shankster, & Kohler, 1994; Pulakos, White, Oppler, & Borman, 1989; Waldman & Avolio, 1991). Studies using laboratory methods (e.g., Hamner, Kim, Baird, & Bigoness, 1974; Rosen & Jerdee, 1976; Schmitt & Lappin, 1980), are more likely to report demographic differences in ratings, especially when those studies involve vignettes rather than observations of actual performance, but these biases do not appear to be substantial in ratings collected in the field (see meta-analysis results reported by discussion by Murphy, Herr, Lockhart, & Maguire, 1986). This is not to say that there are not situations where bias is very real and very serious (e.g. Heilman & Chen, 2005), but the general hypothesis that performance ratings are substantially biased against women, members of minority groups, older workers or disabled workers does not seem credible (DeNisi & Murphy, 2017; Murphy et al., 2018). On the whole, the lack of substantial bias typically encountered in performance appraisals can be considered as evidence in favor of the construct validity of performance ratings. Finally, evidence regarding employee reactions to appraisals and perceptions that the ratings are fair would be worth collecting. As noted earlier, the research focusing on ratee reactions and perceptions of fairness have a reasonably long history (e.g., Landy, Barnes, & Murphy, 1978; Landy, Barnes-Farrell, & Cleveland, 1980), and continues to be studied as an important part of the entire performance management process (cf., Folger, Konovsky, & Cropanzano, 1992; Greenberg, 1986, 1987; Greenberg & Folger, 1983; Taylor, Tracy, Renard, Harrison, & Carroll, 1995). But, since ratee reactions are seen as mediating variables relating to ratee motivation to improve performance and, ultimately, to actual performance improvement, data on reactions should be collected in conjunction with data on actual performance improvement. Synthesis Synthesizing evidence from all (or even many) of these sources is a non-trivial task. Therefore, as with all construct validation efforts, the process will take time and effort, and will not be a one-step evaluation process. Also, the construct validation process will involve continuing efforts to collect evidence so that we may become more and more certain about any conclusions reached. In any case, the process requires the accumulation of evidence and the judgment as to how strong a case has been made for construct validity. Since the final assessment will necessarily be a matter of judgment, it is clear that there are a number of issues that will need to be addressed. One such issue is the determination of how much evidence is enough. Obviously more evidence is always preferable but collecting more evidence may not always be practical. Therefore, the question will remain as to how many “pieces” of evidence will be needed to make a convincing case. The actual number of evi-

Evaluating Job Performance Measures  •  123

dentiary data may also be a function of whether or not all the available evidence comes to the same conclusion. That is, it may be the case that relatively few bits of evidence are sufficient if they all indicate that then appraisal instrument has sufficient content validity. But what if there is not consensus with regard to the evidence? Therefore, another important issue in developing a protocol for evaluating the construct validity of performance measures is determining how to reconcile different streams of evidence that suggest different conclusion. First, there must be a decision as to whether a case could be made for construct validity in the presence of any contradictory evidence. Then, assuming some contradictory evidence, a decision must be made concerning how to weigh different types of evidence. Earlier, in our discussion of traditional measures for evaluating appraisal instruments, we noted that rating errors were not a good proxy for rating accuracy, and probably not a good measure for evaluation at all. It would seem reasonable then, that evidence relating to rating errors could be discounted in any analysis. But what about assessing other types of evidence such as measurement equivalence or source agreement, or the absence of bias? How much weight to give each of these will ultimately be a judgment call, and the ability of anyone to make a case for construct validity will depend largely upon one’s ability to make the case for some differential weighting. But there may be one type of evidence that can be given some precedence in this process. We argue that, while organization conduct appraisals for a number of reasons, ultimately, they conduct appraisals in the hope of help in employees to improve their performance. Therefore, some deference should be shown to evidence that supports this improvement. That is, if there is evidence that implementing an appraisal system has resulted in a true improvement in individual performance, this should be given a fair amount of weight in supporting the construct validity of the system. Furthermore, evidence that the appraisal system has also resulted in true improvement in performance at the level of the firm, should be given even more weight. We note, however, that evidence clearly linking improvements in individual-level with improvements in firm-level performance, is extremely rare (cf., DeNisi & Murphy, 2017). DIRECTIONS FOR FUTURE RESEARCH AND CONCLUSIONS So, where do we go from here? We believe that one of the major reasons for the recurring failure in the century-long search for “criteria for criteria” is the tendency to limit this search to a single class of measures, such as inter-rater agreement measures, rater error scores, indices of rating accuracy and the like. Some of these measures have serious enough problems that they probably should not be used at all, but, even if we accept that some of these measures provide us some insight

124 • ANGELO S. DENISI & KEVIN R. MURPHY

as to the usefulness of appraisal systems, they can only tell us part of the story. Instead, we have proposed reframing the criteria we use to evaluate measures of job performance in terms of the way we evaluate other measures of important constructs—i.e., through the lens of construct validation. But, the approach we have proposed suggests that evaluation process will be complex. It requires collecting different types of data, where each data source can tell us something about the effectiveness of appraisal systems, but where only when we combine these different sources will we begin to get a true picture of effectiveness. We have discussed a number of such data sources, which we have termed as sources of evidence of construct validity, and research needs to continue to identify and refine these sources of evidence. Research needs to more fully examine issues of convergence across rating sources. There is evidence to suggest that ratings of the same person, from different rating sources, are correlated, but are not substitutable. Is this because of measurement error, bias, or is it because raters who have different relationships with a ratee observe different behaviors? Perhaps peers, supervisors, subordinates, etc. see similar things but apply different standards in evaluating what they see. Determining the source of the disagreement may help us to establish upward boundaries that could be expected so that we can more accurately assess convergence across sources. More information about equivalence of ratings across cultures and contexts is also needed. This type of research may require special efforts to overcome the effects of language differences, as well as differences in definitions across cultures. For example, Farh, Earley, and Lin (1997) examined how American and Chinese works viewed the idea of organizational citizenship behavior (OCB). The found that it was necessary to go beyond the mere translation of OCB scales developed in the west. Instead, they generated a Chinese definition of OCB and found that measures of this Chinese version of OCB displayed the same relations with various justice measures as the U.S. based measures did. But they also found that the translated measures did not display the same relations. They concluded that citizenship behavior was as important for the Chinese sample as it was for the U.S. sample, but that the two groups defined citizenship in slightly different ways, and it was important to respect these differences when comparing results. Therefore, it may not be enough to simply translate appraisal instruments in order to compare equivalence across cultures. But, on the other hand, at some point, the conceptualizations may be so different as to suggest that there is really not any equivalence. These issues require a great deal of further research. We noted that, although there is evidence of different types of bias in performance ratings, these biases actually explained only small amounts of variance in actual ratings. It is important to obtain clear estimates of how important bias may be for ratings in different settings. This too may allow us to set upward boundaries to help interpret data on bias, but it will also help to identify cases where bias is more serious, and what to do in such situations.

Evaluating Job Performance Measures  •  125

Finally, although we believe that further research is needed on several of these topics, as helping to establish construct validity or appraisals, we also see it as especially important to generate evidence that appraisal systems matter. That is, we view it as critical that any attempt to evaluate appraisal systems includes data that feedback from appraisals change behavior. We view reliance upon ratee reactions and perceptions as fairness as an important step in this process, but, ultimately, ratee reactions and perceptions would only serve to mediate the relationships between appraisal results and performance improvement. Whatever evidence exists, it might be difficult to establish construct validity of appraisals that didn’t help employees to improve. Furthermore, beyond improving the performance of individual employees, it is also important to show how appraisal systems can help firms improve performance of the firm itself. This would require demonstrating how changes in individual-level performance actually translate into changes in firm-level performance, and, as noted by DeNisi & Murphy (2017), data supporting such a relation is extremely rare and will require a great deal of effort to collect. The search for “criteria for criteria” has been a long and disappointing one, in part because none of the particular measures (e.g., agreement, rater errors) that have been proposed have been fully adequate for the task. Other reviews of the appraisal literature focused upon ways to improve appraisals, but they did not consider potential problems with the criteria used to evaluate appraisal systems. The present review focused explicitly upon questions of the criterion used, and a critical review indicated that there were problems with most of the criterion measures used I the past. After completing this review, we believe it is time to abandon the search for one single criterion measure that could allow us to evaluate performance appraisal systems and to adopt the same approach we have adopted to validating most other measures. The evaluation of performance appraisal systems will involve the same sort of complex, ongoing system of collecting, weighing and evaluating evidence that we routinely apply when asking whether, for example, a new measure of Agreeableness actually taps this construct. The good news is that we have all of the tools and training we need, as well as a wellestablished framework for validation. We look forward to applications of the construct validation framework to the important question of evaluating performance appraisal systems. REFERENCES American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational and Psychological Testing (U.S.). (2014). Standards for educational and psychological testing. Washington, DC: AERA. Aguinis, H., & O’Boyle, E. (2014). Star performers in twenty-first-century organizations. Personnel Psychology, 67, 313–350.

126 • ANGELO S. DENISI & KEVIN R. MURPHY Aguinis, H., O’Boyle, E., Gonzalez-Mulé, E., & Joo, H. (2016). Cumulative advantage: Conductors and insulators of heavy-tailed productivity distributions and productivity stars. Personnel Psychology, 69, 3–66. Arvey, R., & Murphy, K. (1998). Personnel evaluation in work settings. Annual Review of Psychology, 49, 141–168. Atwater, L. E., & Yammarino, F. Y. (1992). Does self–other agreement on leadership perceptions moderate the validity of leadership and performance predictions? Personnel Psychology, 45, 141–164. Austin, J. T., & Villanova, P. (1992). The criterion problem: 1917–1992. Journal of Applied Psychology, 77, 836–874. Balzer, W. K., & Sulsky, L. M. (1992). Halo and performance appraisal research: A critical examination. Journal of Applied Psychology, 77, 975–985. Bass, A. R., & Turner, J. N. (1973). Ethnic group differences in relationships among criteria of job performance. Journal of Applied Psychology, 57, 101–109. Baxter, G. W. (2012). Reconsidering the black-white disparity in federal performance ratings. Public Personnel Management, 41, 199–218. Beck, J. W., Beatty, A. S., & Sackett, P. R. (2014). On the distribution of job performance: The role of measurement characteristics in observed departures from normality. Personnel Psychology, 67, 531–566. Becker, B. E., & Cardy, R. L. (1986). Influence of halo error on appraisal effectiveness: A conceptual and empirical reconsideration. Journal of Applied Psychology, 71, 662–671. Bendig, A. W. (1953). The reliability of self-ratings as a function of the amount of verbal anchoring and the number of categories on the scale. Journal of Applied Psychology, 37, 38–41. Bento, R. F., White, L. F. & Zacur, S. R. (2012). The stigma of obesity and discrimination in performance appraisal: A theoretical model. International Journal of Human Resource Management, 23, 3196–3224. Bernardin, H. J., & Beatty, R. W. (1984). Performance appraisal: Assessing human behavior at work. Boston, MA: Kent. Bernardin, H. J., & Buckley, M. R. (1981). Strategies in rater training. Academy of Management Review, 6, 205–212. Bingham, W. V. (1939). Halo, invalid and valid. Journal of Applied Psychology, 23, 221– 228. Blanz, F., & Ghiselli, E. E. (1972). The mixed standard scale: A new rating system. Personnel Psychology, 25, 185–200. Bommer, W. H., Johnson, J. L., Rich, G. A., Podsakoff, P. M., & MacKenzie, S. B. (1995). On the interchangeability of objective and subjective measures of employee performance: A meta-analysis. Personnel Psychology, 48, 587–605. Borman, W. C. (1977). Consistency of rating accuracy and rating errors in the judgment of human performance. Organizational Behavior and Human Performance, 20, 238–252. Borman, W. C. (1978). Exploring the upper limits of reliability and validity in job performance ratings. Journal of Applied Psychology, 63, 135–144. Borman, W. C. (1979). Format and training effects on rating accuracy and rater errors. Journal of Applied Psychology, 64, 410–421.

Evaluating Job Performance Measures  •  127 Borman, W.C (1991). Job behavior, performance, and effectiveness. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational Psychology (pp. 271–326). Palo Alto, CA: Consulting Psychologists Press. Bowen, C., Swim, J. K., & Jacobs, R. (2000). Evaluating gender biases on actual job performance of real people: A meta-analysis. Journal of Applied Social Psychology, 30, 2194–2215. Bretz, R. D., Milkovich, G. T., & Read, W. (1992). The current state of performance appraisal research and practice: Concerns, directions, and implications. Journal of Management, 18, 321–352. Campbell J. P. (1990). Modeling the performance prediction problem in industrial and organizational psychology. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (Vol. 1, pp. 687–732). Palo Alto, CA: Consulting Psychologists Press. Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 35–70). San Francisco, CA: Jossey-Bass. Cardy, R. L., & Dobbins, G. H. (1986). Affect and appraisal accuracy: Liking as an integral dimension in evaluating performance. Journal of Applied Psychology, 71, 672–678. Cleveland, J. N., Murphy, K. R., & Williams, R. E. (1989). Multiple uses of performance appraisal: Prevalence and correlates. Journal of Applied Psychology, 74, 130–135. Colella, A., DeNisi, A. S., & Varma, A. (1998). The impact of ratee’s disability on performance judgments and choice as partner: the role of disability-job fit stereotypes and interdependence of rewards. Journal of Applied Psychology, 83, 102–111. Conway, J. M. (1998). Understanding method variance in multitrait-multirater performance appraisal matrices: Examples using general impressions and interpersonal affect as measured method factors. Human Performance, 11, 29–55. Conway, J. M., & Huffcutt, A. I. (1997). Psychometric properties of multisource performance ratings: A meta-analysis of subordinate, supervisor, peer, and self-ratings. Human Performance, 10, 331–360. Conway, J. M., Lombardo, K., & Sanders, K. C. (2001). A meta-analysis of incremental validity and nomological networks for subordinate and peer rating. Human Performance, 14, 267–303. Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues for field settings. Boston, MA: Houghton Mifflin Company. Cooper, W. H. (1981a). Conceptual similarity as a source of illusory halo in job performance ratings. Journal of Applied Psychology, 66, 302–307. Cooper, W. H. (1981b). Ubiquitous halo. Psychological Bulletin, 90, 218–244. Cronbach, L. J. (1955). Processes affecting scores on “understanding of others” and “assumed similarity.” Psychological Bulletin, 52, 177–193. Cronbach, L. J. (1990). Essentials of psychological testing. New York, NY: Harper and Row. Czajka, J. M., & DeNisi, A. S. (1988). The influence of ratee disability on performance ratings: The effects of unambiguous performance standards. Academy of Management Journal, 31, 394–404.

128 • ANGELO S. DENISI & KEVIN R. MURPHY DeNisi, A. S., & Gonzalez, J. A. (2004). Design performance appraisal to improve performance appraisal. In E. A. Locke (Ed.) The Blackwell handbook of principles of organizational behavior (Updated version, pp. 60–72). London, UK: Blackwell Publishers. DeNisi, A. S., & Murphy, K. R. (2017). Performance appraisal and performance management: 100 Years of progress? Journal of Applied Psychology, 102, 421–433. DeNisi, A. S., & Peters, L. H. (1996). Organization of information in memory and the performance appraisal process: evidence from the field. Journal of Applied Psychology, 81, 717. DeNisi, A. S., Robbins, T., & Cafferty, T. P. (1989). Organization of information used for performance appraisals: Role of diary-keeping. Journal of Applied Psychology, 74, 124–129. DeNisi, A. S., Robbins, T. L., & Summers, T. P. (1997). Organization, processing, and the use of performance information: A cognitive role for appraisal instruments. Journal of Applied Social Psychology, 27, 1884–1905. DeNisi, A. S., & Sonesh, S. (2011). The appraisal and management of performance at work. In S. Zedeck (Ed.), Handbook of industrial and organizational psychology (pp. 255–280). Washington, DC: APA Press. Dierdorff, E. C., & Surface, E. A. (2007). Placing peer ratings in context: systematic influences beyond ratee performance. Personnel Psychology, 60, 93–126. Farh, J., Earley, P.C., & Lin, S. 1997). Impetus for action: A cultural analysis of justice and organizational citizenship behavior in Chinese society. Administrative Science Quarterly, 42, 421–444. Fleenor, J.W., Fleenor, J.B., & Grossnickle, W.F. (1996). Interrater reliability and agreement of performance ratings: A methodological comparison. Journal of Business and Psychology, 10, 367–38. Folger, R., Konovsky, M. A., & Cropanzano, R. (1992). A due process metaphor for performance appraisal. Research in Organizational Behavior, 14, 129–129. Greenberg J. (1986) Determinants of perceived fairness of performance evaluations. Journal of Applied Psychology, 71, 340–342. Greenberg, J. (1987). A taxonomy of organizational justice theories. Academy of Management Review, 12, 9–22. Greguras, G. J. (2005). Managerial experience and the measurement equivalence of performance ratings. Journal of Business and Psychology, 19, 383–397. Greguras, G. J., & Robie, C. (1998). A new look at within-source interrater reliability of 360-degree feedback ratings. Journal of Applied Psychology, 83, 960–968. Greguras, G. J., Robie, C., Schleicher, D. J., & Goff, M. (2003). A field study of the effects of rating purpose on the quality of multisource ratings. Personnel Psychology, 56, 1–21. Hamner, W. C., Kim, J. S., Baird, L., & Bigoness, W. J. (1979). Race and sex as determinants of ratings by potential employers in a simulated work-sampling task. Journal of Applied Psychology, 59, 705–711. Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of self-supervisory, self-peer, and peer-subordinate ratings. Personnel Psychology, 41, 43–62.

Evaluating Job Performance Measures  •  129 Heilman, M. E., & Chen, J. J. (2005). Same behavior, different consequences: reactions to men’s and women’s altruistic citizenship behavior. Journal of Applied Psychology, 90, 431–441. Heneman, R. L. (1986). The relationship between supervisory ratings and results-oriented measures of performance: A meta-analysis. Personnel Psychology, 39, 811–826. Hoffman, B. J., Lance, C. E., Bynum, B., & Gentry, W. A. (2010). Rater source effects are alive and well after all. Personnel Psychology, 63, 119–151. Hoffman, B. J., & Woehr, D. J. (2009). Disentangling the meaning of multisource performance rating source and dimension factors. Personnel Psychology, 62, 735–765. Ilgen, D. R. (1993). Performance appraisal accuracy: An elusive and sometimes misguided goal. In H. Schuler, J. L. Farr, & M. Smith (Eds.), Personnel selection and assessment: Industrial and organizational perspectives (pp. 235–252). Hillsdale, NJ: Erlbaum. Ilgen, D. R., Barnes-Farrell, J. L., & McKellin, D. B. (1993). Performance appraisal process research in the 1980s: What has it contributed to appraisals in use? Organizational Behavior and Human Decision Processes, 54, 321–68. Jennings, T., Palmer, J. K., & Thomas, A. (2004). Effects of performance context on processing speed and performance ratings. Journal of Business and Psychology, 18, 453–463. Joo, H., Aguinis, H., & Bradley, K. J. (2017). Not all non-normal distributions are created equal: Improved theoretical and measurement precision. Journal of Applied Psychology, 102, 1022–1053. Kasten, R., & Nevo, B. (2008). Exploring the relationship between interrater correlations and validity of peer ratings. Human Performance, 21, 180–197. Kingsbury, F. A (1922). Analyzing ratings and training raters. Journal of Personnel Research, 1, 377–382. Kingsbury, F. A. (1933). Psychological tests for executives. Personnel, 9, 121–133. Kluger, A. N., & DeNisi, A. S. (1996). The effects of feedback interventions on performance: Historical review, meta-analysis, and a preliminary feedback intervention theory. Psychological Bulletin, 119, 254–284. Kraiger, K., & Ford, J. K. (1985). A meta-analysis of ratee race effects in performance ratings. Journal of Applied Psychology, 70, 56–65. Lance, C. E. (1994). Test of a latent structure of performance ratings derived from Wherry’s (1952) theory of rating. Journal of Management, 20, 757–771. Lance, C. E., Baranik, L. E., Lau, A. R., & Scharlau, E. A. (2009). If it ain’t trait it must be method: (mis)application of the multitrait-multimethod design in organizational research. In C. E. Lance & R. L. Vandenberg (Eds.), Statistical and methodological myths and urban legends (pp. 227–360). New York, NY: Routledge. Lance, C. E., LaPointe, J. A., & Stewart, A. M. (1994). A test of the context dependency of three causal models of halo rater error. Journal of Applied Psychology, 79, 332–340. Lance, C. E., Teachout, M. S., & Donnelly, T. M. (1992). Specification of the criterion construct space: An application of hierarchical confirmatory factor analysis. Journal of Applied Psychology, 77, 437–452.

130 • ANGELO S. DENISI & KEVIN R. MURPHY Landy, F. J. (2010). Performance ratings: Then and now. In J.L. Outtz (Ed.). Adverse impact: Implications for organizational staffing and high-stakes selection (pp. 227–248). New York, NY: Routledge. Landy, F. J., Barnes, J., & Murphy, K. R. (1978). Correlates of perceived fairness and accuracy of performance appraisals. Journal of Applied Psychology, 63, 751–754. Landy, F. J., Barnes-Farrell, J., & Cleveland, J. (1980). Perceived fairness and accuracy of performance appraisals: A follow-up. Journal of Applied Psychology, 65, 355–356. Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological Bulletin, 87, 72–107. Landy, F. J., Shankster, L. J., & Kohler, S. S. (1994). Personnel selection and placement. Annual Review of Psychology, 45, 261–296. Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology, 28, 563–575. LeBreton, J. M., Scherer, K. T., & James, L. R. (2014). Corrections for criterion reliability in validity generalization: A false prophet in a land of suspended judgment. Industrial and Organizational Psychology: Perspectives on Science and Practice, 7, 478–500. Mabe, P. A., & West, S. G. (1982). Validity of self-evaluation of ability: A review and meta-analysis. Journal of Applied Psychology, 67, 280–290. McIntyre, R. M., Smith, D., & Hassett, C. E. (1984). Accuracy of performance ratings as affected by rater training and perceived purpose of rating. Journal of Applied Psychology, 69, 147–156. Milkovich, G. T., & Wigdor, A. K. (1991). Pay for performance. Washington, DC: National Academy Press. Motowidlo, S. J., & Kell, H. J. (2013). Job Performance. In N. W. Schmitt & S. Highhouse (Eds.), Comprehensive handbook of psychology, Volume 12: Industrial and organizational psychology (2nd ed., pp. 82–103). New York, NY: Wiley. Mount, M. K., Judge, T. A., Scullen, S. E., Sytsma, M. R., & Hezlett, S. A. (1998). Trait, rater, and level effects in 360-degree performance ratings. Personnel Psychology, 51, 557–576. Murphy, K. R, (1991). Criterion issues in performance appraisal research. Behavioral accuracy vs. classification accuracy. Organizational Behavior and Human Decision Processes, 50, 45–50. Murphy, K. R. (2008). Explaining the weak relationship between job performance and ratings of job performance. Industrial and Organizational Psychology: Perspectives on Science and Practice, 1, 148–160. Murphy, K. R., & Anhalt, R. L. (1992). Is halo error a property of the rater, ratees, or the specific behaviors observed? Journal of Applied Psychology, 77, 494–500. Murphy, K. R., & Balzer, W. K. (1986). Systematic distortions in memory-based behavior ratings and performance evaluations: Consequences for rating accuracy. Journal of Applied Psychology, 71, 39–44. Murphy, K. R., & Balzer, W. K. (1989). Rater errors and rating accuracy. Journal of Applied Psychology, 74, 619–624. Murphy, K. R., Balzer, W. K., Kellam, K. L., & Armstrong, J. (1984). Effect of purpose of rating on accuracy in observing teacher behavior and evaluating teaching performance. Journal of Educational Psychology, 76, 45–54.

Evaluating Job Performance Measures  •  131 Murphy, K. R., & Cleveland, J. N. (1995). Understanding performance appraisal: Social, organizational and goal-oriented perspectives. Newbury Park, CA: Sage. Murphy, K. R., Cleveland, J., & Hanscom, M. (2018). Performance appraisal and management Thousand Oaks, CA: Sage. Murphy, K., & Davidshofer, C. (2005). Psychological testing: Principles and applications (6th ed). Upper Sadddle River, NJ: Prentice Hall. Murphy, K. R., & DeShon, R. (2000). Interrater correlations do not estimate the reliability of job performance ratings. Personnel Psychology, 53, 873–900. Murphy, K. R., Garcia, M., Kerkar, S., Martin, C., & Balzer, W. K. (1982). Relationship between observational accuracy and accuracy in evaluating performance. Journal of Applied Psychology, 67, 320. Murphy, K. R., Herr, B. M., Lockhart, M .C., & Maguire, E. (1986). Evaluating the performance of paper people. Journal of Applied Psychology, 71, 654–661. Murphy, K. R., Jako, R. A., & Anhalt, R. L. (1993). Nature and consequences of halo error: A critical analysis. Journal of Applied Psychology, 78, 218–225. Murphy, K. R., Martin, C., & Garcia, M. (1982). Do behavioral observation scales measure observation? Journal of Applied Psychology, 67, 562–567. Nathan, B. R., & Tippins, N. (1989). The consequences of halo “error” in performance ratings: A field study of the moderating effect of halo on test validation results. Journal of Applied Psychology, 74, 290–296. O’Neill, T. A., McLarnon, M. J. W., & Carswell, J. J. (2015). Variance components of job performance ratings. Human Performance, 32, 801–824. Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (2008). No new terrain: Reliability and construct validity of job performance ratings. Industrial and Organizational Psychology: Perspectives on Science and Practice, 1, 174–179. Ployhart, R. E., Wiechmann, D., Schmitt, N., Sacco, J. M., & Rogg, K. (2003). The crosscultural equivalence of job performance ratings. Human Performance, 16, 49–79. Polit, D. F., Beck, C. T., & Owen, S. V. (2007). Is the CVI an acceptable indicator of content validity. Research in Nursing and Health, 30, 451–467. Pulakos, E. D. (1986). The development of training programs to increased accuracy in different rating tasks. Organizational Behavior and Human Decision Processes, 38, 76–91. Pulakos, E. D., White, L. A., Oppler, S. H., & Borman, W. C. (1989). Examination of race and sex effects on performance ratings. Journal of Applied Psychology, 74, 770–780. Putka, D. J., Le, H., McCloy, R. A., & Diaz, T. (2008). Ill-structured measurement designs in organizational research: Implications for estimating interrater reliability. Journal of Applied Psychology, 93, 959. Remmers, H. H. (1931). Reliability and halo effect of high school and college students’ judgments of their teachers. Journal of Applied Psychology 18, 619–630. Roberson, L., Galvin, B. M., & Charles, A. C. (2007). When group identities matter: Bias in performance appraisal. The Academy of Management Annals, 1, 617–650. Roch, S. G., Paquin, A. R., & Littlejohn, T. W. (2009). Do raters agree more on observable items? Human Performance, 22, 391–409. Rosen, B., & Jerdee, T. H. (1976). The nature of job-related age stereotypes. Journal of Applied Psychology, 61, 180–183.

132 • ANGELO S. DENISI & KEVIN R. MURPHY Saal, F. E., Downey, R. C., & Lahey, M. A. (1980). Rating the ratings: Assessing the quality of rating data. Psychological Bulletin, 88, 413–428. Sanchez, J. I., & De La Torre, P. (1996). A second look at the relationship between rating and behavioral accuracy in performance appraisal. Journal of Applied Psychology, 81, 3–10. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Schmitt, N., & Lappin, M. (1980). Race and sex as determinants of the mean and variance of performance ratings. Journal of Applied Psychology, 65, 428–435. Schmidt, F. L.,Viswesvaran, C., & Ones, D. S. (2000). Reliability is not validity and validity is not reliability. Personnel Psychology, 53, 901–912. Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85, 956–970. Shadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton-Mifflin. Smith, P. C. (1976). Behaviors, results, and organizational effectiveness. In M. Dunnette (Ed.), Handbook of industrial and organizational psychology. Chicago, IL: RandMcNally. Smith, P. C., & Kendall, L. M. (1963). Retranlsation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47, 149–155. Solomonson, A. L., & Lance, C. E. (1997). Examination of the relationship between true halo and halo error in performance ratings. Journal of Applied Psychology, 82, 665–674. Stone-Romero, E. F., Alvarez, K., & Thompson, L. F. (2009). The construct validity of conceptual and operational definitions of contextual performance and related constructs. Human Resource Management Review, 19, 104–116. Sulsky, L. M., & Balzer, W. K. (1988). Meaning and measurement of performance rating accuracy: Some methodological and theoretical concerns. Journal of Applied Psychology, 73, 497–506. Taylor, M. S., Tracy, K. B., Renard, M. K., Harrison, J. K., & Carroll, S. J. (1995). Due process in performance appraisal: A quasi-experiment in procedural justice. Administrative Science Quarterly, 495–523. Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied Psychology, 4, 25–29. Thorndike, R. L. (1949). Personnel selection. New York, NY: Wiley. Valle, M., & Bozeman, D. (2002). Interrater agreement on employees’ job performance: Review and directions. Psychological Reports, 90, 975–985. Varma, A., DeNisi, A. S., & Peters, L. H. (1996). Interpersonal affect in performance appraisal: A field study. Personnel Psychology, 49, 341–360. Viswesvaran, C., Schmidt, F. L., & Ones, D. S. (2002). The moderating influence of job performance dimensions on convergence of supervisory and peer ratings of job performance: Unconfounding construct-level convergence and rating difficulty. Journal of Applied Psychology, 87, 345–354.

Evaluating Job Performance Measures  •  133 Waldman, D. A., & Avolio, B. J. (1991). Race effects in performance evaluations: Controlling for ability, education, and experience. Journal of Applied Psychology, 76, 897–901. Williams, K. J., DeNisi. A. S., Meglino, B. M., & Cafferty, T. P. (1986). Initial decisions and subsequent performance ratings. Journal of Applied Psychology, 71, 189–195. Woehr, D. J., & Roch, S. G. (2016).Of babies and bathwater: Don’t throw the measure out with the application. Industrial and Organizational Psychology: Perspectives on Science and Practice, 9, 357—361. Woehr, D. J., Sheehan, M. K., & Bennett, W. (2005). Assessing measurement equivalence across rating sources: A multitrait-multirater approach. Journal of Applied Psychology, 90, 592–600.

CHAPTER 7

RESEARCH METHODS IN ORGANIZATIONAL POLITICS Issues, Challenges, and Opportunities Liam P. Maher, Zachary A. Russell, Samantha L. Jordan, Gerald R. Ferris, and Wayne A. Hochwarter

Scientific inquiry has identified, casually discussed, informally examined, and vigorously investigated organizational politics phenomena for over a century (e.g., Byrne, 1917; Ferris & Treadway, 2012; Lasswell, 1936). In reality, politics work goes back even further if we consider the publication of Niccolo Machiavelli’s The Prince, initially written in the 1500s and first published in 1903 (Machiavelli, 1952). Delineations of organizational politics are abundant in the existing literature (many would claim ‘too many’ with little overlapping agreement; Lepisto & Pratt, 2012). The expanding research base notwithstanding, the field has yet to offer an agreed upon theory-driven definition. Foundationally (and historically), organizational politics has been cast in a mostly pejorative or negative light, referring to the self-interested behavior of individuals, groups, or organizations (e.g., Ferris & Treadway, 2012). Ostensibly, this view has driven previous empirical research, with only a small number of exceptions (see Franke & Foerstl, 2018; Landells & Albrecht, 2013). Given its maturation, especially in the past three decades, significant reviews of the organizational politics literature exist (see Ferris, Harris, Russell, & Maher, Research Methods in Human Resource Management: Toward Valid Research-Based Inferences, pages 135–172. Copyright © 2020 by Information Age Publishing All rights of reproduction in any form reserved.

135

136 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

2018; Ferris & Hochwarter, 2011; Kacmar & Baron, 1999; Lux, Ferris, Brouer, Laird, & Summers, 2008), investigating a myriad of substantive relations. These reviews identified trends, critically examined foundational underpinnings, and noted inconsistencies and possible causes (Chang, Rosen, & Levy, 2009). Embedded in many of these summaries are critiques of research design as well as recommendations for addressing existing methodological deficiencies (Ferris, Ellen, McAllister, & Maher, 2019). However, to our knowledge, there has been no systematic examination of research method issues in organizational politics scholarship to date. Therefore, we offer a detailed critique of issues, challenges, and future directions of organizational politics research methods. Scope of Organizational Politics Review Given the exponential growth of published politics research, we thought it prudent to identify a potential starting point when designing our discussion. Specifically, the original perceptions of politics (POPs) model (Ferris, Russ, & Fandt, 1989), has spawned considerable research and remains influential (e.g., Ahmad, Akhtar, ur Rahman, Imran, ul Ain, 2017). Admittedly, other influential politics discussions surfaced before the initial Ferris et al. model (e.g., Madison, Allen, Porter, Renwick, & Mayes, 1980; Mayes & Allen, 1977). For example, Mintzberg’s (1985) characterization of organizations as “political arenas” and Pfeffer’s (1981) support of politics as a constructive and legitimate element of organizational realities preceded Ferris et al. (1989) and remain relevant in contemporary research (Cullen, Gerbasi, & Chrobot-Mason, 2018). For clarity, though, we use Ferris et al.’s (1989) discussion as our point of departure. In this review, we identify notable methodological approaches guiding past discussions of organizational politics. Further, we offer suggestions for augmenting methodological approaches considered critical when scholars develop the next generation of politics research. In terms of scope, we argue that organizational politics is far from a unitary or singular construct, but instead reflects a multi-faceted area of inquiry. Indeed, organizational politics is quite complex, and some of the shortcomings of existing research likely derive from a failure to fully recognize this multi-faceted reality. Delineation of what each construct is and is not is essential when seeking to establish construct validity. Further, a clear definition and understanding of a construct’s placement within its nomological network is key to successful measure development (Hinkin, 1998). Psychometrically sound measures must then be developed, tested, and established as possessing validity in order for research to progress our knowledge (MacKenzie, Podsakoff, & Podsakoff, 2011) of the organizational politics nomological network. Although only subtle differences may be visible to some, the research domain classically consists of perceptions of organizational politics, political behavior, and political skill (Ferris & Hochwarter, 2011; Ferris, Perrewé, Daniels, Lawong, & Holmes, 2017; Ferris & Treadway, 2012). However, we also include in our present review and analysis two conceptually related constructs. ‘Political will’

Research Methods in Organizational Politics  •  137

and ‘reputation’ are burgeoning areas of study that fit well within the organizational politics nomological network. (Blom-Hansen & Finke, in press; Ferris et al., 2019). In concept, political will has been around for some time (Mintzberg, 1983; Treadway, Hochwarter, Kacmar, & Ferris, 2005). Historically, the term represented worker behaviors undertaken to sabotage the leader’s directives (Brecht, 1937). In contemporary terms, conceptual advancements increased interest (Blickle, Schütte, & Wihler, 2018; Maher, Gallagher, Rossi, Ferris, & Perrewé, 2018), and publication of the Political Will Scale (PWS) helped develop empirical research in recent years (Kapoutsis, Papalexandris, Treadway, & Bentley, 2017). Organizational reputation is not a new concept (Bromley, 1993; O’Shea, 1920). As an example, McArthur (1917) argued: “Reputation is something that you can’t value in dollars and cents, but is mighty precious just the same…” (p. 63). However, for such a foundational construct, relatively little theory and research have been conducted on reputation in the organizational sciences (Ferris, Blass, Douglas, Kolodinsky, & Treadway, 2003; Ferris, Harris, Russell, Ellen, Martinez, & Blass, 2014). As far back as Tsui (1984), and extending to the present day (Ferris et al., 2019), reputation in organizations has been construed as less of an objectively scientific construct and more of a sociopolitical one (Ravasi, Rindova, Etter, & Cornelissen, 2018). Hence, reputation’s inclusion as a facet of organizational politics is entirely appropriate (Munyon, Summers, Thompson, & Ferris, 2015; Zinko, Gentry, & Laird, 2016) given its influence (direct and indirect) on both tactics (Ferris et al., 2017) and presentation acuity (Smith, Plowman, Duchon, & Quinn, 2009). PRIMARY CONSTRUCTS WITHIN ORGANIZATIONAL POLITICS In the following sections, we review and analyze the five primary constructs and their associated measurement instruments within politics research. As we describe below, the study of politics research presents substantial, but imperfect, attempts to describe and measure important phenomena. Beyond summarizing trends, we highlight the areas for potential improvement by identifying limitations—many of which come from the authors of this chapter. Perceptions of Organizational Politics Definition and Conceptualization. The operationalization of the term ‘perceptions of organizational politics’ (POPs) is traced to the broader organizational politics literature (Stolz, 1955). Traditionally defined as non-sanctioned and illegitimate activities characterized by self-interest (Ferris & King, 1991; Ferris & Treadway, 2012; Mintzberg, 1983, 1985), political behavior in organizational contexts casts a pejorative light. As such, conceptualizations focus primarily on the dysfunctional and self-serving aspects of others’ political behavior (Chang et

138 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

al., 2009; Guo, Kang, Shao, & Halvorsen, 2019; Miller, Rutherford, & Kolodinsky, 2008). Although similarities between POPs and the broader organizational politics construct exist, researchers note that perceptions are always subjective evaluations, whereas organizational politics are captured objectively (Ferris, HarrellCook, & Dulebohn, 2000; Ferris et al., 2019). Because perceptions ostensibly manufacture reality (Landry, 1969), what is seen is impactful (Lewin, 1936; Porter, 1976) and capable of explaining affective, cognitive, and behavioral outcomes at work (Ferris & Kacmar, 1992). Accordingly, we define POPs as an individual’s idiosyncratic estimation and evaluation of others self-serving, or egocentric, behavior at work (Ferris et al., 1989; Ferris et al., 2000; Ferris & Kacmar, 1992). Ferris et al. (1989) developed one of the first theoretical models of POPs, which specified the antecedents, outcomes, and moderators within the nomological network of POPs. A subsequent review expanded these related constructs (Ferris, Adams, Kolodinsky, Hochwarter, & Ammeter, 2002). Although no one study has tested each proposed link, general support has been found for these two guiding models, which were the primary studies that established the POPs nomological network. Despite the strong theoretical rationale for previous antecedent models, theorization concerning the link between POPs and organizational outcomes was largely absent before Chang et al.’s (2009) meta-analytic examination. Their study was one of the first to identify psychological mechanisms linking POPs to more distal work outcomes (i.e., turnover intentions and performance). Chang et al. (2009) found that psychological strain mediated the relation between perceptions of organizational politics and performance, such that as POPs increased, so did psychological strain, in turn reducing performance. Morale mediated the POPs— performance and turnover relation, albeit in a different fashion. Finally, one of the most significant findings was the wide credibility intervals surrounding the estimated effects of POPs on outcomes. This catalyzed the search for moderating effects, which has dominated the POPs literature over the past decade. Measurement. In 1980, two independent sets of scholars made first efforts to assess political perceptions at work. Gandz and Murray (1980) asked employees to report on the amount of political communication existing in their organization, as well as its influence in shaping work environments. Respondents also reported the organizational levels where political activities were most prevalent and offered opinions on the effectiveness of these behaviors. Furthermore, respondents provided a specific situation indicative of “a good example of workplace politics in action” (Gandz & Murray, 1980, p. 240). Madison et al. (1980) captured POPs through detailed interviews with chief executive officers, high staff managers, and supervisors. Specifically, participants answered questions, via face-to-face interviews, and reported on the frequency of politics across different functional areas. They also described, in an open-ended

Research Methods in Organizational Politics  •  139

fashion, their general perceptions of politics as either helpful or harmful to both the individual and the organization. Almost a decade later, Ferris and Kacmar (1989) proposed a five-item unidimensional measure of general POPs. Shortly, after that, Kacmar and Ferris (1991) developed the Perceptions of Organizational Politics Scale (POPS). The 12-item POPS contained three factors: (1) general political behavior, (2) going along to get ahead, and (3) pay and promotion. Subsequent attempts to validate the POPS’ three-factor structure evidenced psychometric shortcomings (Kacmar & Carlson, 1997; Nye & Witt, 1993). In response, Kacmar and Carlson (1997) evaluated the contributions of each POPS item, removed items not functioning as intended, and developed new items, resulting in the 15-itemextended POPS. Vigoda (2002) shortened this scale to a 6-item measure, which remains commonly used in research (e.g., Sun & Chen, 2017). Since the development of the POPS (Ferris & Kacmar, 1989), and its subsequent extension (Kacmar & Carlson, 1997), scholars sought other ways to measure politics perceptions. For example, Hochwarter, Kacmar, Treadway, and Watson (2003) asked respondents to report experienced politics with a scale with endpoints ranging from 0 (no politics exist) to 100 (great levels of politics exist) (50 served as a midpoint – moderate levels of politics exist). Three organizational levels were examined: (a) at the highest levels in your organization; (b) at the level of your immediate supervisor, and (c) at the level of you and your coworkers. This approach allowed respondents to indicate an absolute level of viewed politics without a priori directionality. Documented differences across levels surfaced and uniquely predicted outcomes. As an extension, Hochwarter, Kacmar, Perrewé, and Johnson (2003) developed a six-item assessment of POPs. Respondents were asked to respond to each item while considering three different organizational levels (e.g., current, one level up, highest level). Since its emergence, Hochwarter et al.’s scale remains in use given its short length and overall acceptable reliability across organizational levels (Dahling, Gabriel, & MacGowan, 2017; Rosen, Koopman, Gabriel, & Johnson, 2016). Critique and Future Research Directions. Arguably, the most fundamental issue with current scholarship on POPs concerns its pervasively negative operationalization, conceptualization, and measurement (Ellen, 2014; Hochwarter, 2012). As alluded to above, the construct’s negative orientation equates to its definitional overlap with the term ‘organizational politics’ from the broader politics literature (Ferris et al., 2019). Despite its positioning as a “dark side” phenomenon (Ferris & King, 1991), as well as a “hindrance” stressor (Chang et al., 2009; LePine, Podsakoff, & LePine, 2005), POPs also can serve neutral and positive functions in organizational contexts. This confusion threatens the construct validity of POPs. Thus, rather than assuming POPs is either positive or negative, future scholarship should expand its thinking to include neutral and positive operationaliza-

140 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

tions. For example, some scholars already have defined the construct as the active management of shared meaning (Ferris & Judge, 1991; Pfeffer, 1981), as well as the effort to restore justice, attain resources and benefits for others, and/ or as a source of positive influence and change (Ellen, 2014; Hochwarter, 2012). These views represent an initial benchmark for the constructs future refinement and measurement. Furthermore, despite literature focusing on self-serving and proactive tactics, reactive and defensive political strategies are also viable (Ashforth & Lee, 1990; Valle & Perrewé, 2000). Landells and Albrecht (2017) interpreted and categorized POPs into four levels. Those who perceived organizational politics as reactive, regarded the behaviors as destructive and manipulative, whereas reluctant politics represented a “necessary evil” (Landells & Albrecht, 2017, p. 41). Furthermore, strategic behaviors accomplished goals, and integrated tactics benefited actors when central to successful company functioning, activity, and decision-making. These findings support claims for an expansion that captures a fuller content domain. We encourage the use of grounded theory investigations as theoretical starting points for improving conceptualizations and psychometric treatments. Also concerning is the lack of theorizing regarding how POPs affect individual-, group-, and organizational-level outcomes. Although several conceptual models have begun to specify the direct effects of POPs (e.g., Aryee, Chen, & Budhwar, 2004; Ferris et al., 2002; Valle & Perrewé, 2000), few studies have offered theoretical support for possible processes that indirectly link POPs to employee and organizational outcomes. Exceptions include studies investigating morale (Rosen, Levy, & Hall, 2006) and need satisfaction (Rosen, Ferris, Brown, Chen, & Yan, 2014) as mediating mechanisms. Building on these studies, more substantial theorization needs to explain how and why POPs are associated with attitudes and behaviors at work (Chang et al., 2009) across organizational levels (Adams, Ammeter, Treadway, Ferris, Hochwarter, & Kolodinsky, 2002; Dipboye & Foster, 2002; Franke & Foerstl, 2018). Whereas historically the organizational politics literature has focused predominantly on between-person variance in politics perceptions as a stable environmental factor (Rosen et al., 2016), it is highly possible that politics perceptions vary throughout the day, week, or more broadly across time. As research on experience sampling methods continue (Matta, Scott, Colquitt, Koopman, & Passantino, 2017), it would be beneficial for researchers also to consider within-person variation in politics perceptions, and the antecedents that may result in such variance. Assuming within-person variance exists, researchers would be drawing a broader picture as to how politics perceptions are developed and modified across time. Furthermore, given the importance of uncertainty in the larger politics literature, researchers also may want to consider whether within-person variability in politics perceptions is more harmful than consistently perceiving politics. Perhaps, politics manifest in ways similar to justice perceptions. Specifically, vari-

Research Methods in Organizational Politics  •  141

ability of cues likely cause more disdain when inconsistent (sometimes good – sometimes bad) than when consistent (always bad) (Matta et al., 2017). Political Behavior Definition and Conceptualization. As stated by Mintzberg (1983, 1985), organizations are political arenas in which motivated and capable individuals enact self-serving behavior. Although employees often perceive ‘office politics’ as being decisively negative, political behavior can produce organizational and interpersonal benefits when appropriately implemented (Treadway et al., 2005). Given widespread disagreement on the implications of politics, and more specifically political behavior, conceptualizations have varied over time and across studies (Kidron & Vinarski-Peretz, 2018; Lampaki & Papadakis, 2018). Generally, researchers agree that political behavior is normal, and sometimes, an essential element of functioning (Zanzi & O’Neill, 2001). However, no agreedupon definition that captures the complexity of political action exists (Ferris et al., 2019). Whereas most definitions posit political behavior as non-sanctioned activity within organizational settings (Farrell & Petersen, 1982; Gandz & Murray, 1980; Mintzberg, 1983; Schein, 1977), others focus on political behavior as an interdependent social enactor-receiver relationship (Lepisto & Pratt, 2012; Sharfman, Wolf, Chase, & Tansik, 1988). Furthermore, some researchers classify influence tactics (Kipnis & Schmidt, 1988; Kipnis, Schmidt, & Wilkinson, 1980; Yukl & Falbe, 1990), impression management (Liden & Mitchell, 1988; Tedeschi & Melburg, 1984), and even voice (Burris, 2012; Ferris et al., 2019) as relevant for effective operationalization of the politics construct. Since its original operationalization, several conceptual models have emerged to explain potential antecedents of political behavior. The first, developed by Porter, Allen, and Angle (1981), argued that political behavior is, at least partially, a function of Machiavellianism, locus of control, need for power, risk-seeking propensity, and a lack of personal power. Just over a decade later, Ferris, Fedor, and King (1994) stated that political behavior is the result of Machiavellianism and locus of control, like seen in Porter et al.’s (1981) model, as well as self-monitoring, a propensity unique to the Ferris et al. (1994) model. Overall, empirical research investigating the antecedents of political behavior has been inconclusive (Grams & Rogers, 1990; Vecchio & Sussman, 1991), leading to calls for an expansion of the individual difference domain previously specified (Ferris, Hochwarter, Douglas, Blass, Kolodinsky, & Treadway, 2002b). In response, Treadway et al. (2005) conceptualized political behavior to include motivational and achievement need components, and Ferris et al. (2019) conceptualized general political behavior as one of the multiple other political actions that organizational members enact. We now briefly describe other forms of political action conceptualized as being part of political behavior in organizations. Influence tactics are specific strategies employed to obtain desired goals. Despite general disagreement regarding what types of influence tactics exist (Kipnis

142 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

et al., 1980; Kipnis & Schmidt, 1988; Yukl & Tracey, 1992), an extensive body of literature has examined not only what tactics are most effective, but also the boundary conditions affecting tactic success (e.g., frequency of influence, direction of influence, power distance between enactor and receiver, reason for influence attempt). As part of this trend, several meta-analytic studies have begun to tease apart these direct and moderating implications (Barbuto & Moss, 2006; Higgins, Judge, & Ferris, 2003; Lee, Han, Cheong, Kim, & Yun, 2017; Smith et al., 2013). Impression management reflects any political act designed to manage how one is perceived (Tedeschi & Melburg, 1984; Tedeschi, Melburg, Bacharach, & Lawler, 1984). Attempts at impression management fall into five primary categories, including ingratiation, self-promotion, exemplification, supplication, and intimidation (Jones & Pittman, 1982). Past work has categorized impression management into two dimensions (i.e., tactical-strategic and assertive-defensive; Tedeschi & Melburg, 1984). The tactical-strategic dimension considers whether short-term or long-term purposes guide impression management. Moreover, the assertivedefensive dimension determines if behavior escalates proactively or reactively to situational contingencies. Although the common intention of impression management is a favorable assessment, recent work reports that poorly executed tactics can be detrimental for one’s social image (Bolino, Long, & Turnley, 2016). Voice, a type of organizational citizenship behavior (OCB), is the expression of effective solutions in response to perceived problems to improve a given situation (Li, Wu, Liu, Kwan, & Liu, 2014; Van Dyne & LePine, 1998). Voice is essential for the management of shared meaning in organizational contexts (Ferris et al., 2019), and represents a mechanism to advertise and promote personal opinions and concerns (Burris, 2012). However, unlike many other forms of OCBs, voice can be maladaptive for individuals enacting the behavior, as well as for their coworkers and the organization as a whole (Turnley & Feldman, 1999). As such, employee voice exemplifies a form of informal political behavior (Ferris et al., 2019). Measurement. Despite existing theoretical avenues within the political behavior literature, there is still considerable disagreement surrounding construct definition and use in scholarly practice (Ferris et al., 2019), which impedes construct validity. Given a general lack of operational and conceptual consensus, measures of political behavior also have been limited and quite inconsistent. Whereas some scholars have developed scales assessing general political behavior (Valle & Perrewé, 2000; Zanzi, Arthur, & Shamir, 1991), others have used impression management (Bolino & Turnley, 1999), influence tactics (Kipnis & Schmidt, 1988), and voice (Van Dyne & LePine, 1998) as proxies for political behavior in organizational settings. The most commonly utilized measure of individual political behavior was developed by Treadway et al. (2005; α = .83). Six items captured general politicking behavior toward goal attainment, interpersonal influence, accomplishment shar-

Research Methods in Organizational Politics  •  143

ing, and ‘behind the scenes’ political activity. Despite its widespread use since the scale’s emergence, Treadway et al.’s (2005) measure has yet to undergo the empirical rigor that traditional scale developments endure (Ferris et al., 2019). Critique and Future Research Directions. Before empirical work on the construct can continue, researchers need to develop a concise and agreed upon operation of political behavior that includes traditional definitional components while taking into consideration the importance of intentionality (Hochwarter, 2012), goal-directed activity and behavioral targets (Lepisto & Pratt, 2012), and interpersonal dependencies (French & Raven, 1959). Furthermore, researchers need to decide whether to expand political behavior to include concepts like influence tactics, impression management, and voice, or if each construct is unique enough to hold an independent position within political behaviors’ nomological network. Once the construct is better defined, and its related constructs identified, researchers will want to use this conceptualization to help inform subsequent scale development attempts. We encourage researchers to cast a wide net when defining political behaviors and its potential underlying dimensions. Political behaviors reflect inherently non-sanctioned and self-serving actions (Mitchell, Baer, Ambrose, Folger, & Palmer, 2018), triggering ostensibly adverse outcomes. However, not all non-sanctioned behavior is aversive nor all self-serving behavior dysfunctional (Ferris & Judge, 1991; Zanzi & O’Neill, 2001). For example, egotistic behavior may not be intrinsic to the actor. Instead, contexts infused with threat often trigger self-serving motivations as a protective mechanism (Lafrenière, Sedikides, & Lei, 2016; Von Hippel, Lakin, & Shakarchi, 2005). For this reason, future research should expand conceptualizations and measurement to include constructs predisposed to neutral and positive implications as well (Ellen, 2014; Fedor, Maslyn, Farmer, & Bettenhausen, 2008; Ferris & Treadway, 2012; Hochwarter, 2012; Maslyn, Farmer, & Bettenhausen, 2017). Furthermore, political behavior is a broad term encapsulating activity enacted by different sources, including the self, others, groups, and organizations (Hill, Thomas, & Meriac, 2016). Given its possible manifestations across organizational levels, future research must redefine the construct within the appropriate and intended theoretical level. As part of this process, researchers also must consider whether political behavior is a level-generic (or level-specific) phenomenon, manifesting similarly (or differentially) across multiple hierarchies. The objectionable and surreptitious nature of political behavior (Wickenberg & Kylén, 2007) provokes the use of self-report measures prone to socially desirable responding. Diverse approaches, however, are likely unable to capture the extensiveness and frequency of political activity for the very same reasons. This conundrum is shared across disciplines (Reitz, Motti-Stefanidi, & Asendorpf, 2016; Zare & Flinchbaugh, 2019), as other-report indices are vulnerable to halo bias (Dalal, 2005). As researchers develop improved measures of political behavior, convergence or divergence must be determined to establish validity (Kruse, Chancellor, & Lyubomirsky, 2017).

144 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

Political Skill Definition and Conceptualization. Approximately 40 years ago, two independent scholars concurrently introduced the political skill construct to the literature. Pfeffer (1981) defined political skill as a social effectiveness competency allowing for the active execution of political behavior and attainment of power. Mintzberg (1983, 1985) positioned the construct as the exercise of political influence (or interpersonal style) using manipulation, persuasion, and negotiation through formal power. Despite the extensiveness of political activity in organizational settings, these initial works made little progress beyond the definition and conceptualization stages. However, over the past few decades, researchers have acknowledged the prominence and importance of political acuity, social savviness, and social intelligence (Ferris, Perrewe, & Douglas, 2002; McAllister et al., 2018). Ahearn, Ferris, Hochwarter, Douglas, and Ammeter (2004) provided an early effort at delineating the political skill construct (then termed “social skill”), defined as “the ability to effectively understand others at work, and to use such knowledge to influence others to act in ways that enhance one’s personal and/or organizational objectives” (Ahearn et al., 2004, p. 311). Additionally, political skill was argued to encompass four critical underlying dimensional competencies, including (1) social astuteness, (2) interpersonal influence, (3) networking ability, and (4) apparent sincerity. Social astuteness, or the ability to be self-aware and to interpret the behavior of others accurately, is necessary for effective influence (Pfeffer, 1992). Individuals possessing political skill are keen observers of social situations. Not only are they able to accurately interpret the behavior of others, but also they can adapt socially in response to what they perceive (Ferris, Treadway, Perrewé, Brouer, Douglas, & Lux, 2007). This “sensitivity to others” (Pfeffer, 1992, p. 173) provides politically skilled individuals the ability to understand the motivations of both themselves and others better, making them useful in many political arenas. Interpersonal influence concerns “flexibility,” or the successful adaptation of behavior to different personal and situational contingencies to achieve desired goals (Pfeffer, 1992). Individuals high in political skill exert powerful influence through subtle and convincing interpersonal persuasion (Ferris et al., 2005, 2007). Whereas Mintzberg (1983, 1985) defined political skill concerning influence and explicit formal power, Ahearn et al.’s (2004) definition does not include direct references to formal authority (Perrewé, Zellars, Ferris, Rossi, Kacmar, & Ralston, 2004). Instead, this view focuses on influence originating from the selection of appropriate communication styles relative to the context at hand, as well as successful adaptation and calibration when tactics are ineffective. Politically skilled individuals also are adept at developing and utilizing social networks (Ferris et al., 2005, 2007). Not only are these networks secure in terms of their extensiveness, but also they tend to include more valuable and influential members. Such networking capabilities allow individuals high in political skill to

Research Methods in Organizational Politics  •  145

formulate robust and beneficial alliances and coalitions that offer further opportunities to maintain, as well as develop, an increasingly more extensive social network. Further, because these networks are strategically developed over time, the politically skilled are better able to position themselves so as to take advantage of available network-generated resources, opportunities, and social capital (Ahearn et al., 2004; Pfeffer, 2010; Tocher, Oswald, Shook, & Adams, 2012). The last characteristic politically skilled individuals possess is apparent sincerity. That is, they are or at least appear to be, genuine in their intentions when engaging in political behaviors (Douglas & Ammeter, 2004). Sincerity is essential given that influence attempts are only successful when the intention is devoid of ulterior or manipulative motives (Jones, 1990). Thus, perceived intentions may matter more than actual intentions, for inspiring behavioral modification and confidence in others. Subsequently, Ferris et al. (2007) provided a systematic conceptualization of political skill grounded in social-political influence theory. As part of this conceptualization, they characterized political skill as “a comprehensive pattern of social competencies, with cognitive, affective, and behavioral manifestations” (Ferris et al., 2007, p. 291). Specifically, they argued that political skill operated on self, others, and group/organizational processes. Their model identified five antecedents of political skill, including perceptiveness, control, affability, active influence, and developmental experiences. Munyon et al. (2015) extended this model to encapsulate the effect of political skill on self-evaluations and situational appraisals (i.e., intrapsychic processes), situational responses (i.e., behavioral processes), as well as evaluations by others and group/organizational processes (i.e., interpersonal processes). Recently, Frieder, Ferris, Perrewé, Wihler, and Brooks (in press) extended this meta-theoretical framework of social—political influence to leadership. Overall, research on political skill has generated considerable interest since its original refinement by Ferris et al. (2005). Within the last decade, multiple reviews and meta-analyses (Bing, Davison, Minor, Novicevic, & Frink, 2011; Ferris, Treadway, Brouer, & Munyon, 2012; Munyon et al., 2015) have reported on the effectiveness of political skill in work settings, both as a significant predictor as well as a boundary condition. Some notable outcomes include the effect of political skill on stress management (Hochwarter, Ferris, Zinko, Arnell, & James, 2007; Hochwarter, Summers, Thompson, Perrewé, & Ferris, 2010; Perrewé et al., 2004), career success and performance (Blickle et al., 2011; Gentry, Gilmore, Shuffler, & Leslie, 2012; Munyon et al., 2015), and leadership effectiveness (Brouer, Douglas, Treadway, & Ferris, 2013; Whitman, Halbesleben, & Shanine, 2013). Measurement. Ferris et al. (1999) provided a first effort at measuring the political skill construct by developing the six-item Political Skill Inventory (PSI). Despite acceptable psychometric properties and scale reliability across five studies, the PSI was not without flaws. Although the scale reflected social astuteness

146 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

and interpersonal influence, they did not emerge as separate and distinguishable factors. Resulting unidimensionality and construct domain concerns triggered the development of an 18-item version, which retained the original scale name as well as three original scale items (Ferris et al., 2005). To develop the 18-item PSI, Ferris et al. (2005) generated an initial pool of 40 items to capture the full content domain of the political skill construct. After omitting scale items prone to socially desirable responding and those with problematically high cross loading values, a final set of 18-items was retained, and as hypothesized, a four-factor solution emerged containing the social astuteness, interpersonal influence, networking ability, and apparent sincerity dimensions. Critique and Future Research Directions. Coming up on its 15th anniversary, the PSI has been widely accepted as a sound psychometric measure by those well entrenched within the organizational politics field. Few conceptual squabbles exist among scholars, and the theoretical clarity paired with strong empirically established links to relevant constructs is evidence for strong construct validity. However, this measure has a few notable deficiencies. The PSI inherently suffers from the drawbacks associated with self-reports. Certainly, self-reports are easy to obtain, and are considered the best way to measure psychological states, perceptions, and motives (McFarland et al., 2012). However, as a tool for assessing behavioral effectiveness, self-reports have some issues. Hubris, perceptual bias, and self-desirability within individuals can lead to overinflated estimates of their social abilities. Some individuals may believe, or are told erroneously, that they are likable and keen social agents, but in reality, they are a social pariah who annoy and infuriate their colleagues. Also problematic are invariant source requirements needed for response accuracy. For example, social astuteness and networking ability are largely perceptual measures and best obtained through self-reports. Logically, observers cannot provide an accurate account of what individuals observe during social interaction. However, observers may be the best suited to assess interpersonal influence and apparent sincerity. Influence represents a change of an attitude, judgment, or decision; that is, cues more amenable to assessment by an observer or trained rater. Apparent sincerity is in the eye of the beholder regardless of whether the focal individuals believe they intended on being or thought they acted sincerely (Silvester & Wyatt, 2018). With these shortcomings in mind, scholars would help advance scholarship by developing a behavioral measure that can assess political skill without solely relying on self-reports. Developing such a measure would contribute to further legitimizing the construct of political skill to those scholars and practitioners who are not intimately familiar with the organizational politics literature, and doubt its merits. Furthermore, this measure need not replace the PSI entirely, but a stream of investigations that employed both a behavioral and self-report measure could illuminate the utility or futility of how we currently measure political skill. Admittedly, this type of measurement requires added effort likely complicating data col-

Research Methods in Organizational Politics  •  147

lection processes. However, we are confident that value rests in doing so if only to confirm the utility of self-reports. Another opportunity within the political skill literature is to evaluate the construct’s developmental qualities. According to Ferris et al. (2005, 2007), political skill is a social competency that can be cultivated over time through social feedback, role modeling, and mentorship. Despite strong theoretical support, grounded in social learning theory (Bandura, 1986), little evidence for the development of political skill through observation and modeling exists. Further, if both genetic properties and situational factors affect political skill, then researchers need to consider which individuals are more or less receptive to organizational training, behavioral interventions, incentives, and role modeling techniques. Until empirical evidence is present, scholars should be cautious of discussing political skill as a learnable or trainable competency. Political Will Definition and Conceptualization. Political will is a construct commonly used in the popular press and governmental politics to describe a collective’s willingness or unwillingness to expend resources towards a particular cause (Post, Raile, & Raile, 2010). The creation of new laws and political courses of action upsets the status quo, and in a world of diverse and often competing interests, politicians must be willing to expend resources to fight for their desired agenda. Similarly, Mintzberg (1983) argued that individual agents within organizations needed political skill and political will in order to execute their desired managerial actions successfully. Over three decades ago, political will and political skill were introduced conceptually into the organization sciences. Despite the sustained interest in political skill, however, political will has attracted further inquiry only recently. This neglect is unfortunate given that both constructs were integral to Mintzberg’s theoretical framework, and omission of essential variables within measurement models biases parameter estimates in previous studies. Treadway (2012) provided a theoretical application of political will and suggested instrumental (relational, concern for self, concern for others) and risk tolerance as underlying dimensions. Treadway defined political will as “the motivation to engage in strategic, goal-directed behavior that advances the personal agenda and objectives of the actor that inherently involves the risk of relational or reputational capital (p. 533).” Keeping with Mintzberg’s conceptualization, Treadway focused on describing political will at the individual level of analysis. Nonetheless, he did acknowledge that political will embodies a group mentality towards a particular agenda. Measurement. Scholars made a few early attempts to measure political will before the development of a validated psychometric measure. Treadway et al. (2005) first attempted to measure political will using need for achievement and intrinsic motivation as proxies. These constructs successfully predicted the activ-

148 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

ity level of political behavior. Similarly, Liu, Liu, and Wu (2010) used need for achievement, and analogously, need for power to predict political behavior. In the same vein, Shaughnessy, Treadway, Breland, and Perrewé (2017) used the need for power as a proxy for political will, which predicted informal leadership. Lastly, Doldor, Anderson, and Vinnicombe (2013) used semi-structured interviews to explore what political will meant to male and female managers. Rather than focus on the trait-like qualities previously employed as proxies, they found that political will was more of an attitude about engaging in organizational politics. Lastly, they found that functional, ethical, and emotional appraisals shaped political attitudes. Recently, Kaptousis, Papelalexandris, Treadway, and Bentley (2017) developed an eight-item measure called the Political Will Scale (PWS). Based on Treadway’s (2012) conception of political will, they expected the scale to break out into the five dimensions of instrumental, relational, concern for self, concern for others, and risk tolerance. However, confirmatory principal axis factor analysis revealed two factors for this scale, which they labeled benevolent and self-serving. To date, only a handful of published studies have used this new measure. As an example, Maher et al. (2018) found that political will and political skill predicted configurations of impression management tactics. Moreover, moderate levels of political will were associated with the most effective configuration. Blickle et al. (2018) applied additional psychometric testing to the scale. In applying a triadic multisource design, they found support for the construct and criterion-related validity of the self-serving dimension of political will. However, they did not find justification for the benevolent dimension. Instead, they interpreted this dimension to be synonymous with altruistic political will. Critique and Future Research Directions. Because the study of political will is in its nascent stage, lending a critical eye helps introduce ideas for remedying potential deficiencies. Establishing, expanding, and empirically testing the political will nomological network will help establish construct validity and advance research in this area. The sections that follow evaluate the state of the construct, with a focus on vetting current conceptualizations and measurement instruments. To date, within the organization sciences, political will resides as an individual-level variable. Indeed, we take no issue with this stance. Mintzberg specifically discussed political will and political skill as individual attributes necessary to navigate workplace settings. However, scholars in political science have characterized political will as a group-level phenomenon (Post et al., 2010). Appropriately, scholars within the organization sciences should also conceptualize and explore political will at collective levels of analysis. Indeed, political will possesses attitude-based qualities (Doldor et al., 2013), and thus, can proliferate to others within similar social networks (Salancik & Pfeffer, 1978). Furthermore, scholars must examine how formal and informal leadership create unique political will profiles, and assess how these configurations might affect group outcomes. For example, teams with a high and consistent aggregate amount of political will have a singular focus that leads to higher performance results. It

Research Methods in Organizational Politics  •  149

may also be true that having one team member or leader who takes care of the ‘dirty work’ enables other team members to complete work tasks without engaging in office politics. Currently, scholars conceptualize political will as an individual characteristic. To date, instruments make no effort to test whether this characteristic differs across organizational situations and contexts. However, political scientists maintain that political will is issue specific (Donovan, Bateman, & Heggestad, 2013; Morgan, 1989). In keeping with this notion, we suggest that a novel and illuminating line of study would be to apply an event-oriented approach (Morgeson, Mitchell, & Liu, 2015) to studying political will. Under this design, scholars could examine how political will focuses on resources and effort toward a particular cause, and track how these manifestations affect goals and change outcomes. Unlike teamlevel aggregation, this approach would require the development of a new measure rather than merely changing the referent in the existing measure of political will. As with many constructs in the organizational politics literature, there is little consensus on the underlying theoretical foundations of political will. Conceptualizations and definitions are essential for any sound psychometric instrument, and this incongruence is a current affliction within the study of political will. Indeed, we applaud the advancement in theory and measurement by Treadway (2012) and Kapoutsis et al. (2017), as they represent the seminal works in the field. Previous proxy measures (i.e., need for achievement, need for power, intrinsic motivation) were rooted in constructs that are stable individual traits, and recent thinking more appropriately suggests that political will is a state-like attribute closely akin to an attitude. However, there are potential issues to confront concerning the more contemporary works mentioned above. For example, the multidimensional conceptualization of political will (Treadway, 2012) was not found to be supported (Kapoutsis et al., 2017). Correctly, no items reflected risk tolerance, suggesting too narrow an operationalization. Similarly, Rose and Greeley (2006) suggest that political will represents a sustained commitment to a cause, as adversity and pushback are integral aspects of the process. This aspect of political will is also absent from the recent measure. Scholars should analyze the PWS dimensions in conjunction with scales of perseverance (e.g., grit, Duckworth & Quinn, 2009) and risk tolerance to see if they load onto a common factor. As mentioned above, the word ‘politics’ is a loaded word that means different things to different people. Many see it as a toxic parasite requiring immediate extinction (Cantoni, 1993; Zhang, & Lu, 2009). Conversely, others recognize its importance, necessity, and inevitability (Eldor, 2016). An in-depth debate regarding the positive and negative aspects of organizational politics is beyond the scope of this chapter. However, it is clear that definition unanimity has evaded both scholars and study participants. Anecdotally, evidence suggests that respondents consider workplace ‘political behavior’ to embody advocacy for a particular governmental candidate. There are two potential remedies to this issue.

150 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

First, we suggest that scholars define organizational politics within the survey instrument in use. This approach will focus participants’ attention to organizational politics, not governmental politics. Second, scholars should avoid using any variant of the word ‘politics’ in the measures that they create, and instead use more specific language to illustrate the intended political situation or characteristic. When writing scale items, scholars advocate for clear language, so that interpretations are uniform across time, culture, and individual attributes (Clark & Watson, 1995). We find this practice particularly important given the ubiquity and lack of agreement about the word ‘politics.’ Unfortunately, the PWS suffers from this issue, as all eight items employ a variant of the word politics. Thus, we suggest that scholars define politics in ways that clearly are understood by target samples. Lastly, we must note that the initial validation of the PWS has produced mixed results. Blickle et al. (2018) found evidence that the self-serving dimension of political will did demonstrate descriptive and convergent validity. Specifically, the benevolent dimension of political will did not correlate with altruism, as the authors argue. We question if altruism truly fits within the political framework, as acting politically on others behalf does not have to be genuinely self-sacrificing (which highlights the need for higher conceptual agreement). These results, combined with the other issues raised in this review, warrant further construct validation research on the PWS. Reputation Definition and Conceptualization. Reputation is commonly discussed among the public across social and business contexts. Generally, reflecting a positive reputation is a complementary attribute. However, academic investigations of reputation are inconsistent, and our understanding of what exactly reputation is and how it functions is limited. Extant research exists across social science disciplines (e.g., economics, management, psychology; Ferris et al., 2003). Like many other constructs in the organizational politics literature, disagreement regarding the definition of reputation has thwarted research. This discrepancy is due, primarily, to the different labels across fields, and even separate pockets of research within each field (e.g., individual, team, organizational, and industry-level within the management literature) in some cases. These different markers and branches of research have fragmented research (Ferris et al., 2014). To synthesize the existing research and create greater understanding among scholars, Ferris et al. (2014) provided a cross-level review of reputation. They found that it has three interacting features: (1) elements that inform reputation, (2) stakeholder perceptions, and (3) functional utility. That is, the characteristics of a focal entity interact with stakeholder perceptions to form the entity’s reputation. Thus, reputation then has a particular value, which, if positive, can result in positive outcomes. Considering this, Ferris et al. (2014) proposed the following definition of reputation: “a perceptual identity formed from the collective percep-

Research Methods in Organizational Politics  •  151

tions of others, which is reflective of the complex combination of salient entity characteristics and accomplishments, demonstrated behavior, and intended images presented over some period of time as observed directly and/or reported from secondary sources, which reduces ambiguity about expected future behavior” (Ferris et al., 2014, p. 272). An essential element noted in this definition of reputation is its perceptual nature. That is, the focal individual does not own reputation. Instead, stakeholders form perceptions of the focal individual based on prior behavior indicative of performance and character. Another defining element of this definition is the idea of saliency. This extension argues that one cannot merely be of high character and a high performer. Instead, stakeholders need to be aware of the focal individual and her or his behaviors and accomplishments. The definition proposed by Ferris et al. (2014) corresponds strongly with the most frequently referenced individual level conceptualization (i.e., Hochwarter et al., 2007). This view argues that two informing elements—character/integrity and performance/ results—build reputation. Further, it proposes that perceptions develop from a history of observed individual behavior. Recently, Zinko et al. (2016) proposed that reputation has three dimensions (i.e., task, integrity, and social). This conceptualization is similar to Hochwarter et al. in that it emphasizes performance and character, but also addresses saliency (Ferris et al., 2014). Zinko et al. (2016) proposed that actor popularity affects reputation distinctly from one’s expertise in an area. Little research has investigated the second feature of reputation (i.e., stakeholder perceptions). Instead, scholars have studied the functional utility of reputation. Briefly, positive outcomes evolve from one’s favorable reputation as perceived by others, including more resources, power, and behavioral discretion (e.g., Bartol & Martin, 1990; Wade, Porac, Pollock, & Graffin, 2006). In sum, although there has been little investigation into individual reputation, scholars report characteristics (i.e., performance/results, character/integrity, saliency/prominence) that build functional utility, and lead to positive outcomes for the focal individual. Measurement. Hochwarter et al. (2007) developed the most widely cited measure of personal reputation. As mentioned above, this measure argues that reputation represents an observer’s opinion of a focal individual based on a history of behavior related to character/ integrity and performance/results. Scholars have used the measure with both other-report (e.g., Laird, Zboja, & Ferris, 2012) and self-report (e.g., Hochwarter et al., 2007) responses, and have found empirical support for the performance and character dimensions of reputation assessed by the measure (e.g., Liu, Ferris, Zinko, Perrewé, Weitz, & Xu, 2007). Studies have found a strong relation between self-reports and other-reports of this measure, suggesting that the measure is reliable and capable of use in either method (Hochwarter at al., 2007; Laird et al., 2012). Some studies of personal reputation have focused on just a single aspect, generally assessing only the performance dimension of reputation (e.g., Liu et al.,

152 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

2007). Relatedly, a single dimension of reputation is standard at the organization level (e.g., Bromley, 2000; Deephouse & Carter, 2005). Such measures have caused some discussion regarding the validity of measures that only capture part of the construct (e.g., Rindova, Williamson, & Petkova, 2010; Zinko et al., 2016). Critique and Future Research Directions. More reputation research is needed in order to establish, expand, and empirically test the reputation nomological network. As mentioned above, the vast majority of reputation research is at the organization level, and although a few scholars have investigated individual-level reputation for decades, only recently has it gained considerable attention (George, Dahlander, Graffin, & Sim, 2016). Although reputation affects the lives of every employee, it is due to the limited research that we have less understanding of its role and function in organizational politics. Because there is more research at the organization level, and because reputation appears to function similarly across levels of analysis (Ferris et al., 2014), it makes sense to examine the construct as a whole, cross-reference, and integrate the literature to gain greater understanding. Using the literature across levels and fields of analysis as foundations, many areas require attention. Hochwarter et al. (2007) developed a measure of personal reputation reflecting common use and adequate predictive validity (e.g., Laird et al., 2012). However, this measure does not account for all of the dimensions frequently discussed in the reputation literature. Although it does capture the performance and character dimensions, it does not capture the saliency/prominence dimension of reputation acknowledged by Ferris et al. (2014). Borrowing from an organization-level analysis, several scholars have argued for the inclusion of a prominence dimension. For example, Rindova et al., (2010) wrote, “prominence reflects the organization’s relative centrality in collective attention and memory” (p. 615), in that it assesses the size or distinctness of reputation. The prominence dimension is essential and addresses how well known an individual is relative to peers. Hinkin (1998) noted that to study a construct effectively, it is essential to use a measure that adequately represents the construct. A first step to support investigations is the development and validation of a new measure that captures all three frequently mentioned dimensions of reputation (i.e., performance, character, prominence) (Hinkin, 1998; Schriesheim, Powers, Scandura, Gardiner, & Lankau, 1993). Once developed, a measure that captures appropriate dimensions can be used to expand exiting research. Accordingly, future research should conduct secondorder factor analyses to explore if a higher-order factor underlies the three dimensions. Such evidence would promote theory development by identifying contexts most predictive of reputational effects. Johns (2001, 2006, 2018) has called for desperately needed research into contextual effects in management research. That is, more research is needed regarding when and under which circumstances do “known” relations exist (or not exist), and why. Each dimension likely acts somewhat differently within the reputation-related nomological network. That is, different antecedents and consequences of reputa-

Research Methods in Organizational Politics  •  153

tion likely have different relations with each dimension. Future research should not only investigate reputation as a whole, but also seek to understand how, when, and why each dimension of reputation is more or less influential on related constructs. To date, research at the individual-level has primarily focused on the informing elements and functional utility features of reputation. The third element (i.e., stakeholder perceptions) has received very little attention. Indeed, it seems as though researchers generally avoid this important element altogether. This inattention is concerning, as reputation is a “perceptual identity formed from the collective perceptions of others” (Ferris et al., 2014, p. 62) residing “in the minds of external observers” (Rindova et al., 2010, p. 614). Despite this acceptance, there has been no empirical investigation of the functional role of stakeholder characteristics in reputation formation. The variance in how others may perceive a focal individual is a central theme in reputation development. Indeed, individuals interpret the same information differently (Branzei, Ursacki-Bryant, Vertinsky, & Zhang, 2004), and attribute behaviors to different causes (Heider, 1958; Kelley, 1973). Still, although this variance in perception is well established, its effect consequences of reputation (e.g., autonomy, financial reward) has received little attention. Related to how stakeholders may interpret informing elements differently, and tying back into the measurement of personal reputation, is the obvious concern with the method in which reputation is measured. Although Hochwarter et al.’s (2007) measure has received statistical support (and convergence across self- and other-report indices), assessments came from focal individuals exclusively (Ferris et al., 2014). Although obtaining multiple assessments of a single focal individual is generally more complicated, a reputation assessment from a single individual offers minimal insight. BROAD SCALE CRITIQUE AND DIRECTIONS FOR FUTURE RESEARCH In our above review and analysis of the organizational politics literature, we identified methodological issues requiring attention in each designated topic areas. Our general findings suggest that there are some conceptual challenges that must be overcome in order for the field to move forward. In the remaining sections of this review, we identify potential roadblocks, as well as opportunities presumed to augment research in a manner that is projective rather than reflective. To date, scholars identified, evaluated, and incorporated recommendations into contemporary research designs (Ferris et al., 2019). However, as concerns exit the list, to be added are others that embody ever-changing research realities and expectations. Our discussion focuses on these externalities, which we feel have the most significant potential to affect studies contributing to the next generation of politics research. Specifically, we dissect conceptual, research design, and data collection

154 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

issues, and provide to the best of our ability some potential remedies to common challenges within the field of organizational politics. Conceptual Challenges The decades of research on organizational politics notwithstanding, the field still suffers from a fundamental issue of conceptual incongruence, which poses a threat to construct validity (Ferris et al. 2019; Lepisto & Pratt, 2012; McFarland et al., 2012). This shortcoming is not tremendously surprising, as accurately defining and capturing motives and behaviors that are inherently concealed, informal, murky, or downright dishonest is no easy task. Adding to this complexity is the perspective that the word politics itself is a well-known, yet misunderstood, term within the popular lexicon, and these preconceived notions from both practitioners and scholars alike can contaminate conceptualizations and measurement. Ideally, a researcher would approach his or her studies with a tabula rasa, or clean slate (Craig & Douglas, 2011; Fendt, & Sachs, 2008). However, as objective as individuals aim to be, researchers’ personal experiences may influence how constructs are conceptualized and evaluated. Perhaps it is true that the commercialization of greed that occurred in the 1980s, when current conceptualizations of POPs and other political constructs were established, influenced how scholars defined and measured constructs within organizational politics. It may also be the case that the more modern positive psychology movement (e.g., Luthans & Avolio, 2009) has led scholars to search for positive aspects of organizational politics (Byrne, Manning, Weston, & Hochwarter, 2017; Elbanna, Kapoutisis, & Mellahi, 2017). We offer no formal definition here, but we do suggest that future attempts to unify organizational politics under a common conceptual understanding acknowledge that much of what goes on in organizations is informal and social, and that this reality allows for many different outcomes, both good and bad. A unifying definition of organizational politics should also consider the full breadth of different behaviors and motivations. We have talked about the politician as an omnibus term rather than defining and refining what precisely that means. Given the complex nature of political constructs, we advocate the development of multidimensional constructs with both first- and second-order levels. This strategy allows practitioners to look for general main effects, as well as more nuanced relationships (e.g., Brouer, Badaway, Gallagher, & Haber, 2015). Specifically, it might be helpful to establish profiles of and related to, political behavior. From a research design standpoint, methods such as latent profile analysis (Gabriel, Campbell, Djurdjevic, Johnson, & Rosen, 2018; Gabriel, Daniels, Diefendorff, & Greguras, 2015), cluster analysis (Maher et al. 2018), and qualitative comparative analysis (QCA; Misangyi, Greckhamer, Furnari, Fiss, Crilly, & Aguilera, 2017; Rihoux & Ragin, 2008) represent data analysis techniques that are currently underutilized in the organizational politics literature. We also assert that the complexity of political constructs call for more nuanced explorations,

Research Methods in Organizational Politics  •  155

and scholars should consider theorizing and measuring nonlinear and moderated nonlinear investigations of politics (Ferris, Bowen, Treadway, Hochwarter, Hall, & Perrewé, 2006; Grant & Schwartz, 2011; Hochwarter, Ferris, Laird, Treadway, & Gallagher, 2010; Maslyn et al. 2017; Pierce & Aguinis, 2013; Rosen & Hochwarter, 2014). Another important consideration for the future of organizational research is to examine context (see Johns, 2006, 2018). The majority of POPs research has come from scholars and samples from the United States (for a few notable exceptions, please see Abbas & Raja, 2014; Basar, & Basim, 2016; Eldor, 2016; Kapoutsis et al., 2017). We feel that it is imperative to incorporate different viewpoints from all corners of the world as we move towards a shared conceptual understanding of organizational politics. Failure to do so creates issues of construct adequacy (Arafat, Chowdhury, Qusar, & Hafez, 2016; Hult et al., 2008). Indeed, we expect broad and salient similarities across cultures, but politics may look, act, and feel different across different contexts. In addition, the role of context likely also affects politics at a more localized level. That is, among others, the type of organization (e.g., for-profit vs. not-forprofit), industry (e.g., finance vs. social services), and hierarchical level (e.g., top management teams vs. line managers) likely affect the prevalence, type, and process of organizational politics. Contextualizing our research will provide an abundance of different avenues from which we can continue to evaluate how, what, when, why, and the effectiveness of political action under different circumstances. This approach will help illuminate theory and help build a greater conceptual understanding of politics. Lastly, although there are ample avenues for investigations that employ the contemporary political constructs discussed in this chapter, organizational politics scholars should not rest on their laurels concerning the development of new theories and constructs. We encourage the inclusion and development of new theories that could help explain political phenomena. For example, organizational politics literature is rooted in the idea that individuals are not merely passive agents, but instead enact and respond to their environment. The fields of leadership and organizational politics are inextricably linked, and much as the field of leadership has placed emphasis on leaders over followers (Epitropaki, Kark, Mainemelis, & Lord, 2017), organizational scholars have focused on the actions of the influencers rather than the targets of those influences. This perspective ignores a century old stream of research that spans the social sciences and argues that there is individual variation in the extent to which individuals are affected by their environment (Allport, 1920; Belsky & Pluess, 2009). Incorporating individuals’ susceptibility to social influence into theories and models of organizational politics would restore balance to the contemporary biased perspective, and help alleviate concerns of omitted variable bias.

156 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

Research Design Challenges We now discuss the particular aspects of research design that make it difficult to establish precise connections between theory and measurement (Van Maanen, Sorensen, & Mitchell, 2007). Issues of conceptual agreement aside, organizational politics inherently suffers from the fact that many of its core constructs are inconspicuous and, in some cases, intentionally hidden from organizational members. Suspicions of backroom deals, fraternizing and favoritism, informal reciprocal agreements, and leader political support are environmental stimuli that run the gambit from objective and observable to speculation and hearsay. From a design standpoint, this makes getting a consensus on observations difficult, as individuals vary on the extent to which they observed stimuli directly, or stimuli were described to them by a primary or secondary source. Further complicating matters is that ulterior motives, impression management, and deception can also cloud perceptions. Multiple individuals could observe the same action, but perceptions of motivation and intent will vary how individuals attribute those actions (Bolino, 1999). Perhaps because of these difficulties, scholars have applied a somewhat similar set of research designs to study organizational politics. Most designs focus on individual attributes and perceptions, which is essential research but ignores the fact that organizational politics is inherently an interpersonal and multilevel field. In order to combat these systemic issues, we provide recommendations on how to design studies so that instruments are better able to measure the theoretical underpinnings of organizational politics. First, in order to ensure that study participants are evaluating the same phenomenon, we echo the call for assessing specific foci and stimuli (Maslyn & Fedor, 1998). Specifically, researchers should highlight specific aspects of the environment so that subjects are responding to the same stimuli. Although this is not a new philosophy, its use and importance in organizational politics research seem to have not fully caught on in organizational politics research. We find it particularly important, as the ubiquity and ambiguity of politics make measuring one particular aspect of the political environment rather tricky. For example, the statement “it is pretty political around here” could be interpreted as referring to more proximal group dynamics, leader dynamics, or more distal organizational dynamics, all of which will have different antecedents, outcomes, and boundary conditions. Furthermore, we advocate the study of specific events rather than general appraisals of climate. Event systems theory argues that novel, disruptive, and critical events affect organizations across time (Morgeson et al., 2015). Applying a design of this nature would represent a break from the conventional research design, and would provide much needed illumination of the ways in which organizational politics creates temporal (Hochwarter, Ferris, Gavin, Perrewé, Hall, & Frink, 2007; Kiewitz, Restubog, Zagenczyk, & Hochwarter, 2009; Saleem, 2015) and multilevel (Dipboye & Foster, 2002; Rosen, Kacmar, Harris, Gavin, & Hochwarter, 2017) effects upon organizations.

Research Methods in Organizational Politics  •  157

Specifically, concerning multilevel research, it is vital that we recognize the effect that leaders can demonstrate on their followers (Ahearn et al. 2004; Douglas & Ammeter, 2004; Frieder et al., in press; Treadway et al., 2004). This line of research is promising, and this type of design can help bridge the gap that exists between macro- and micro-level research within organizational politics (Lepisto & Pratt, 2012). At the same time, scholars can employ multilevel modeling to examine within-person effects over time. Doing so would be of great interest to those who are interested in knowing how constructs like political will, political skill, and reputation grow over time. Furthermore, experience-sampling approaches, also known as diary studies (Gabriel, Koopman, Rosen, & Johnson, 2018; Larson & Csikszentmihalyi, 1983; Lim, Ilies, Koopman, Christoforou, & Arvey, 2018) could illuminate how political constructs affect intrapsychic processes and individual attributes throughout the day. As a whole, organizational politics suffers from an imbalance of the threehorned dilemma (Runkel & McGrath 1972). Researchers aim to collect and analyze data that promote realism, generalizability, and precision. However, these three horns are not mutually exclusive. Specifically, increasing precision dilutes generalizability and realism, increasing generalizability dilutes precision and realism, and increasing realism dilutes generalizability and precision. To date, organizational politics research has focused on generalizability and has room for improvement in both realism and precision. Except for rare exceptions (e.g., Doldor et al. 2013; Landells & Albrecht, 2017), qualitative research in organizational politics is meager. This omission is especially disappointing given the nuance, complexity, innuendo, and richness associated with political theory. As an example, given that POPs are a perceptual construct, and that qualitative work is based on an epistemology that espouses various constructions of reality, it seems as if the marriage of politics research and qualitative designs would be kismet. Perhaps it is not surprising to see a field with such conceptual disagreement have such a void of quality grounded theory work at its foundation. To improve theoretical richness and provide a foundation for more targeted quantitative inquiry, we call for qualitative designs (Lincoln & Guba, 1985) such as ethnography, interviews, historical analyses that provide a richer understanding than what is possible through traditional survey research methods. Whereas qualitative research would bring more realism to the field, employing experimental designs would enable organizational politics scholars to evaluate political theories with more precision. Experimental designs involve the manipulation of an independent variable to test its effect on dependent variables. These designs have many advantages, as they provide reliable inferences for causality, can be employed to evaluate subjects that are not legally or ethically viable in field studies, and are easily administered (McFarland et al., 2012). Only a small number of studies within the organizational politics literature have employed experimental designs (e.g., Kacmar, Wayne, & Wright, 1996; van Knippenberg &

158 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

Steensma, 2003). Indeed, we understand how the complexities of organizational politics make it challenging to design scenarios and manipulations that fully capture the intertwined network of factors that affect political action. However, many opportunities exist to apply experimental designs to probe subjects like influence susceptibility and effectiveness, decisions making, and differential effects of political skill dimensionality. Our review of the literature indicates that organizational politics literature has some fundamental issues for scholars to address. The covert nature of politics necessitates the measurement of multiple perspectives. We would like to see the field move towards a balance among designs that promote generalizability, realism, and precision, as each type of research design has its virtues and drawbacks. In order to achieve this balance, we call for individual studies and multi-study packages that employ qualitative or experimental designs that can augment the quantitative field designs that are more commonly used. Indeed, organizational politics produce complex phenomena, and no one design can adequately address this intricacy. Mixed methods research continues to grow in its popularity and sophistication (Clark, 2019; de Leeuw & Toepoel, 2018; Jehn & Jonsen, 2010; Molina-Azorin, Bergh, Corley, & Ketchen, 2017), and these types of studies can be skillfully designed to compensate for the inherent shortcomings in different individual research designs. Lastly, although not a unique issue within our field, the complexity of organizational politics research requires replications to ensure that theory is accurately describing the political realities that individuals face at work. Replications, extensions, and the use of multi-study research packages help us demonstrate these patterns of results so that we can be more confident in the validity of our findings, or adjust our theory by exploring new contexts (Hochwarter, Ferris, & Hanes, 2011; Kacmar, Bozeman, Carlson, & Anthony, 1999; Li, Liang, & Farh, 2020). Data Collection Challenges Although scholars from all disciplines face the practical challenges of collecting data, there are some inherent challenges that scholars face when collecting data on organizational politics. Managers and HR practitioners can be reluctant to grant access to their employees when they fear that ‘political’ questions may poison the well, and prime their employees to think about injustices coming from leadership. In this case, the plurality of meaning assigned to the word politics must be navigated with data collection gatekeepers as well, which requires educating and explaining in other terms the purpose of your study. Using alternative phrases such as ‘rules of the game,’ ‘informal channels,’ and ‘social dynamics’ may be words that accurately describe the nature of your study, and help avoid using the politically charged word politics. One partial remedy to this problem has been the use of student-recruited samples (Hochwarter, 2014; Wheeler, Shanine, Leon, & Whitman, 2013), as there are

Research Methods in Organizational Politics  •  159

fewer barriers to access with these samples. Despite the potential pitfalls of this data collection method, these samples can increase the generalizability of a study, perhaps more so than a sample drawn from a single organization. We encourage the appropriate use of these samples (see Wheeler et al., for guidelines), especially in conjunction with other data collection methods as part of a multi-study package, as student recruited sampling methods have the potential to attenuate the weaknesses of other study designs (e.g., interviews, experiments, single-site field studies). In a similar vein, technology has enabled us to gather data from different online sources such as Amazon Mechanical Turk and Qualtrics. Although these data sources can potentially suffer from some of the ills plaguing poorly designed and executed student-recruited samples, understanding the virtues can help scholars demonstrate strengths to their empirical studies (Cheung, Burns, Sinclair, & Sliter, 2017; Couper, 2013; Das, Ester, & Kaczmirek, 2020; Finkel, Eastwick, & Reis, 2015; Jann, Krumpal, & Wolter, 2019; Porter, Outlaw, Gale, & Cho, 2019). No matter where the data are collected, organizational scholars will still run into the inherent problem that organizational politics constructs are measured in imperfect ways because of the invisibility of many of its core constructs. Thus, we will close with a final appeal to use multiple sources of information to illuminate political phenomena. There is an old Hindu parable about a collection of blind men who individually feel parts of an elephant, and then collectively share their knowledge to get a shared conceptualization of the elephant. Given the hidden and often invisible nature of politics constructs, we must too rely on multiple accounts to achieve a collective understanding. For example, few studies have attempted to use objective measures of performance when assessing the proposed relations with political skill (see Ahearn et al., 2004 for an exception). Subjective measures of performance can be problematic, as those high in political skill can influence others, and likely the subjective performance assessments. Thus, collecting both objective and subjective performance and employing congruence analysis can not only help us understand the quality of our data, but also extract theoretical richness. The same is true for constructs such as self- and other-reported political skill, leader political behavior, and perceptions of organizational politics. Polynomial regression and other forms of congruence analysis can help determine if and why subjects are or are not seeing things the same way (Cheung, 2009; Edwards, 1994; Edwards & Parry, 1993). Differences in these scores may well predict different outcomes, which can add to our theoretical understanding of political phenomena. CONCLUSION The organizational politics literature has been going strong for decades, yet still suffers from some of the fundamental problems that we see with fledgling streams of research. At the core of almost every political construct is the issue of conceptual clarity and congruence. Without a sound theoretical basis, measures exist on unstable grounds, and fault lines are sure to divide and divert what could be a

160 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

sound collective research stream. In this chapter, we have reviewed and critically examined the theoretical bases and associated measures of the five significant constructs in the field as well as the conventional research designs that predominate our literature. This exercise has led us to point out some of the virtues and drawbacks of current established measures and methods, and to take some hard looks in the mirror at our work. We hope that our suggestions help inspire and guide future research so that the collective strength of this invaluable field continues to grow. REFERENCES Abbas, M., & Raja, U. (2014). Impact of perceived organizational politics on supervisoryrated innovative performance and job stress: Evidence from Pakistan. Journal of Advanced Management Science, 2, 158–162. Adams, G., Ammeter, A., Treadway, D., Ferris, G., Hochwarter, W., & Kolodinsky, R. (2002). Perceptions of organizational politics: Additional thoughts, reactions, and multi-level issues. In F. Yammarino & F. Dansereau (Eds.), Research in multi-level issues, Volume 1: The many faces of multi-level issues (pp. 287–294). Oxford, UK: Elsevier Science. Ahearn, K., Ferris, G., Hochwarter, W., Douglas, C., & Ammeter, A. (2004). Leader political skill and team performance. Journal of Management, 30, 309–327. Ahmad, J., Akhtar, H., ur Rahman, H., Imran, R., & ul Ain, N. (2017). Effect of diversified model of organizational politics on diversified emotional intelligence. Journal of Basic and Applied Sciences, 13, 375–385. Allport, F. (1920). The influence of the group upon association and thought. Journal of Experimental Psychology, 3, 159–182. Arafat, S., Chowdhury, H., Qusar, M., & Hafez, M. (2016). Cross-cultural adaptation and psychometric validation of research instruments: A methodological review. Journal of Behavioral Health, 5, 129–136. Aryee, S., Chen, Z., & Budhwar, P. (2004). Exchange fairness and employee performance: An examination of the relationship between organizational politics and procedural justice. Organizational Behavior and Human Decision Processes, 94, 1–14. Ashforth, B., & Lee, R. (1990). Defensive behavior in organizations: A preliminary model. Human Relations, 43, 621–648. Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory. Englewood Cliffs, NJ: Prentice Hall. Barbuto, J., & Moss, J. (2006). Dispositional effects in intra-organizational influence tactics: A meta-analytic review. Journal of Leadership & Organizational Studies, 12, 30–48. Bartol, K., & Martin, D. (1990). When politics pays: Factors influencing managerial compensation decisions. Personnel Psychology, 43, 599–614. Basar, U., & Basim, N. (2016). A cross‐sectional survey on consequences of nurses’ burnout: Moderating role of organizational politics. Journal of Advanced Nursing, 72, 1838–1850. Belsky, J., & Pluess, M. (2009). Beyond diathesis stress: Differential susceptibility to environmental influences. Psychological Bulletin, 135, 885–908.

Research Methods in Organizational Politics  •  161 Bing, M., Davison, H., Minor, I., Novicevic, M., & Frink, D. (2011). The prediction of task and contextual performance by political skill: A meta-analysis and moderator test. Journal of Vocational Behavior, 79, 563–577. Blickle, G., Ferris, G., Munyon, T., Momm, T., Zettler, I., Schneider, P., & Buckley, M. (2011). A multi‐source, multi‐study investigation of job performance prediction by political skill. Applied Psychology, 60, 449–474. Blickle, G., Schütte, N., & Wihler, A. (2018). Political will, work values, and objective career success: A novel approach – The Trait-Reputation-Identity Model. Journal of Vocational Behavior, 107, 42–56. Blom-Hansen, J., & Finke, D. (2020). Reputation and organizational politics: Inside the EU Commission. The Journal of Politics, 82(1), 135–148. Bolino, M. (1999). Citizenship and impression management: Good soldiers or good actors? Academy of Management Review, 24, 82–98. Bolino, M., Long, D., & Turnley, W. (2016). Impression management in organizations: Critical questions, answers, and areas for future research. Annual Review of Organizational Psychology and Organizational Behavior, 3, 377–406. Bolino, M., & Turnley, W. (1999). Measuring impression management in organizations: A scale development based on the Jones and Pittman taxonomy. Organizational Research Methods, 2, 187–206. Branzei, O., Ursacki-Bryant, T., Vertinsky, I., & Zhang, W. (2004). The formation of green strategies in Chinese firms: Matching corporate environmental responses and individual principles. Strategic Management Journal, 25, 1075–1095. Brecht, A. (1937). Bureaucratic sabotage. The Annals of the American Academy of Political and Social Science, 189, 48–57. Bromley, D. (1993). Reputation, image, and impression management. New York, NY: Wiley. Bromley, D. (2000). Psychological aspects of corporate identity, image and reputation. Corporate Reputation Review, 3, 240–253. Brouer, R., Badaway, R., Gallagher, V., & Haber, J. (2015). Political skill dimensionality and impression management choice and effective use. Journal of Business and Psychology, 30, 217–233. Brouer, R., Douglas, C., Treadway, D., & Ferris, G. (2013). Leader political skill, relationship quality, and leadership effectiveness a two-study model test and constructive replication. Journal of Leadership & Organizational Studies, 20, 185–198. Burris, E. (2012). The risks and rewards of speaking up: Managerial responses to employee voice. Academy of Management Journal, 55, 851–875. Byrne, D. (1917). Executive session. Nash’s Pall Mall Magazine, 59, 49–56. Byrne, Z., Manning, S., Weston, J., & Hochwarter, W. (2017). All roads lead to well-being: Unexpected relationships between organizational POPs, employee engagement, and worker well-being. In C. Rosen & P. Perrewé (Eds.), Power, politics, and political skill in job stress (pp. 1–32). Bingley, UK: Emerald. Cantoni, C. (1993). Eliminating bureaucracy-roots and all. Management Review, 82, 30– 33. Chang, C., Rosen, C., & Levy, P. (2009). The relationship between perceptions of organizational politics and employee attitudes, strain, and behavior: A meta-analytic examination. Academy of Management Journal, 52, 779–801.

162 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER Cheung, G. (2009). A multiple-perspective approach to data analysis in congruence research. Organizational Research Methods, 12(1), 63–68. Cheung, J., Burns, D., Sinclair, R., & Sliter, M. (2017). Amazon Mechanical Turk in organizational psychology: An evaluation and practical recommendations. Journal of Business and Psychology, 32, 347–361. Clark, V. (2019). Meaningful integration within mixed methods studies: Identifying why, what, when and how. Contemporary Educational Psychology, 57, 106–111. Clark, L., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7, 309–319. Couper, M. (2013). Is the sky falling? New technology, changing media, and the future of surveys. Survey Research Methods, 7, 145–156. Craig, C., & Douglas, S. (2011). Assessing cross-cultural marketing theory and research: A commentary essay. Journal of Business Research, 64, 625–627. Cullen, K., Gerbasi, A., & Chrobot-Mason, D. (2018). Thriving in central network positions: The role of political skill. Journal of Management, 44, 682–706. Dahling, J., Gabriel, A., & MacGowan, R. (2017). Understanding typologies of feedback environment perceptions: A latent profile investigation. Journal of Vocational Behavior, 101, 133–148. Dalal, R. (2005). A meta-analysis of the relationship between organizational citizenship behavior and counterproductive work behavior. Journal of Applied Psychology, 90, 1241–1255. Das, M., Ester, P., & Kaczmirek, L. (Eds.). (2020). Social and behavioral research and the internet: Advances in applied methods and research strategies. New York, NY: Routledge. de Leeuw, E., & Toepoel, V. (2018). Mixed-mode and mixed-device surveys. In D. Vannette & J. Krosnick (Eds.), The Palgrave handbook of survey research (pp. 51–61). Cham, Switzerland: Palgrave MacMillan. Deephouse, D., & Carter, S. (2005). An examination of differences between organizational legitimacy and organizational reputation. Journal of Management Studies, 42, 329–360. Dipboye, R., & Foster, J. (2002). Multi-level theorizing about perceptions of organizational politics. In F. Yammarino & F. Dansereau (Eds.), The many faces of multi-level issues (pp. 255–270). Oxford, UK: Elsevier Science. Doldor, E., Anderson, D., & Vinnicombe, S. (2013). Refining the concept of political will: A gender perspective. British Journal of Management, 24, 414–427. Donovan, J., Bateman, T., & Heggestad, E. (2013). Individual differences in work motivation: Current directions and future needs. In N. Christiansen & R. Tett (Eds.), Handbook of personality at work (pp. 121–128). New York: NY: Routledge. Douglas, C., & Ammeter, A. (2004). An examination of leader political skill and its effect on ratings of leader effectiveness. The Leadership Quarterly, 15, 537–550. Duckworth, A., & Quinn, P. (2009). Development and validation of the Short Grit Scale (GRIT–S). Journal of Personality Assessment, 91, 166–174. Edwards, J. (1994). The study of congruence in organizational behavior research: Critique and a proposed alternative. Organizational Behavior and Human Decision Process, 58, 51–100.

Research Methods in Organizational Politics  •  163 Edwards, J., & Parry, M. (1993). On the use of polynomial regression equations as an alternative to difference scores in organizational research. Academy of Management Journal, 36, 1577–1613. Elbanna, S., Kapoutsis, I., & Mellahi, K. (2017). Creativity and propitiousness in strategic decision making: The role of positive politics and macro-economic uncertainty. Management Decision, 55, 2218–2236. Eldor, L. (2016). Looking on the bright side: The positive role of organizational politics in the relationship between employee engagement and performance at work. Applied Psychology, 66, 233–259. Ellen III, B. (2014). Considering the positive possibilities of leader political behavior. Journal of Organizational Behavior, 35, 892–896. Epitropaki, O., Kark, R., Mainemelis, C., & Lord, R. G. (2017). Leadership and followership identity processes: A multilevel review. The Leadership Quarterly, 28, 104– 129. Farrell, D., & Petersen, J. (1982). Patterns of political behavior in organizations. Academy of Management Review, 7, 403–412. Fedor, D., Maslyn, J., Farmer, S., & Bettenhausen, K. (2008). The contribution of positive politics to the prediction of employee reactions. Journal of Applied Social Psychology, 38, 76–96. Fendt, J., & Sachs, W. (2008). Grounded theory method in management research: Users’ perspectives. Organizational Research Methods, 11, 430–455. Ferris, G., Adams, G., Kolodinsky, R., Hochwarter, W., & Ammeter, A. (2002). Perceptions of organizational politics: Theory and research directions. In F. Yammarino & F. Dansereau (Eds.), Research in multi-level issues, Volume 1: The many faces of multi-level issues (pp. 179–254). Oxford, UK: Elsevier. Ferris, G., Berkson, H., Kaplan, D., Gilmore, D., Buckley, M., Hochwarter, W., et al. (1999). Development and initial validation of the political skill inventory. Paper presented at the 59th annual national meeting of the Academy of Management, Chicago. Ferris, G., Blass, R., Douglas, C., Kolodinsky, R.,k & Treadway, D. (2003). Personal reputation in organizations. In J. Greenberg (Ed.), Organizational behavior: The state of the science (pp. 211–246). Mahwah, NJ: Lawrence Erlbaum. Ferris, G. R., Bowen, M. G., Treadway, D. C., Hochwarter, W. A., Hall, A. T., & Perrewé, P. L. (2006). The assumed linearity of organizational phenomena: Implications for occupational stress and well-being. In P. L. Perrewé & D. C. Ganster (Eds.), Research in occupational stress and well-being (Vol. 5, pp. 205–232). Oxford, UK: Elsevier Science Ltd. Ferris, G., Ellen, B., McAllister, C., & Maher, L. (2019). Reorganizing organizational politics research: A review of the literature and identification of future research directions. Annual Review of Organizational Psychology and Organizational Behavior, 6, 299–323. Ferris, G., Fedor, D., & King, T. (1994). A political conceptualization of managerial behavior. Human Resource Management Review, 4, 1–34. Ferris, G., Harrell-Cook, G., & Dulebohn, J. (2000). Organizational politics: The nature of the relationship between politics perceptions and political behavior. In S. Bacharach & E. Lawler (Eds.), Research in the sociology of organizations (pp. 89–130). Stamford, CT: JAI Press.

164 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER Ferris, G., Harris, J., Russell, Z., Ellen, B., Martinez, A., & Blass, F. (2014). The role of reputation in the organizational sciences: A multi-level review, construct assessment, and research directions. In M. Buckley, A. Wheeler, & J. Halbesleben (Eds.), Research in personnel and human resources management (pp. 241–303). Bingley, UK: Emerald. Ferris, G., Harris, J., Russell, Z., & Maher, L. (2018). Politics in organizations. In N. Anderson, D. Ones, & H. Sinangi (Eds.), The handbook of industrial, work, and organization psychology (pp. 514–531). Thousand Oaks, CA: Sage. Ferris, G., & Hochwarter, W. (2011). Organizational politics. In S. Zedeck (Ed.), APA handbook of industrial and organizational psychology (pp. 435–459). Washington, DC: APA. Ferris, G., Hochwarter, W., Douglas, C., Blass, F., Kolodinsky, R., & Treadway, D. (2002b). Social influence processes in organizations and human resource systems. In G. Ferris, & J. Martocchio (Eds.), Research in personnel and human resources management (pp. 65–127). Oxford, U.K.: JAI Press/Elsevier Science. Ferris, G., & Judge, T. (1991). Personnel/human resources management: A political influence perspective. Journal of Management, 17, 447–488. Ferris, G., & Kacmar, K. (1989). Perceptions of organizational politics. Paper presented at the 49th Annual Academy of Management Meeting, Washington, DC. Ferris, G., & Kacmar, K. (1992). Perceptions of organizational politics. Journal of Management, 18, 93–116. Ferris, G., & King, T. (1991). Politics in human resources decisions: A walk on the dark side. Organizational Dynamics, 20, 59–71. Ferris, G., Perrewé, P., Daniels, S., Lawong, D., & Holmes, J. (2017). Social influence and politics in organizational research: What we know and what we need to know. Journal of Leadership & Organizational Studies, 24, 5–19. Ferris, G., Perrewe, P., & Douglas, C. (2002). Social effectiveness in organizations: Construct validity and research directions. Journal of Leadership and Organizational Studies, 9, 49–63. Ferris, G., Russ, G., & Fandt, P. (1989). Politics in organizations. In R. Giacalone & P. Rosenfeld (Eds.), Impression management in the organization (pp. 143–170). Hillsdale, NJ: Erlbaum. Ferris, G., & Treadway, D. (2012). Politics in organizations: History, construct specification, and research directions. In G. Ferris & D. Treadway (Eds.), Politics in organizations: Theory and research considerations (pp. 3–26). New York, NY: Routledge/ Taylor and Francis. Ferris, G., Treadway, D., Brouer, R., & Munyon, T. (2012). Political skill in the organizational sciences. In G. Ferris & D. Treadway (Eds.), Politics in organizations: Theory and research considerations (pp. 487–528). New York, NY: Routledge/Taylor & Francis. Ferris, G., Treadway, D., Kolodinsky, R., Hochwarter, W., Kacmar, C., Douglas, C., & Frink, D. D. (2005). Development and validation of the political skill inventory. Journal of Management, 31, 126–152. Ferris, G., Treadway, D., Perrewé, P., Brouer, R., Douglas, C., & Lux, S. (2007). Political skill in organizations. Journal of Management, 33, 290–320.

Research Methods in Organizational Politics  •  165 Finkel, E., Eastwick, P., & Reis, H. (2015). Best research practices in psychology: Illustrating epistemological and pragmatic considerations with the case of relationship science. Journal of Personality and Social Psychology, 108, 275–297. Franke, H., & Foerstl, K. (2018). Fostering integrated research on organizational politics and conflict in teams: A cross-phenomenal review. European Management Journal, 36, 593–607. French, J., & Raven, B. (1959). The bases of social power. In D. Cartwright & A. Zander (Eds.), Group dynamics (pp. 150–167). New York, NY: Harper & Row. Frieder, R. E., Ferris, G. R., Perrewé, P. L., Wihler, A., & Brooks, C. D. (2019). Extending the metatheoretical framework of social/political influence to leadership: Political skill effects on situational appraisals, responses, and evaluations by others. Personnel Psychology, 72(4), 543–569. Gabriel, A., Campbell, J., Djurdjevic, E., Johnson, R., & Rosen, C. (2018). Fuzzy profiles: comparing and contrasting latent profile analysis and fuzzy set qualitative comparative analysis for person-centered research. Organizational Research Methods, 21, 877–904. Gabriel, A., Daniels, M., Diefendorff, J., & Greguras, G. (2015). Emotional labor actors: A latent profile analysis of emotional labor strategies. Journal of Applied Psychology, 100, 863–879. Gabriel, A., Koopman, J., Rosen, C., & Johnson, R. (2018). Helping others or helping oneself? An episodic examination of the behavioral consequences of helping at work. Personnel Psychology, 71, 85–107. Gandz, J., & Murray, V. (1980). The experience of workplace politics. Academy of Management Journal, 23, 237–251. Gentry, W., Gilmore, D., Shuffler, M., & Leslie, J. (2012). Political skill as an indicator of promotability among multiple rater sources. Journal of Organizational Behavior, 33, 89–104. George, G., Dahlander, L., Graffin, S., & Sim, S. (2016). Reputation and status: Expanding the role of social evaluations in management research. Academy of Management Journal, 59, 1–13. Grams, W., & Rogers, R. (1990). Power and personality: Effects of Machiavellianism, need for approval, and motivation on use of influence tactics. Journal of General Psychology, 117, 71–82. Grant, A., & Schwartz, B. (2011). Too much of a good thing: The challenge and opportunity of the inverted U. Perspectives on Psychological Science, 6, 61–76. Guo, Y., Kang, H., Shao, B., & Halvorsen, B. (2019). Organizational politics as a blindfold: Employee work engagement is negatively related to supervisor-rated work outcomes when organizational politics is high. Personnel Review, 48, 784–798. Heider, F. (1958). The psychology of interpersonal relations. New York, NY: Wiley. Higgins, C., Judge, T., & Ferris, G. (2003). Influence tactics and work outcomes: A meta‐ analysis. Journal of Organizational Behavior, 24, 89–106. Hill, S., Thomas, A., & Meriac, J. (2016). Political behaviors, politics perceptions and work outcomes: Moving to an experimental study. In E. Vigoda-Gabot & A. Drory (Eds.), Handbook of organizational politics: Looking back and to the future (pp. 369–400). Northampton, MA: Edward Elgar Publishing. Hinkin, T. (1998). A brief tutorial on the development of measures for use in survey questionnaires. Organizational Research Methods, 1, 104–121.

166 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER Hochwarter, W. (2012). The positive side of organizational politics. In G. Ferris & D. Treadway (Eds.), Politics in organizations: Theory and research considerations (pp. 20–45). New York, NY: Routledge/Taylor and Francis. Hochwarter, W. (2014). On the merits of student‐recruited sampling: Opinions a decade in the making. Journal of Occupational and Organizational Psychology, 87, 27–33. Hochwarter, W., Ferris, G., Gavin, M., Perrewé, P., Hall, A., & Frink, D. (2007). Political skill as neutralizer of felt accountability—Job tension effects on job performance ratings: A longitudinal investigation. Organizational Behavior and Human Decision Processes, 102, 226–239. Hochwarter, W., Ferris, G., & Hanes, T. (2011). Multi-study packages in organizational science research. In D. Ketchen & D. Bergh (Eds.), Building methodological bridges: Research methodology in strategy and management (pp. 163–199). Bingley, UK: Emerald. Hochwarter, W., Ferris, G., Laird, M., Treadway, D., & Gallagher, V. (2010). Nonlinear politics perceptions–work outcome relationships: A three-study, five-sample investigation. Journal of Management, 36, 740–763. Hochwarter, W., Ferris, G., Zinko, R., Arnell, B., & James, M. (2007). Reputation as a moderator of political behavior-work outcomes relationships: A two-study investigation with convergent results. Journal of Applied Psychology, 92, 567–576. Hochwarter, W., Kacmar, C., Perrewé, P., & Johnson, D. (2003). Perceived organizational support as a mediator of the relationship between politics perceptions and work outcomes. Journal of Vocational Behavior, 63, 438–456. Hochwarter, W., Kacmar, K., Treadway, D., & Watson, T. (2003). It’s all relative: The distinction and prediction of political perceptions across levels. Journal of Applied Social Psychology, 33, 1995–2016. Hochwarter, W., Summers, J., Thompson, K., Perrewé, P., & Ferris, G. (2010). Strain reactions to perceived entitlement behavior by others as a contextual stressor: Moderating role of political skill in three samples. Journal of Occupational Health Psychology, 15, 388–398. Hult, G., Ketchen, D., Griffith, D., Chabowski, B., Hamman, M., Dykes, B., Pollitte, W., & Cavusgil, S. (2008). An assessment of the measurement of performance in international business research. Journal of International Business Studies, 39, 1064–1080. Jann, B., Krumpal, I., & Wolter, F. (2019). Social desirability bias in surveys – Collecting and analyzing sensitive data. Methods, Data, Analyses, 13, 3–6. Jehn, K., & Jonsen, K. (2010). A multimethod approach to the study of sensitive organizational issues. Journal of Mixed Methods Research, 4, 313–341. Johns, G. (2001). In praise of context. Journal of Organizational Behavior, 22, 31–42. Johns, G. (2006). The essential impact of context on organizational behavior. Academy of Management Review, 31, 386–408. Johns, G. (2018). Advances in the treatment of context in organizational research. Annual Review of Organizational Psychology and Organizational Behavior, 5, 21–46. Jones, E. (1990). Interpersonal perception. New York, NY: W.H. Freeman. Jones, E., & Pittman, T. (1982). Toward a general theory of strategic self-presentation. Psychological Perspectives on the Self, 1, 231–262. Kacmar, K., & Baron, R. (1999). Organizational politics: The state of the field, links to related processes, and an agenda for future research. In G. Ferris (Ed.), Research in personnel and human resources management (pp. 1–39). Stamford, CT: JAI Press.

Research Methods in Organizational Politics  •  167 Kacmar, K., & Bozeman, D., Carlson, D., & Anthony, W. (1999). An examination of the perceptions of organizational politics model: Replication and extension. Human Relations, 52, 383–416. Kacmar, K., & Carlson, D. (1997). Further validation of the perceptions of politics scale (POPs): A multiple sample investigation. Journal of Management, 23, 627–658. Kacmar, K., & Ferris, G. (1991). Perceptions of organizational politics scale (POPs): Development and construct validation. Educational and Psychological Measurement, 51, 193–205. Kacmar, K., Wayne, S., & Wright, P. (1996). Subordinate reactions to the use of impression management tactics and feedback by the supervisor. Journal of Managerial Issues, 8, 35–53. Kapoutsis, I., Papalexandris, A., Treadway, D., & Bentley, J. (2017). Measuring political will in organizations: Theoretical construct development and empirical validation. Journal of Management, 43, 2252–2280. Kelley, H. (1973). The process of causal attributions. American Psychologist, 28, 107–128. Kidron, A., & Vinarski-Peretz, H. (2018). The political iceberg: The hidden side of leaders’ political behaviour. Leadership & Organization Development Journal, 39, 1010– 1023. Kiewitz, C., Restubog, S., Zagenczyk, T., & Hochwarter, W. (2009). The interactive effects of psychological contract breach and organizational politics on perceived organizational support: Evidence from two longitudinal studies. Journal of Management Studies, 46, 806–834. Kipnis, D., & Schmidt, S. (1988). Upward-influence styles: Relationship with performance evaluations, salary, and stress. Administrative Science Quarterly, 33, 528–542. Kipnis, D., Schmidt, S., & Wilkinson, I. (1980). Intraorganizational influence tactics: Explorations in getting one’s way. Journal of Applied Psychology, 65, 440–452. Kruse, E., Chancellor, J., & Lyubomirsky, S. (2017). State humility: Measurement, conceptual validation, and intrapersonal processes. Self and Identity, 16, 399–438. Lafrenière, M., Sedikides, C., & Lei, X. (2016). Regulatory fit in self-enhancement and self-protection: implications for life satisfaction in the west and the east. Journal of Happiness Studies, 17, 1111–1123. Laird, M., Zboja, J., & Ferris, G. (2012). Partial mediation of the political skill-reputation relationship, Career Development International, 17, 557–582. Lampaki, A., & Papadakis, V. (2018). The impact of organisational politics and trust in the top management team on strategic decision implementation success: A middle manager’s perspective. European Management Journal, 36, 627–637. Landells, E., & Albrecht, S. (2013). Organizational political climate: Shared perceptions about the building and use of power bases. Human Resource Management Review, 23, 357–365. Landells, E., & Albrecht, S. (2017). The positives and negatives of organizational politics: A qualitative study. Journal of Business and Psychology, 32, 41–58. Landry, H. (1969). Creativity and personality integration. Canadian Journal of Counselling and Psychotherapy, 3, 5–11. Larson, R., & Csikszentmihalyi, M. (1983). The experience sampling method. New Directions for Methodology of Social & Behavioral Science, 15, 41–56. Lasswell, H. (1936). Politics: Who gets what, when, how? New York, NY: Whittlesey.

168 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER Lee, S., Han, S., Cheong, M., Kim, S. L., & Yun, S. (2017). How do I get my way? A meta-analytic review of research on influence tactics. The Leadership Quarterly, 28, 210–228. LePine, J., Podsakoff, N., & LePine, M. (2005). A meta-analytic test of the challenge stressor–hindrance stressor framework: An explanation for inconsistent relationships among stressors and performance. Academy of Management Journal, 48, 764–775. Lepisto, D., & Pratt, M. (2012). Politics in perspectives: On the theoretical challenges and opportunities in studying organizational politics. In G. Ferris & D. Treadway (Eds.), Politics in organizations: Theory and research considerations (pp. 67–98). New York, NY: Routledge/Taylor and Francis. Lewin, K. (1936). Principles of topological psychology. New York, NY: McGraw-Hill. Li, C., Liang, J., & Farh, J. L. (2020). Speaking up when water is Murky: An uncertaintybased model linking perceived organizational politics to employee voice. Journal of Management, 46(3), 443–469. Li, J., Wu, L., Liu, D., Kwan, H., & Liu, J. (2014). Insiders maintain voice: A psychological safety model of organizational politics. Asia Pacific Journal of Management, 31, 853–874. Liden, R., & Mitchell, T. (1988). Ingratiatory behaviors in organizational settings. Academy of Management Review, 13, 572–587. Lim, S., Ilies, R., Koopman, J., Christoforou, P., & Arvey, R. (2018). Emotional mechanisms linking incivility at work to aggression and withdrawal at home: An experience-sampling study. Journal of Management, 44, 2888–2908. Lincoln, Y., & Guba, E. (1985). Naturalistic observation. Thousand Oaks, CA: Sage Publications. Liu, Y., Ferris, G., Zinko, R., Perrewé, P., Weitz, B., & Xu, J. (2007). Dispositional antecedents and outcomes of political skill in organizations: a four-study investigation with convergence, Journal of Vocational Behavior, 71, 146–165. Liu, Y., Liu, J., & Wu, L. (2010). Are you willing and able? Roles of motivation, power, and politics in career growth. Journal of Management, 36, 1432–1460. Luthans, F., & Avolio, B. (2009). Inquiry unplugged: building on Hackman’s potential perils of POB. Journal of Organizational Behavior: The International Journal of Industrial, Occupational and Organizational Psychology and Behavior, 30, 323–328. Lux, S., Ferris, G., Brouer, R., Laird, M., & Summers, J. (2008). A multi-level conceptualization of organizational politics. In C. Cooper & J. Barling (Eds.), The SAGE handbook of organizational behavior (pp. 353–371). Thousand Oaks, CA: Sage. Machiavelli, N. (1952). The prince. New York, NY: New American Library (The translation of Machiavelli’s The Prince by Luigi Ricci was first published in 1903). Madison, D., Allen, R., Porter, L., Renwick, P., & Mayes, B. (1980). Organizational politics: An exploration of managers’ perceptions. Human Relations, 33, 79–100. Maher, L., Gallagher, V., Rossi, A., Ferris, G., & Perrewé, P. (2018). Political skill and will as predictors of impression management frequency and style: A three-study investigation. Journal of Vocational Behavior, 107, 276–294. Maslyn, J., Farmer, S., & Bettenhausen, K. (2017). When organizational politics matters: The effects of the perceived frequency and distance of experienced politics. Human Relations, 70, 1486–1513. Maslyn, J., & Fedor, D. (1998). Perceptions of politics: Does measuring different foci matter? Journal of Applied Psychology, 83, 645–653.

Research Methods in Organizational Politics  •  169 Matta, F., Scott, B., Colquitt, J., Koopman, J., & Passantino, L. (2017). Is consistently unfair better than sporadically fair? An investigation of justice variability and stress. Academy of Management Journal, 60, 743–770. Mayes, B., & Allen, R. (1977). Toward a definition of organizational politics. Academy of Management Review, 2, 672–678. McArthur, J. (1917). What a company officer should know. New York, NY: Harvey Press. Miller, B., Rutherford, M., & Kolodinsky, R. (2008). Perceptions of organizational politics: A meta-analysis of outcomes. Journal of Business and Psychology, 22, 209–222. Mintzberg, H. (1983). Power in and around organizations. Englewood Cliffs, NJ: PrenticeHall. Mintzberg, H. (1985). The organization as political arena. Journal of Management Studies, 22, 133–154. Misangyi, V., Greckhamer, T., Furnari, S., Fiss, P., Crilly, D., & Aguilera, R. (2017). Embracing causal complexity: The emergence of a neo-configurational perspective. Journal of Management, 43, 255–282. Mitchell, M., Baer, M., Ambrose, M., Folger, R., & Palmer, N. (2018). Cheating under pressure: A self-protection model of workplace cheating behavior. Journal of Applied Psychology, 103, 54–73. Molina-Azorin, J., Bergh, D., Corley, K., & Ketchen, D. (2017). Mixed methods in the organizational sciences: Taking stock and moving forward. Organizational Research Methods, 20, 179–192. Morgan, L. (1989). “Political will” and community participation in Costa Rican primary health care. Medical Anthropology Quarterly, 3, 232–245. Morgeson, F., Mitchell, T., & Liu, D. (2015). Event system theory: An event-oriented approach to the organizational sciences. Academy of Management Review, 40, 515– 537. Munyon, T., Summers, J., Thompson, K., & Ferris, G. (2015). Political skill and work outcomes: A theoretical extension, meta‐analytic investigation, and agenda for the future. Personnel Psychology, 68, 143–184. Nye, L., & Witt, L. (1993). Dimensionality and construct validity of the perceptions of organizational politics scale (POPS). Educational and Psychological Measurement, 53, 821–829. O’Shea, P. (1920). Employees’ magazines for factories, offices, and business organizations. New York, NY: Wilson. Perrewé, P., Zellars, K., Ferris, G., Rossi, A., Kacmar, C., & Ralston, D. (2004). Neutralizing job stressors: Political skill as an antidote to the dysfunctional consequences of role conflict. Academy of Management Journal, 47, 141–152. Pfeffer, J. (1981). Power in organizations. Marshfield, MA: Pitman. Pfeffer, J. (1992). Managing with power: Politics and influence in organizations. Boston, MA: Harvard Business Press. Pfeffer, J. (2010). Power: Why some people have it and others don’t. New York, NY: HarperCollins Publishers. Pierce, J., & Aguinis, H. (2013). The too-much-of-a-good-thing effect in management. Journal of Management, 39, 313–338. Porter, L. (1976). Organizations as political animals. Presidential address, Division of Industrial-Organizational Psychology, 84th Annual Meeting of the American Psychological Association, Washington, DC.

170 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER Porter, L., Allen, R., & Angle, H. (1981). The politics of upward influence in organizations. In L. Cummings, & B. Staw (Eds.), Research in organizational behavior (pp. 109–149). Greenwich, CT: JAI Press. Porter, C., Outlaw, R., Gale, J., & Cho, T. (2019). The use of online panel data in management research: A review and recommendations. Journal of Management, 45, 319–344. Post, L., Raile, A., & Raile, E. (2010). Defining political will. Politics & Policy, 38, 653– 676. Ravasi, D., Rindova, V., Etter, M., & Cornelissen, J. (2018). The formation of organizational reputation. Academy of Management Annals, 12, 574–599. Reitz, A., Motti-Stefanidi, F., & Asendorpf, J. (2016). Me, us, and them: Testing sociometer theory in a socially diverse real-life context. Journal of Personality and Social Psychology, 110, 908–920. Rihoux, B., & Ragin, C. (2008). Configurational comparative methods: Qualitative comparative analysis (QCA) and related techniques (Vol. 51). Thousand Oaks, CA: Sage Publications. Rindova, V., Williamson, I., & Petkova, A. (2010). Reputation as an intangible asset: Reflections on theory and methods in two empirical studies of business school reputations. Journal of Management, 36, 610–619. Rose, P., & Greeley, M. (2006). Education in fragile states: Capturing lessons and identifying good practice. Brighton, UK: DAC Fragile States Group. Rosen, C., Ferris, D., Brown, D., Chen, Y., & Yan, M. (2014). Perceptions of organizational politics: A need satisfaction paradigm. Organization Science, 25, 1026–1055. Rosen, C., & Hochwarter, W. (2014). Looking back and falling further behind: The moderating role of rumination on the relationship between organizational politics and employee attitudes, well-being, and performance. Organizational Behavior and Human Decision Processes, 124, 177–189. Rosen, C., Kacmar, K., Harris, K., Gavin, M., & Hochwarter, W. (2017). Workplace politics and performance appraisal: A two-study, multilevel field investigation. Journal of Leadership & Organizational Studies, 24, 20–38. Rosen, C., Koopman, J., Gabriel, A., & Johnson, R. (2016). Who strikes back? A daily investigation of when and why incivility begets incivility. Journal of Applied Psychology, 101, 1620–1634. Rosen, C., Levy, P., & Hall, R. (2006). Placing perceptions of politics in the context of the feedback environment, employee attitudes, and job performance. Journal of Applied Psychology, 91, 211–220. Runkel, P., & McGrath, J. (1972), Research on human behavior: A systematic guide to method, New York, NY: Holt, Rinehart and Winston, Inc. Salancik, G., & Pfeffer, J. (1978). A social information processing approach to job attitudes and task design. Administrative Science Quarterly, 23, 224–253. Saleem, H. (2015). The impact of leadership styles on job satisfaction and mediating role of perceived organizational politics. Procedia-Social and Behavioral Sciences, 172, 563–569. Schein, V. (1977). Individual power and political behaviors in organizations: An inadequately explored reality. Academy of Management Review, 2, 64–72. Schriesheim, C., Powers, K., Scandura, T., Gardiner, C., & Lankau, M. (1993). Improving construct measurement in management research: Comments and a quantitative

Research Methods in Organizational Politics  •  171 approach for assessing the theoretical content adequacy of paper-and-pencil surveytype instruments. Journal of Management, 19, 385–417. Sharfman, M., Wolf, G., Chase, R., & Tansik, D. (1988). Antecedents of organizational slack. Academy of Management Review, 13, 601–614. Shaughnessy, B., Treadway, D., Breland, J., & Perrewé, P. (2017). Informal leadership status and individual performance: The roles of political skill and political will. Journal of Leadership & Organizational Studies, 24, 83–94. Silvester, J., & Wyatt, M. (2018). Political effectiveness at work. In C. Viswesvaran, D. Ones, N. Anderson, & H. Sinangil (Eds.), Handbook of industrial work and organizational Psychology (pp. 228–247). London, UK: Sage. Smith, A., Plowman, D., Duchon, D., & Quinn, A. (2009). A qualitative study of highreputation plant managers: Political skill and successful outcomes. Journal of Operations Management, 27, 428–443. Smith, A., Watkins, M., Burke, M., Christian, M., Smith, C., Hall, A., & Simms, S. (2013). Gendered influence: A gender role perspective on the use and effectiveness of influence tactics. Journal of Management, 39, 1156–1183. Stolz, R. (1955). Is executive development coming of age? The Journal of Business, 28, 48–57. Sun, S., & Chen, H. (2017). Is political behavior a viable coping strategy to perceived organizational politics? Unveiling the underlying resource dynamics. Journal of Applied Psychology, 102, 1471–1482. Tedeschi, J., & Melburg, V. (1984). Impression management and influence in the organization. In S. Bacharach & E. Lawler (Eds.), Research in the sociology of organizations (Vol. 3, pp. 31–58). Greenwich, CT: JAI Press. Tedeschi, J., Melburg, V., Bacharach, S., & Lawler, E. (1984). Impression management and influence in the organization. In S. Bacharach & E. Lawler (Eds.), Research in the sociology of organizations (Vol. 3, pp. 31–58). Greenwich, CT: JAI Press. Tocher, N., Oswald, S., Shook, C., & Adams, G. (2012). Entrepreneur political skill and new venture performance: Extending the social competence perspective. Entrepreneurship & Regional Development: An International Journal, 24, 283–305. Treadway, D. (2012). Political will in organizations. In G. Ferris & D. Treadway (Eds.), Politics in organizations: Theory and research considerations (pp. 531–566). New York, NY: Routledge/Taylor & Francis Group. Treadway, D., Hochwarter, W., Ferris, G., Kacmar, C., Douglas, C., Ammeter, A., & Buckley, M. (2004). Leader political skill and employee reactions. The Leadership Quarterly, 15, 493–513. Treadway, D., Hochwarter, W., Kacmar, C., & Ferris, G. (2005). Political will, political skill, and political behavior. Journal of Organizational Behavior, 26, 229–245. Tsui, A. (1984). A role set analysis of managerial reputation. Organizational Behavior and Human Performance, 34, 64–96. Turnley, W., & Feldman, D. (1999). The impact of psychological contract violations on exit, voice, loyalty, and neglect. Human Relations, 52, 895–922. Valle, M., & Perrewé, P. (2000). Do politics perceptions relate to political behaviors? Tests of an implicit assumption and expanded model. Human Relations, 53, 359–386. Van Dyne, L., & LePine, J. (1998). Helping and voice extra-role behaviors: Evidence of construct and predictive validity. Academy of Management Journal, 41, 108–119.

172 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER Van Knippenberg, B., & Steensma, H. (2003). Future interaction expectation and the use of soft and hard influence tactics. Applied Psychology, 52, 55–67. Van Maanen, J., Sorensen, J., & Mitchell, T. (2007). The interplay between theory and method. Academy of Management Review, 32, 1145–1154. Vecchio, R., & Sussman, M. (1991). Choice of influence tactics: individual and organizational determinants. Journal of Organizational Behavior, 12, 73–80. Vigoda, E. (2002). Stress-related aftermaths to workplace politics: The relationships among politics, job distress, and aggressive behavior in organizations. Journal of Organizational Behavior, 23, 571–588. Von Hippel, W., Lakin, J., & Shakarchi, R. (2005). Individual differences in motivated social cognition: The case of self-serving information processing. Personality and Social Psychology Bulletin, 31, 1347–1357. Wade, J., Porac, J., Pollock, T., & Graffin, S. (2006). The burden of celebrity: The impact of CEO certification contests on CEO pay and performance. Academy of Management Journal, 49, 643–660. Wheeler, A., Shanine, K., Leon, M., & Whitman, M. (2014). Student‐recruited samples in organizational research: A review, analysis, and guidelines for future research. Journal of Occupational and Organizational Psychology, 87, 1–26. Whitman, M., Halbesleben, J., & Shanine, K. (2013). Psychological entitlement and abusive supervision: Political skill as a self-regulatory mechanism. Health Care Management Review, 38, 248–257. Wickenberg, J., & Kylén, S. (2007). How frequent is organizational political behaviour? A study of managers’ opinions at 491 workplaces. In S. Reddy (Ed.), Organizational politics—New insights (pp. 82–94). Hyderabad, India: ICFAI University Press. Yukl, G., & Falbe, C. (1990). Influence tactics and objectives in upward, downward, and lateral influence attempts. Journal of Applied Psychology, 75, 132–140. Yukl, G., & Tracey, J. (1992). Consequences of influence tactics used with subordinates, peers, and the boss. Journal of Applied Psychology, 77, 525–535. Zanzi, A., Arthur, M., & Shamir, B. (1991). The relationships between career concerns and political tactics in organizations. Journal of Organizational Behavior, 12, 219–233. Zare, M., & Flinchbaugh, C. (2019). Voice, creativity, and big five personality traits: A meta-analysis. Human Performance, 32, 30–51. Zhang, Y., & Lu, C. (2009). Challenge stressor-hindrance stressor and employees’ work-related attitudes, and behaviors: The moderating effects of general self-efficacy. Acta Psychologica Sinica, 6, 501–509. Zinko, R., Gentry, W., & Laird, M. (2016). A development of the dimensions of personal reputation in organizations. International Journal of Organizational Analysis, 24, 634–649.

CHAPTER 8

RANGE RESTRICTION IN EMPLOYMENT INTERVIEWS An Influence Too Big to Ignore Allen I. Huffcutt

The emergence of meta-analysis as a formal research technique in the early 1980s raised awareness of the need to consider the influence of various statistical “artifacts” in research (e.g., Hunter, Schmidt, & Jackson, 1982). Sampling error, for instance, artificially increases variability across coefficients, which could result in the conclusion that validity is highly specific to individual selection situations and not generalizable. Measurement error (particularly in performance criteria) and restriction in range (hereafter range restriction) reduce the magnitude of validity coefficients, thereby making selection approaches appear less effective than they really are in predicting job performance. Construct validity can also be affected by artifacts. For example, range restriction can artificially reduce correlations in a multitrait-multimethod (MTMM) analysis, lowering confidence that similar measures are assessing a common construct. The effect of range restriction, potentially the most potent statistical artifact, is particularly troublesome with employment interviews. In most selection systems, assessment of candidates occurs sequentially. Completing an application blank first is standard practice. After that, it is common to administer measures that are relatively quick and reasonably inexpensive, such as those for ability and Research Methods in Human Resource Management: Toward Valid Research-Based Inferences, pages 173–196. Copyright © 2020 by Information Age Publishing All rights of reproduction in any form reserved.

173

174 • ALLEN I. HUFFCUTT

personality. Procedures that are the most time intensive and/or expensive tend to be implemented at (or towards) the end, usually after a number of candidates have been eliminated. Interviews typically fall in this latter category. The later the interview is in the selection process, the greater the possibility for (and extent of) range restriction. The degree to which range restriction can diminish the magnitude of validity coefficients is illustrated in several of the larger and more prominent interview meta-analyses. For instance, McDaniel, Whetzel, Schmidt, and Maurer (1994) reported that the mean population validity of the Situational Interview or SI (Latham, Saari, Pursell, & Campion, 1980) rose from .35 to .50 after further correction for range restriction. Expressed as a percent-of-variance, SIs accounted for 12% of performance variance without the correction and 25% after. (As explained later, McDaniel et al.’s correction is most likely conservative because it was based on direct rather than indirect restriction.) Yet, most primary interview researchers—those actually conducting studies rather than meta-analytically summarizing them—fail to take range restriction into account. The lack of attention is evident with a quick search on PsycINFO (search date: 4-11-2018). Entering “job applicant interviews” as the subject (SU) term resulted in 1,472 entries. Adding “range restriction” as a second search term anywhere in the document reduced that number to only seven. Using the alternate term “restriction in range” resulted in the same number. Additional evidence comes again from the McDaniel et al. (1994) interview meta-analysis, where only 14 of the 160 total studies in their dataset reported range restriction information (see pp. 605–606). Such widespread lack of consideration is surprising in one respect because the mathematics behind range restriction, and the equations to correct for it, have been around for a long time. Building on the earlier work of Pearson (1903), for example, Thorndike (1949) presented the relatively straightforward procedure needed to correct for direct (i.e., Case II) restriction.1 The first meta-analysis book in Industrial-Organizational (I-O) psychology, Hunter et al. (1982), also outlined the correction procedure for direct range restriction and illustrated how to utilize it in selection research. Unfortunately, the issue of range restriction got more complex for employment interview researchers in the mid-2000s. Throughout essentially the entire history of selection research, restriction was largely presumed to be direct. Schmidt and colleagues (e.g., Hunter & Schmidt, 2004; Hunter, Schmidt, & Le, 2006) made the assertion that most restriction is actually indirect rather than direct. Thorndike (1949) provided the equations needed to correct for indirect (Case III) restriction as well, but they were generally not viable for selection contexts because too much of the needed information was unknown. Hunter et al. were able to simplify the mathematics to make the indirect correction more feasible, although it is still more complicated than direct. To distinguish their methodology from that of Thorndike, they named their procedure Case IV.

Range Restriction in Employment Interviews  •  175

It would appear that interview researchers as a whole appear to have paid even less attention to the indirect form of restriction. Another search on PsycINFO (same date) combining “job applicant interviews” as the subject term with “indirect restriction” as a general term anywhere in the document yielded only three entries.2 The first was an interview meta-analysis that incorporated indirect restriction as its primary purpose (Huffcutt, Culbertson, & Weyhrauch, 2014a). The second was a reanalysis of the McDaniel et al. (1994) interview dataset using indirect methodology (Oh, Postlethwaite, & Schmidt, 2013; see also Le & Schmidt, 2006). The third was a general commentary on indirect restriction (Schmitt, 2007). Failure to account for range restriction in a primary interview study (and in a meta-analysis for that matter) can result in inaccurate or even mistaken conclusions. Consider a company that wants to switch from traditional, unstructured interviews to a new structured format such as a SI or a Behavior Description Interview or BDI (Janz, 1982), but isn’t completely sure doing so is worth the time and administrative trouble. If range restriction is present, which is likely, the resulting validity coefficient will be artificially low. It might even be low enough that the company decides not to make the switch. One possible reason for the lack of consistent attention to range restriction among interview researchers is that they don’t have a good, intuitive feel for its effects. Graduate school treatment of range restriction, along with prominent psychometric textbooks (e.g., Nunnally, 1978), tend to focus only on the corrective formulas. Visual presentation, such as scatterplots, is often missing. Another potential reason is that some of the needed information (e.g., the unrestricted standard deviation of interview ratings in the applicant population) may not be readily available given the customized nature of interviews (e.g., as opposed to standardized ability measures). A final reason, one based on convenience and/or expense, is that interview researchers may not feel they have the time to refamiliarize themselves with the correction process, which is not always presented in a user-friendly manner, or purchase meta-analytic software that they would only use periodically or perhaps even once (e.g., Schmidt & Le, 2014). The overarching purpose of this manuscript is to provide a convenient, all-inone reference for interview researchers to help them deal with range restriction. Subsumed under this purpose is an overview of the basic concepts and mechanics of range restriction (including the all-important difference between its direct and indirect forms), visual presentation of restriction effects via scatterplots to enhance intuitive understanding, and a summary of equations and procedures for their use in restriction correction. Further, realistic simulations are utilized to derive some of the most difficult parameters for interviews, and then these parameters are built into the correction equations in order to simplify them. UNRESTRICTED INTERVIEW POPULATION The starting point is creation of a hypothetical but realistic distribution of interview and performance ratings, one that is totally unrestricted. The focus is on

176 • ALLEN I. HUFFCUTT

high-structure interviews, as they consistently show the highest validity and are considerably more standardized across situations than unstructured ones. For instance, although the content of questions varies, all SIs are comprised of hypothetical scenarios while all BDIs focus exclusively on description of past experiences. In contrast, the content, nature, and even format of unstructured interviews can vary immensely by interviewer and even by interview. Indeed, it is not surprising that unstructured interviews have been likened unto a “disorganized conversation.” A key parameter in this distribution is the population correlation between highly structured interview ratings and job performance. At the present time, the best available estimate appears to be the fully corrected (via indirect methodology) population value (i.e., rho) of .69 from Huffcutt et al. (2014a). They provided population estimates for four levels of structure (none to highly structured), and this value is for the highest level (see p. 303). This level includes virtually all SIs, and a majority of BDIs. (BDIs can be conducted using more of an intermediate level of structure, such as allowing interviewers to choose questions from a bank and to probe extensively; such studies usually reside at Level 3.) Their value of .69 is corrected for both unreliability in performance assessment and range restriction, but not for interview reliability. As explained in more detail below, such a correction results in “operational validity” rather than a construct-to-construct association, that is the level to which a predictor (in its actual, imperfect state) is associated with true performance. To enhance realism (and out of the necessity of choosing a scaling), interview parameters from Weekley and Gier (1987) were utilized. They developed a SI to select entry-level associates in a national retail outlet. The sample question they provide (see p. 485) about an angry customer whose watch is late coming back from being repaired is cited regularly as an example of the SI format. Their final interview contained 16 questions, all rated using the typical five-point scale that has behavioral benchmarks at one, three, and five, resulting in a possible range of 16 to 80 with a midpoint of 48. Using Excel, a normal distribution was generated with a mean of 48.0 (the midpoint) and a standard deviation of 7.4 (the actual sd in their validity sample; see p. 486). In regards to sample size, 100 was chosen out of convenience. These parameters should be reasonably representative of highstructure interviews in general.3 On the performance side, the goal was to create a second distribution that correlated .69 with the original distribution of interview ratings. Using Excel, the interview distribution was copied, sufficient measurement error was added to reduce the correlation with the original distribution to .69, and then the result was rescaled to have a mean of 50.0 and a standard deviation of 10.0 (i.e., T scaling). Given the extremely wide variation in performance rating formats across studies, this particular scaling was chosen out of convenience assuming that it was reasonably representative.

Range Restriction in Employment Interviews  •  177

FIGURE 8.1.  Scatterplot between unrestricted structured interview ratings and error-free ratings of job performance. The sample size is 100 and the correlation is .69.

The resulting bi-variate distribution is shown in Figure 8.1. Conceptually, this scatterplot illustrates the results of a selection situation where 100 applicants apply, all are interviewed with a highly structured interview, all are hired, and then error-free (i.e., true score) performance ratings are collected. (Implicit in this scenario is the lack of attrition, something that is built into upcoming scenarios, specifically Scenarios 2 and 4.) Now various scenarios that build upon this distribution are presented. Scenario 1: All Applicants are Hired—No Restriction or Attrition This scenario represents the infrequent but not unheard of case where there is no restriction in range on the predictor, and all (or most) who are hired remain on the job. Technically speaking, no studies should fall into this category unless every person who applies is hired without any consideration of interview ratings and there is no prior selection of any kind (even from an application blank, since that can cause indirect restriction). Practically speaking, some interview studies could reasonably be considered to do so. For instance, in their Study 3, Latham et al. (1980) noted that “The situational interview was administered to 56 applicants for entry-level work in a pulp mill, all of whom were subsequently hired” (p. 425). Additional examples include Benz (1974) and McMurry (1947). Unfortunately, even with no restriction (or attrition), a scatterplot of actual data with these parameters would not mirror that in Figure 8.1. The reason is that there is no such thing as error-free performance ratings in organizational contexts.

178 • ALLEN I. HUFFCUTT

Performance ratings contain measurement error, a considerable amount actually (Hunter et al., 2006), which diminishes its correlation with interview ratings. Although various values of interrater reliability (i.e., IRR) for performance ratings can be found in the literature, the most common appears to be .52 (Viswesvaran, Ones, & Schmidt, 1996; see also Rothstein, 1990). Using the upcoming Formula 1 in reverse, measurement error in the performance ratings reduces the magnitude of its correlation with interview ratings from .69 to .50 (assuming an IRR of .52). This level of association is illustrated as a scatterplot in Figure 8.2. Using Excel, the error-free performance distribution was copied and sufficient measurement error was added to reduce the correlation with the original interview distribution to .50. Finally, it was rescaled using T scaling again. Using the same scaling as in Figure 8.1 enhances comparison and highlights the effect of performance measurement error. To illustrate, visual inspection suggests that, around an interview rating of 50, the range of performance ratings increases roughly from 26 to 35. Although the primary focus of this manuscript is on range restriction, the visible difference between these two figures highlights the importance of also correcting for measurement error in performance assessment. Readers may wonder why no correction was made for measurement error on the interview side, since clearly it is there as well (Conway, Jako, & Goodman, 1995). Measurement error in performance ratings is artifactual because these ratings (often made by supervisors or managers) do not reflect true performance on the job due to influences such as bias, contrast effects, and halo. Conversely, the

FIGURE 8.2.  Scatterplot between unrestricted structured interview ratings and ratings of job performance, but with measurement error in the performance ratings. The correlation is .50.

Range Restriction in Employment Interviews  •  179

ratings interviewers make, while error-prone, are used to make actual selection decisions. Hence, correcting performance alone is often referred to as “operational validity” (Schmidt, Hunter, Pearlman, & Rothstein-Hirsh, 1985, p. 763). Interviewer ratings can be corrected as well, and if done, provides valuable (albeit theoretical) information on construct associations. To illustrate, Huffcutt, Roth, and McDaniel (1996) corrected for measurement error in both interview ratings and cognitive ability test scores (see p. 465) in order to assess the degree of construct saturation of the latter in the former. Statistically, correcting for measurement error in performance ratings is accomplished as shown in Formula 1, where ro is the observed (actual) validity coefficient and ryy is the performance IRR (i.e., .52). Note that the correction involves the square root of the reliability. The correction returns the validity coefficient to its full population value. Readers are referred to Schmidt and Hunter (2015) for more information on this correction (see p. 112).

rc =

ro ryy

=

ro .52

=

ro .50 = = .69 (1) .72 .72

If an interview researcher has a study that fits this scenario (at least to a reasonable degree), the correction is simple. Just divide the actual validity coefficient by .72. Situations where a new structured interview is being pilot tested with applicants (not incumbents) and is not used to make actual selection decisions would be particularly relevant, especially if a high majority of applicants are hired and retained once on the job. Scenario 2: All Applicants Are Hired—No Restriction But There Is Attrition Short of financial difficulties that necessitate lay-off’s, there are organizations and/or job areas where employees are rarely let go and don’t frequently leave (e.g., union shops, civil service). That situation is represented reasonably well by Scenario 1 (again, assuming a high majority of applicants are hired). In other work environments, some degree of attrition is common. Such attrition often comes from both the top and bottom of performance levels, as top performers get promoted or leave and bottom performers get let go or reassigned (Sackett, Laczo, & Arvey, 2002). There are exceptions of course, such as when top or bottom employees leave but not both. Attrition typically results in restriction in performance ratings, which artificially lowers the validity coefficient. Although the focus of this manuscript is on restriction in interview ratings, a general correction for attrition is offered. The type and degree of attrition no doubt varies considerably across situations, and modeling a broad spectrum of possibilities is way beyond the scope of this study. Out of necessity, one hopefully common and realistic scenario is modeled. Huffcutt, Culbertson, and Weyhrauch (2014b) posited

180 • ALLEN I. HUFFCUTT

a general scenario of 10% attrition, 5% from the top and 5% from the bottom. Based on a simulation, they derived a range restriction ratio (u) of .80 (see p. 550), which is the standard deviation of the restricted ratings (R) divided by the standard deviation of the unrestricted ratings (P) in the population (i.e., sdR/sdP). In regards to correction for attrition, a key question is whether it represents direct or indirect restriction. Given that all (or most) applicants are hired in this scenario and that attrition is an end-stage phenomenon, it seems reasonable to view it as direct. The formulas for direct restriction (Hunter & Schmidt, 1990, p. 48; Hunter & Schmidt, 2004, p. 215; see also Callender & Osburn, 1980, p. 549) are generally intended for the predictor (here interviews). However, as noted by Schmidt and Hunter (2015, p. 48), the effects of the predictor and the criterion on the validity coefficient are symmetrical; hence, the same formulas can be used to correct for attrition restriction by itself. If there happens to be both restriction on the predictor and attrition, then things get considerably more complicated (see Schmidt & Hunter, p. 48, for a discussion). This situation is addressed in Scenario 4. In regard to procedure, the direct correction equation from Callender and Osburn (1980; p. 549) seems particularly popular and used widely (see Hunter & Schmidt, 1990, p. 48; Hunter & Schmidt, 2004, p. 37). It is presented as Formula 2. The key component in this equation is u, the range restriction ratio noted above. ro

rc =



2

(1 − u )ro2 + u 2

(2)

Substituting .80 for u, the equation becomes:

rc =

ro 2

2 o

(1 − .80 )r + .80

2

=

ro .36ro2 + .64

(3)

The above correction restores the validity coefficient to what it would have been had there not been any attrition. To estimate operational validity, however, an additional correction needs to be made for measurement error in the performance ratings, which, fortunately, can be combined with the correction for direct restriction. That equation is presented as Formula 4. As before, .52 is used for the IRR of performance ratings.

rc =

ro 2 o

.52 (1 − .64)r + .64

=

ro .72 .36ro2 + .64

(4)

Using this equation, an interview researcher with a study that fits reasonably well with this scenario simply has to enter his/her observed correlation in the

Range Restriction in Employment Interviews  •  181

last part of the above equation and do the computations to find an estimate of the corrected (population) correlation. Situations where a new structured interview is being pilot tested with applicants and is not used to make actual selection decisions, a high majority of applicants are hired and/or hiring is done without strong reference to merit, and there is moderate (but not extreme) attrition by the time performance ratings are collected (from both the top and bottom) would be particularly relevant. Scenario 3: Hiring Based Solely on Interview Ratings—Direct Restriction, No Attrition Case II (direct) restriction occurs when the new structured interview format being evaluated is actually used in a strictly top-down fashion to select employees. Technically speaking, to fall into this category, studies would have to have no prior selection whatsoever (again, even from an application blank). Practically speaking, some studies could reasonably be classified as such. To illustrate, Arvey, Miller, Gould, and Burch (1987) noted that “the interview was the sole basis for hiring decisions” (p. 3), while Robertson, Gratton, and Rout (1990) noted that applicants were placed “as a direct consequence of their performance in the situational interview” (p. 72). Although the number of studies in the interview literature with direct restriction is limited (see Huffcutt et al., 2014a, Table 1), there may be more in practice. One such source could be promotions (Hunter et al., 2006), a type of study that is reported far less often in the literature than initial selection but occurs frequently in the workplace. The degree of direct restriction is a function of the selection ratio. If, say, 90% of interviewees are hired, the restriction would not be extensive. If 10% are hired, there should be considerably more restriction. To illustrate the progressive effects of direct restriction graphically, scatterplots were formed illustrating hiring percentages of 90, 50, and 10, respectively. There no doubt are situations where the hiring percentage is outside these end values or between the middle value and an endpoint. Nonetheless, these three levels should illustrate a sufficiently broad spectrum of real employment scenarios. To form these scatterplots, the structured interview ratings from the original distribution (with a rho of .69) were rank-ordered and the appropriate number were eliminated (e.g., the bottom 10% for 90% hiring). These data were used rather than the subsequent data shown in Figure 8.2 where measurement error is added to performance because this measurement error, in theory, has not occurred yet and won’t occur until after restriction in interview ratings has taken place (and the individuals have been on-the-job long enough to be evaluated). From the data remaining after the appropriate elimination, the actual correlation between interview ratings and performance was computed. Then, using Formula 1 in reverse (with an IRR of .52), the value of the validity coefficient with performance error induced was estimated. To portray this level of association visually, the error-free performance ratings for each hiring level were copied and

182 • ALLEN I. HUFFCUTT

FIGURE 8.3.  Scatterplots illustrating the association between interview and job performance ratings with 90%, 50%, and 10% hiring respectively (and measurement error in performance ratings). The correlations are .44, .39, and .29 respectively.

Range Restriction in Employment Interviews  •  183

sufficient measurement error was added to reduce the correlation with interview ratings to the estimated value with performance error induced. Finally, a scatterplot was created. The scatterplots for all three levels of hiring are shown in Figure 8.3. The traditional way to correct for direct range restriction and performance measurement error is to do the corrections simultaneously combing Formulas 1 and 2 as shown in Formula 5 below (Callender & Osburn, 1980, p. 549; Hunter & Schmidt, 1990, p. 48; Hunter & Schmidt, 2004, p. 215). This is essentially what was done in Formula 4 in the correction for attrition. Like there, the key parameter is the range restriction ratio u, which here is the ratio of the restricted standard deviation of interview ratings to the unrestricted one. ro

rc =



ryy (1 − u 2 )ro2 + u 2

(5)

Hunter et al. (2006) presented a two-step alternative based on “the little known fact that when range restriction is direct, accurate corrections for range restriction require not only use of the appropriate correction formula…but also the correct sequencing of corrections for measurement error and range restriction” (p. 596). In their method, the observed validity coefficient is corrected first for measurement error in performance ratings (since that occurs last) using the restricted IRR value (i.e., .52; denoted as “YYR”). That is accomplished using Formula 1. Then, the corrected coefficient is inserted into an accompanying restriction formula (Step 2 in their Table 1; see p. 599). To simply the process, the formulas for these two steps are integrated into one, which is shown as Formula 6. Note that UX is the inverse of the range restriction ratio (i.e., 1/ux). rc =



U x ∗ ro / rYYR 1 + (U x2 − 1) ∗ (ro / rYYR ) 2

(6)

Now to the results. For 90% hiring (top panel in Figure 8.3), the standard deviation with the bottom 10% of interview ratings removed is 6.0, resulting in a u value of .81 (i.e., 6.0/7.4) and a U value of 1.24 (i.e., 1/.81). The performance IRR value, as always, is .52. The validity coefficient drops to .44. Inserting these values into the above formula, as shown in Formula 7, returns the fully corrected value of .69 (which is important to confirm given that the validity coefficient was computed from the actual data after removal of the bottom 10%).

rc =

1.24 ∗ .44 .52 2

1 + (1.24 − 1) ∗ (.44 .52)

2

=

.756 = .69 (7) 1.096

184 • ALLEN I. HUFFCUTT

If an interview researcher has a situation where a highly structured interview is used in a top-down fashion in selection, there is minimal preselection prior to the interview, a high majority of applicants are hired, and there is no (or minimal) attrition by the time that performance ratings are collected, this equation can be used to correct the observed validity coefficient. Isolating the relevant portion of this equation to do the computations, one simply inserts the observed validity coefficient in the last part of the equation in Formula 8 to get a reasonable estimate of the fully corrected value.

rc =

1.24 / .72 ∗ ro

1.72ro

=

1 + (1.242 − 1) ∗ (ro / .52) 2

1 + 1.03r02

(8)

For 50% hiring (middle panel in Figure 8.3), the standard deviation with the bottom half of interview ratings removed is 5.0, resulting in a u value of .67 (i.e., 5.0 / 7.4) and a U value of 1.49 (i.e., 1 / .67). The validity coefficient drops to .39. Inserting these values into Formula 6, as shown in Formula 9 below, returns the fully corrected value of .69.

rc =

1.49 ∗ .39 .52 2

1 + (1.49 − 1)(.39 / .52)

2

=

.804 = .69 (9) 1.164

Extracting the relevant portion again, the correction equation becomes as shown in Formula 10. Interview researchers can use this equation when a highly structured interview is used in a top-down fashion in selection, there is minimal preselection prior to the interview, the proportion of applicants hired is in the ballpark of one-half, and there is no (or minimal) attrition by the time that performance ratings are collected.

rc =

1.49 / .72 ∗ ro 1 + (1.492 − 1)(ro2 / .52)

=

2.07 ro 1 + 2.35ro2

(10)

Finally, for 10% hiring (bottom panel in Figure 8.3), the standard deviation with the bottom 90% of interview ratings removed is 3.5, resulting in a u value of .47 (i.e., 3.5 / 7.4) and a U value of 2.13 (i.e., 1 / .47). The validity coefficient drops to .29. Inserting these values into Formula 6, as shown in Formula 11, returns the fully corrected value of .69.

rc =

2.13 ∗ .29 .52 2

1 + (2.13 − 1)(.29 / .52)

2

=

.864 = .69 (11) 1.257

Range Restriction in Employment Interviews  •  185

Isolating the relevant portion once again, the result is shown in Formula 12. Interview researchers can use this equation when a highly structured interview is used in a top-down fashion in selection, there is minimal preselection prior to the interview, only a small percentage of applicants are hired, and there is no (or minimal) attrition by the time that performance ratings are collected.

rc =

2.13 / .72 ∗ .ro 2

2

1 + (2.13 − 1)(ro / .52)

=

2.95ro 1 + 6.77 ro2

(12)

A primary purpose of this manuscript is to highlight the effects of range restriction in a straightforward and understandable manner in order enhance understanding of the importance of correcting for it. The findings for 10% hiring, which is a good benchmark in a number of selection situations, illustrates those effects nicely. Comparison of the third panel in Figure 8.3 with Figure 8.2 shows the considerable reduction in range after 90% of the candidates are eliminated. Earlier a potential danger of range restriction was noted, specifically in reference a company that wants to switch to a highly structured interview format but doesn’t feel like the resulting validity is high enough to justify doing do so. An observed coefficient around .29 may not seem that much better (or any better) than the original selection process and might not be worth the administrative trouble of switching to the new format. Conversely, a fully corrected coefficient of .69 sounds very desirable and should generate a decision to switch as quickly as possible. A key assumption of this scenario is that there is no (or minimal) attrition by the time that performance ratings are collected. What if there is more than minimal attrition? An important question then becomes the nature and/or status of those who leave. It is common for people to leave for reasons other than performance, including family, location, health, and retirement. If the performance of those departing does not differ substantially from those remaining, the effects on the correction process are probably negligible. On the other hand, if the departures are largely performance-related (typically top / bottom), then the effects would be more pronounced. In this case, the corrections offered for this scenario could still be done, but they would be conservative because there is performance-related restriction that would not be accounted for in the process. Attrition is incorporated into the next scenario. Before moving on, two psychometric phenomena emerge from this scenario that are interesting scientifically. The first has to do with the progressive effects of 90%, 50%, and 10% hiring on the range of scores. Comparing the top panel in Figure 8.3 with Figure 8.2, some might be surprised at the degree to which the range of interview ratings is reduced at 90% hiring (relative to no restriction), specifically from 40 to 29. On the surface, a 10% elimination seems pretty minimal and shouldn’t result in such a noticeable drop. Univariate normal distributions contain considerably more data points in the middle than at the ends, and so do bi-

186 • ALLEN I. HUFFCUTT

variate. Because of the relatively small concentration of data points at the low end of the distribution, it does not take much elimination of points from that region to reduce the overall range noticeably. Conversely, comparing the middle panel in Figure 8.3 with the top panel, the change in the scatterplot from 10% to 50% hiring may not be as pronounced as some might expect. Specifically, the range drops from 29 to only 20 even though four times as many points were eliminated (compared to 90% hiring). This time, the elimination occurred in the very dense scoring region leading up the middle of the distribution. Because of that density, the drop in range is much more modest, in fact slightly less than the change from no restriction to 10% hiring. The drop in range from 50% to 10% (second and third panels in Figure 8.3) is essentially the same because it is the same region, just on the back side of the center. There is a potentially important implication of this phenomenon, one that should be explored further in future research. Given the low density in the high end of the distribution (just like in the low end), one would expect the range (and validity coefficient) to drop somewhat noticeably as the hiring ratio drops in relatively small increments below 10%. This issue is particularly important for jobs where a large number of individuals often apply (e.g., academic positions) and/or when unemployment is high. In both cases, a very limited number of individuals (sometimes only one) are hired. The second phenomenon pertains to Hunter et al.’s (2006) two-step alternative procedure, which continues to be “little known” (p. 596) in the general metaanalytic community. Does it really lead to improved estimates over the traditional Callender and Osburn (1980)-type simultaneous correction? As a supplemental analysis, the computations were rerun for all three hiring percentages using the simultaneous approach. The corrected validity coefficient was in fact overestimated at all three hiring levels. Moreover, the degree of overestimation increased progressively as the hiring percentage decreased. The overestimation was .03 at 90% hiring (i.e., .72 vs. .69), .05 at 50% hiring (i.e., .74 vs. .69), and .07 at 10% hiring (i.e., 76 vs. .69). Clearly, the two-step procedure seems more accurate, particularly with lower hiring percentages. Scenario 4: Hiring Based Solely on Interview Ratings—Direct Restriction with Attrition This scenario involves restriction both on the predictor side from use in selection (here interview ratings) and on the performance side (from attrition). As noted by Schmidt and Hunter (2015), no methods currently exist for dealing with double restriction (see pp. 48–49). There is a method that could possibly be adapted, that of Alexander, Carson, Alliger, and Carr (1987), but their method is based on removal only of the lower portion of both distributions (e.g., from use of cutoff scores). That assumption is probably fine for the predictor, but attrition from both top and bottom is probably more likely with performance.

Range Restriction in Employment Interviews  •  187

The effects of double restriction are illustrated using the Scenario 3 data with 50% hiring. Unlike the other scenarios, a correction formula is not offered, as again, one does not currently exist. That said, it is important for both practitioners and researchers to understand fully the debilitating effects of double restriction. Especially so since it is likely to be extremely common in practice. Recalling the 50% case (the middle panel in Figure 8.3), the standard deviation with the bottom half of interview ratings removed is 5.0, resulting in a u value of 5.0 / 7.4 or .67, and the validity coefficient drops to .39. Those data were sorted by performance rating from highest to lowest, and then the top and bottom 5% were removed. Given that the starting sample size is 50, that corresponded to removal of the top five and bottom five sets of ratings and a final sample size of 40. The resulting distribution is shown in Figure 8.4. Removal of the top and bottom 5% causes the validity coefficient to drop from .39 to .06. The standard deviation of interview ratings dropped only modestly, from 5.0 to 4.5. As expected, the standard deviation of performance ratings dropped more noticeably, from 11.1 to 7.1, although by itself, such reduction does not appear sufficient to account for the somewhat drastic drop in validity (at least not fully). So why did the validity coefficient drop from a somewhat respectable .39 to something not that far from zero? Schmidt and Hunter (2015) provide valuable insight, namely that the regression line changes in complex ways when there is double restriction, including no longer being linear and homoscedastic.4 Inspection of Figure 8.4 suggests that the regression line, which retained essentially the same

FIGURE 8.4.  Scatterplot between interview and job performance ratings with 50% hiring and 10% attrition (5% from the bottom and top respectively). The correlation is .06.

188 • ALLEN I. HUFFCUTT

pronounced slope throughout the previous scenarios, is now almost flat. Imagine an upward sloping rectangle, and then slicing off the bottom left and top right corners. Those two corners, a noticeable portion of which were removed because of 10% attrition, were largely responsible for the distinct upward slope. And, the peculiar shape of this distribution appears to violate virtually every known regression assumption about bi-variate relationships (see Cohen & Cohen, 1983; Osborne, 2016) include being heteroscedastic. Given all these considerations, it is not surprising that no statistical formulas exist for correction of double restriction. The implications of this illustration are of paramount importance for organizations. Head-to-head, 50% hiring with 10% attrition came out far worse than 10% hiring with no attrition (i.e., .06 vs. .29). It would appear that attrition, even at relatively low levels (e.g., 10%), has a powerful influence on validity when direct restriction is already present (and presumably indirect as well). And, the assumption of 50% hiring with 10% attribution is probably conservative. There most likely are many employment situations where hiring is less than 50%, which should, in theory, makes things even worse since the starting scatterplot and validity coefficient (before attrition effects) are already diminished and/or where attrition is greater than 10%. Clearly, more research attention needs to be given to developing ways to deal with double restriction. Scenario 5: Validation Data is Collected from Incumbents—Indirect Restriction no Attrition As noted above, most restriction in selection is now presumed to be indirect. Indirect restriction can take various forms, including hiring based on another predictor before the interview is given (one that is correlated with interview ratings), testing of incumbents, and selection based on a composite of the interview with other predictors. Inspection of the frequency data presented by Huffcutt et al. (2014a), specifically their Table 1, suggests that testing of incumbents is by far the most frequent. Accordingly, that form of indirect restriction is the focus in this scenario and the others forms are left for future research. To provide a context, assume a company hears about the wonders of modern structured interview formats and decides to invest the resources to develop one. To evaluate it, they administer their new interview to a sample of current employees (i.e., incumbents) and then correlate those ratings with ratings of job performance. In regards to the latter, they could utilize the most recent company performance evaluations (i.e., administrative) or develop a new appraisal form specifically for the study and collect performance ratings on-the-spot. A major challenge with this scenario is that information regarding the original selection process is rarely available. It is likely that the original selection measures correlate to some degree with the new structured interview, thus inducing indirect range restriction. An assumption of the indirect correction procedure is that the original process directly causes restriction only on the predictor and not on performance ratings (Schmidt & Hunter, 2015; p. 47). As noted by Hunter et

Range Restriction in Employment Interviews  •  189

al. (2006), this assumption is likely to be met to a close enough degree in selection studies. If it is clear that this assumption does not hold, an alternative method has been developed (see Le, Oh, Schmidt, & Wooldridge, 2016). Denoted as “Case V” indirect correction, this method does not have the above assumption, but does require the range restriction ratio for the second variable as well. If that variable is job performance ratings, which is usually the case with selection, the range restriction ratio for it is extremely difficult to obtain empirically (Le et al., p. 981). Correction for Case IV indirect restriction is a five-step process, clearly making it more involved than direct correction. Step 1 is to find / estimate the unrestricted reliability of the predictor in the applicant population (rXX_A). This, of course, is not known for interviews. Accordingly, the equation for estimating it is shown in Formula 13 (Schmidt & Hunter, 2015, p. 127), which involves the restricted reliability value (rXX_R) and the range restriction ratio (uX).

rXX_A = 1—uX2(1—rXX_R) (13)

Taking all three sources of measurement error into account (i.e., random response, transient, and conspect), Huffcutt, Culbertson, and Weyhrauch (2013) found a mean interrater reliability of .61 for highly structured interviews (see Table 3, p. 271).5 In regards to the range restriction ratio, Hunter and Schmidt (2004) recommend using a general value of .65 for all tests and all job families when the actual value is unknown (see p. 184). Inserting these two values, the equation becomes as shown in Formula 14. The pronounced difference between the restricted and unrestricted IRR values highlight yet another important psychometric principle, which is that reliability coefficients are influenced by range restriction as well.

rXX_A = 1—.652(1—.61) = .84

(14)

Step 2 is to convert the actual range restriction ratio (uX) into its equivalent for true scores, unaffected by measurement error (i.e., uT). That equation is shown as Formula 15 (Schmidt & Hunter, 2015, p. 127), which involves the unrestricted applicant IRR value for the interview and the actual range restriction ratio. As indicated, the range restriction ratio for true scores is smaller than the actual one, which helps explain why indirect restriction tends to have a more detrimental effect than direct.

uT =

u X2 − (1 − rXX _ A ) rXX _ A

=

.652 − (1 − .84) = .56 (15) .84

190 • ALLEN I. HUFFCUTT

Step 3 is to correct the observed validity coefficient for measurement error in both the predictor and criterion using restricted reliability values (Schmidt & Hunter, 2015, p. 150). That computation is shown as Formula 16. rc =



ro rXX _ A rYY _ R

ro

=

.61 .52

=

ro (16) .56

Step 4 is to make the actual correction for indirect restriction, the equation for which is shown as Formula 17 (Schmidt & Hunter, 2015, p. 129). Note that this formula uses UT, which is the inverse of uT (i.e., 1/.56=1.79). Also note that the subscript “T” denotes true scores for the interview and “P” denotes true scores for performance. ρTP =



U T ∗ rc 2 T

2 c

(U − 1)r + 1

=

1.79 ∗ rc 2

2 c

(1.79 − 1)r + 1

=

1.79 ∗ rc 2.20rc2 + 1

(17)

Because a correction was made for interview reliability, the value of rho that comes out of the above formula is actually the construct-level association between interviews and performance. Thus, the final step, Step 5, is to translate it back to operational validity by restoring measurement error in the interviews (Schmidt & Hunter, 2015, p. 155). It is important to note that the IRR value used for interviews in this final step should be its unrestricted version and not the restricted one. Using the value of .84 noted earlier, the computation becomes:

ρ XP = ρTP ∗ rXX _ A = ρTP ∗ .84 = .92 ∗ ρTP (18)

Synthesizing Formulas 16–18 yields a single formula for correction of indirect restriction with highly structured employment interviews, as shown in Formula 19.

ρ XP =

.92 ∗ U T ∗ ro / .56 2 T

2

(U − 1)(ro / .56) + 1

=

.92 ∗1.79 / .56 ∗ ro 2

2 o

(1.79 − 1) / .31 ∗ r + 1

=

2.94 ∗ ro 7.11 ∗ ro2 + 1

(19)

To illustrate, assume that a researcher does a concurrent validation of a new structured interview format and finds an observed validity coefficient of .27 (a very typical value). Using the last portion of Formula 19, the unrestricted operational validity of this new interview is estimated to be .64 as shown in Formula

Range Restriction in Employment Interviews  •  191

20. This value compares very favorably with Hunter et al.’s (2006) updated (via indirect correction) value of .66 for the validity of General Mental Ability (GMA) for medium complexity jobs (see p. 606).

ρ XP =

2.94 ∗ ro 2 o

7.1 ∗ r + 1

=

2.94 ∗ .27 7.11 ∗ .27 ∗ .27 + 1

=

.81 = .64 (20) 1.52

Although this scenario is focused on incumbents, it should be reasonably accurate for the other two indirect situations identified by Huffcutt et al. (2014a). The first is where another predictor is used for actual hiring and then the interview is administered but not used, and the second is when hiring is based on a composite of the interview and another predictor (see their Table 1). Using Formula 19 is going to result in a much more accurate estimate of validity than no correction whatsoever. What this formula would not be appropriate for is correction of the final two restriction patterns identified by Huffcutt et al., namely hiring based on another predictor first and then on the interview (indirect then direct) and hiring based on the interview first and then on another predictor (direct then indirect). DISCUSSION The primary purpose of this manuscript is to provide a convenient, all-in-one reference for interview researchers to help them deal with range restriction. Was that purpose accomplished? The answer is a qualified “yes.” Selection researchers in a wide range of contexts should find the simplified formulas useful, especially so given that the most difficult parameters are already estimated and incorporated. If all (or at least a high majority) of applicants are hired and retained, then Formula 1 provides an easy correction for performance measurement error. If the interview under consideration is used to make selection decisions in a top-down fashion, then researchers simply pick the hiring proportion that is closest to their own ratio (i.e., 90%, 50%, or 10%) and insert their observed validity coefficient into the corresponding formula (i.e., Formula 8, 10, or 12). If the interview is not used to make selection decisions, then the observed correlation can be inserted into Formula 19 for indirect correction. Where the qualification manifests itself is when there is attrition. Due to the symmetrical effects of the predictor and the criterion on the validity coefficient, a modest level of attribution by itself (involving both the top and bottom segments of the performance distribution) can be corrected for using Formula 4. Unfortunately, when attrition is combined with any form of restriction, the impact on the validity coefficient is both devastating and uncorrectable. Developing methods to deal with attrition combined with restriction appears to be one of the most overlooked psychometric challenges in the entire selection realm. One possible way to deal with attrition and restriction is a backwards graphical approach. A similar method is found when correcting a predictor variable for

192 • ALLEN I. HUFFCUTT

nonlinearity in a correlation / regression analysis. There are references available (e.g., Osborne, 2016) that show various nonlinear patterns graphically and the associated equation to transform the data to become reasonably linear. Figure 8.4, for instance, portrays a very distinctive pattern that implies a specific combination of hiring and attrition, and can be traced back to the original distribution with the full population value of the validity correlation. It might be possible to simulate other combinations of hiring and attrition and find different yet distinctive and identifiable patterns. In perspective, some might be uncomfortable with the idea of simply inserting an observed validity coefficient into a formula that outputs a supposed estimate of the population value. Such concerns are both noted and appreciated. The best response at the present time is a reminder that the goal of a validity study is to assess how well a particular selection measure does at predicting job performance across the entire applicant pool, that is with all potential applicants and not just a subset thereof. Results of the simulations in this manuscript, along with a sizable body of meta-analytic research, suggest that observed (uncorrected) estimates often are far too low. They simply do not reflect the true level of predictability of most selection measures in the entire applicant pool. Conversely, corrected measures, even though imperfect, are likely to be considerably closer to the true underlying population value. The observed validity coefficient of .29 with 10% hiring illustrates nicely the dangers of no correction (when compared to the population value of .69). There is a real possibility that an organization would decide not to expend the time and resources needed to implement a new structured interview format for this modest level of validity. Several directions for future research emerge from this work, some already noted. One of the most interesting psychometrically is the nonlinear effects of hiring percentages on the range of predictor scores and resulting validity coefficient. For instance, the range, standard deviation, and validity coefficient could be computed for all values from 10% to 1% hiring. The regions from 10% to 50% hiring (and 50% to 10%) could also be flushed out. Also worthy of research attention are restriction patterns that combine direct in some fashion with indirect. To illustrate, indirect then direct involves initial selection based on another predictor that is correlated with the interview and then, subsequently, the interview is used to make further selection decisions. Two considerations make this pattern particularly important. One is that it seems to occur somewhat regularly (see Huffcutt et al., 2014a, Table 1) in the literature and probably even more often in practice. The other consideration relates to an issue that has largely been ignored, which is up-front selection based on an application blank and letters of reference. At the present time, there is very little understanding of how much these two almost universal selection tools are actually used to reduce the applicant pool and how much they tend to correlate with structured interview ratings. If the correlation tends to be negligible, then their use really isn’t an issue. If the correlation is not

Range Restriction in Employment Interviews  •  193

negligible, however, then indirect restriction is induced, which complicates every scenario presented in this manuscript. Finally, there are related topics that are likely to be affected by range restriction as well, and its influence in these areas should be explored. One such topic pertains to advances in data science and “big data” (Tonidandel, King, & Cortina, 2018). There is considerable interest in machine learning and the use of artificial intelligence in selection, yet very little understanding of how range restriction affects them and the decisions made from them (Rosopa, Moore, & Klinefelter, 2019). Another potential topic is missing data, a common phenomenon in selection analyses since values for two variables (predictor and criterion) are required. It is unclear how range restriction affects the calculation of correlation coefficients and multiple regression estimates when there is missing data. On a closing note, this manuscript is focused on primary researchers with the hope of providing them with a convenient, all-in-one resource for dealing with range restriction. That said, the findings are just as applicable (and potentially useful) for meta-analysts as well. They need the same difficult parameters, which hopefully are presented sufficiently through the various scenarios. The main difference is that they work with mean validity coefficients instead of individual ones. NOTES 1.

2. 3.

4.

5.

Thorndike’s Case I correction is applicable to the relation between two variables (X1 and X2) when the actual restriction is on X1 but restriction information is available only for X2. He noted that this situation is unlikely to be encountered very often in practice. Several other selection reanalyses have been done by Schmidt and colleagues, which, for whatever reason, did not appear on this search. See Oh et al. (2013, p. 301) for a summary. The mean of their overall SI scores was actually above the midpoint of the scale. The midpoint was chosen, however, in an attempt to keep the distribution symmetrical. However, even with the mean at the midpoint, there was a small skew (which is not surprising given that a sample size of 100 is not overly large). There were also minor anomalies in subsequent distributions, such as with homoscedasticity. Liberty was taken in adjusting some of the data points to correct these anomalies (here to make the distribution highly symmetrical). Homoscedasticity is the assumption that the variability of criterion scores (e.g., range) is reasonably consistent across the entire spectrum of predictor values. When violated, the distribution is said to be heteroscedastic, power is reduced, and Type I error rates are inflated (see Rosopa, Schaffer, & Schroeder, 2013, for a comprehensive review). While random response error and transient error reflect variations in interviewee responses to essentially the same questions within the same

194 • ALLEN I. HUFFCUTT

and across interviews, respectively, conspect error reflects disagreements among interviewers in how they evaluate the same response information. As noted by Schmidt and Zimmerman (2004), panel interviews only control fully for conspect error since interviewers observe the same random response errors and there is no second interview. REFERENCES Alexander, R. A., Carson, K. P., Alliger, G. M., & Carr, L. (1987). Correcting doubly truncated correlations: An improved approximation for correcting the bivariate normal correlation when truncation has occurred on both variables. Educational and Psychological Measurement, 47, 309–315. Arvey, R. R., Miller, H. E., Gould, R., & Burch, R. (1987). Interview validity for selecting sales clerks. Personnel Psychology, 40, 1–12. doi:10.1111/j.1744-6570.1987. tb02373.x Benz, M. P. (1974). Validation of the examination for Staff Nurse II. Urbana, IL: University Civil Service Testing Program of Illinois, Testing Research Program. Callender, J. C., & Osburn, H. G. (1980). Development and test of a new model for validity generalization. Journal of Applied Psychology, 65, 543–558. Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates. Conway, J. M., Jako, R. A., & Goodman, D. F. (1995). A meta-analysis of interrater and internal consistency reliability of selection interviews. Journal of Applied Psychology, 80, 565–579. Huffcutt, A. I., Culbertson, S. S., & Weyhrauch, W. S. (2013). Employment interview reliability: New meta-analytic estimates by structure and format. International Journal of Selection and Assessment, 21, 264–276. Huffcutt, A. I., Culbertson, S. S., & Weyhrauch, W. S. (2014a). Moving forward indirectly: Reanalyzing the validity employment interviews with indirect range restriction methodology. International Journal of Selection and Assessment, 22, 297–309. doi: org/10.1111/ijsa.12078 Huffcutt, A. I., Culbertson, S. S., & Weyhrauch, W. S. (2014b). Multistage artifact correction: An illustration with structured employment interviews. Industrial and Organizational Psychology: Perspectives on Science and Practice, 7, 552–557. Huffcutt, A., Roth, P., & McDaniel, M. (1996). A meta-analytic investigation of cognitive ability in employment interview evaluations: Moderating characteristics and implications for incremental validity. Journal of Applied Psychology, 81, 459–473. Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and bias in research findings. Newbury Park, CA: Sage. Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings (2nd ed.). Thousand Oaks, CA: Sage. Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Meta-analysis: Cumulating research findings across studies. Beverly Hills, CA: Sage. Hunter, J. E., Schmidt, F. L., & Lee, H. (2006). Implications of direct and indirect range restriction for meta-analysis methods and findings. Journal of Applied Psychology, 91, 594–612. doi: 0.1037/0021-9010.91.3.594

Range Restriction in Employment Interviews  •  195 Janz, T. (1982). Initial comparisons of patterned behavior description interviews versus unstructured interviews. Journal of Applied Psychology, 67, 577–580. Latham, G. P., Saari, L. M., Pursell, E. D., & Campion, M. A. (1980). The situational interview. Journal of Applied Psychology, 65, 422–427. doi 10.1037/0021-9010.65.4.422 Le, H., & Schmidt, F. L. (2006). Correcting for indirect range restriction in meta-analysis: Testing a new meta-analytic procedure. Psychological Methods, 11, 416–438. Le, H., Oh, I.-S., Schmidt, F. L., & Wooldridge, C. D. (2016). Correction for range restriction in meta-analysis revisited: Improvements and implications for organizational research. Personnel Psychology, 69, 975–1008. McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). The validity of employment interviews: A comprehensive review and meta-analysis. Journal of Applied Psychology, 79, 599–616. McMurry, R. N. (1947). Validating the patterned interview. Personnel, 23, 263–272. Nunnally, J. C. (1978). Psychometric theory. New York, NY: McGraw-Hill. Oh, I.-S., Postlethwaite, B. E., & Schmidt, F. L. (2013). Rethinking the validity of interviews for employment decision making: Implications of recent developments in meta-analysis. In D. J. Svyantek & K. T. Mahoney (Eds.), Received wisdom, kernels of truth, and boundary conditions in organizational studies (pp. 297–329). Charlotte, NC: IAP Information Age Publishing. Osborne, J. W. (2016). Regression & linear modeling: Best practices and modern methods. Thousand Oaks, CA: Sage. Pearson, K. (1903). Mathematical contributions to the theory of evolution—XI. On the influence of natural selection on the variability and correlation of organs. Philosophical Transactions, 321, 1–66. Robertson, I. T., Gratton, L., & Rout, U. (1990). The validity of situational interviews for administrative jobs. Journal of Organizational Behavior, 11, 69–76. Rosopa, P. J., Moore, A., & Klinefelter, Z. (2019, April 5). Employee selection: Don’t let the machines take over. Poster presented at the meeting of the Society for Industrial and Organizational Psychology, National Harbor, MD. Rosopa, P. J., Schaffer, M., & Schroeder, A. N. (2013). Managing heteroscedasticity in general linear models. Psychological Methods, 18, 335–351. Rothstein, H. R. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe. Journal of Applied Psychology, 75, 322–327. Sackett, P. R., Laczo, R. M., & Arvey, R. D. (2002). The effects of range restriction on estimates of criterion interrater reliability: Implications for validation research. Personnel Psychology, 55, 807–825. doi: 10.1111/j.1744-6570.2002.tb00130.x Schmidt, F. L., & Hunter, J. E. (2015). Methods of meta-analysis: Correcting error and bias in research findings (3rd ed.). Thousand Oaks, CA: Sage. Schmidt, F. L., Hunter, J. E., Pearlman, K., & Rothstein-Hirsh, H. (1985). Forty questions about validity generalization and meta-analysis. Personnel Psychology, 38, 697–798, Schmidt, F. L., & Le, H. (2014). Software for the Hunter-Schmidt meta-analysis methods (Version 2). Iowa City, IA: University of Iowa, Department of Management & Organizations.

196 • ALLEN I. HUFFCUTT Schmidt, F. L., & Zimmerman, R. D. (2004). A counterintuitive hypothesis about employment interview validity and some supporting evidence. Journal of Applied Psychology, 89, 553–581. doi: 10.1037/0021-9010.89.3.553 Schmitt, N. (2007). The value of personnel selection: Reflections on some remarkable claims. The Academy of Management Perspectives, 21, 19–23. Thorndike, R. L. (1949). Personnel selection. New York, NY: Wiley. Tonidandel, S., King, E. B., & Cortina, J. M. (2018). Big data methods: Leveraging modern data analytic techniques to build organizational science. Organizational Research Methods, 21, 525–547. Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81, 557–574. doi: 10.1037/0021-9010.81.5.557 Weekley, J. A., & Gier, J. A. (1987). Reliability and validity of the situational interview for a sales position. Journal of Applied Psychology, 72, 484–487.

CHAPTER 9

WE’VE GOT (SAFETY) ISSUES Current Methods and Potential Future Directions in Safety Climate Research Lois E. Tetrick, Robert R. Sinclair, Gargi Sawhney, and Tiancheng (Allen) Chen

Safety has been the focus of much research over the past four decades, given the social and economic costs of unsafe work. For instance, the International Labor Organization (2009) estimated that approximately 2.3 million workers die each year due to occupational injuries and illnesses, and additionally, millions incur non-fatal injuries and illnesses. More recently, the Liberty Mutual Research Institute for Safety (2016) estimated that US companies spend $62 billion in worker compensation claims alone. In the Human Resource Management and related literatures (e.g., Industrial Psychology) safety climate has been perhaps the most heavily studied aspect of workplace safety (Casey, Griffin, Flatau Harrison, & Neal, 2017; Hofmann, Burke, & Zohar, 2017). Several meta-analyses have established that safety climate is an important contextual antecedent of safety behavior and corresponding outcomes (e.g., Christian, Bradley, Wallace, & Burke, 2009; Clarke, 2010, 2013; Nahrgang, Morgeson, & Hofmann, 2011). However, the research included in these meta-analyses varies considerably in several methodological and conceptual qualities that may affect the inferences drawn from safety climate studies. Research Methods in Human Resource Management: Toward Valid Research-Based Inferences, pages 197–226. Copyright © 2020 by Information Age Publishing All rights of reproduction in any form reserved.

197

198 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

Understanding how these issues potentially influence safety climate research should improve researchers’ ability to conduct high quality safety climate research. Better quality research can help advance theoretical perspectives on safety climate and inform evidence-based safety climate interventions. Therefore, the purpose of this review is to examine the research methods used in recent research on safety climate. Our review includes conceptual and measurement challenges in defining the safety climate construct, cross-level implications of safety climate at the individual, group/team, and organizational levels, research designs that limit causal inferences and implications for external validity. LITERATURE SEARCH PROCESS To identify safety climate articles for this review, we searched the keyword of safety climate in the following databases: PsycINFO, Psychology and Behavioral Sciences Collection, PsycARTICLES, Business Source Complete, ERIC, and Health Source: Nursing/Academic Edition. We confined the keyword search to the abstract of articles, limited the results to peer-reviewed journal articles, and set a five-year time frame from 2013 to October 2018 (when the search was conducted). Our search resulted in a total of 1,002 articles. Figure 9.1 summarizes our literature search process. Two authors conducted the first round of review by reading the abstracts and skimming the body of text to exclude articles that were not written in English or those that did not assess safety climate (e.g., articles on different types of climate, articles discussing safety issues but not climate). This first round of review yielded 284 relevant articles. In the second round of review, all four authors coded a set of 50 articles on the construct definition, methodology, analytics, results, study samples, and industries. The purpose of having all four authors code this set of 50 articles was to refine and reach consensus on the coding process. After consensus was reached, the rest of the articles were distributed to each author to code individually, excluding the review papers. During this second round of review, we excluded 23 articles because they mentioned safety climate but did not directly study it. Therefore, the final number of articles included was 261, which consisted of 230 empirical quantitative studies, 7 empirical qualitative studies, 6 conceptual papers, and 18 review papers.1 CONCEPTUALIZATION AND MEASUREMENT OF SAFETY CLIMATE Solid conceptual and operational definitions form the foundation of any research enterprise. As Shadish, Cook, and Campbell (1982, p. 21) discussed: “the first problem of causal generalization is always the same: How can we generalize from a sample of instances and the data patterns associated with them to the particular target constructs they represent?” Similarly, the AERA, APA, and NCME (1985)

We’ve Got (Safety) Issues  •  199

FIGURE 9.1.  Literature search process.

standards for educational and psychological testing have long emphasized the central and critical nature of construct validity in psychological measurement, a view that has evolved into the perspective that all inferences about validity ultimately are inferences about constructs. Although the unitary view of validity as construct validity is not without critics (e.g., Kane, 2012; Lissitz & Samuelsen, 2007), the importance of understanding constructs is generally acknowledged as central to the research enterprise. In the specific case of safety climate, solid conceptual understanding of the definition of safety climate is both a foundational issue in the literature and an often-overlooked stage of the research process.

200 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

As we show below, the safety climate literature is plagued by problems associated with what Block (1995) called the jingle and jangle fallacies. Block described the jingle fallacy as a problem created when researchers use similar terms to describe different ideas and the jangle fallacy as when researchers use different terms to describe similar ideas. Both of these problems are pervasive in safety climate research as well as three additional concerns. First, safety climate studies often use inconsistent conceptual and operational definitions, providing conceptual definitions of safety climate that do not match their measurement process. Second, safety climate research often offers vague conceptual definitions of safety climate that do not suggest any particular measurement process. Third, there is a great deal of inconsistency in the conceptual scope of safety climate – especially related to whether/how researchers define the dimensions of safety climate. Each of these definitional issues creates conceptual ambiguity that leads to challenges in integrating findings across the body of safety climate literature as well as operational challenges in measuring safety climate. Our review focuses on six definitional and operational issues in understanding safety climate: the general definition of climate, distinguishing climate from related terms, the importance of aggregation, the climate-culture distinction, industry-specific versus universal/ generic climate measurement, and the dimensionality of climate. Safety Climate as a Strategic Organizational Climate Ambiguity about the definition of safety climate is perhaps the fundamental problem running through the relevant literature. There is considerable variability in terms of how researchers conceptualize safety climate along with corresponding inconsistencies for safety climate measurement. In a review of the general climate literature, Ostroff, Kinicki and Muhammad (2013) discussed the historical origins of the concept and placed it in a nomological network explaining how climate differs from the related term organizational culture. They also provided a heuristic model linking climate to individual outcomes such as job attitudes and performance behaviors as well as organizational outcomes such as shared attitudes and organizational effectiveness and efficiency. Drawing from prior work by James and Jones (1974) and Schneider (2000), Ostroff et al. (2013, p. 644) characterized climate as an “experientially based description of what people see and report happening to them in an organizational situation.” Although Ostroff et al. (2013) discussed other climate constructs such as generic and molar climate, safety climate research usually treats safety climate as a specific example of what are referred to as strategic organizational climate constructs. Starting with the work of Schneider (1975), strategic organizational climate research focused on the idea of climate as having a particular referent, reflective of an organization’s goals and priorities (see also Schneider, 1990). Ostroff et al. (2013) cited literature on a wide range of these referents, including safety, service, sexual harassment, diversity, innovation, justice, citizenship behavior, ethics, empowerment, voice, and excellence.

We’ve Got (Safety) Issues  •  201

Zohar (1980) is widely credited as the first researcher to describe safety climate as one of these strategic climates; he noted that “when the strategic focus involves performance of high-risk operations, the resultant shared perceptions define safety climate” (Zohar, 2010, p. 2009). Interestingly, Zohar’s (2010) review appeared nearly a decade ago. At that time, he characterized the literature as mostly focusing on climate measurement issues such as its factor structure and predictive validity with a corresponding need for greater attention to theoretical issues. Since then, multiple meta-analytic and narrative reviews have accumulated evidence supporting the predictive validity of climate perceptions, demonstrating the efficacy of climate-related interventions, and clarifying the theoretical pathways linking safety climate to safety-related outcomes (Beus, Payne, Bergman, & Arthur, 2010; Christian et al., 2009; Clarke, 2010; Clarke, 2013; Hofmann et al., 2017; Lee, Huang, Cheung, Chen, & Shaw, 2018; Leitão & Greiner, 2016; Nahrgang et al., 2011). Despite this progress, definitional ambiguities remain a problem in the literature with fundamental measurement issues about the nature of safety climate remaining unresolved. One fundamental definitional issue concerns the extent to which Zohar’s definition of safety climate is accepted in the literature. To address this question, we coded studies according to how they defined safety climate based on the citation used. A total of 86 studies (36.3%) cited Zohar, 25 studies (10.6%) cited Neal and Griffin in some combination, and 11 studies offered a definition without a citation (4.6%). It is important to note that whereas Griffin and Neal’s earlier work emphasized the individual level (e.g., Griffin & Neal, 2000; Neal, Griffin, & Hart, 2000), their later work emphasized both the individual and group level in a similar fashion to Zohar (e.g., Casey, et al., 2017; Neal & Griffin, 2006). Interestingly, 110 studies (46.4%) offered some other citation and 42 studies (17.7%) did not clearly define safety climate. Table 9.1 presents illustrative examples of the range of these definitions. As should be evident from the table, there are a wide range of approaches that vary in how precisely they define safety climate. Some key definitional issues include (1) whether safety climate is conceptualized as a group, individual, or multilevel construct and thus, involves shared perceptions; (2) what is the temporal stability of climate perceptions; and (3) whether safety climate narrowly refers to perceptions about the relative priority of safety or whether safety climate also encompasses perceptions about a variety of management practices that may inform perceptions about the relative priority of safety. Not shown in the table are examples from the literature of the many studies that do not offer an explicit definition, appearing to take for granted that there is a shared understanding of the meaning of safety climate, beyond something about the idea that safety is important (for example, Arcury, Grzywacz, Chen, Mora, & Quandt, 2014; Cox et al., 2017). Given that conceptual definitions should inform researchers’ methodological choices of what to measure, we see the lack of definitional precision in the safety climate literature as troubling. Future research

202 • TETRICK, SINCLAIR, SAWHNEY, & CHEN TABLE 9.1.  Illustrative Examples of Various Safety Climate Definitions Clarke, S. (2013)

Climate perceptions represent the individual’s cognitive interpretations of the organizational context, bridging the effects of this wider context on individual attitudes and behaviour. In relation to safety, Zohar (1980) argued for the existence of a facet-specific climate for safety, which represents employees’ perceptions of the relative priority of safety in relation to other organizational goals. In subsequent work, safety climate has been operationalised as a group-level construct (Zohar, 2000), and so researchers have aggregated climate perceptions to represent the shared perceptions at this level. Safety climate can also be considered as an individual-level construct, where perceived safety climate represents ‘individual perceptions of policies, procedures and practices relating to safety in the workplace’ (p. 27)

Bennett, et al., (2014)

Safety climate differs from safety culture in that it is ‘the temporal state measure of safety culture, subject to commonalities among individual perceptions of the organization. It is therefore situationally based, refers to the perceived state of safety at a particular place at a particular time, is relatively unstable, and subject to change depending on the features of the current environment or prevailing conditions’ (p. 27)

Bergheim, et al., (2013)

Organizational climate is conceived as an empirically measurable component of culture and is linked to a number of important organizational outcomes (Zohar, 2002). In safety critical organizations such as air traffic control, the concept of safety climate is more often used than the more general organizational climate, to emphasize the importance of ensuring a focus on safety issues (Cox & Flin, 1998.... safety climate to describe how air traffic controllers perceive both management and group commitment to safety in their everyday work Wiegmann, Zhang, & von Thaden, 2001). Thus, safety climate is often referred to as a state-like construct, providing a snapshot of selected aspects of an organization’s safety culture at a particular point in time (Mearns, Whitaker, & Flin, 2003). (p. 232)

should attend much more closely to definitional issues and strive toward consensus on the fundamental meaning of safety climate, particularly toward greater use of the original Zohar definition. Distinguishing Safety Climate from Related Terms Safety climate literature faces the challenge that there are a variety of related terms that refer to similar but conceptually different constructs. Although distinguishing these terms may be more of a conceptual than a methodological issue, the distinctions are important to developing clarity about what is measured in a study, an important concern given that some researchers use climate-related terms in potentially confusing ways. It is especially important to discuss distinctions between safety climate, psychosocial safety climate (PSC), and psychological safety.

We’ve Got (Safety) Issues  •  203 TABLE 9.1.  Continued Colley, et al., (2013)

‘‘Safety climate’’ refers to perceptions of organizational policies, procedures and practices relating to safety (p. 69)

Bell, et al. (2016).

Safety climate refers to the components of safety culture [7] that can be measured. Safety culture, in turn, determines how safety is managed by a team or organization. (p. 71)

Golubovich, et al. (2014)

Safety climate refers to employees’ perceptions of safety policies, procedures, and practices within their unit or organization (Zohar and Luria, 2005). (p. 759)

Curcuruto, et al., (2018)

Shared perceptions with regard to safety policies, procedures and practices. (p. 184)

Rodrigues, et al. (2015)

Zohar emphasised safety climate in the 1980s (Zohar 1980), and it has been defined as a descriptive measure that ‘can be regarded as the surface features of the safety culture discerned from the workforce’s attitudes and perceptions at a given point in time’ (p. 412)

Hicks, et al. (2016)

Safety climate has been conceptualized in this paper as consisting of management’s commitment to safety, safety communication, safety standards and goals, environmental risk, safety systems, and safety knowledge and training. (p. 20)

Hinde, T. et al. (2016).

“Safety culture” is described as “The product of individual and group values, attitudes, perceptions, competencies, and patterns of behaviour” (Health and Safety Commission, 1993...Safety and teamwork climates (the feelings and attitudes of everyone in a work unit) are two components of safety culture that are readily measured and amenable to improvement by focused interventions. (p. 251)

Huang, et al. (2016)

Safety climate, the degree to which employees perceive that safety is prioritized in their company (Zohar, 2010) (p. 248)

Dollard and Bakker (2010, p. 580) described PSC as the extent to which the organization has “policies, practices, and procedures aimed to protect the health and psychological safety of workers.” They elaborated (and empirically demonstrated) that PSC perceptions could be shared within an organizational unit (schools in their study) and, similar to definitions of safety climate, they characterized PSC as focused on perceptions about management policies, practices, and procedures that reflected the relative priority of employees’ psychosocial health. Thus, PSC expands the health focus of safety climate to include psychosocial stressors and outcomes in addition to the physical safety/injury prevention focus of safety climate. Numerous studies show that PSC is related to psychosocial outcomes (e.g., Lawrie, Tuckey, & Dollard, 2018; Mansour & Tremblay, 2018, 2019). Some research has linked PSC to safety related outcomes such as injuries and musculoskeletal disorders (Hall, Dollard, & Coward, 2010; Idris, Dollard, Coward, & Dormann, 2012). An understudied issue in this literature concerns the empirical

204 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

distinctiveness of safety climate and PSC. A few studies have examined measures of both safety and psychosocial safety in the same study (e.g., Hall et al., 2010; Bronkhorst, 2015; Bronkhorst & Vermeeren, 2016) but often using safety climate measures that are PSC measures adapted to safety. These studies have found that although PSC and safety climate measures are often highly correlated (i.e., r > .69 in Bronkhorst, 2015; Bronkhorst & Vermeeren, 2016; and Study 1 of Idris et al., 2012), they are structurally distinct with different patterns of correlates (Idris, et al., 2012). However, more research is clearly required to determine the extent to which PSC and safety climate measures are distinct and the possible boundary conditions that might affect the degree to which they are related to each other and/ or to various safety and health-related outcomes. Edmondson (1999, p. 354) defined psychological safety as “a shared belief held by a work team that the team is safe for interpersonal risk taking.” Guerrero, Lapalme, and Séguin (2015) used the term “participative safety” to describe essentially the same idea, but the term psychological safety is much more commonly used and Edmondson’s approach appears to represent consensus in the now fairly extensive literature on the antecedents and outcomes of psychological safety (Newman, Donohue, & Eva, 2017). Some of the qualities of a psychologically safe work environment include mutual respect among coworkers, the ability to engage in constructive conflict, and comfort in expressing options and taking interpersonal risks (Newman, et al., 2017). Thus, whereas safety climate focuses on perceptions about the organization’s relative priority for employees’ physical safety and PSC focuses on relative priorities for psychosocial health, psychological safety refers to employees’ general comfort in the interpersonal aspects of the workplace. Another definitional issue in safety climate research is the distinction between psychological safety and individual-level perceptions of safety climate. Some researchers use the term psychological safety climate to refer to individual level perceptions about safety climate issues (cf. Clark, Zickar, & Jex, 2014; Nixon, et al. 2015). These authors appear to have had the good intentions to clearly label individual level safety climate perceptions with a term that highlights the individual nature of the construct (i.e., drawing on psychological climate literature such as James & James, 1989; James, et al., 2008). However, other researchers have studied psychological safety as a component of safety culture (Vogus. Cull, Hengelbrok, Modell, & Epstein, 2016) or as an antecedent of safety outcomes (e.g., Chen, McCabe, & Hyatt, 2018; Halbesleben et al., 2013). In our view, it is theoretically appropriate to treat psychological safety as an antecedent of psychological climate, but, researchers need to be wary of exactly how studies are using these various terms. The terminological confusion between terms such as safety climate, psychological safety, psychological safety climate, and PSC represents a potential barrier to accumulating knowledge about and drawing clear distinctions between these constructs. At the very least, researchers are urged to use caution when citing

We’ve Got (Safety) Issues  •  205

studies to ensure that they do in fact capture the construct of interest. However, further empirical research is needed to distinguish these terms. Do We All Have to Agree? The Importance (or Not) of Aggregation Zohar’s (1980) original depiction of climate explicitly included the idea of shared perceptions: “shared employee perceptions about the relative importance of safe conduct in their occupational behavior” (p. 96). As Hofmann, Burke, and Zohar (2017, p. 329) elaborate: Key terms in this definition emphasize that it is a shared, agreed upon cognition regarding the relative importance or priority of acting safely versus meeting other competing demands such productivity or cost cutting. These safety climate perceptions emerge through ongoing social interaction in which employees share personal experiences informing the extent to which management cares and invests in their protection (as opposed to cost cutting or productivity).

In our anecdotal experience (the first two authors have been editors and associate editors of multiple journals), the issue of whether climate measures need to be shared raises considerable consternation among researchers, particularly during the peer review process. We have seen some reviewers assert that if the study does not include shared perceptions, it is not a study of climate; whereas other authors acknowledge that safety climate is a shared construct, but continue to study it at the individual level; and still others do not discuss its multilevel nature. Irrespective of how researchers conceptually define safety climate, very few studies assess it at the group level. In fact, out of the 230 empirical, quantitative studies we reviewed, only 67 studies (29.1%) aggregated individual level data to test climate effects at unit or higher levels. Of these 67, 42 studies (62.7%) reported statistical evidence for the appropriateness of aggregation including ICC(1) only (N = 12, 17.9%), rwg only (N = 1, 1.5%), ICC(1) and ICC(2) (N = 4, 6.0%), ICC(1) and rwg (N = 6, 9.0%), and all three measures (N = 19, 28.4%). These data suggest that more multilevel studies are needed with improved reporting of statistical justification for aggregation. As we reviewed this literature, we were especially struck by the number of articles that offered a group level conceptual definition of safety climate but studied safety climate at the individual level, often without explicit rationale for the discrepancy (for example, Hoffmann et al., 2013; McGuire et al., 2017; Schwatka & Rosencrance, 2016). Other studies have defined safety climate at the group level but offered a justification for studying it as an individual construct. For example, multiple studies by Huang and colleagues (examples include Huang et al., 2013, 2018; Huang, Lee, McFadden, Rineer, & Robertson, 2017) have argued that shared definitions of climate are less meaningful for employees who work by themselves, such as long-haul truck drivers. On one hand, the lack of attention to safety climate as a shared perception represents a potentially serious problem in the literature, as there appears to be a wide

206 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

disparity between how Zohar (1980) initially conceptualized safety climate and how many researchers appear to be operationalizing it in practice. One might go so far as to argue that given that comparatively little research has been performed on safety climate as a group level construct, relatively little is known about it. On the other hand, both the general climate literature (e.g., Ostroff et al., 2013) and the safety climate literature (e.g., Clarke, 2013) explicitly acknowledge the conceptual relevance of individual safety climate perceptions to the study of climate. A common practice is to distinguish between organizational climate (a group level construct) and psychological climate (an individual level construct; cf. James & James, 1989; James et al., 2008). When designing a study, researchers should consider Ostroff et al.’s (2013) discussion of this distinction as their model proposed that psychological climate is more directly relevant to individual level outcomes while organizational climate is more directly related to group level outcomes. The individual-organizational level distinctions highlight the need to avoid atomistic and ecological fallacies (cf. Hannan, 1971) in safety climate research. Atomistic fallacies occur when results obtained at the individual level are erroneously generalized to the group level. On the other hand, ecological fallacies occur when group level results are used to draw conclusions at the individual level. Also, it is important to acknowledge that researchers often focus on the individual level because of practical constraints such as the lack of a work group/unit identifier that can be used as the basis of aggregation, the lack of a sufficient number of subunits to study climate, or a lack of proficiency in the multilevel methods needed to study climate across organizational levels. Given these issues, it may be appropriate for safety researchers interested in individual behavioral and attitudinal phenomena to focus on psychological climate perceptions as they relate to safety, although they should test for and/or attempt to rule out group level effects when possible. When researchers focus on individual level safety climate measurement, it is important to ensure that their theoretical rationale fits with the individual level formulation of climate. One of the potential areas of confusion in this literature concerns the use of the term level. Although climate researchers distinguish between individual and organizational safety climate measures based on the level of analysis/measurement; safety climate researchers also use the term level to refer to particular climate stakeholders. For example, drawing from Zohar (2000, 2008, 2010), Huang et al. (2013) described group and organizational level climate as two distinct perceptions employees form about safety. In Huang et al.’s approach, the group level refers to one’s immediate work unit, with measures typically focused on employees’ perceptions of safety as a relative priority of one’s immediate supervisor. The organizational level refers to employees’ perceptions of the global organization’s (or top management’s) relative priority for safety. But, both group and organizational-level safety climate in Huang et al.’s model are usually measured with individual level perceptual measures.

We’ve Got (Safety) Issues  •  207

Safety Climate versus Safety Culture Although nearly four decades have passed since Zohar’s initial conception of safety climate (Zohar, 1980), many researchers still blur the distinction between safety culture and safety climate or use what are essentially safety climate measures to study safety culture. For instance, Kagan and Barnoy (2013) referred to safety culture as “workers’ understanding of the hazards in their workplace, and the norms and roles governing safe working [conditions]” (p. 273). Similarly, Pan, Huang, Lin, and Chen (2018) posited that safety culture can be characterized as including employee safety cognitions, behaviors, safety management system, safety environment, individual’s stress recognition and competence. He et al. (2016, p. 230) noted that safety climate “can reflect the current state of the underlying safety culture.” Hinde, Gale, Anderson, Roberts, and Sice (2016, p. 251) characterized safety climate as aspects “of safety culture that are readily measured and amenable to improvement by focused interventions.” Hartman, Meterko, Zhao, Palmer, and Berlowitz (2013) also described safety climate as modifiable aspects of the work environment. Referring to health care settings, Hong and Li (2017) measured safety climate as a dimension of patient safety culture where other dimensions included teamwork climate, perception of management, job satisfaction, and work stress. Still others described safety climate as the measurable aspects of safety culture (e.g., Bell, Reeves, Marsden, & Avery, 2016; Martowirono, Wagner, & Bijnen, 2014), which is problematic in that it implies that other aspects of safety culture such as artifacts or assumptions, cannot be measured. Finally, blurring the lines even further, Milijić, Mihajlović, Nikolić., & Živković (2014, p. 510) indicated that: Safety climate is viewed as an individual attribute, which consists of two factors: management’s commitment to safety and workers’ involvement in safety” (Dedobbeleer & Béland, 1991). On the other hand, safety culture refers to the term used to describe a way in which safety is managed at the workplace, and often reflects “the attitudes, beliefs, perceptions and values that employees share in relation to safety. (Cox & Cox, 1991)

Researchers also sometimes describe safety climate as a “snapshot” of the organization’s safety culture. According to Bergheim et al. (2013, p. 232) “safety climate is often referred to as a state-like construct, providing a snapshot of selected aspects of an organization’s safety culture at a particular point in time (Mearns, Whitaker, & Flin, 2003).” Similarly, Bennett et al. (2014) noted that as compared to safety culture, safety climate is more contingent on the work environment and susceptible to change (Bennett et al., 2014). Given the relative paucity of longitudinal research on safety climate (see below), little data exist concerning the temporal stability of safety climate. One recent study reported test-retest correlations greater than .50 for two safety climate measures and a corresponding ability of the climate measures to predict safety outcomes across a two-year time period (Lee, Sinclair, Huang, & Cheung, in press). However, others have concluded that

208 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

the ability of safety climate to predict outcomes drops much more rapidly (Bergman, Payne, Taylor, & Beus, 2014). Of course, the stability of both safety climate scores and their ability to predict outcomes likely depends on the stability of the work environment, but relatively little research has directly addressed this issue. In our view, whether safety climate is a relatively stable phenomenon or a varying snapshot of culture remains unresolved. Guldenmund (2000, p. 220) noted that “before defining safety culture and climate, the distinction between culture and climate has to be resolved.” In the ensuing nearly two decades, although progress has been made in understanding of the general conceptual distinctions between organizational climate and culture (cf. Ostroff et al., 2013), safety researchers are often careless in distinguishing culture and climate (Zohar, 2014). Some researchers assume that the climate-culture distinction rests on the idea that climate is easier to change than culture, others distinguish them in terms of the relative temporal stability of the two constructs. Still others treat climate as the measurable aspect of culture, even though other aspects of culture are likely measurable, albeit through different strategies than those used in climate assessments. These ambiguities highlight the critical need for further clarity in the conceptualization of climate. In fact, Hofmann, Burke, and Zohar (2017, p. 381) concluded: In the context of safety research, there potentially is even greater conceptual ambiguity given the lack of a clear and agreed upon definition of safety culture, and where the definitions that have been put forth do not make reference to broader, more general aspects of organizational culture. In addition, many measures of safety culture use items and scales which resemble safety climate measures. This has led many authors to use the two constructs interchangeably. We believe this situation is unfortunate and suggest that any study of safety culture should be integrated with and connected to the broader, more general organizational culture as well as the models and research within this domain.

Industry Specific versus General Measures One of the ongoing issues in safety climate measurement concerns the use of industry/context specific versus what are referred to as general or universal measures. The former refers to safety climate measures with item content designed to reflect safety issues in a specific industry, the latter refers to measures with items designed to generalize across a wide variety of contexts. Our review indicates that although general safety climate measures are more commonly used, industry specific measures appear to be more common in a few types of settings including schools/education, health care, transportation, and offshore and gas production (Jiang et al. (2019) also reported these as the most common contexts in their meta-analysis). One example of context specific measures comes from a series of studies on truck drivers by Huang and colleagues (Huang et al., 2018; Lee, Sinclair, Huang, & Cheung, 2019). Huang et al. (2013) argued that the need for a context specific

We’ve Got (Safety) Issues  •  209

measure reflects the unique safety concerns faced by lone workers such as truck drivers. They developed and validated a measure consisting of three organizational-level factors: (a) proactive practices, (b) driver safety priority, and (c) supervisory care promotion) and three group/unit level measures: (a) safety promotion, (b) delivery limits, and (c) cell phone (use) disapproval. Another example comes from literature on school climate. Zohar and Lee (2016) provided an example of a traditional safety climate study conducted in a school setting with school bus drivers. In addition to items measuring perceived management commitment to safety, they developed context specific items such as management becomes angry with drivers who have violated any safety rule, and department immediately informs school principal of driver complaint against disruptive child. Occupational health research does not pay as much attention to the school climate literature compared to other contexts such as manufacturing; nevertheless, we conducted a separate review of school climate literature which located over 1,000 citations to school climate, including over 500 in 2013 alone. Although a full review of this literature is well-beyond the scope of this article, it should be noted that safety issues are frequently mentioned in the school climate literature (Wang & Degol, 2016). However, rather than reflecting physical injuries from sources such as transportation incidents, slips, and strains, the predominant safety concern is the extent to which teachers and students are protected from physical and verbal violence. Moreover, much of this literature is concerned with student health and academic performance outcomes rather than teachers’ occupational well-being. Thus, traditional safety climate measures may be insufficient to capture the unique challenges of this context. Healthcare is another setting where context-specific measures are frequently used. Healthcare, however, encompasses a wide variety of practice areas and occupations, each with specific sets of safety challenges. Accordingly, researchers have measured a wide array of different aspects of safety climate such as error-related communication (Ausserhofer et al., 2013), hospital falls prevention (Bennett et al., 2014), communication openness and handoffs and transitions (Cox et al., 2017), forensic ward climate such as therapeutic hold and patients’ cohesion and mutual support (de Vries, Brazil, Tonkin, & Bulten, 2016), and hospital safety climate items relating to issues such as availability of personal protective equipment and cleanliness (Kim et al., 2018). The variety of issues captured by these measures raises questions about whether healthcare should be treated as a single industry context by researchers seeking to understand the effects of context on safety climate. Jiang et al. (2019) highlighted some of the reasons why general/universal or context-specific measures might be preferred. For example, industry-specific measures may have greater value in diagnosing safety concerns that are unique to a specific industry and therefore potentially more useful in guiding safety interventions (see also Zohar, 2014). General measures may have more predictive value if safety climate primarily reflects a general management commitment to

210 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

safety; if this is the case, safety interventions should focus on those broadly applicable concerns. General measures can also contribute to benchmarking norms that may be used across a wide variety of industries. To test the possible distinctions between universal and industry-specific measures, Jiang et al. (2019) tested the relative predictive power of each type of measure in a meta-analytic review of 120 samples (N = 81,213). They found that each type of measure performed better in different situations. Specifically, the industry-specific measures were more strongly related to safety behavior and risk perceptions whereas the universal measures predicted other adverse events such as errors and near misses. There were no differences between universal and industry-specific measures in their ability to predict accidents and injuries. It is important to note that Jiang et al. (2019) did not test whether the industries of the industry-specific measures differed from those of the universal measures. Jiang et al. (2019) cited as the most commonly studied industries in their review to be construction (K = 21), health care, hospitality, manufacturing (K = 18), transportation (K = 18), hospitality, restaurant/accommodations (K = 12), and construction (K = 11) with 19 studies described as “mixed context.” Our review (which encompasses a different set of years than Jiang et al.) indicates that the industries that appeared to be most likely to use industry-specific measures were transportation, off-shore and gas production, education, and hospital/health care. Thus, the comparison of industry-specific versus general measure may be somewhat confounded if some industries are more/less likely to be represented in the industry-specific group. Researchers could address this by comparing both measures within the same industry. Keiser and Payne (2018) did just this, using both types of measures in the same settings which were university research labs and including context-specific measures for animal biological, biological, chemical, human subjects/computer, and mechanical/electrical labs. They concluded that while the context-specific measures appeared to be more useful in less safety-salient contexts, there were relatively little differences between the measures. However, they also noted that there appeared to be measurement equivalence problems with the general measure across the different settings they investigated. Of course, Keiser and Payne’s findings may be unique to their organizational setting given that university research labs likely differ in many ways from other types of safety-salient contexts. Thus, there is mixed evidence about whether researchers should use context-specific versus universal/general measures that so far appears to suggest at least some differences between the types of measures in the settings in which they are most useful. But this is clearly an issue which requires further research. The Dimensionality of Safety Climate Nearly 40 years after Zohar (1980) first offered a formal definition of safety climate, there seems to be little consensus on the dimensionality of safety climate measures. This has been a long-standing concern in the literature. Twenty years

We’ve Got (Safety) Issues  •  211

after Zohar’s original publication, Flin, Mearns, O’Connor, and Bryden (2000) identified 100 dimensions of safety climate used in prior literature. They narrowed these dimensions down to six themes (1) management/supervision, (2) safety system, (3) risk, (4) work pressure, (5) competence of the workforce, and (6) procedures/rules. Yet, measures continued to proliferate; in fact, 10 years after the Flin et al. (2000) publication Beus et al.’s (2010) meta-analytic review identified 61 different climate measures with varying numbers of dimensions. Our review suggests that little progress has been made and there continues to be a wide array of approaches to measuring safety climate. As noted above, one important distinction is between universal/generic and context-specific measures, with many alternatives within each of these categories. A related issue concerns the dimensions of those measures. For the purpose of this review, we did not compile a list of the dimensions used in various measures of safety climate. Rather, we focused on the methods used to ascertain the number of dimensions in individual studies. Factor analysis is a widely recognized approach to assessing dimensionality of a measure and therefore is an important step in measure development and construct validation. Factor analyses are especially important in a literature such as safety climate where there is a lack of clarity about the dimensionality of the construct. Therefore, we coded studies in terms of whether they used any factor analytic technique; if so, what technique they used. Across the 230 quantitative empirical studies, the most common factor analytic technique used was confirmatory factor analysis (CFA, K = 64; 27.8%). Approximately 22% of the studies used exploratory factor analysis (EFA) with half of them only using EFA (K = 25, 10.9%) and half using a combination of EFA and CFA (K = 24, 10.4%). That CFA was used separately or in some combination with EFA in 38.2% of the studies (K =88) is encouraging given that CFA requires researchers to specify an a priori measurement model. However, it is arguably more distressing that nearly half of the studies in our review (K = 112, 48.7%) did not report any form of factor analysis, 2 studies reported the use of an unspecified form of factor analysis (0.9%), and 3 studies (1.3%) reported using CFA but only on other measures than safety climate. Given the lack of clarity in the literature about the dimensionality of safety climate, the fact that just over 50% of the studies in our review either did not report factor analyses or provided unclear information about the factor analytic techniques used represents an important barrier to accumulating evidence about the dimensionality of safety climate measures. A related issue concerns how safety researchers interpret factor analytic results. Some researchers use unidimensional measures typically focusing on the core idea of perceived management commitment to safety (for example, Arcury et al., 2014; He et al., 2016). This approach is consistent with the argument that management commitment is the central concept in safety climate literature as well as with meta-analytic evidence showing that management commitment is among the best predictors of safety-related outcomes (Beus et al., 2010). However, good

212 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

methodological practice suggests that confirmatory factor analyses should be used to affirm the dimensionality of these measures. Other studies use multidimensional measures but treat them as unidimensional, combining scores across multiple dimensions into an overall construct (for example, McCaughey, DelliFraine, McGhan, & Bruning, 2013). In some cases, this may be justified by high correlations among the factors in question. However, other studies using multidimensional measures have treated the dimensions as separate scores (for example, Hoffmann et al. 2013; Huang et al., 2017). The fact that the studies using multidimensional measures often find differences in safety climate antecedents or outcomes across climate dimensions suggests that researchers who combine multidimensional measures into a single score may be missing information of diagnostic or theoretical value. In addition to the broad question of whether safety climate measures should be treated as multidimensional, questions can be raised about what dimensions should be included in safety climate measures. For example, five of the six themes in the Flin et al. (2000) review of the literature did not directly concern perceptions of management commitment: safety system, risk, work pressure, competence of the workforce, and procedures/rules. Two other commonly-mentioned themes relate to safety training (e.g., Golubovich, Chang, & Eatough, 2014; Graeve, McGovern, Arnold, & Polovich, 2017) and safety communication (e.g., Bell, Reeves, Marsden, & Avery, 2016; Cox et al., 2017). The question that can be raised about any of these dimensions is whether they should be treated as part of a unitary safety climate construct or whether they should be regarded as antecedents or even possibly consequences of safety climate. Referring to the general climate literature, Ostroff et al. (2013) treated organizational climate as a consequence of organizational structure and practices. To the extent that one adopts a strict definition of safety climate as relating to perceptions about the relative priority of safety, several of the commonly-used dimensions of safety climate might be viewed as causal antecedents of rather than indicators of safety climate. Huang et al. (2018) noted that safety communication has been treated as a part of safety climate, as a cause of safety climate, and even as a consequence of safety climate. They argued that the literature was ambiguous enough on this point that they treated safety communication as a correlate of safety climate. They found that communication both independently predicted safety performance and moderated the safety climate-safety performance relationship such that the benefits of (individual level) climate perceptions were stronger when workers also perceived good organizational practices. Future research needs to attend to the issue of whether perceptions about organizational policies and practices (such as training and communication) as well as working conditions (such as job stress or work pressure) should be viewed as indicators or as antecedents of safety climate. This is an important issue both in terms of striving to reach consensus on the nature of climate and in terms of researchers’ decisions about how to operationalize safety climate. Importantly,

We’ve Got (Safety) Issues  •  213

strong correlations among the dimensions may be an insufficient justification as one would expect that policy and practice-based antecedents of safety climate should influence workers’ perceptions of the extent to which their management is committed to safety. We also noted several studies that used dimensions of safety climate not directly related to workers’ perceptions about the relative priority of safety. A few examples include Arens, Feirz, and Zúñiga, (2017) who combined measures of teamwork climate and patient safety climate, Hong and Li (2017) who used measures such as teamwork climate, stress recognition, and job satisfaction, Kim et al. (2018) who included a measure of absence of job hindrances, and Ausserhofer et al. (2013) who assessed commitment to resilience. We discourage researchers from including such marginally-relevant concepts directly in their safety climate measures. Rather, future research should carefully consider whether such variables might be better treated as antecedents or outcomes of safety climate or perhaps as moderators of the effects of climate on outcomes. Conceptualizing and Operationalizing Safety Climate: A Progress Report Safety climate has been a topic in occupational health research for nearly 40 years. In that time, several hundred studies have supported the general importance of safety climate in occupational health. Despite the voluminous literature on the topic, including 230 quantitative empirical studies in the past five years, problems remain in nearly every aspect of conceptualizing and operationalizing safety climate. These issues create challenges in drawing conclusions about the nature of safety climate as a construct. There seems to be relatively broad consensus about the idea of safety climate as reflecting workers’ perceptions about the importance of safety issues. However, there also are wide inconsistencies regarding how researchers define and measure safety climate. Given that clarity about constructs is fundamental to scientific progress, these inconsistencies raise questions about how much we really know about the nature of safety climate. Some of these problems are likely to worsen as researchers begin to study similar ideas such as PSC and psychological safety, as well as the emerging interest in other health-related aspects of climate (e.g., Gazica & Spector, 2016; Mearns, Hope, Ford, & Tetrick, 2010; Sawhney, Sinclair, Cox, Munc, & Sliter, 2018; Sliter, 2013). Ultimately, these problems should be addressed through greater attention to the importance of construct validation and establishing a coherent nomological network for the safety climate construct. METHODS/DESIGNS As indicated above, the conceptualization and measurement of safety climate has several pitfalls that generate challenges for the design of studies seeking to examine the effects of safety climate, the antecedents of safety climate, as well as

214 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

the mediating and moderating effects of safety climate. In this section we review some of the methodological challenges and issues. Interventions For the period 2013–2018, only 6% of all of the articles we coded were intervention studies. Of these, four studies treated safety climate as an independent variable, eight studies treated safety climate as a dependent variable, and two studies treated safety climate as a mediator. Two of these intervention studies used random assignment, two used quasi-experimental designs, and two used random assignment clustering. Therefore, experimental or quasi-experimental designs were rare. Admittedly these designs are difficult to implement in an applied field setting but their absence does limit our ability to make causal inferences about safety climate-related processes. Lee, Huang, Cheung, Chen, and Shaw (2018) reviewed 19 intervention studies that met their inclusion criteria; they reported that 10 of the 19 studies were quasiexperimental pre-post-intervention designs and eight were based on mixed designs with between- and within-subjects components. Ten of the 19 studies were published in years preceding the period of our review which raises the question as to whether research designs are becoming stronger. That said, the results of both of these reviews support the ability of interventions to improve safety climate in applied settings across several industries. But, they also highlight how rare such studies are and the corresponding need for more studies utilizing these designs. Longitudinal versus Cross-sectional Designs In our review, 21% of the studies we coded obtained date across multiple measurement occasions. Twenty-six studies included either two or three waves of data collection although they did not necessarily measure all variables at all points in time. There were eight prospective studies, and 15 intervention studies. However, about 80% of the studies were cross-sectional studies. This is considerably higher than the proportion of single-source cross-sectional designs reported in Spector’s (2019) review. Forty-one percent of the studies in two occupational health psychology journals (Journal of Occupational Health Psychology and Work & Stress) used a single-source cross-sectional design (Spector & Pindek, 2016) and 38% of the studies in Journal of Business and Psychology used a single-source cross-sectional designs (Spector, 2019). The higher presence of cross-sectional designs in safety climate research maybe a result of disciplinary differences or editorial policies and practices. There does appear to be a belief among many researchers that using a longitudinal design is preferred in establishing the validity of the research; however, adding additional measurement occasions raises a number of threats to validity and may still not allow strong causal inferences depending on other design issues (see Ferrer & Grimm, 2012; Ployhart & Ward, 2011; Stone-Romero, 2010).

We’ve Got (Safety) Issues  •  215

Many longitudinal studies do not make a case for the specific lag in measurement they included in their designs. In addition, if there are not at least three measurement occasions, then it is not possible to detect nonlinear trends. Unfortunately, the theories commonly used in safety climate research are silent on the most appropriate time lag to choose for a given research question. It may be the case that there is no perfect time lag as changes in safety climate may be best explained by unique events, such as severe accidents or changes in organizational policy. Nevertheless, we echo calls by other scholars (e.g., Ployhart & Ward, 2011) to incorporate time into our research designs. This is especially important for understanding the time that it takes for a cause (e.g., an accident) to exert an effect (e.g., changes in safety climate). Other scholars (e.g., Spector, 2019) have suggested that we modify our measures to explicitly incorporate time. Many of our measures are so general that it is impossible to assess the sequencing of events. By including time related content such as “in the last month” or “today,” the temporal ambiguity is reduced if not eliminated. Level of Analysis As discussed above, many conceptualizations of safety climate suggest a group or organizational level of analysis. However, 70.4% of the studies we coded measured and/or analyzed safety climate at only the individual level of analysis. Only 67 studies (29.1%) took a group, organizational, or multi-level approach. As we point out in the previous section as well as in the section on future directions below, moving beyond the individual level of analysis is necessary to advance understanding of safety climate. More research at the group and organizational levels are needed to link safety climate to organizational level outcomes as well as understand the relations between the group and organizational levels with individual level behaviors and outcomes. Role of Safety Climate Reflecting the varied definitions and theoretical frameworks for safety climate, there is considerable variability in whether safety climate is treated as an independent variable, a dependent variable, a moderator or a mediator. Forty-two percent of the studies we coded treated safety climate as an independent variable and only 20% treated safety climate as a dependent variable. Leadership was the most prevalent predictor of climate. Only 5.2% of the studies treated safety climate as a moderator and 7.8% treated climate as a mediator even though most of the studies treating safety climate as a mediator were cross-sectional designs, which some people argue is not an appropriate design for testing for mediation (Stone-Romero, 2010). However, Spector (2019) suggested that in mature areas of research, cross-sectional mediation models may be useful for ruling out plausible alternative explanations or identifying potential mediating mechanisms. Safety climate thus can be considered a dependent variable, independent variable, moderator, or

216 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

mediator depending on the research question; however, our review indicates that the research literature needs to investigate these roles to broaden our understanding of the development and effects of safety climate. FUTURE RESEARCH AGENDA Research on safety climate has made tremendous progress since the term safety climate was coined by Zohar in 1980. In the last five years, over 200 empirical studies have been published on the topic. These studies span such countries as China, Australia, Portugal, USA, and Norway, as well as a range of industries, including healthcare, construction, education, manufacturing, and transportation. Despite the considerable interest and advancement in the field, much work remains to be done to improve the literature on safety climate. The remainder of this paper is devoted to a discussion of the recommendations for conducting research on safety climate. Specifically, the focus of the future research agenda should be to improve theory, research designs and measurement. Perhaps the most fundamental need in safety climate literature is to arrive at consensus on the conceptual and operational definition of safety climate. Although many studies define safety climate as shared perceptions with respect to safety (Zohar, 1980) or policies, procedures, and practices in an organization with regards to safety (Neal & Griffin, 2006), there is still considerable variability in how people define the construct. Additionally, safety climate and culture are sometimes still used interchangeably. Furthermore, the agreement on the dimensions that comprise the construct of safety climate is currently lacking. Although much of the research has considered safety climate to be multidimensional in nature (e.g., Kines et al., 2011; Sawhney et al., 2018), some studies have operationalized it as a unidimensional construct (Drach-Zahavy & Somech, 2015). With respect to dimensionality, even though most safety climate measures share common themes of management commitment to safety, supervisor support for safety, and safety communication (Kines et al., 2011; Neal & Griffin, 2006; Sawhney et al., 2018), there are other dimensions that are less frequently utilized. These include perceptions of causes of error, satisfaction with the safety program, safety justice, and social status of the safety officer, to name just a few (Hoffmann et al., 2013; McCaughey, DelliFraine, McGhan, & Bruning, 2013; Schwatka & Rosecrance, 2016; Zohar, 1980). The lack of consensus on the conceptualization of safety climate spills into the operationalization of the construct. As recently discussed by Beus, Payne, Arthur, and Muñoz (2019), this conceptual ambiguity makes it difficult to compare studies that use different definitions of safety climate. According to Guldenmund (2000), a construct’s definition “sets the stage for ensuing research [and] is the basis for hypotheses, research paradigms, and interpretations of finding (p. 227).” With the differing dimensions of safety climate, research can produce different findings. Future research can benefit from dispelling theoretical ambiguity by reaching consensus on the dimensions that comprise safety climate beyond management commitment to safety.

We’ve Got (Safety) Issues  •  217

The advancement of research on safety climate is also contingent upon theoretical development in the area. Currently, we have a few frameworks that are specific to safety climate, such as Zohar’s (2000) multilevel model of safety climate and Neal, Griffin, and Hart’s (2000) model of safety performance. Despite the existence of these theories, researchers rarely draw explicitly upon these frameworks, and therefore, these theories largely remain untested. The exception appears research examining the links between safety climate with safety behavior and outcomes which has received abundant research attention. Theoretical development and testing will provide the necessary foundation for better understanding of predictors, mechanisms, and outcomes of safety climate. Currently, we have comparatively few studies on the antecedents of safety climate (e.g., Beus et al., 2015). Theories of safety climate can shed light on contextual factors that facilitate the emergence of such climate in the workplace. By better understanding the antecedents of safety climate, researchers and practitioners will be better equipped to intervene in order to enhance workplace safety. Research on safety climate can further prosper if theoretical models of safety acknowledge the different levels of safety climate within an organization. Much of the research on safety climate continues to focus on the individual level, with relatively fewer studies exploring safety climate at the group or organizational level. At the same time, it remains unknown whether aggregated responses to safety climate measures maintain the properties of individual-level responses (Beus et al., 2019). Considering that organizational processes at different levels are often interconnected (Kozlowski & Klein, 2000), examining safety climate at different levels separately will only give us an incomplete picture of safety within an organization. Therefore, by explicitly modeling relations between safety climate and various outcomes at different levels of the organization, we decrease the risk of committing either the atomistic or ecological fallacy (Hannan, 1971). Beus et al. (2019) examined the validity of a newly developed safety climate measure across individual and group levels of the construct and reported that the associations between group-level safety climate and injuries/incidents were not substantially different from those using corresponding individual-level perceptions of safety climate. However, more studies that explore the interconnectedness of safety climate and criteria at varying levels are needed. In addition to advances in theoretical development, research on safety climate can be bolstered by strengthening research methods and measurement. Based on our analysis of studies published over the last five years, safety climate researchers primarily rely on quantitative methods with a comparative lack of attention given to qualitative studies. Although quantitative studies permit researchers to objectively test relations of safety climate with theoretically meaningful constructs, such as safety performance and accidents and injuries in the workplace, qualitative methods have been credited with allowing in-depth analysis of complex social phenomena (Patton, 2002).

218 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

In the case of safety climate, qualitative designs may be particularly useful in generating theory by understanding the process of emergence of employee perceptions regarding safety and even emergence of group level perceptions of safety. Bliese, Chan, and Ployhart (2007) described three sources that might affect the formation of group level perceptions: employees’ individual experiences, shared group characteristics (e.g., group cohesion) and clustering of individual attributes by workgroup (e.g., demographics, backgrounds). For example, team faultlines —defined as the hypothetical lines that divide a team into subgroups on one or more attributes (Lau & Murnighan, 1998)—could be viewed as a clustering of individual attributes by workgroup. Team faultlines create subgroups among employees; the norms and ideologies of those subgroups shape in-group members’ perception. Eventually, such a divide of in-group and out-group dynamics could produce inaccuracies in quantitative measures. That is, quantitative measures demonstrated justifications for aggregation and a shared perception among employees; the subgroups formed through faultlines had different perceptions on safety policies, procedures, practices (see Beus, Jarrett, Bergman, & Payne, 2012). Qualitative research methods (e.g., action research, interviews, observational research) could reveal the nuances and contexts of a multilevel system (Aiken, Hanges, & Chen, 2018). Thus, using qualitative methods to understand an organization could enrich the understanding of the emergence of employee perceptions at both the individual level and higher levels. Similarly, such designs may be useful for gaining better insights into change in safety climate. Therefore, more safety climate studies should be undertaken using qualitative methodologies. For research to move forward in the area of safety climate, more studies are needed that utilize strong designs. Studies on safety climate have predominantly relied on cross-sectional designs (e.g., Bodner, Kraner, Bradford, Hammer, & Truxillo, 2014; Smith, Eldridge, & DeJoy, 2016). Although cross-sectional designs allow investigations regarding the interrelatedness of different constructs, they are insufficient for establishing causal relations while potentially leading to results influenced by common method biases (CMB; Podsakoff, MacKenzie, Lee, & Podsakoff, 2003; but also see Spector, 2019). To overcome CMB, some studies have employed prospective designs (e.g., Zohar, Huang, Lee, & Robertson, 2014) whereby variables are measured at two or more different time points. Although prospective designs may offer some advantages over cross-sectional designs, they do not necessarily remove CMB (Spector, 2019), test the reverse causation hypotheses (Zapf, Dormann, & Frese, 1996) or allow causal inferences (StoneRomero, 2010) as implied by some authors. To detect causal effects, studies are needed that employ experimental and quasi-experimental designs. Based on our review, only a handful of studies have utilized such designs in the safety climate literature (e.g., Cox et al., 2017; Graeve, McGovern, Arnold, & Polovich, 2017). Although researchers have argued that safety climate can change over time, research has yet to explore this phenomenon. One way to explore such changes in

We’ve Got (Safety) Issues  •  219

safety climate is to utilize experience sampling methodology (ESM), which goes beyond between-subject approaches. Specifically, ESM designs are equipped to assess how within-person perceptions of safety climate fluctuate from one day to another (Beal & Weiss, 2003). At the same time, ESM can rule out explanations that are introduced by third variables (Uy, Foo, & Aguinis, 2010), thereby enhancing theory. Future studies may consider employing ESM designs to better understand the effect of safety climate on criteria of interest. CONCLUSION In the present paper, we reviewed trends within the last five years in the safety climate literature. Our review focused on safety climate, a mature area of research that extends over four decades and encompasses hundreds of studies. Despite the size of the literature, it still lacks consistent conceptualization and operationalization of constructs. Research needs to consider these potentially important aspects of safety climate as either concepts of the definition or important antecedents or outcomes of safety climate. Additionally, research should explore alternative analytic perspectives of examining dimensions and progression of safety climate over time, including the stability of safety climate, non-linearity patterns of safety climate, and the relation of safety climate with potential antecedents and outcomes. NOTE 1.

Because it was not possible to cite all of the empirical studies, a list of the 237 empirical studies included in this review can be obtained from the first author. Please contact her at [email protected] REFERENCES

Aiken, J. R., Hanges, P. J., & Chen, T. (2018). The means are the end: Complexity science in organizational research. In S.E. Humphrey & J. M. LeBreton (Ed.), The handbook of multilevel theory, measurement, and analysis. Washington, DC: American Psychological Association. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association. Arcury, T. A., Grzywacz, J. G., Chen, H., Mora, D. C., & Quandt, S. A. (2014). Work organization and health among immigrant women: Latina manual workers in North Carolina. American Journal of Public Health, 104(12), 2445–2452. Arens, O. B., Fierz, K., & Zúñiga, F. (2017). Elder abuse in nursing homes: Do special care units make a difference? A secondary data analysis of the Swiss Nursing Homes Human Resources Project. Gerontology, 63(2), 169–179. https://doi. org/10.1159/000450787 Ausserhofer, D., Schubert, M., Desmedt, M., Blegen, M. A., De Geest, S., & Schwendimann, R. (2013). The association of patient safety climate and nurse-related organizational factors with selected patient outcomes: A cross-sectional survey. In-

220 • TETRICK, SINCLAIR, SAWHNEY, & CHEN ternational Journal of Nursing Studies, 50(2), 240–252. https://doi.org/10.1016/j. ijnurstu.2012.04.007 Beal, D. J., & Weiss, H. M. (2003). Methods of ecological momentary assessment in organizational research. Organizational Research Methods, 6(4), 440–464. Bell, B. G., Reeves, D., Marsden, K., & Avery, A. (2016). Safety climate in English general practices: Workload pressures may compromise safety. Journal of Evaluation in Clinical Practice, 22(1), 71–76. https://doi.org/10.1111/jep.12437 Bennett, P. N., Ockerby, C., Stinson, J., Willcocks, K., & Chalmers, C. (2014). Measuring hospital falls prevention safety climate. Contemporary Nurse, 47(1–2), 27–35. https://doi.org/10.1080/10376178.2014.11081903 Bergheim, K., Eid, J., Hystad, S. W., Nielsen, M. B., Mearns, K., Larsson, G., & Luthans, B. (2013). The role of psychological capital in perception of safety climate among air traffic controllers. Journal of Leadership & Organizational Studies, 20(2), 232– 241. https://doi.org/10.1177/1548051813475483 Bergman, M. E., Payne, S. C., Taylor, A. B., & Beus, J. M. (2014). The shelf life of a safety climate assessment: How long until the relationship with safety–critical incidents expires? Journal of Business and Psychology, 29(4), 519–540. Beus, J. M., Dhanani, L. Y., & McCord, M. A. (2015). A meta-analysis of personality and workplace safety: Addressing unanswered questions. Journal of Applied Psychology, 100, 481–498. https://doi.org/10.1037/a0037916 Beus, J. M., Jarrett, S. M., Bergman, M. E., & Payne, S. C. (2012). Perceptual equivalence of psychological climates within groups: When agreement indices do not agree. Journal of Occupational and Organizational Psychology, 85(3), 454–471. Beus, J. M., Payne, S. C., Arthur Jr, W., & Muñoz, G. J. (2019). The development and validation of a cross-industry safety climate measure: resolving conceptual and operational issues. Journal of Management, 45(5), 1987–2013. Beus, J. M., Payne, S. C., Bergman, M. E., & Arthur, W. (2010). Safety climate and injuries: An examination of theoretical and empirical relationships. Journal of Applied Psychology, 95, 713–727. Bliese, P. D., Chan, D., & Ployhart, R. E. (2007). Multilevel methods: Future directions in measurement, longitudinal analyses, and nonnormal outcomes. Organizational Research Methods, 10(4), 551–563. Block, J. (1995). A contrarian view of the five-factor approach to personality description. Psychological Bulletin, 117, 187–215. Bodner, T., Kraner, M., Bradford, B., Hammer, L., & Truxillo, D. (2014). Safety, health, and well-being of municipal utility and construction workers. Journal of Occupational and Environmental Medicine, 56, 771–778. https://doi.org/10.1097/ JOM.0000000000000178 Bronkhorst, B. (2015). Behaving safely under pressure: The effects of job demands, resources, and safety climate on employee physical and psychosocial safety behavior. Journal of Safety Research, 55, 63–72. https://doi.org/10.1016/j.jsr.2015.09.002 Bronkhorst, B., & Vermeeren, B. (2016). Safety climate, worker health and organizational health performance: Testing a physical, psychosocial and combined pathway. International Journal of Workplace Health Management, 9(3), 270–289. Casey, T., Griffin, M. A., Flatau Harrison, H., Neal, A. (2017). Safety climate and culture: Integrating psychological and systems perspectives. Journal of Occupational Health Psychology, 22, 341–353.

We’ve Got (Safety) Issues  •  221 Christian, M. S., Bradley, J. C., Wallace, J. C., & Burke, M. J. (2009). Workplace safety: A meta-analysis of the roles of person and situational factors. Journal of Applied Psychology, 94, 1103–1127. Clarke, S. (2010). An integrative model of safety climate: Linking psychological climate and work attitudes to individual safety outcomes using meta-analysis. Journal of Occupational and Organizational Psychology, 83, 553–579. Clarke, S. (2013). Safety leadership: A meta-analytic review of transformational and transactional leadership styles as antecedents of safety behaviors. Journal of Occupational and Organizational Psychology, 86, 22–49. Clark, O. L., Zickar, M. J., & Jex, S. M. (2014). Role definition as a moderator of the relationship between safety climate and organizational citizenship behavior among hospital nurses. Journal of Business and Psychology, 29, 101–110. https://doi. org/10.1007/s10869-013-9302-0 Cox, S., & Cox, T. (1991). The structure of employee attitudes to safety: A European example. Work & Stress, 5(2), 93–106. https://doi.org/10.1080/02678379108257007 Cox, E. D., Jacobsohn, G. C., Rajamanickam, V. P., Carayon, P., Kelly, M. M., Wetterneck, T. B., ... & Brown, R. L. (2017). A family-centered rounds checklist, family engagement, and patient safety: A randomized trial. Pediatrics, 139(5), 1–10. https://doiorg.proxy.lib.odu.edu/10.1542/peds.2016-1688 Dedobbeleer, N., & Béland, F. (1991). A safety climate measure for construction sites. Journal of Safety Research, 22, 97–103. https://doi.org/10.1016/0022-4375(91)90017-P de Vries, M. G., Brazil, I. A., Tonkin, M., & Bulten, B. H. (2016). Ward climate within a high secure forensic psychiatric hospital: Perceptions of patients and nursing staff and the role of patient characteristics. Archives of Psychiatric Nursing, 30(3), 342– 349. https://doi.org/10.1016/j.apnu.2015.12.007 Dollard, M. F., & Bakker, A. B. (2010). Psychosocial safety climate as a precursor to conducive work environments, psychological health problems, and employee engagement. Journal of Occupational and Organizational Psychology, 83(3), 579–599. Drach-Zahavy, A., & Somech, A. (2015). Goal orientation and safety climate: Enhancing versus compensatory mechanisms for safety compliance? Group & Organization Management, 40, 560–588. https://doi.org/10.1177/1059601114560372 Edmondson, A. (1999). Psychological safety and learning behavior in work teams. Administrative science quarterly, 44(2), 350–383. Flin, R., Mearns, K., O’Connor, P., & Bryden, R. (2000). Measuring safety climate: Identifying the common features. Safety Science, 34, 177–192. Gazica M. W., & Spector P. E. (2016) A test of safety, violence prevention, and civility climate domain-specific relationships with relevant workplace hazards. International Journal of Occupational and Environmental Health, 22, 45–51. Golubovich, J., Chang, C. H., & Eatough, E. M. (2014). Safety climate, hardiness, and musculoskeletal complaints: A mediated moderation model. Applied Ergonomics, 45(3), 757–766. https://doi.org/10.1016/j.apergo.2013.10.008 Graeve, C., McGovern, P. M., Arnold, S., & Polovich, M. (2017). Testing an intervention to decrease healthcare workers’ exposure to antineoplastic agents. Oncology Nursing Forum, 44(1), E10–E19. https://doi.org/10.1188/17.ONF.E10-E19 Griffin, M. A., & Neal, A. (2000). Perceptions of safety at work: A framework for linking safety climate to safety performance, knowledge, and motivation. Journal of Occupational Health Psychology, 5, 347–358.

222 • TETRICK, SINCLAIR, SAWHNEY, & CHEN Guerrero, S., Lapalme, M. È., & Séguin, M. (2015). Board chair authentic leadership and nonexecutives’ motivation and commitment. Journal of Leadership & Organizational Studies, 22(1), 88-101. https://doi-org.proxy.lib.odu. edu/10.1177/1548051814531825 Guldenmund, F. W. (2000). The nature of safety culture: a review of theory and research. Safety Science, 34, 215–257. Halbesleben, J. R. B., Hannes, L., Dierynck, B., Simmons, T., Savage, G. T., McCaughey, D., Leon, M. R. (2013). Living up to safety values in health care: The effect of leader behavioral integrity on occupational safety. Journal of Occupational Health Psychology, 18, 395–405. Hall, G.B., Dollard, M.F., & Coward, J. (2010). Psychosocial safety climate: development of the PSC-12. International Journal of Stress Management, 17, 353–383. Hannan, M. T. (1971). Aggregation and disaggregation in sociology. Lexington, MA: Lexington Books. Hartmann, C. W., Meterko, M., Zhao, S., Palmer, J. A., & Berlowitz, D. (2013). Validation of a novel safety climate instrument in VHA nursing homes. Medical Care Research and Review, 70(4), 400–417. https://doi.org/10.1177/1077558712474349 He, Q., Dong, S., Rose, T., Li, H., Yin, Q., & Cao, D. (2016). Systematic impact of institutional pressures on safety climate in the construction industry. Accident Analysis and Prevention, 93, 230–239. https://doi.org/10.1016/j.aap.2015.11.034 Hinde, T., Gale, T., Anderson, I., Roberts, M., & Sice, P. (2016). A study to assess the influence of interprofessional point of care simulation training on safety culture in the operating theatre environment of a university teaching hospital. Journal of Interprofessional Care, 30(2), 251–253. https://doi.org/10.3109/13561820.2015.1084277 Hoffmann, B., Miessner, C., Albay, Z., Scbrbber, J., Weppler, K., Gerlach, F. M., & Guthlin, C. (2013). Impact of individual and team features of patient safety climate: A survey in family practices. Annals of Family Medicine, 11, 355–362. https://doi-org. proxy.lib.odu.edu/10.1370/afm.1500 Hofmann, D. A., Burke, M. J., & Zohar, D. (2017). 100 Years of occupational safety research: From basic protections and work analysis to a multilevel view of workplace safety and risk. Journal of Applied Psychology, 102, 375–388. Hong, S., & Li, Q. (2017). The reasons for Chinese nursing staff to report adverse events: A questionnaire survey. Journal of Nursing Management, 25(3), 231–239. https:// doi.org/10.1111/jonm.12461 Huang, Y., Lee, J., McFadden, A. C., Rineer, J., & Robertson, M. M. (2017). Individual employee’s perceptions of “Group-level Safety Climate” (supervisor referenced) versus “Organization-level Safety Climate” (top management referenced): Associations with safety outcomes for lone workers. Accident Analysis and Prevention, 98, 37–45. https://doi.org/10.1016/j.aap.2016.09.016 Huang, Y., Sinclair, R. R., Lee, J., McFadden, A. C., Cheung, J. H., & Murphy, L. A. (2018). Does talking the talk matter? Effects of supervisor safety communication and safety climate on long-haul truckers’ safety performance. Accident Analysis & Prevention, 117, 357–367. https://doi-org.proxy.lib.odu.edu/10.1016/j.aap.2017.09.006 Huang, Y., Zohar, D., Robertson, M. M., Garabet, A., Lee, J., & Murphy, L. A. (2013). Development and validation of safety climate scales for lone workers using truck drivers as exemplar. Transportation Research Part F: Traffic Psychology and Behaviour, 17, 5–19. https://doi.org/10.1016/j.trf.2012.08.011

We’ve Got (Safety) Issues  •  223 Idris, M. A., Dollard, M. F., Coward, J., & Dormann, C. (2012). Psychosocial safety climate: Conceptual distinctiveness and effect on job demands and worker health. Safety Science, 50, 19–28. International Labor Organization (2009). World day for safety and health at work 2009: Facts on safety and health at work? International Labour Office. Geneva: ILO. Retrieved from: http://www.ilo.org/wcmsp5/groups/public/@dgreports/@dcomm/ documents/publication/wcms_105146.pdf James, L. A., & James, L. R. (1989). Integrating work environment perceptions: Explorations into the measurement of meaning. Journal of Applied Psychology, 74, 739– 751. James, L. R., & Jones, A. P. (1974). Organizational climate: A review of theory and research. Psychological Bulletin, 81, 1096–1112. James, L. R., Choi, C. C., Ko, C.-H. E., McNeil, P. K., Minton, M. K., Wright, M. A., & Kim, K. I. (2008). Organizational and psychological climate: A review of theory and research. European Journal of Work and Organizational Psychology, 17, 5–32. Jiang, L., Lavaysse, L. M. & Probst, T. M. (2019) Safety climate and safety outcomes: A meta-analytic comparison of universal vs. industry-specific safety climate predictive validity, Work & Stress, 33, 41–57. Kagan, I., & Barnoy, S. (2013). Organizational safety culture and medical error reporting by Israeli nurses. Journal of Nursing Scholarship, 45(3), 273–280. https://doi-org. proxy.lib.odu.edu/10.1111/jnu.12026 Kane, M. (2012). All validity is construct validity. Or is it? Measurement, 10, 66–70. Keiser, N. L., & Payne, S. C. (2018). Safety climate measurement: An empirical test of context-specific versus general assessments. Journal of Business and Psychology, 33, 479–494. Kim, O., Kim, M. S., Jang, H. J., Lee, H., Kang, Y., Pang, Y., & Jung, H. (2018). Radiation safety education and compliance with safety procedures—The Korea Nurses’ Health Study. Journal of Clinical Nursing, 27(13/14), 2650–2660. https://doi-org. proxy.lib.odu.edu/10.1111/jocn.14338 Kines, P., Lappalainen, J., Mikkelsen, K. L., Olsen, E., Pousette, A., Tharaldsen, J., ... & Törner, M. (2011). Nordic Safety Climate Questionnaire (NOSACQ-50): A new tool for diagnosing occupational safety climate. International Journal of Industrial Ergonomics, 41, 634–646. Kozlowski, S. W., & Klein, K. J. (2000). A multilevel approach to theory and research in organizations: Conceptual, temporal, and emergent processes. In K. Kline & S. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations (pp. 3–90). San Francisco, CA: Jossey-Bass. Lawrie, E. J., Tuckey, M. R., & Dollard, M. F. (2018). Job design for mindful work: The boosting effect of psychosocial safety climate. Journal of Occupational Health Psychology, 23(4), 483–495. https://doi-org.proxy.lib.odu.edu/10.1037/ocp0000102 Lau, D. C., & Murnighan, J. K. (1998). Demographic diversity and faultlines: The compositional dynamics of organizational groups. Academy of Management Review, 23, 325–340. doi:10.2307/259377 Lee, J., Huang, Y-H, Cheung, J. H., Chen, Zhuo, & Shaw, W. S. (2018). A systematic review of the safety climate intervention literature: Past trends and future directions. Journal of Occupational Health Psychology, 24, 66–91.

224 • TETRICK, SINCLAIR, SAWHNEY, & CHEN Lee, J., Sinclair, R. R., Huang, E., & Cheung, J. (2019). Outcomes of safety climate in trucking: A longitudinal framework. Journal of Business and Psychology, 34, 865– 878. Leitão, S., & Greiner, B. A. (2016). Organisational safety climate and occupational accidents and injuries: An epidemiology based systematic review. Work & Stress, 30, 71–90. Liberty Mutual Research Institute for Safety. (2016). 2016 Liberty Mutual workplace safety index. Hopkinton, MA. Retrieved from: http://cdn2.hubspot.net/ hubfs/330425/2016_Liberty_Mutual_Workplace_Safety_Index.pdf Lissitz, R. W., & Samuelsen, K. (2007). A suggested change in terminology and emphasis regarding validity and education. Educational Researcher, 36, 437–448. Mansour, S., & Tremblay, D. G. (2018). Psychosocial safety climate as resource pathways to alleviate work-family conflict. A study in the health sector in Quebec. Personnel Review, 47(2), 474–493. https://doi.org/10.1108/PR-10-2016-0281 Mansour, S., & Tremblay, D. G. (2019). How can we decrease burnout and safety workaround behaviors in health care organizations? The role of psychosocial safety climate. Personnel Review, 48(2), 528–550. Martowirono, K., Wagner, C., & Bijnen, A. B. (2014). Surgical residents’ perceptions of patient safety climate in Dutch teaching hospitals. Journal of Evaluation in Clinical Practice, 20(2), 121–128. https://doi-org.proxy.lib.odu.edu/10.1111/jep.12096 McCaughey, D., DelliFraine, J. L., McGhan, G., & Bruning, N. S. (2013). The negative effects of workplace injury and illness on workplace safety climate perceptions and health care worker outcomes. Safety Science, 51, 138–147. https://doi.org/10.1016/j. ssci.2012.06.004 Mearns, K., Hope, L., Ford, M. T., & Tetrick, L. E. (2010). Investment in workforce health: Exploring the implications for workforce safety climate and commitment. Accident Analysis and Prevention, 42, 1445–1454. Mearns, K., Whitaker, S. M., & Flin, R. (2003). Safety climate, safety management practice and safety performance in offshore environments. Safety Science, 41(8), 641– 680. Retrieved from: https://psycnet.apa.org/doi/10.1016/S0925-7535(02)00011-5 Milijić, N., Mihajlović, I., Nikolić, D., & Živković, Ž. (2014). Multicriteria analysis of safety climate measurements at workplaces in production industries in Serbia. International Journal of Industrial Ergonomics, 44(4), 510–519. https://doi.org/10.1016/j. ergon.2014.03.004 Nahrgang, J. D., Morgeson, F. P., & Hoffmann, D. A. (2011). Safety at work: A meta-analytic investigation of the link between job demands, job resources, burnout, engagement, and safety outcomes. Journal of Applied Psychology, 96, 71–94. Neal, A., & Griffin, M. A. (2006). A study of the lagged relationships among safety climate, safety motivation, safety behavior, and accidents at the individual and group levels. Journal of Applied Psychology, 91(4), 946–953. Neal, A., Griffin, M. A., & Hart, P. M. (2000). The impact of organizational climate on safety climate and individual behavior. Safety Science, 34, 99–109. Newman, A., Donohue, R., & Eva, N. (2017). Psychological safety: A systematic review of the literature. Human Resource Management Review, 27, 521–535. Nixon, A. E., Lanz, J. J., Manapragada, A., Bruk-Lee, V., Schantz, A., & Rodriguez, J. F. (2015). Nurse safety: How is safety climate related to affect and attitude? Work &

We’ve Got (Safety) Issues  •  225 Stress, 29(4), 401–419. https://doi-org.proxy.lib.odu.edu/10.1080/02678373.2015.1 076536 Ostroff, C., Kinicki, A. J., & Muhammad, R. S. (2013). Organizational culture and climate. In I. B. Wiener (Ed.), Handbook of psychology (2nd ed., pp. 643–676). New York, NY: Wiley. Pan, K. C., Huang, C. Y., Lin, S. C., & Chen, C. I. (2018). Evaluating safety culture and related factors on leaving intention of nurses: The mediating effect of emotional intelligence. International Journal of Organizational Innovation, 11(1), 1–9. Patton, M. Q. (2002). Qualitative research & evaluation methods (3rd ed.). Thousand Oaks, CA: Sage. Ployhart, R. E., & Ward, A.-K. (2011). The “quick start guide” for conducting and publishing longitudinal research. Journal of Business and Psychology, 26(4), 413–422. doi:10.1007/s10869-011-9209-6 Podsakoff, P. M., MacKenzie, S. B., Lee, J. Y., & Podsakoff, N. P. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88, 879–903. Sawhney, G., Sinclair, R. R., Cox, A. R., Munc, A. H., & Sliter, M. T., (2018). One climate or many: Examining the structural distinctiveness of safety, health, and stress prevention climate measures. Journal of Occupational and Environmental Medicine, 60, 1015–1025. Schneider, B. (1975). Organizational climates: An essay. Personnel Psychology, 28, 447– 479. Schneider, B. (2000). The psychological life of organizations. In N. M. Ashkanasy, C. P. M. Wilderom, & M. F. Peterson (Eds.), Handbook of organizational culture & climate (pp. xvii–xxi). Thousand Oaks, CA: Sage. Schneider, B., 1990. The climate for service: an application of the climate construct. In: Schneider, B. (Ed.), Organizational climate and culture (pp. 383–412). Jossey-Bass, San Francisco, CA. Schwatka, N. V., & Rosecrance, J. C. (2016). Safety climate and safety behaviors in the construction industry: The importance of co-workers commitment to safety. Work: Journal of Prevention, Assessment & Rehabilitation, 54, 401–413. https://doi. org/10.3233/WOR-162341 Shadish, W. R., Cook, T. D., & Campbell, D. T. (1982). Experimental and quasi experimental designs for generalized causal inference. New York, NY: Houghton Mifflin. Sliter K. A. (2013). Development and validation of a measure of workplace climate for healthy weight maintenance. Journal of Occupational Health Psychology, 18, 350– 362. Smith, T. D., Eldridge, F., & DeJoy, D. M. (2016). Safety-specific transformational and passive leadership influences on firefighter safety climate perceptions and safety behavior outcomes. Safety Science, 86, 92–97. https://doi.org/10.1016/j.ssci.2016.02.019 Spector, P. E. (2019). Do not cross me: Optimizing the use of cross-sectional designs. Journal of Business and Psychology 34, 125–137. Doi.org/10.1007/s 10869-01809613-8 Spector, P. E., & Pindek, S. (2016). The future of research methods in work and occupational health psychology. Applied Psychology: An International Review, 65(2), 412–431. https://doi.org/10.1111/apps.12056

226 • TETRICK, SINCLAIR, SAWHNEY, & CHEN Stone-Romero, E. F. (2010). Research strategies in industrial and organizational psychology: Nonexperimental, quasi-experimental, and randomized experimental research in special purpose and nonspecial purpose settings. In S. Zedeck (Ed.), Handbook of industrial and organizational psychology (pp. 35–70). Washington, DC: American Psychological Association Press. Uy, M. A., Foo, M. D., & Aguinis, H. (2010). Using experience sampling methodology to advance entrepreneurship theory and research. Organizational Research Methods, 13, 31–54. Vogus, T. J., Cull, M. J., Hengelbrok, N. E., Modell, S. J., & Epstein, R. A. (2016). Assessing safety culture in child welfare: Evidence from Tennessee. Children and Youth Services Review, 65, 94–103. Wang, M-T., & Degol, J. L. (2016). School climate: A review of the construct, measurement, and impact on student outcomes. Educational Psychology Review, 128, 315352. https://doi.org/10.1007/s10648-015-9319-1 Zapf, D., Dormann, C., & Frese, M. (1996). Longitudinal studies in organizational stress research: A review of the literature with reference to methodological issues. Journal of Occupational Health Psychology, 1, 145–169. Retrieved from: http://dx.doi. org/10.1037/1076-8998.1.2.145 Zohar, D. (1980). Safety climate in industrial organizations: Theoretical and applied implications. Journal of Applied Psychology, 65, 96–102. Zohar, D. (2000). A group-level model of safety climate: Testing the effects of group climate on microaccidents in manufacturing jobs. Journal of Applied Psychology, 85, 587–596. Zohar, D. (2008). Safety climate and beyond: A multi-level multi-climate framework. Safety Science, 46, 376–387. Zohar, D. (2010). Thirty years of safety climate research: Reflections and future directions. Accident Analysis and Prevention, 42, 1517–1522. Zohar, D. (2014). Safety climate: Conceptualization, measurement, and improvement. In B. Schneider & K. M. Barbera (Eds.), The Oxford handbook of organizational climate and culture (pp. 317–334). New York, NY: Oxford University Press. Zohar, D., & Lee, J. (2016). Testing the effects of safety climate and disruptive children behavior on school bus drivers performance: A multilevel model. Accident Analysis and Prevention, 95(Part A), 116–124. https://doi.org/10.1016/j.aap.2016.06.016

BIOGRAPHIES

ABOUT THE AUTHORS Dr. David G. Allen is Associate Dean for Graduate Programs and Professor of Management, Entrepreneurship, and Leadership at the Neeley School of Business at Texas Christian University; Distinguished Research Environment Professor at Warwick Business School; and Editor-in-Chief of the Journal of Management. Professor Allen earned his Ph.D. from the Beebe Institute of Personnel and Employment Relations at Georgia State University. His teaching, research, and consulting cover a wide range of topics related to people and work, with a particular focus on the flow of human capital into and out of organizations. His awardwinning research has been regularly published in the field’s top journals, such as Academy of Management Journal, Human Relations, Human Resource Management, Journal of Applied Psychology, Journal of Management, Journal of Organizational Behavior, Organization Science, Organizational Research Methods, and Personnel Psychology, and he is the author of the book Managing Employee Turnover: Dispelling Myths and Fostering Evidence-Based Retention Strategies. Professor Allen is a Fellow of the American Psychological Association, the Society for Industrial and Organizational Psychology, and the Southern Management Association. Research Methods in Human Resource Management: Toward Valid Research-Based Inferences, pages 227–235. Copyright © 2020 by Information Age Publishing All rights of reproduction in any form reserved.

227

228 • BIOGRAPHIES

Tiancheng (Allen) Chen has a Master of Professional Studies degree in Industrial and Organizational Psychology from the University of Maryland, College Park and is currently is a doctoral student at George Mason University. Allen is also a student member of the Society for Industrial Organizational Psychology and the Personnel Testing Council Metropolitan Washington. Allen’s research interests are leadership, teams, and organizational climate and culture. Allen’s main publication is co-authoring a book chapter in the Handbook of Multilevel Theory, Measurement, and Analysis. Dr. Angelo DeNisi is the Albert Harry Cohen Chair in Business Administration at Tulane University, where he also served a six-year term as Dean of the A.B. Freeman School of Business. After receiving his Ph.D. in Industrial/Organizational Psychology from Purdue University in 1977, he served as a faculty member at Kent State, the University of South Carolina, Rutgers, and Texas A&M University before moving to Tulane. His research interests include performance appraisal and performance management, as well as expatriate management, and his research has been funded by the National Science Foundation, the U.S. Army Research Institute, several state agencies and several industry groups in the U.S. He has also served as President of the Society for Industrial and Organizational Psychology (SIOP), as well as President of Academy of Management (AOM); he has chaired both the Organizational Behavior and the Human Resources Divisions of the AOM, and he is a Fellow of the Academy of Management, SIOP, and the American Psychological Association. He has published more than a dozen book chapters, and more than 80 articles in refereed journals, most of them in top academic journals such as the Academy of Management Journal (AMJ), the Academy of Management Review (AMR), the Journal of Applied Psychology (JAP), the Journal of Personality and Social Psychology and Psychological Bulletin. His research has been recognized with awards from several divisions of the AOM, including winning the 2016 Herbert Heneman Lifetime Contribution Award from the Human Resources Division, and SIOP named him the cowinner of the 2005 Distinguished Lifetime Scientific Contribution Award. He also serves, or has served on a number of Editorial Boards, including AMJ, AMR, JAP, Journal of Management, Entrepreneurship Theory and Practice, and Journal of Organizational Behavior. He was Editor of AMJ from 1994 to 1996, and was recently named Co-Editor of the SIOP Organizational Frontiers Series, with Kevin Murphy. Ian N. Fairbanks is a graduate student and teaching assistant at Clemson University. He is pursuing a Master of Science in Applied Psychology with an emphasis in industrial-organizational psychology. His research interests are in personality and individual differences, particularly in their application to personnel selection and training. He is a member of the Society for Industrial and Organizational Psychology

Biographies  •  229

Dr. Gerald R. Ferris is the Francis Eppes Professor of Management, Professor of Psychology, and Professor of Sport Management at Florida State University. Before accepting this chaired position, he held the Robert M. Hearin Chair of Business Administration and was Professor of Management and Acting Associate Dean for Faculty and Research in the School of Business Administration at the University of Mississippi from 1999–2000. Before that, he served as Professor of Labor and Industrial Relations, of Business Administration, and of Psychology at the University of Illinois at Urbana-Champaign from 1989–1999, and as the Director of the Center for Human Resource Management at the University of Illinois from 1991–1996. Ferris received a Ph.D. in Business Administration from the University of Illinois at Urbana-Champaign. He has research interests in the areas of social influence in organizations, performance evaluation, and reputation in organizational contexts. Ferris is the author of numerous articles published in such scholarly journals as the Journal of Applied Psychology, Organizational Behavior and Human Decision Processes, Personnel Psychology, the Academy of Management Journal, the Journal of Management, and the Academy of Management Review. Ferris served as editor of the annual series, Research in Personnel and Human Resources Management, from 1981–2003. He has authored or edited several books including Political Skill at Work, Handbook of Human Resource Management, Strategy and Human Resources Management, and Method & Analysis in Organizational Research. Ferris has been the recipient of many distinctions and honors, and in 2001 was the recipient of the Heneman Career Achievement Award, and in 2010 was the recipient of the Thomas A. Mahoney Mentoring Award, both from the Human Resources Division of the Academy of Management. Dr. Julie I. Hancock is Assistant Professor at the G. Brint Ryan College of Business, University of North Texas. She holds a Ph.D. in Business Administration from the University of Memphis. Her primary research interests include the flow of people in organizations, collective turnover, perceived organizational support, and pro-social rule breaking. Her work on these topics has been published in Journal of Management, Journal of Organizational Behavior, Human Relations, and Human Resource Management Review. Dr. Hancock currently serves on the Academy of Management HR Division Executive Committee as a Representative-at-Large. Wayne A. Hochwarter is the Jim Moran Professor of Organizational Behavior in the Department of Management, College of Business at Florida State University (FSU). He also is a Research Fellow in the Jim Moran Institute for Global Entrepreneurship at FSU, and Honorary Research Professor at Australia Catholic University. Before moving to FSU in 2001, Hochwarter served on the faculties of Management at Mississippi State University and the University of Alabama. He received a Ph.D. in Management from FSU. Hochwarter has research interests

230 • BIOGRAPHIES

in organizational leadership, power, influence, and workplace social dynamics, and his research has been published in such journals as Administrative Science Quarterly, the Journal of Applied Psychology, Organizational Behavior and Human Decision Processes, the Academy of Management Review, and the Journal of Management. Dr. Allen I Huffcutt is Caterpillar Professor at Bradley University in Peoria, Illinois. He publishes regularly in the employment interview literature in a variety of journals including Human Resource Management Review, European Management Journal, International Journal of Selection and Assessment, and Personnel Assessment and Decisions. This research has addressed core issues such as reliability, validity, construct assessment, and ethnic group differences. His current research focus is on the cognitive processes that underlie responding to Behavior Description and Situational Interviews. In addition, he publishes research on methodological and measurement issues, including meta-analysis and artifact (e.g., range restriction) correction. Dr. Huffcutt has written various book chapters, including in the APA Handbook of Industrial and Organizational Psychology and the Encyclopedia of Industrial-Organizational Psychology. Finally, he is a Fellow in the Society for Industrial and Organizational Psychology, and was recently recognized as one of the top two percent most influential authors (Aguinis et al., 2017) as measured by textbook citations. He reviews for a number of journals, including Human Performance, Journal of Business and Psychology, Personnel Assessment and Decisions, and Journal of Business Research. Samantha L. Jordan is a Ph.D. candidate in Organizational Behavior/Human Resources Management at Florida State University. She received a B.S. degree in Psychology at the University of Florida. Jordan has research interests in organizational justice and inclusion, social influence and political processes, and individual differences (e.g., grit, narcissism). Her research has been published in such journals as Group & Organization Management, Human Resource Management Review, and the Journal of Leadership & Organizational Studies. Dr. Liam P. Maher is Assistant Professor of Management in the College of Business and Economics at Boise State University. He received a Ph.D. in Management at Florida State University. His research interests include political skill, political will, leadership, and identity. His research can be found in Personnel Psychology, Annual Review of Organizational Psychology and Organizational Behavior, Journal of Vocational Behavior, Group & Organization Management, and Journal of Organizational & Leadership Studies. Dr. Kevin Murphy holds the Kemmy Chair of Work and Employment Studies at the University of Limerick. Professor Murphy earned his PhD in Psychology from The Pennsylvania State University in 1979, and has served on the facul-

Biographies  •  231

ties of Rice University, New York University, Pennsylvania State University and Colorado State University. He is a Fellow of the American Psychological Association, the Society for Industrial and Organizational Psychology (SIOP) and the American Psychological Society, and the recipient of SIOP’s 2004 Distinguished Scientific Contribution Award. He is the author of over one hundred and ninety articles and book chapters, and author or editor of eleven books, in areas ranging from psychometrics and statistical analysis to individual differences, performance assessment and honesty in the workplace. He served as co-Editor of the Taylor & Francis (previously Erlbaum) Applied Psychology Series and has been appointed co-editor, with Angelo DeNisi, of the SIOP Organizational Frontiers Series. He has served as President of SIOP and Editor of Journal of Applied Psychology and of Industrial and Organizational Psychology: Perspectives on Science and Practice, and is a member of numerous editorial boards. Throughout his career, Dr. Murphy has worked to advance both research and the application of that research to solve practical problems in organizations. For example, he served as both a member and the Chair of the U.S. Department of Defense Advisory Committee on Military Personnel Testing, and has also served on five U.S. National Academy of Sciences committees, all of which dealt with problems in the workplace. He has carried out a number of research projects with military and national security organizations, dealing with problems ranging from training to applying research on motivation to problems of nuclear deterrence, and has worked with numerous private and public-sector organizations to build and evaluate their human resource management systems. Dr. Patrick J. Rosopa is an Associate Professor in the Department of Psychology at Clemson University. His substantive research interests are in personality and cognitive ability, stereotypes and fairness in the workplace, and cross-cultural issues in organizational research. He also has quantitative research interests in applied statistical modeling in the behavioral sciences including applications of machine learning and the use of computer-intensive approaches to evaluate statistical procedures. Dr. Rosopa’s work has been supported by $4.1 million in grants from various organizations including Alcon, BMW, and the National Science Foundation. Dr. Rosopa has published in various peer-reviewed journals including Psychological Methods, Organizational Research Methods, Journal of Modern Applied Statistical Methods,  Human Resource Management Review,  Journal of Managerial Psychology, Journal of Vocational Behavior, Human Performance, and Personality and Individual Differences. In addition, he has co-authored a statistics textbook titled Statistical Reasoning in the Behavioral Sciences, published by Wiley in 2010 and 2018. Dr. Rosopa serves on the editorial board of Human Resource Management Review and Organizational Research Methods. He also serves as Associate Editor-Methodology for Journal of Managerial Psychology. Dr. Rosopa is a member of the American Psychological Association, Association

232 • BIOGRAPHIES

for Psychological Science, and Society for Industrial and Organizational Psychology. Zachary A. Russell is Assistant Professor of Management in the Department of Management and Entrepreneurship, Williams College of Business at Xavier University. He received a Ph.D. in Management at Florida State University. His research interests include reputation, social influence, organizational politics, human resource practice implementation, and labor unions. His research has been published in Human Resource Management Review, Journal or Leadership & Organizational Studies, Journal of Organizational Effectiveness: People and Performance, and Journal of Labor Research. Dr. Gargi Sawhney is Assistant Professor at the University of Minnesota Duluth. Her research interests fall within the realm of occupational stress and occupational safety. Dr. Sawhney’s research has been published in various peer-reviewed outlets, including Journal of Business and Psychology, Journal of Occupational Health Psychology, and Journal of Positive Psychology. Dr. Neal Schmitt is Emeritus Professor of Psychology and Management at Michigan State University. He was editor of Journal of Applied Psychology from 1988–1994 and has served on a dozen editorial boards. He has received the Society for Industrial and Organizational Psychology’s Distinguished Scientific Contributions Award (1999) and its Distinguished Service Contributions Award (1998). In 2014, he was named a James McKeen Cattell Fellow of the American Psychological Society (APS). He served as the Society’s President in 1989–90 and as the President of Division 5 (Measurement, Evaluation, and Statistics) of the American Psychological Association (APA). Schmitt is a Fellow of Divisions 5 and 14, APA, and APS. He was also awarded the Heneman Career Achievement Award and the Career Mentoring Award from the Human Resources Division of the Academy of Management and Distinguished Career Award from the Research Methods Division of the Academy of Management. He has coauthored three textbooks, Staffing Organizations with Ben Schneider and Rob Ployhart, Research Methods in Human Resource Management with Richard Klimoski, Personnel Selection with David Chan, edited the Handbook of Assessment and Selection, and co-edited Personnel Selection in Organizations with Walter Borman and Measurement and Data Analysis with Fritz Drasgow and published approximately 250 peer-reviewed articles and book chapters. His current research centers on the effectiveness of organizations’ selection procedures, college admissions processes, and the outcomes of these procedures. He is the current chair of the Defense Advisory Committee for Military Personnel Testing and chair of the Publications Committee of the International Testing Commission.

Biographies  •  233

Dr. Amber N. Schroeder is an assistant professor of psychology at The University of Texas at Arlington. She earned an M.S. and Ph.D. in Industrial-Organizational Psychology from Clemson University after completing a B.A. in psychology from Texas A&M University. She has published in elite, peer-reviewed journals, including the Journal of Applied Psychology, Psychological Methods, Psychological Bulletin, Journal of Occupational Health Psychology, Journal of Managerial Psychology, and Computers in Human Behavior, among others. She has also served as a PI on a National Science Foundation grant-funded project and as a program evaluator on a U.S. Department of Education Race to the Top grant. Dr. Schroeder’s research focuses primarily on the impact of technology use in organizational settings, with published articles on topics such as the use of web-based job applicant screening (i.e., cybervetting) and employee engagement in cybermisbehavior (e.g., online incivility), as well as on approaches for detecting and managing heteroscedasticity. Dr. Schroeder is a member of the Society for Industrial and Organizational Psychology (SIOP) and the Association for Psychological Science (APS). Dr. Robert R. Sinclair is Professor of Industrial-Organizational Psychology at Clemson University. Prior to arriving at Clemson in 2008, he was a member of the faculty at Portland State University (2000–2008) and the University of Tulsa (1995–1999). Bob currently serves as the Founding Editor-in-Chief for Occupational Health Science, as an Associate Editor for the Journal of Business and Psychology and is a founding member and past-president of the Society for Occupational Health Psychology. Bob has published over 70 book chapters and articles in leading journals such as the Journal of Applied Psychology, Journal of Organizational Behavior, Journal of Occupational and Organizational Psychology, and Journal of Occupational Health Psychology. He also has published four edited volumes including Contemporary Occupational Health Psychology: Global Perspectives on Research and Practice, Volume 2 (2012 with Houdmont & Leka) and Volume 3 (with Leka), Building Psychological Resilience in Military Personnel: Theory and Practice (2013, with Britt), and Research Methods in Occupational Health Psychology: Measurement, Design, and Data Analysis (2012, with Wang and Tetrick). Dr. Eugene F. Stone-Romero is Research Professor at the Anderson Graduate School of Management, University of New Mexico. He is a Fellow of the Society for Industrial and Organizational Psychology, the Association for Psychological Science, and the American Psychological Association. He served as an Associate Editor of the Journal of Applied Psychology and as a member of numerous editorial boards. Stone-Romero received the Distinguished Career Award of the Research Methods Division of the Academy of Management in recognition of publications in the field of research methods. In addition, he received the Thomas Mahoney Career Mentoring Award from the Human Resource Division of the

234 • BIOGRAPHIES

Academy of Management in recognition of lifelong mentoring of doctoral students in human resource management. Moreover, he received the Kenneth and Mamie Clark Award of American Psychological Association of Graduate Students (APAGS) in recognition of outstanding contributions to the professional development of ethnic minority graduate students. The results of his research have appeared in such outlets as the Journal of Applied Psychology, Organizational Behavior and Human Performance, Personnel Psychology, Journal of Vocational Behavior, Academy of Management Journal, Journal of Management, Educational and Psychological Measurement, Journal of Educational Psychology, Research in Personnel and Human Resources Management, Applied Psychology: An International Review, and the Journal of Applied Social Psychology. He is also the author of numerous book chapters dealing with issues germane to the related fields of research methods, human resources management, industrial and organizational psychology, and organizational behavior. Stone-Romero is the author of a chapter on research methods in the APA Handbook of Industrial and Organizational Psychology. What’s more, he is the author of a book titled Research Methods in Organizational Behavior, and the co-author of books titled Job Satisfaction: How People Feel about Their Jobs and How It Affects Their Performance, and The Influence of Culture on Human Resource Management Processes and Practices. Dr. Lois Tetrick is University Professor in the Industrial and Organizational Psychology Program, George Mason University. She is a former president of the Society for Industrial and Organizational Psychology and a founding member of the Society for Occupational Health Psychology. Dr. Tetrick is a fellow of the European Academy of Occupational Health Psychology, the American Psychological Association, the Society for Industrial and Organizational Psychology and the Association for Psychological Science. Dr. Tetrick is a past editor of the Journal of Occupational Health Psychology and the Journal of Managerial Psychology, and served as Associate Editor of the Journal of Applied Psychology. Dr. Tetrick has edited several books including The employment relationship: Examining psychological and contextual perspectives with Jackie Coyle-Shapiro, Lynn Shore, and Susan Taylor; The Employee-Organization Relations: Applications for the 21st Century with Lynn Shore and Jackie Coyle-Shapiro; the Handbook of Occupational Health Psychology (1st and 2nd editions) with James C. Quick; Health and Safety in Organizations with David Hofmann; Research Methods in Occupational Health Psychology: Measurement, Design and Data Analysis with Bob Sinclair and Mo Wang and two volumes on cybersecurity response teams: Psychosocial dynamics of cybersecurity and Improving Social Maturity of Cybersecurity incident response teams with S. J. Zaccaro, R. D. Dalal, and colleagues. In addition, she has published numerous chapters and journal articles on topics related to her research interests in occupational health and safety, occupational stress, the workfamily interface, psychological contracts, social exchange theory and reciprocity, organizational commitment, and organizational change and development.

Biographies  •  235

Julia H. Whitaker is a doctoral student at The University of Texas at Arlington. She has an M.S. in experimental psychology from The University of Texas at Arlington, as well as a B.S. in psychology and a certificate in Human Resource Management from Indiana University Purdue University Indianapolis. Her research interests include organizational technology use, with a special focus on the use of online information for employment purposes (i.e., cybervetting). She has co-authored a book chapter examining various forms of cyberdeviance (e.g., cyberloafing, cybercrime), and has presented at national conferences on topics such as applicant reactions to cybervetting procedures and rater cognition in cybervetting-based evaluation. Julia is a member of the Society for Industrial and Organizational Psychology. Phoebe Xoxakos is a graduate student and teaching assistant in the Department of Psychology at Clemson University. She is pursuing a Ph.D. in Industrial-Organizational Psychology. Her research interests relate to diversity including ways to enhance diversity climate; the effects of oppressive systems such as sexism, ageism, and racism; and quantitative methods. She is a member of the Society for Industrial and Organizational Psychology.