Applied Univariate, Bivariate, and Multivariate Statistics Using SPSS 111946580X, 9781119465805

2,001 361 12MB

English Pages 288 [208] Year 2018

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Applied Univariate, Bivariate, and Multivariate Statistics Using SPSS
 111946580X, 9781119465805

Table of contents :
Cover
SPSS Data Analysis for Univariate,
Bivariate, and Multivariate Statistics
© 2019
Contents
Preface
1 Review of Essential Statistical Principles
2 Introduction to SPSS
3 Exploratory Data Analysis, Basic Statistics, and Visual Displays
4 Data Management in SPSS
5 Inferential Tests on Correlations, Counts, and Means
6 Power Analysis and Estimating Sample Size
7 Analysis of Variance: Fixed and Random Effects
8 Repeated Measures ANOVA
9 Simple and Multiple Linear Regression
10 Logistic Regression
11 Multivariate Analysis of Variance (MANOVA) and Discriminant Analysis
12 Principal Components Analysis
13 Exploratory Factor Analysis
14 Nonparametric Tests
Closing Remarks and Next Steps
References
Index

Citation preview

SPSS Data Analysis for Univariate, Bivariate, and Multivariate Statistics

SPSS Data Analysis for Univariate, Bivariate, and Multivariate Statistics Daniel J. Denis

This edition first published 2019 © 2019 John Wiley & Sons, Inc. Edition History All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions. The right of Daniel J. Denis to be identified as the author of the material in this work has been asserted in accordance with law. Registered Office John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA Editorial Office 111 River Street, Hoboken, NJ 07030, USA For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com. Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats. Limit of Liability/Disclaimer of Warranty In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Library of Congress Cataloging‐in‐Publication Data Names: Denis, Daniel J., 1974– author. Title: SPSS data analysis for univariate, bivariate, and multivariate statistics / Daniel J. Denis. Description: Hoboken, NJ : Wiley, 2019. | Includes bibliographical references and index. | Identifiers: LCCN 2018025509 (print) | LCCN 2018029180 (ebook) | ISBN 9781119465805 (Adobe PDF) |   ISBN 9781119465782 (ePub) | ISBN 9781119465812 (hardcover) Subjects: LCSH: Analysis of variance–Data processing. | Multivariate analysis–Data processing. |   Mathematical statistics–Data processing. | SPSS (Computer file) Classification: LCC QA279 (ebook) | LCC QA279 .D45775 2019 (print) | DDC 519.5/3–dc23 LC record available at https://lccn.loc.gov/2018025509 Cover Design: Wiley Cover Images: © GarryKillian/Shutterstock Set in 10/12pt Warnock by SPi Global, Pondicherry, India Printed in the United States of America 10 9 8 7 6 5 4 3 2 1

v

Contents Preface  ix Review of Essential Statistical Principles  1 1.1 ­Variables and Types of Data  2 1.2 ­Significance Tests and Hypothesis Testing  3 1.3 ­Significance Levels and Type I and Type II Errors  4 1.4 ­Sample Size and Power  5 1.5 ­Model Assumptions  6

1

Introduction to SPSS  9 2.1 ­How to Communicate with SPSS  9 2.2 ­Data View vs. Variable View  10 2.3 ­Missing Data in SPSS: Think Twice Before Replacing Data!  12

2

3

Exploratory Data Analysis, Basic Statistics, and Visual Displays  19

3.1 ­Frequencies and Descriptives  19 3.2 ­The Explore Function  23 3.3 ­What Should I Do with Outliers? Delete or Keep Them?  28 3.4 ­Data Transformations  29 4 Data Management in SPSS  33 4.1 ­Computing a New Variable  33 4.2 ­Selecting Cases  34 4.3 ­Recoding Variables into Same or Different Variables  36 4.4 ­Sort Cases  37 4.5 ­Transposing Data  38 5

Inferential Tests on Correlations, Counts, and Means  41

5.1 ­Computing z‐Scores in SPSS  41 5.2 ­Correlation Coefficients  44 5.3 ­A Measure of Reliability: Cohen’s Kappa  52 5.4 ­Binomial Tests  52 5.5 ­Chi‐square Goodness‐of‐fit Test  54

vi

Contents

5.6 ­One‐sample t‐Test for a Mean  57 5.7 ­Two‐sample t‐Test for Means  59 6 Power Analysis and Estimating Sample Size  63 6.1 ­Example Using G*Power: Estimating Required Sample Size for Detecting Population Correlation  64 6.2 ­Power for Chi‐square Goodness of Fit  66 6.3 ­Power for Independent‐samples t‐Test  66 6.4 ­Power for Paired‐samples t‐Test  67 Analysis of Variance: Fixed and Random Effects  69 7.1 ­Performing the ANOVA in SPSS  70 7.2 ­The F‐Test for ANOVA  73 7.3 ­Effect Size  74 7.4 ­Contrasts and Post Hoc Tests on Teacher  75 7.5 ­Alternative Post Hoc Tests and Comparisons  78 7.6 ­Random Effects ANOVA  80 7.7 ­Fixed Effects Factorial ANOVA and Interactions  82 7.8 ­What Would the Absence of an Interaction Look Like?  86 7.9 ­Simple Main Effects  86 7.10 ­Analysis of Covariance (ANCOVA)  88 7.11 ­Power for Analysis of Variance  90

7

8 Repeated Measures ANOVA  91 8.1 ­One‐way Repeated Measures  91 8.2 ­Two‐way Repeated Measures: One Between and One Within Factor  99 9 Simple and Multiple Linear Regression  103 9.1 ­Example of Simple Linear Regression  103 9.2 ­Interpreting a Simple Linear Regression: Overview of Output  105 9.3 ­Multiple Regression Analysis  107 9.4 ­Scatterplot Matrix  111 9.5 ­Running the Multiple Regression  112 9.6 ­Approaches to Model Building in Regression  118 9.7 ­Forward, Backward, and Stepwise Regression  120 9.8 ­Interactions in Multiple Regression  121 9.9 ­Residuals and Residual Plots: Evaluating Assumptions  123 9.10 ­Homoscedasticity Assumption and Patterns of Residuals  125 9.11 ­Detecting Multivariate Outliers and Influential Observations  126 9.12 ­Mediation Analysis  127 9.13 ­Power for Regression  129 Logistic Regression  131 10.1 ­Example of Logistic Regression  132 10.2 ­Multiple Logistic Regression  138 10.3 ­Power for Logistic Regression  139

10

Contents

11

Multivariate Analysis of Variance (MANOVA) and Discriminant Analysis  141

11.1 ­Example of MANOVA  142 11.2 ­Effect Sizes  146 11.3 ­Box’s M Test  147 11.4 ­Discriminant Function Analysis  148 11.5 ­Equality of Covariance Matrices Assumption  152 11.6 ­MANOVA and Discriminant Analysis on Three Populations  153 11.7 ­Classification Statistics  159 11.8 ­Visualizing Results  161 11.9 ­Power Analysis for MANOVA  162 12 Principal Components Analysis  163 12.1 ­Example of PCA  163 12.2 ­Pearson’s 1901 Data  164 12.3 ­Component Scores  166 12.4 ­Visualizing Principal Components  167 12.5 ­PCA of Correlation Matrix  170 Exploratory Factor Analysis  175 13.1 ­The Common Factor Analysis Model  175 13.2 ­The Problem with Exploratory Factor Analysis  176 13.3 ­Factor Analysis of the PCA Data  176 13.4 ­What Do We Conclude from the Factor Analysis?  179 13.5 ­Scree Plot  180 13.6 ­Rotating the Factor Solution  181 13.7 ­Is There Sufficient Correlation to Do the Factor Analysis?  182 13.8 ­Reproducing the Correlation Matrix  183 13.9 ­Cluster Analysis  184 13.10 ­How to Validate Clusters?  187 13.11 ­Hierarchical Cluster Analysis  188

13

Nonparametric Tests  191 14.1 ­Independent‐samples: Mann–Whitney U  192 14.2 ­Multiple Independent‐samples: Kruskal–Wallis Test  193 14.3 ­Repeated Measures Data: The Wilcoxon Signed‐rank Test and Friedman Test  194 14.4 ­The Sign Test  196

14

Closing Remarks and Next Steps  199 References  201 Index  203

vii

ix

Preface The goals of this book are to present a very concise, easy‐to‐use introductory primer of a host of computational tools useful for making sense out of data, whether that data come from the social, behavioral, or natural sciences, and to get you started doing data analysis fast. The emphasis on the book is data analysis and drawing conclusions from empirical observations. The emphasis of the book is not on theory. Formulas are given where needed in many places, but the focus of the book is on concepts rather than on mathematical abstraction. We emphasize computational tools used in the discovery of empirical patterns and feature a variety of popular statistical analyses and data management tasks that you can immediately apply as needed to your own research. The book features analyses and demonstrations using SPSS. Most of the data sets analyzed are very small and convenient, so entering them into SPSS should be easy. If desired, however, one can also download them from www.datapsyc.com. Many of the data sets were also first used in a more theoretical text written by the same author (see Denis, 2016), which should be consulted for a more in‐depth treatment of the topics presented in this book. Additional references for readings are also given throughout the book.

­Target Audience and Level This is a “how‐to” book and will be of use to undergraduate and graduate students along with researchers and professionals who require a quick go‐to source, to help them perform essential statistical analyses and data management tasks. The book only assumes minimal prior knowledge of statistics, providing you with the tools you need right now to help you understand and interpret your data analyses. A prior introductory course in statistics at the undergraduate level would be helpful, but is not required for this book. Instructors may choose to use the book either as a primary text for an undergraduate or graduate course or as a supplement to a more technical text, referring to this book primarily for the “how to’s” of data analysis in SPSS. The book can also be used for self‐study. It is suitable for use as a general reference in all social and natural science fields and may also be of interest to those in business who use SPSS for decision‐making. References to further reading are provided where appropriate should the reader wish to follow up on these topics or expand one’s knowledge base as it pertains to theory and further applications. An early chapter reviews essential statistical and research principles usually covered in an introductory statistics course, which should be sufficient for understanding the rest of the book and interpreting analyses. Mini brief sample write‐ups are also provided for select analyses in places to give the reader a starting point to writing up his/her own results for his/her thesis, dissertation, or publication. The book is meant to be an

x

Preface

easy, user‐friendly introduction to a wealth of statistical methods while simultaneously demonstrating their implementation in SPSS. Please contact me at [email protected] or [email protected] with any comments or corrections.

­Glossary of Icons and Special Features When you see this symbol, it means a brief sample write‐up has been provided for the accompanying output. These brief write‐ups can be used as starting points to writing up your own results for your thesis/dissertation or even publication. When you see this symbol, it means a special note, hint, or reminder has been provided or signifies extra insight into something not thoroughly discussed in the text. When you see this symbol, it means a special WARNING has been issued that if not followed may result in a serious error.

­Acknowledgments Thanks go out to Wiley for publishing this book, especially to Jon Gurstelle for presenting the idea to Wiley and securing the contract for the book and to Mindy Okura‐Marszycki for taking over the project after Jon left. Thank you Kathleen Pagliaro for keeping in touch about this project and the former book. Thanks goes out to everyone (far too many to mention) who have influenced me in one way or another in my views and philosophy about statistics and science, including undergraduate and graduate students whom I have had the pleasure of teaching (and learning from) in my courses taught at the University of Montana. This book is dedicated to all military veterans of the United States of America, past, present, and future, who teach us that all problems are relative.

1

1 Review of Essential Statistical Principles Big Picture on Statistical Modeling and Inference The purpose of statistical modeling is to both describe sample data and make inferences about that sample data to the population from which the data was drawn. We compute statistics on samples (e.g. sample mean) and use such statistics as estimators of population parameters (e.g. population mean). When we use the sample statistic to estimate a parameter in the population, we are engaged in the process of inference, which is why such statistics are referred to as inferential statistics, as opposed to descriptive statistics where we are typically simply describing something about a sample or population. All of this usually occurs in an experimental design (e.g. where we have a control vs. treatment group) or nonexperimental design (where we exercise little or no control over variables). As an example of an experimental design, suppose you wanted to learn whether a pill was effective in reducing symptoms from a headache. You could sample 100 individuals with headaches, give them a pill, and compare their reduction in symptoms to 100 people suffering from a headache but not receiving the pill. If the group receiving the pill showed a decrease in symptomology compared with the nontreated group, it may indicate that your pill is effective. However, to estimate whether the effect observed in the sample data is generalizable and inferable to the population from which the data were drawn, a statistical test could be performed to indicate whether it is plausible that such a difference between groups could have occurred simply by chance. If it were found that the difference was unlikely due to chance, then we may indeed conclude a difference in the population from which the data were drawn. The probability of data occurring under some assumption of (typically) equality is the infamous p‐value, usually set at 0.05. If the probability of such data is relatively low (e.g. less than 0.05) under the null hypothesis of no difference, we reject the null and infer the statistical alter‑ native hypothesis of a difference in population means. Much of statistical modeling follows a similar logic to that featured above – sample some data, apply a model to the data, and then estimate how good the model fits and whether there is inferential evidence to suggest an effect in the population from which the data were drawn. The actual model you will fit to your data usually depends on the type of data you are working with. For instance, if you have collected sample means and wish to test differences between means, then t‐test and ANOVA tech‑ niques are appropriate. On the other hand, if you have collected data in which you would like to see if there is a linear relationship between continuous variables, then correlation and regression are usually appropriate. If you have collected data on numerous dependent variables and believe these variables, taken together as a set, represent some kind of composite variable, and wish to determine mean differences on this composite dependent variable, then a multivariate analysis of variance (MANOVA) technique may be useful. If you wish to predict group membership into two or more SPSS Data Analysis for Univariate, Bivariate, and Multivariate Statistics, First Edition. Daniel J. Denis. © 2019 John Wiley & Sons, Inc. Published 2019 by John Wiley & Sons, Inc.

2

1  Review of Essential Statistical Principles

categories based on a set of predictors, then discriminant analysis or logistic regression would be an option. If you wished to take many variables and reduce them down to fewer dimensions, then principal components analysis or factor analysis may be your technique of choice. Finally, if you are interested in hypothesizing networks of variables and their interrelationships, then path analysis and structural equation modeling may be your model of choice (not covered in this book). There are numerous other possibilities as well, but overall, you should heed the following principle in guid‑ ing your choice of statistical analysis: The type of statistical model or method you select often depends on the types of data you have and your purpose for wanting to build a model. There usually is not one and only one method that is possible for a given set of data. The method of choice will be dictated often by the rationale of your research. You must know your variables very well along with the goals of your research to diligently select a statistical model.

1.1 ­Variables and Types of Data Recall that variables are typically of two kinds – dependent or response variables and independent or predictor variables. The terms “dependent” and “independent” are most common in ANOVA‐ type models, while “response” and “predictor” are more common in regression‐type models, though their usage is not uniform to any particular methodology. The classic function statement Y = f(X) tells the story – input a value for X (independent variable), and observe the effect on Y (dependent vari‑ able). In an independent‐samples t‐test, for instance, X is a variable with two levels, while the depend‑ ent variable is a continuous variable. In a classic one‐way ANOVA, X has multiple levels. In a simple linear regression, X is usually a continuous variable, and we use the variable to make predictions of another continuous variable Y. Most of statistical modeling is simply observing an outcome based on something you are inputting into an estimated (estimated based on the sample data) equation. Data come in many different forms. Though there are rather precise theoretical distinctions between different forms of data, for applied purposes, we can summarize the discussion into the fol‑ lowing types for now: (i) continuous and (ii) discrete. Variables measured on a continuous scale can, in theory, achieve any numerical value on the given scale. For instance, length is typically considered to be a continuous variable, since we can measure length to any specified numerical degree. That is, the distance between 5 and 10 in. on a scale contains an infinite number of measurement possibilities (e.g. 6.1852, 8.341 364, etc.). The scale is continuous because it assumes an infinite number of possi‑ bilities between any two points on the scale and has no “breaks” in that continuum. On the other hand, if a scale is discrete, it means that between any two values on the scale, only a select number of possibilities can exist. As an example, the number of coins in my pocket is a discrete variable, since I cannot have 1.5 coins. I can have 1 coin, 2 coins, 3 coins, etc., but between those values do not exist an infinite number of possibilities. Sometimes data is also categorical, which means values of the variable are mutually exclusive categories, such as A or B or C or “boy” or “girl.” Other times, data come in the form of counts, where instead of measuring something like IQ, we are only counting the number of occurrences of some behavior (e.g. number of times I blink in a minute). Depending on the type of data you have, different statistical methods will apply. As we survey what SPSS has to offer, we identify variables as continuous, discrete, or categorical as we discuss the given method. However, do not get too caught up with definitions here; there is always a bit of a “fuzziness” in

1.2  Significance Tests and Hypothesis Testing

learning about the nature of the variables you have. For example, if I count the number of raindrops in a rainstorm, we would be hard pressed to call this “count data.” We would instead just accept it as continuous data and treat it as such. Many times you have to compromise a bit between data types to best answer a research question. Surely, the average number of people per household does not make sense, yet census reports often give us such figures on “count” data. Always remember however that the software does not recognize the nature of your variables or how they are measured. You have to be certain of this information going in; know your variables very well, so that you can be sure SPSS is treating them as you had planned. Scales of measurement are also distinguished between nominal, ordinal, interval, and ratio. A nominal scale is not really measurement in the first place, since it is simply assigning labels to objects we are studying. The classic example is that of numbers on football jerseys. That one player has the number 10 and another the number 15 does not mean anything other than labels to distinguish between two players. If differences between numbers do represent magnitudes, but that differences between the magnitudes are unknown or imprecise, then we have measurement at the ordinal level. For example, that a runner finished first and another second constitutes measurement at the ordinal level. Nothing is said of the time difference between the first and second runner, only that there is a “ranking” of the runners. If differences between numbers on a scale represent equal lengths, but that an absolute zero point still cannot be defined, then we have measurement at the interval level. A classic example of this is temperature in degrees Fahrenheit – the difference between 10 and 20° represents the same amount of temperature distance as that between 20 and 30; however zero on the scale does not represent an “absence” of temperature. When we can ascribe an absolute zero point in addition to inferring the properties of the interval scale, then we have measurement at the ratio scale. The number of coins in my pocket is an example of ratio measurement, since zero on the scale represents a complete absence of coins. The number of car accidents in a year is another variable measurable on a ratio scale, since it is possible, however unlikely, that there were no accidents in a given year. The first step in choosing a statistical model is knowing what kind of data you have, whether they are continuous, discrete, or categorical and with some attention also devoted to whether the data are nominal, ordinal, interval, or ratio. Making these decisions can be a lot trickier than it sounds, and you may need to consult with someone for advice on this before selecting a model. Other times, it is very easy to determine what kind of data you have. But if you are not sure, check with a statistical consultant to help confirm the nature of your variables, because making an error at this initial stage of analysis can have serious consequences and jeopardize your data analyses entirely.

1.2 ­Significance Tests and Hypothesis Testing In classical statistics, a hypothesis test is about the value of a parameter we are wishing to estimate with our sample data. Consider our previous example of the two‐group problem regarding trying to establish whether taking a pill is effective in reducing headache symptoms. If there were no differ‑ ence between the group receiving the treatment and the group not receiving the treatment, then we would expect the parameter difference to equal 0. We state this as our null hypothesis: Null hypothesis: The mean difference in the population is equal to 0. The alternative hypothesis is that the mean difference is not equal to 0. Now, if our sample means come out to be 50.0 for the control group and 50.0 for the treated group, then it is obvious that we do

3

4

1  Review of Essential Statistical Principles

not have evidence to reject the null, since the difference of 50.0 – 50.0 = 0 aligns directly with expectation under the null. On the other hand, if the means were 48.0 vs. 52.0, could we reject the null? Yes, there is definitely a sample difference between groups, but do we have evidence for a population ­difference? It is difficult to say without asking the following question: What is the probability of observing a difference such as 48.0 vs. 52.0 under the null hypothesis of no difference? When we evaluate a null hypothesis, it is the parameter we are interested in, not the sample statis‑ tic. The fact that we observed a difference of 4 (i.e. 52.0–48.0) in our sample does not by itself indicate that in the population, the parameter is unequal to 0. To be able to reject the null hypothesis, we need to conduct a significance test on the mean difference of 48.0 vs. 52.0, which involves comput‑ ing (in this particular case) what is known as a standard error of the difference in means to estimate how likely such differences occur in theoretical repeated sampling. When we do this, we are compar‑ ing an observed difference to a difference we would expect simply due to random variation. Virtually all test statistics follow the same logic. That is, we compare what we have observed in our sample(s) to variation we would expect under a null hypothesis or, crudely, what we would expect under simply “chance.” Virtually all test statistics have the following form: Test statistic = observed/expected If the observed difference is large relative to the expected difference, then we garner evidence that such a difference is not simply due to chance and may represent an actual difference in the popula‑ tion from which the data were drawn. As mentioned previously, significance tests are not only performed on mean differences, however. Whenever we wish to estimate a parameter, whatever the kind, we can perform a significance test on it. Hence, when we perform t‐tests, ANOVAs, regressions, etc., we are continually computing sample statistics and conducting tests of significance about parameters of interest. Whenever you see such output as “Sig.” in SPSS with a probability value underneath it, it means a significance test has been performed on that statistic, which, as mentioned already, contains the p‐value. When we reject the null at, say, p