The PSPP Guide: An Introduction to Statistical Analysis [2nd ed.]
 9780692866047

Table of contents :
CONTENTS
Chapter 1 An Introduction to the Guide Second Edition
Notes about the statistics guide
Notes about the data
The Philosophy Behind This Book and the Open Source Community
Chapter 2 Overview of Statistical Analysis in Social Science
Why use statistics in Social Science research?
What is Continuous and Categorical Data?
Parametric versus Non-Parametric Data
Confidence Intervals (CI)
P-Value
Effect Size
Effect Size Calculations
Chapter 3 The PSPP Statistical Analysis Environment
What is PSPP?
Data Visualization
Chapter 4 Getting Started with PSPP
Preparing the Data and Making Decisions
Creating Your Variable Codebook
Creating Variable/Data Names in PSPP
Entering Data Directly into PSPP
Opening Data Files with PSPP (.sav files)
Importing Data Files into PSPP from a Spreadsheet (.ods files)
Chapter 5 Descriptive Statistics
What are descriptive statistics?
Creating Descriptive Statistics in PSPP for Categorical Data
Creating Visual Representations for Categorical Data
Creating Descriptive Statistics in PSPP for Continuous Data
Creating Visual Representations for Continuous Data
Exploring the Data
Chapter 6 Graphs: Scatterplot, Histogram, Barchart
Scatterplots
Histograms
Barcharts
Chapter 7 Relationship Analysis with Chi-Square
Chi-Square Analysis (Categorical Differences)
Using the Chi-Square Function in PSPP
Interpreting Output Tables: Chi Square
Chi-Square Crosstabs Table Analysis
i
1
PSPP Guide
Chapter 8 Relationship Analysis with t-Test
t-Test Analysis (Continuous Differences, two groups)
One Sample t-Test using PSPP
Independent Samples t-Test using PSPP
Paired Samples t-Test using PSPP
Chapter 9 Relationship Analysis with ANOVA
Analysis of Variance (ANOVA)
One-Way ANOVA
Interpreting Output Tables: One-Way ANOVA
Introduction to Planned Contrasts
Conducting One-Way ANOVA with Planned Contrasts
ANOVA with Planned Contrasts for Orthogonal Polynomial Trends
Analyzing the ANOVA Output Tables for Orthogonal Polynomial Trends
Chapter 10 Univariate Analysis: General Linear Model (GLM)
Using Univariate Analysis for the General Linear Model (GLM)
Using Univariate Analysis for Two-Way (Factorial) ANOVA
Chapter 11 Associations with Correlation
Correlation Analysis with PSPP
Chapter 12 Associations with Regression (Linear)
Regression Analysis with PSPP
Interpreting Output Tables: Regression
Chapter 13 Associations with Regression (Binomial Logistic)
A Simple Binomial Logistic Regression Example with Categorical Data
A Simple Binomial Logistic Regression Example with Continuous Data
Chapter 14 Reliability
Reliability Using PSPP for Agreement
Reliability Using PSPP for Accuracy
Chapter 15 Factor Analysis
What is Factor Analysis?
Determining the Number of Factors to Extract
Conducting Factor Analysis with PSPP
Chapter 16 Why is Statistics So Confusing?
The Research Process
Exploring the data
The General Linear Model (GLM)
One-Way ANOVA with Confidence Intervals
One-Way ANOVA with Contrasts for Trends
Our Findings from the Data
Chapter 17 Concluding Thoughts
Resources
Analysis Memos

Citation preview

The PSPP Guide: An Introduction to Statistical Analysis

Second Edition Christopher P. Halter CreativeMinds Press Group San Diego, CA

PSPP Guide

Copyright © 2017 by Christopher P. Halter All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without prior written permission of the CreativeMinds Press Group ISBN: 0692866043 ISBN-13: 978-0692866047 ii

CONTENTS Chapter 1 An Introduction to the Guide Second Edition Notes about the statistics guide Notes about the data The Philosophy Behind This Book and the Open Source Community

1 1 2 3

Chapter 2 Overview of Statistical Analysis in Social Science Why use statistics in Social Science research? What is Continuous and Categorical Data? Parametric versus Non-Parametric Data Confidence Intervals (CI) P-Value Effect Size Effect Size Calculations

4 4 5 10 11 12 14 15

Chapter 3 The PSPP Statistical Analysis Environment What is PSPP? Data Visualization

17 17 20

Chapter 4 Getting Started with PSPP Preparing the Data and Making Decisions Creating Your Variable Codebook Creating Variable/Data Names in PSPP Entering Data Directly into PSPP Opening Data Files with PSPP (.sav files) Importing Data Files into PSPP from a Spreadsheet (.ods files)

22 22 22 25 29 30 32

Chapter 5 Descriptive Statistics What are descriptive statistics? Creating Descriptive Statistics in PSPP for Categorical Data Creating Visual Representations for Categorical Data Creating Descriptive Statistics in PSPP for Continuous Data Creating Visual Representations for Continuous Data Exploring the Data

36 36 36 38 39 40 44

Chapter 6 Graphs: Scatterplot, Histogram, Barchart Scatterplots Histograms Barcharts

48 48 49 51

Chapter 7 Relationship Analysis with Chi-Square Chi-Square Analysis (Categorical Differences) Using the Chi-Square Function in PSPP Interpreting Output Tables: Chi Square Chi-Square Crosstabs Table Analysis

54 54 54 57 60

i

PSPP Guide

Chapter 8 Relationship Analysis with t-Test t-Test Analysis (Continuous Differences, two groups) One Sample t-Test using PSPP Independent Samples t-Test using PSPP Paired Samples t-Test using PSPP

68 68 69 71 78

Chapter 9 Relationship Analysis with ANOVA Analysis of Variance (ANOVA) One-Way ANOVA Interpreting Output Tables: One-Way ANOVA Introduction to Planned Contrasts Conducting One-Way ANOVA with Planned Contrasts ANOVA with Planned Contrasts for Orthogonal Polynomial Trends Analyzing the ANOVA Output Tables for Orthogonal Polynomial Trends

81 81 81 84 87 87 99 102

Chapter 10 Univariate Analysis: General Linear Model (GLM) Using Univariate Analysis for the General Linear Model (GLM) Using Univariate Analysis for Two-Way (Factorial) ANOVA

104 105 107

Chapter 11 Associations with Correlation Correlation Analysis with PSPP

110 110

Chapter 12 Associations with Regression (Linear) Regression Analysis with PSPP Interpreting Output Tables: Regression

114 114 117

Chapter 13 Associations with Regression (Binomial Logistic) A Simple Binomial Logistic Regression Example with Categorical Data A Simple Binomial Logistic Regression Example with Continuous Data

119 121 124

Chapter 14 Reliability Reliability Using PSPP for Agreement Reliability Using PSPP for Accuracy

127 128 130

Chapter 15 Factor Analysis What is Factor Analysis? Determining the Number of Factors to Extract Conducting Factor Analysis with PSPP

132 132 133 136

Chapter 16 Why is Statistics So Confusing? The Research Process Exploring the data The General Linear Model (GLM) One-Way ANOVA with Confidence Intervals One-Way ANOVA with Contrasts for Trends Our Findings from the Data

143 143 144 146 147 148 148

Chapter 17 Concluding Thoughts

149

Resources Analysis Memos

152 153 ii

Second Edition

High School & Beyond Codebook High School & Beyond Sample Data Set Reliability for Agreement Codebook Reliability for Agreement Dataset Reliability for Accuracy Codebook Reliability for Accuracy Dataset Test Scores Codebook Test Scores Dataset Effect Size Tables Box & Whisker Plots Using OpenOffice Spreadsheets Additional Resources

155 156 164 165 166 167 168 169 170 174 183

References

185

Index

186

iii

PSPP Guide

iv

Second Edition

ACKNOWLEDGMENTS Whether knowingly or unknowingly, those of us using technology owe a great deal to the open source software community. It is through projects such as PSPP, OpenOffice, Linux, and others that useful applications can be freely distributed. The programmers who make up this community of professionals offer their time and effort for nothing more than the ability to share something worthwhile with the rest of us.

THANK YOU.

v

Chapter 1 An Introduction to the Guide Second Edition

Notes about the statistics guide So let’s get this out of the way right from the start. This is NOT a math book. The PSPP Guide will not contain beautiful, complex statistical equations. It will not explain the formulas and mathematics behind the statistical tests. It will not provide step by step mathematical guidance to reproduce the statistical results by hand. So what is the purpose of this book? I am glad you asked. The purpose of this guide is to assist the novice social science and educational researcher in interpreting statistical output data using the PSPP Statistical Analysis application. Through the examples and guidance, you will be able to select the statistical test that is appropriate for your data, apply the inferential test to your data, and interpret a statistical test’s output table. The Guide goes into the uses of some of the most commonly used statistical tests and discusses some of the limitations of those tests, i.e. Chi-square, t-Test, ANOVA, 1

PSPP Guide

Correlation, and Regressions (Linear and Binomial). The ANOVA description included procedures for conducting the One-Way ANOVA with Planned Contrasts so that you may test a specific hypothesis concerning group interactions, as well as the General Linear Model (GLM) for other types of ANOVA analysis. Exploratory Factor Analysis has been included in this guide as a valuable procedure for data reduction. The use of Reliability tests will be discussed as a way to verify the reliability of coding data between researchers. Statistical tests are designed to handle either parametric or non-parametric data. The differences between these types of data will be discussed in a later chapter. The majority of the tests included in PSPP are designed for parametric data analysis and will be the focus of this book. PSPP also contains a handful of non-parametric data tools, but these will not be discussed here. The sample window views and output tables shown in this guide were mainly created from PSPP 0.10.x, the graphical user interface version of PSPP called PSPPIRE. PSPP is officially described as a “replacement” application for IBM’s Statistical Package for the Social Sciences (SPSS). However, PSPP does not have any official acronym expansion. The developers of PSPP have some suggestions, such as; • • •

Perfect Statistics Professionally Presented. Probabilities Sometimes Prevent Problems. People Should Prefer PSPP.

The examples shown in this guide represent a subset of the data obtained in the 1988-2000 High School & Beyond (HS&B) study commissioned by the Center on Education Policy (CEP). The sample datasets contains 200 cases and are intended to provide statistical analysis practice and not to draw any conclusions about the sample population.

Notes about the data The High School & Beyond study was commissioned by the Center on Education Policy (CEP) and conducted by researcher Harold Wenglinsky. The study was based on the statistical analyses of a nationally representative, longitudinal database of students and schools from the National Educational Longitudinal Study of 19882000 (NELS). The study focused on a sample of low-income students from innercity high schools. The study compared achievement and other education-related outcomes for students in different types of public and private schools, including comprehensive public high schools (the typical model for the traditional high school); public magnet schools and “schools of choice;” various types of Catholic parochial schools and other religious schools; and independent, secular private 2

Second Edition

schools. The High School and Beyond (HS&B) study included two cohorts: the 1980 senior class, and the 1980 sophomore class.

The Philosophy Behind This Book and the Open Source Community This book began as my own attempt to find a practical way to teach introductory statistical analysis to doctoral students in the field of social sciences. The course would often come early in the training of our students prior to the start of their own data collection or research study. They would purchase a license for one of the major proprietary statistical analysis packages, typically a six month or one-year license. Unfortunately, by the time they had data to analyze the software license would have expired. So began my search for an alternative that would be useful in learning basic analysis skills and capable of performing basic statistical analysis tests. This brought me to PSPP, a part of the GNU Project. This open source computer community has developed a powerful software package that is effective and easy to use. Another key feature of the open source group is that the software is distributed free of charge. This guide is not intended to be a course on statistics or the mathematics behind statistical analysis. With the advent of statistical analysis applications anyone with a computer can run statistical analysis on any dataset. The intention of this guide is to provide the novice researcher with a step-by-step guide to using these powerful analysis tools and the confidence to read and interpret output tables in order to guide their own research.

3

Chapter 2 Overview of Statistical Analysis in Social Science

Why use statistics in Social Science research? Everyone loves a good story. Rich narratives, interesting characters, and the unfolding of events draw us into the story, anticipating in how it ends. Qualitative research methods are well suited to uncover theses stories. However, we should not ignore the power and use of quantitative methods. One of the assumptions made about quantitative research methods is that they merely deal with numbers. And let’s face it, to many of us, numbers are quite boring. A well-constructed data table or beautifully drawn graph does not capture the imagination of most readers. But appropriately used quantitative methods can uncover subtle differences, point out inconsistencies, or bring up more questions that beg to be answered. In short, thoughtful quantitative methods can help guide and shape the qualitative story. This union of rich narratives and statistical evidence is at the heart of any good

4

Second Edition

mixed methods study. The researcher uses the data to guide the narrative. Together these methods can reveal a more complete and complex account of the topic.

What is Continuous and Categorical Data? Within statistical analysis we often talk about data as being either continuous or categorical. This distinction is important since it guides us towards appropriate methods that are used for each kind of data set. Depending on the kind of data you have there are specific statistical techniques available for you to use with that data. Mark Twain (1804-1881) is often credited with describing the three types of lies as “lies, damn lies, and statistics”. This phrase is still often used in association with our view of statistics. This may be due to the fact that one could manipulate statistical analysis to give whatever outcome is being sought. Poor statistics has also been used to support weak or inconsistent claims. This does not mean that the statistics is at fault, but rather a researcher who used statistical methods inappropriately. As researchers we must take great care in employing proper methods with our data. Continuous data can be thought of as “scaled data”. The numbers may have been obtained from some sort of assessment or from some counting activity. A common example of continuous data is test scores. If we can plot the data with a line graph, then it is probably continuous data. Some examples of continuous data include; • • • •

The time it takes to complete a task A student’s test scores Time of day that someone goes to bed The weight or height of a 2nd grade student

All of these examples can be thought of as rational numbers. For those of us who have not been in an Algebra class for a number of years, rational numbers can be represented as fractions, which in turn can be represented as decimals. Rational numbers can still be represented as whole numbers as well. A subset of this sort of data can be called discrete data. Discrete data is obtained from counting things. They are represented by whole numbers. Some examples of discreet numbers include; • • • •

The number of courses a student takes each year The number of people living in a household The number of languages spoken by someone The number of turns taken by an individual 5

PSPP Guide

Categorical data is another type of important statistical data, and one that is often used in social science research. As the name implies, categorical data is comprised of categories or the names of things that we are interested in studying. In working with statistical methods we often transform the names of our categories into numbers. An example of this process is when we collect information about the primary languages spoken at home by students in our class. We may have found that the primary home languages are English, Spanish, French, and Cantonese. We may convert this data into numbers for analysis. Language Spoken

Code

English

1

Spanish

2

French

3

Cantonese

4 Primary Home Language Code Book

In the above example the numbers assigned to the categories do not signify any differences or order in the languages. The numbers used here are “nominal” or used to represent names. Another example of categorical data could be the grade level of a high school student. In this case we may be interested in high school freshmen, sophomores, juniors, and seniors. Assigning a numerical label to these categories may make our analysis simpler. Grade Level

Code

Freshman

1

Sophomore

2

Junior

3

Senior

4 High School Grade Levels example

In this example, again the numbers are just representing the names of the high school level, however they do have an order. “Freshman” comes prior to

6

Second Edition

“Sophomore”, which is prior to “Junior”, and also “Senior”. This sort of categorical data can be described as ordinal, or representing an order. What kind of data comes from questionnaires? Many questionnaires will contain open ended or free-response questions as well as Likert-scale questions. This latter data source can be very beneficial in social science research and provide the researcher with a wealth of information and data. Likertscale responses can be used to measure preference or agreement with a statement. But are the responses on a questionnaire categorical or continuous? The answer to this question is not as easy as it might seem. Different sectors of the field will have different responses to this question. You will find some referring to this data as categorical and others referring to it as continuous. The disagreement seems to stem from our own definition of continuous data. When we think about continuous data, a key feature is that the numbers have a measured or scaled distance from one another. So when we examine the points along our scale we can find a precise way to describe the difference between one point to the next. For example, the height of students in a classroom could be scaled data. When we describe one student as being 4 feet tall, another as 3 feet tall, and a third as 3.5 feet tall, the differences between those points is measured and precise. We can infer a standard difference between the three. If we have data such as the one used in the high school level example, the difference between a freshman, a sophomore, and a junior is not as precise. The students at all three of those levels will have various skills and abilities, they will have taken different course sequences, and they have differing credits gained in school. Even within these groups there are still differences in courses, skills, and credits achieved. So there is not a defined scale that makes one group different from the next or provides for differences within the grouping. So when we have a questionnaire that asks about one’s feeling to a specific topic, are the differences between those responses along a measured scale and do the responses have similarities about the meaning? A typical Likert-scale questionnaire might ask a range of questions in which the responses can be “disagree, neither disagree or agree, and agree”. We would provide a numbered response scale to make the selections easier for our participants.

7

PSPP Guide

Simple Likert-scale direction Please respond to the following questions using the scale provided; Disagree – 1 Neither disagree or agree – 2 Agree – 3 Sample Likert Scale Levels

So does this sort of data have a measured, scaled difference between the responses? That is a question each researcher must answer and justify before employing any statistical analysis method. How do we tell them apart and why is it important? It is important to recognize the sort of data that is being used in the research analysis process. A researcher should ask; • • • • • • • • •

Does my data represent information that is continuous (a rational number) or is it categorical (names and labels)? Does this represent test scores or evaluations? Does this data represent something that has been counted? Is the interval between the data points a regular, measured interval? Could this data be represented with a linear graph? Does this data represent the names of something? Do the data points represent the order of objects? Are the data points opinions? Are there irregular differences between the data points?

Depending on whether a researcher is using categorical or continuous data, there are specific statistical methods available. Below are some of the most common statistical methods in social science research and the data associated with the method.

8

Second Edition

Common Statistical Methods Descriptive Statistics

Inferential Statistics

Statistical Method

Representation and Use

Normal Distribution

Graphs

Central Tendencies

Mean, Median, Mode

Variance

Standard Deviation

Charts and Graphs

Histogram, Pie Chart, Stem & Leaf Plots, Scatterplots

Chi-square

Differences or relationships between categorical data Differences or relationships between continuous data with two groups Differences or relationships between continuous data with more than 2 groups Associations between continuous data Modeling associations between continuous data Factor grouping within categorical or continuous data. Data reduction.

t-Test

ANOVA

Correlation Regression Factor Analysis

9

PSPP Guide

Parametric versus Non-Parametric Data Our data can also be classified as either parametric or nonparametric. This term refers to the distribution of data points. Parametric data will have a “normal distribution” that is generally shaped like the typical bell curve. Non-parametric data does not have this normal distribution curve.

Normal Distribution Curve of parametric data

Nonparametric data by Carl Boettiger on Flickr https://www.flickr.com/photos/cboettig/9019872976

Depending on the distribution of your data, various statistical analysis techniques are available to use. Some methods are designed for parametric data while other methods are better suited for non-parametric data distributions.

10

Second Edition

Sample Statistics based on Data Distribution Data Distribution

Parametric Normal

Non-Parametric Any

Variance within Data

Homogenous

Any

Typical Data Type

Continuous (Ratio or Interval)

Benefits of the data

More powerful, able to draw stronger conclusions

Continuous or Categorical (Ordinal or Nominal) Simpler to use

Statistical Tests Correlations Relationships with 2 groups Relationships with >2 groups

Pearson

Spearman

t-Test

Mann-Whitney or Wilcoxon Test Kruskal-Wallis or Friedman’s Test

ANOVA

In choosing a statistical method we must consider both the character of the data as well as the distribution of our data. The character, or data type, can be described as nominal, ordinal, ratio, or interval. The distribution can be described as parametric or non-parametric. These data features will lead us to selecting the most appropriate statistical method for our analysis of the data. Throughout this guide the examples will come from our sample data set that contains both categorical and continuous data that is generally parametric in nature.

Confidence Intervals (CI) Much of inferential statistics is based on measurements and manipulations of the means. When we have some sample data the measures of central tendency, such as mean, median, and mode, are very simple to calculate and compare. When these measures are compared across categories, factors, or groupings we begin to find differences in their means. Often we take sample data to represent some larger population. We can calculate the sample mean with certainty but the true mean of the population being represented cannot be known for certain through the sample data. This is when the confidence intervals come into consideration.

11

PSPP Guide

The confidence interval is a calculation of the range of values for the true mean. We can know with a certain amount of “confidence”, typically at the 95% confidence level, that the true mean will within the specified confidence interval. For example, we may find that the mean of our sample is 52.65 for some measure. The calculated confidence interval could be from 51.34 to 53.95 for example. Therefore, given that the sample mean is 52.65, we can state with 95% confidence that the true mean lies somewhere between 51.34 and 53.95 for the population.

P-Value What is a P-value? In statistical analysis the way we measure the significance of any given test is with the P-value. This value indicates the probability of obtaining the same statistical test result by chance. Our calculated p-values are compared against some predetermined significance level. The most common significance levels are the 95% Significance Level, represented by a p-value of 0.05, and the 99% Significance Level, represented by a p-value of 0.01. A significance level of 95%, or 0.05, indicates that we are accepting the risk of being wrong 1 out of every 20 times. A significance level of 99%, or 0.01, indicates that we risk being wrong only 1 out of every 100 times. The most common significance level used in the Social Sciences is 95%, so we are looking for p-values < 0.05 in our test results. However, in statistical analysis we are not looking to prove our test hypothesis with the p-value. We are actually trying to reject the Null Hypothesis. What is the Null Hypothesis? In statistical testing the results are always comparing two competing hypothesis. The null hypothesis is often the dull, boring hypothesis stating that there is no difference between the test populations or conditions. The null hypothesis tells us that whatever phenomenon we were observing had no or very little impact. On the other hand we have the alternative, or researcher’s hypothesis. This is the hypothesis that we are rooting for, the one that we want to accept in many cases. It is the result we often want to find since it often indicates that there are differences between populations or conditions so then we can take that next step to explain those differences or examine them more closely. When we perform a statistical test, the p-value helps determine the significance of the test and the validity of the claim being made. The claim that is always “on trial” here is the null hypothesis. When the p-value is found to be statistically significant, 12

Second Edition

p < 0.05, or that it is highly statistically significant, p < 0.01, then we can conclude that the differences, relationships, or associations found in the observed data are very unlikely to occur if the null hypothesis is actually true. Therefore the researcher can “reject the null hypothesis”. If you reject the null hypothesis, then the alternative hypothesis must be accepted. And this is often what we want as researchers. The only question that the p-value addresses is whether or not the experiment or data provide enough evidence to reasonably reject null hypothesis. The p-value or calculated probability is the estimated probability of rejecting the null hypothesis of a study question when that null hypothesis is actually true. In other words, it measures the probability that you will be wrong in rejecting the null hypothesis. And all of this is decided based on our predetermined significance level, in most cases the 95% level or p < 0.05. Let’s look at an example. Suppose your school purchases a SAT Prep curriculum in the hopes that this will raise the SAT test scores of your students. Some students are enrolled in the prep course while others are not enrolled in the prep course. At the end of the course all your students take the SAT test and their resulting test scores are compared. In this example our null hypothesis would be that “the SAT prep curriculum had no impact on student test scores”. This result would be bad news considering how much time, effort, and money was invested in the test prep. The alternative hypothesis is that the prep curriculum did have an impact on the test scores, and hopefully the impact was to raise those scores. Our predetermined significance level is 95%. After using a statistical test suppose that we find that a p-value of 0.02, which is in deed less than 0.05. We can reject the null hypothesis. Now that we have rejected the null hypothesis the only other option is to accept the alternative hypothesis, specifically that the scores are significantly different. This result does NOT imply a "meaningful" or "important" difference in the data. That conclusion is for you to decide when considering the real-world relevance of your result. So again, statistical analysis is not the end point in research, but a meaningful beginning point to help the researcher identify important and fruitful directions suggested by the data. It has been suggested that the idea of “rejecting the null hypothesis” has very little meaning for social science research. The null hypothesis always states that there are “no differences” to be found within your data. Can we really find NO DIFFERENCES in the data? Are the results that we find between two groups ever going to be identical to one another? The practical answer to these questions is “No”. There will always be differences present in our data. What we are really asking is whether or not those differences have any statistical significance. As we discussed previously, our statistical tests are 13

PSPP Guide

aimed at producing the p-value that indicates the likelihood of having the differences occur purely by chance. And the significance level of p = 0.05 is just an agreed upon value among many social scientist as the acceptable level to consider as statistically significant. And to find that the differences within the data are statistically significant may just be a factor of having a large enough sample size to make those differences meaningful. A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis. P-values very close to the cutoff (0.05) are considered to be marginal so you could go either way. But keep in mind that the choice of significance levels is arbitrary. We have selected a significance level of 95% because of the conventions used in most Social Science research. I could have easily selected a significance level of 80%, but then no one would take my results very seriously. Relying on the p-value alone can give you a false sense of security. The p-value is also very sensitive to sample size. If a given sample size yields a p-value that is close to the significance level, increasing the sample size can often shift the p-value in a favorable direction, i.e. make the resulting value smaller. So how can we use p-values and have a sense of the magnitude of the differences? This is where Effect Size can help.

Effect Size Whereas statistical tests of significance tell us the likelihood that experimental results differ from chance expectations, effect-size measurements tell us the relative magnitude of those differences found within the data. Effect sizes are especially important because they allow us to compare the magnitude of results from one population or sample to the next. Effect size is not as sensitive to sample size since it relies on standard deviation in the calculations. Effect size also allows us to move beyond the simple question of “does this work or not?” or “is there a difference or not?”, but allows us to ask the question “how well does some intervention work within the given context?” Let’s take a look at an example that could, and has happened, to many of us when conducting statistical analysis. When we compare two data sets, perhaps we are looking at SAT assessment scores between a group of students who enrolled in a SAT prep course and another group of students who did not enroll in the prep course. 14

Second Edition

Suppose that the statistical test revealed a p-value of 0.043. We should be quite pleased since this value would be below our significance level of 0.05 and we could report a statistical difference exists between the group of test takers enrolled in the prep course and those who were not enrolled in the course. But what if the calculated p-value was 0.057. Does this mean that the prep course is any less effective? So here is the bottom-line. The p-value calculation will help us decide if a difference or association has some significance that should be explored further. The effect size will give us a sense of the magnitude of any differences to help us decide if those differences have any practical meaning and are worth exploring. So both the p-value and the effect size can be used to assist the researcher in making meaningful judgments about the differences found within our data.

Effect Size Calculations Currently PSPP does not contain effect size functions for all of the statistical analysis. At the time of this book, with PSPP version 10.2.x, effect size is available for Chi Square (Crosstabs) analysis, Correlations, and Regressions. For complete effect size calculation see the appendix. Determining the Magnitude of Effect Size Once we have calculated the effect size value we must determine if this value represents a small, medium, or large effect. Jacob Cohen (1988) suggested various effect size calculations and magnitudes in his text Statistical Power Analysis for the Behavioral Sciences. The values in the effect size magnitude chart can be thought of as a range of values with the numbers in each column representing the midpoint of that particular range. For example, the effect size chart for Phi suggests a small, medium, and large effect size for the values of 0.1, 0.3, and 0.5 respectively. We could think of these as ranges with the small effect for Phi ranging from 0.0 to approximately 0.2, the medium effect size ranging from approximately 0.2 to 0.4, and the large effect size ranging from approximately 0.4 and higher.

15

PSPP Guide

Suggested Effect Size Magnitude Chart Statistics Test

Small Effect

Medium Effect

Large Effect

Chi Squared

0.1

0.3

0.5

Cohen’s d

t-Test (Paired & Independent)

0.2

0.5

0.8

Eta Squared

ANOVA

0.01

0.06

0.14

Correlation

0.1

0.3

0.5

Effect Size Calculation Phi or Cramer’s Phi

r

Correlation and r t-Test 0.01 0.09 0.25 (Independent) Values from Cohen (1988) Statistical Power Analysis for the Behavioral Sciences 2

The importance of effect size can be best summed up by Gene Glass, as cited in Kline’s Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research, Washington DC: American Psychological Association; 2004. p. 95. Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude –not just, does a treatment affect people, but how much does it affect them. -Gene V. Glass

16

Chapter 3 The PSPP Statistical Analysis Environment

What is PSPP? PSPP is a program for the statistical analysis of sampled data. It is particularly suited to the analysis and manipulation of very large data sets. The PSPP online guide states that in addition to statistical hypothesis tests such as t-Tests, analysis of variance and non-parametric tests, PSPP can also perform linear regressions and is a very powerful tool for recoding and sorting of data and for calculating metrics such as skewness and kurtosis.

PSPP website http://www.gnu.org/software/pspp/pspp.html 17

PSPP Guide

Compatibility The official PSPP documentation states that PSPP is designed as a free replacement for SPSS. That is to say, it behaves as experienced SPSS users would expect, and their system files and syntax files can be used in PSPP with little or no modification, and will produce similar results. PSPP supports numeric variables and string variables up to 32,767 bytes long. Variable names may be up to 255 bytes in length. This means that variable names must be less than 64 characters. There are no artificial limits on the number of variables or cases. The free PSPP application is not limited in any way. Graphic User Interface (GUI) Users familiar with other software may prefer the graphic user interface that allows you to define data without needing to become familiar with the PSPP syntax. Data can be entered from the keyboard, imported from spreadsheet applications, or imported from existing files. There is a spreadsheet type data entry window for the entry and viewing of data and its metadata.

PSPP Data View Window

18

Second Edition

Drop down menus provide access to all the supported statistical analyses and transformations, in addition to operations such as loading and saving of the data and syntax files. You can use the features via interactive dialog boxes that indicate the options and required parameters of each command. The drop-down menus and dialogs are useful for many analyses. PSPP also supports the syntax mechanism providing a more powerful and flexible means of controlling PSPP. However for the novice researcher, or most researchers for that matter, the Graphic User Interface (GUI) will be a more familiar and comfortable workspace, similar to many other applications that we use everyday. Output Window There is also a non-interactive output window. The output window is generated when the user conducts any of the analysis or visualization functions. Each successive operation is appended to the output window. To switch between the data window and the output table window use the “Windows” menu.

Sample PSPP Output Window

19

PSPP Guide

Data Visualization PSPP can generate plots or graphs to help with the visualization of the data distribution. Among the types of plots that can be displayed are pie charts, normal probability plots, and histograms. These complement descriptive statistics and help determine the most appropriate type of analysis for the data. The data selection capabilities of PSPP make it simple to generate plots from a subset of variables or data.

Sample PSPP Pie Chart

Sample PSPP Histogram

20

Second Edition

Plots and graphs created by PSPP are formatted in standard file formats such as postscript or PNG, so as to allow easy export and import into reports or other documents. The output tables can be exported as .png, .html, .txt, .jpg, and several other format types.

21

Chapter 4 Getting Started with PSPP

Preparing the Data and Making Decisions Now that we have data collected from our study it is time to perform some analysis to address the research questions posed. One of the first choices made has to do with entering your data into an application so that you can actually do the analysis. Depending on the kind of data you have collected there are many choices available. We will focus on using continuous and categorical data sets with the PSPP statistical analysis application. Creating Your Variable Codebook Whether you plan to perform data entry into a spreadsheet first or to enter the data directly into PSPP you will need to create a codebook for your data. The codebook is used as a planning tool and a quick reference guide to your data. There are some questions that must be addressed prior to data entry.

22

Second Edition

Continuous Data • How large is your data set? • Will all the data be manually entered into the spreadsheet or PSPP? • How many decimal places are required for your data? • How will you “name” the data for easy reference? • Are there any outliers in the data? • How will you handle outliers? Categorical Data • What are the value names for each data item? • How will you represent each value name with an integer value? • Is your data nominal or ordinal? How will this guide the decision for selecting values? Data Types and Analysis Methods Data Type Text data: Interviews, open ended questionnaires, field notes, focus group transcripts, writing samples, etc.

Possible Analysis Applications Word processing application (Word or OpenOffice)

Video or Audio recording data

HyperResearch

HyperResearch Hypertranscribe Quicktime Transana InqScribe StudioCode

Categorical and Continuous Data

Microsoft Excel OpenOffice Spreadsheets PSPP SPSS SAS

With any categorical data, we begin by converting the labels, such as male/female or private school/public school, into numerical values that can be manipulated in the analysis application. In the case of our “High School and Beyond” (hsb2.sav) sample dataset the codebook is represented in the table below. 23

PSPP Guide

CODEBOOK for High School and Beyond Data:

Variable Name gender

Variable type Categorical

Variable Label

Variable Value

Gender

0=male 1=female

race

Categorical

Race

1=Hispanic 2=Asian 3=African American 4=White

ses

Categorical

Socioeconomic status

1=low 2=middle 3=high

schtyp

Categorical

School type

1=public 2=private

prog

Categorical

Program type

1=general 2=academic 3=vocational

read

Continuous

Reading score

write

Continuous

Writing score

math

Continuous

Math score

science

Continuous

Science score

socst

Continuous

Social Studies score

Some of the data in our sample set are nominal in nature. There is not any order to the labels and an order should not be implied from the values used. For example, the value for socioeconomic status has been listed “low, middle, and high” with values of “1, 2, and 3” assigned respectively. This does not imply the “low” is first, “middle” is second, and “high” is third. These labels could have been placed in any order and assigned any value. On the other hand if our sample data had involved grade level or degrees attained, then we might be able to assign values based on an order, so this would represent ordinal data.

24

Second Edition

Sample CODEBOOK for Schooling Data:

Variable Name id

Variable type Categorical

Variable Label

gender

Categorical

Gender

grade

Categorical

GradeLevel

degree

Categorical

Degree Attained

Variable Value

Student ID 0=male 1=female 1=First grade 2=Second grade 3=Third grade 4=Fourth grade 1=Not a High School Graduate 2=High School Diploma 3=Bachelors degree 4=Graduate degree

Creating Variable/Data Names in PSPP When PSPP is launched, the user is presented with a blank table. At the bottom of the table there are two choices; data view and variable view. Data View allows you enter data or to see the data that has previously been entered. Variable View allows you to define your variables for the data.

PSPP Data View Window

25

PSPP Guide

PSPP Variable View Window

Before data can be entered into PSPP you must define your variables. This is where your data codebook is useful since we have already defined the types of data and the values that will be needed for our analysis. • •

But what order should the variables be entered? How will you group the values so that they make sense and allow you to use them more efficiently?

In the PSPP environment, the variable window presents each variable in a row. Each column will define a specific aspect of that variable, such as the name of the variable, the data type for the variable (typically deciding if it is a string or numerical), the width or amount of numbers to show, the decimal places required, etc. String variables cannot be used in any of PSPP’s calculations. Any variable that will be used in a statistical calculation must be Numeric.

Variable Type Dialogue Box

26

Second Edition

Once a variable name has been entered the other columns are activated so that you can define that variable more completely. The data view window then displays each variable or data type as a column and each row will be a case. The case is the level of fine grain detail available for the analysis. In the hsb2 data set each case represents a single participant in the study.

Defining Continuous Data In defining continuous data, enter the short name of your variable, such as math (to represent a math score). For the data type, click on the popup window icon (shown as a rectangle in the cell) and select “numeric” for a continuous data entry. Within the popup box you can also select the maximum length of the value and the decimal places required. Once the variable has been defined, you can change or edit these values by clicking in the cell to activate the selection arrows. Click on the up or down arrows to increase or decrease the value.

Variable View Defining Decimal Places

Defining Categorical Data When a categorical value is entered within PSPP, the “value” must also be defined. The value is the numerical integer that will be used to represent the data in PSPP and allow for the application to perform statistical analysis with that data.

27

PSPP Guide

When the codebook was created we decided beforehand the values that would be assigned for each categorical data type. As shown previously for continuous data entry, once the variable name has been entered the other cells will become active so that they can be defined as well. Be sure to select “numeric” from the “variable type” option in the popup menu. The two basic types of variables that you will use are numeric and string. Numeric variables may only have numbers assigned. String variables may contain letters or numbers, but even if a string variable happens to contain only numbers, statistical operations on that variable will not be allowed. Use the “string” data if you plan to enter information that will not be used in your statistical analysis. Even if you have categorical data that will be analyzed, we are representing that data with numbers and assigning a “label” to the number.

Click this box to bring up the dialogue window

Dialogue Box to Define the Variable Type

To define the values to be used for our categorical data, click the rectangle in the “Value” cell to show the popup menu (see the figure below). In this menu enter the value name, such as male or female, and then the value to be assigned to each name. The value names must be entered one at a time and the “+Add” button clicked. Once all the values are entered you can click “OK” to close the menu. Be sure to enter the correct numerical representation into the “Value” box and the label that is to be used in the “Value label” box. In addition to “Naming” the variable, defining it as Numeric data, including “Labels” for the data if it is categorical, we must also define the data “Type” as Scale for continuous data or either Nominal or Ordinal for Categorical data.

28

Second Edition

Click “Add” after each value is defined.

Dialogue Box to Define Values

Entering Data Directly into PSPP The data can be entered directly into PSPP in the Data View screen. Each row will be one of the cases for which you have collected data. When entering categorical data be sure to enter the values that you have defined in the codebook and not the name or label for the data. After the data has been entered use the “Value Label” button along the top menu bar to switch the view between variable values and variable labels. This will allow you to switch between the numeric values of your data and the labels describing that data, see the figures below. This is an important note worth repeating; when entering categorical values directly into PSPP, you must enter the numerical value and NOT the name or the label used. Having your data codebook handy will help with data entry.

29

PSPP Guide

The “Value Label” button switches view between data vales and data labels or names

Data View Window Displayed with Data Value

Opening Data Files with PSPP (.sav files) The PSPP statistical application is able to natively use and save .sav file formats. This means that files created within SPSS, or any other compatible application can be easily opened with PSPP without having to convert the file. Using File > Open from the top menu, select the .sav data file from your computer.

PSPP Open Command

30

Second Edition

PSPP Open Dialogue Box

The file will open with the variable names and data.

PSPP Data View window with Data

PSPP can also open “.txt” files that contain delimited text, “.sps” syntax files that are also created by SPSS, and “.por” which is a SPSS portability file to exchange data.

31

PSPP Guide

Importing Data Files into PSPP from a Spreadsheet (.ods files) You may find that entering your data into a spreadsheet has some advantages for you over entering directly into the PSPP table. It may also be the case that your data is already contained in a spreadsheet. PSPP is able to import this data, however it cannot import directly from the spreadsheet file. PSPP is set up to import data from open source spreadsheets mainly created by OpenOffice, Gnumeric, and LibreOffice in the Open Document Spreadsheet (ods) file format. Setting Up the Spreadsheet Getting the spreadsheet setup and the data entered is a simple process. There are a few key points to keep in mind: •



• •

In the spreadsheet, the columns contain each variable or data type and the rows represent each case in the study. This is similar to the way PSPP displays the data. The first row of the spreadsheet will be the variable names with row 2 containing the first data set. Variable names should be short. Variable names must be less than 64 characters long. Categorical data must be entered as its numerical value and not the name. The codebook you created will come in handy for this process. Enter all the data.

OpenOffice Spreadsheet with data labels shown as numerical values vice names

32

Second Edition

Saving the File Using OpenOffice Prepare this data file to import into PSPP by saving the file as an “.ods” file. Use the menu at the top of the screen to select File > Save or Save As. From the pull down menu select the Open Document Spreadsheet (.ods) format. Importing Data Into PSPP Now we are ready to import the spreadsheet data into PSPP. From the file menu at the top of the window select File > Import Data. A dialogue box will appear for you to select the file to import into PSPP. Navigate the dialogue box to select the file containing your data.

PSPP Import menu

Once the file begins to import, there will be several popup windows to help format the data so that PSPP can use it. In the first screen be sure that “All Cases” is selected. If you entered the variable names in row 1 on the spreadsheet be sure to check the box “Use first row as variable names”.

PSPP Importing Data

In the next dialogue box you will be presented with the “Variables” window and the “Data Preview” window. The Variables window will allow you to define the variable

33

PSPP Guide

Types, such as either Numeric or String, along with Labels for the Categorical data, and the Measure column to indicate if the data is Scale, Nominal, or Ordinal.

PSPP Select Data to Import

The default setting in PSPP is for every variable to be labeled a numeric value without any value labels, or names. For all the categorical data we must be sure to also include a value label for these numeric values. Again, this is where we will use the codebook to match numeric values to each label used for the categorical data. You can choose to define all the variables during the import process or simply allow PSPP to import the data and then go into the “Variable View” window of PSPP and adjust all the variable definitions to match your codebook.

Variable View window in PSPP

Using your data codebook as a reference, enter each variable value and the value label. Be sure to use the “+Add” button to add the value and name, clicking “OK” when completed. Use your codebook to enter the data’s numeric value in the “Value” box and the label or name of the data in the “Value Label” box. In the example we have entered a value of “0” for the label “males” and a value of “1” for the label “females”. After each value and label is entered click the “Add” 34

Second Edition

button to create the label. This will allow you to view the category names instead of the assigned numerical values.

Click “Add” after each value

Value Label Dialogue Box

The imported data will appear in the data view window of PSPP. To switch between the label names and the label values use the “Value Label” button at the top of the screen.

Switching from Numeric and Label views

35

PSPP Guide

Chapter 5 Descriptive Statistics

What are descriptive statistics? Descriptive statistics is a method to characterize the data in order to make decisions about the nature, tendencies, and appropriate inferential statistics that can be used to analyze the data. In descriptive statistics we look at various measures such as mean, median, mode, and standard deviations.

Creating Descriptive Statistics in PSPP for Categorical Data Categorical data is best described by exploring the frequencies within the data. The frequencies will display the percentages of each category within the data set.

36

Second Edition

Use the PSPP menu Analyze > Descriptive Statistics > Frequencies

PSPP Descriptive Statistics Frequencies Menu

From the dialogue box select the categorical data needed and move them into the “Variable(s)” window. None of the “Statistics” are required for this sort of data. Click “OK”. In order to select the variables, click and highlight the variable, such as gender, SES, etc., from the window on the left side of the dialogue window and click the arrow to move it into the “Variable(s)” window. More than one variable can be entered into the Variables window.

2. Click the arrow to move

1. Select a variable

Frequencies Selection Window

37

PSPP Guide

The Output window will display a frequencies and percentages table for the selected categorical data.

Frequencies Output Table

Creating Visual Representations for Categorical Data Visual representations such as pie charts and bar graphs can be a valuable tool in analyzing a data set. The chart can display features of the data set that may not be as evident in the numerical representation. For categorical data, pie charts may be the most useful visualization. Creating either type of chart within PSPP is accomplished by using the menu Analyze > Descriptive Statistics > Frequencies. In the dialogue box click on the “Charts” button.

Click Charts button Frequencies Charts

38

Second Edition

In the Charts dialogue box select the pie chart checkbox for categorical data.

Creating Pie Charts and Sample Pie Chart

Creating Descriptive Statistics in PSPP for Continuous Data Continuous data lends itself to descriptive statistics that focus on variation and measures of central tendency. Mean, median, mode, range, maximum and minimum values, standard deviation, etc. can provide valuable information about the continuous data. Using the PSPP menu select Analyze > Descriptive Statistics > Descriptives

Descriptive Menu Dialogue Window

In the dialogue box select the checkboxes for the descriptive statistics to be displayed in the output table. In this example we have selected mean, median, minimum, maximum, range, standard deviation, and variance.

39

PSPP Guide

Available Descriptive Statistics

In the output window the descriptive statistics table is generated.

Descriptives Output Table

Creating Visual Representations for Continuous Data Visual representations such as bar charts, also called histograms, can be a valuable tool used in analyzing a continuous data set. The chart can display features of the data set that may not be as evident otherwise. Creating a histogram chart within PSPP is accomplished by using the menu Analyze > Descriptive Statistics > Frequencies In the dialogue box click on the “Charts” button after the data has been entered into the Variables window.

40

Second Edition

Frequencies Window

In the Charts dialogue box select the histogram chart. A histogram works best for displaying continuous data. By selecting “Superimpose normal curve” you will be able to inspect your data for a normal distribution.

Chart Dialogue Box with Sample Histogram

41

PSPP Guide

Interpreting Output Tables: Descriptives In this example we have a sample descriptive output table for both the Reading Test scores and the Math Test scores from our data set. The table was created by using the Analyze > Descriptive Statistics > Frequencies menu. Both Math and Reading scores were moved to the Variable(s) window and in the charts menu the histogram with superimpose normal curve checkbox was selected. Each variable selected will be shown in a row of the output table.

Sample Frequencies output table for math & reading scores

1. The “Mean” will give us the average score in the data. This output table shows us the mean score for both the math and reading test. We also find that the means of the two test scores are almost the same. 2. The “Minimum” and “Maximum” columns give the lowest and highest score for each test. Here we notice that the maximum was about the same on each test and that the Reading test had a slightly lower minimum score. 3. “Standard Deviation” (Std Dev column) gives a measure of the variation in our test scores from the mean. The scores can be described as 1 standard deviation from the mean, or 2 standard deviations from the mean, or 3 standard deviations from the mean. The standard deviations follow the 68-95-99 rule in statistics, in that 68% of the data falls within the first standard deviation, 95% of the data falls within the second standard deviation, and 99% of the data falls within the third standard deviation.

42

Second Edition

Standard Deviation Diagram

In reviewing the math and reading scores, the standard deviation for the math scores is smaller than the standard deviation for the reading scores, or there is less variance in the math scores, which is also evident by comparing the value in our “Variance” column of the output table. We would find that 68% of the math scores are between 43.28 and 62.02. In comparison, 68% of the reading scores are between 41.98 and 62.48. 4. Kurtosis describes the “peakness” of the data. A kurtosis value of zero represents data that resembles a normally distributed data set. Positive values represent data with a leptokurtic distribution, or very high peaks, and negative values represent data with a platykurtic distribution, or one that is more flat. In this example we find that both data sets have a slight negative Kurtosis indicating some “flatness” to the histogram. 5. Skewness gives us information about the distribution of data from the mean. A skewness value of zero would have data evenly distributed and balanced around the mean. A positive skewness value indicates data weighted more heavily to the right of the mean and a negative skewness value indicates data weighted to the left of the mean. In this example both sets of scores have a slight positive skewness value. 6. A histogram of the data can also reveal important features about the scores prior to conducting any inferential analysis.

43

PSPP Guide

Exploring the Data Another important step in the data analysis process is to explore your data as a way to visualize any trends, oddities, patterns, or clues to appropriate inferential statistics. The exploration process begins with the “Explore” command, Analyze > Descriptive Statistics > Explore.

Explore Menu Command

In the Explore dialogue window we have the ability to select dependent variables along with any factors to consider in our exploration. In our example below the researcher might be interested in the math scores of this data with some interest in how these scores differed among the school program types, in this case academic, general, and vocational programs.

Initial Dialogue window to select Dependent List and Factors

With the “Statistics” button in the Dialogue window we can select additional measures to explore the data further.

44

Second Edition

Explore Statistics Check Box window

Any number of Dependent variables can be entered into the Dependent List window as well as any number of Factors can be entered into the Factor List window. The more items that are entered into these selection windows the longer the output will become, exploring each dependent variable by each factor selected. The first set of output tables produced by the Explore command will contain important descriptive statistics for the dependent variable selected. These descriptive data points include five highest and lowest values, the percentile points, measures of central tendency, standard deviation, skewness, kurtosis, etc.

Output Tables for Dependent Variable

The second set of output tables produced by the Explore command will contain important descriptive statistics for the dependent variable selected separated into the factor categories that you chose. These descriptive data points include five highest and lowest values, the percentile points, measures of central tendency, standard deviation, skewness, kurtosis, etc. for each factor. 45

PSPP Guide

Summary Output Table for Dependent Variable by Factor

When we examine the output tables for the dependent variable when it is separated by the factor, the Extreme Values now shows the five highest and lowest values within each factor. In the case of our example the math scores are separated by the program type.

46

Second Edition

We can also compare the percentiles across the selected factors or categories. And we are able to examine the descriptive statistics within each category.

Percentiles Output Table for Dependent Variable by Factor

In the exploration of the data we can now compare each factor with the descriptive statistics for the overall dependent variable.

Descriptive Output Table for Dependent Variable by Factor

An exploration of the data will give us insights into the values, associations, and relationships that may be present within the data. We can begin to consider the appropriate inferential statistics that may provide deeper insights into the data.

47

Pages not included in preview

Chapter 7 Relationship Analysis with Chi-Square

Chi-Square Analysis (Categorical Differences) The Chi Square (or Crosstabs) analysis method is used to find differences between categorical data items. The PSPP output table will produce a Crosstabs table that can be compared across the rows and down the columns for differences in the data.

Using the Chi-Square Function in PSPP Using the menu, select Analyze>Descriptive Statistics>Crosstabs

PSPP Chi Square (Crosstabs) Function

54

Second Edition

In the dialog box select one of the categorical variables for the row and another categorical variable for the column.

Move one variable with arrow

Move one variable with arrow

Using the “Statistics” button in the dialogue window select the Chi square test (Chisq) as well as Phi to calculate the effect size.

Click Statistics to select “Chisq” and “Phi”

Chi Square Selection Window

In the output window review the “Chi-Square tests” output table and find the Pearson Chi Square value. We will also be shown the significance level of those differences in the significance column (Asymp. Sig.) In the Symmetrical Measures table, we will have both the Phi and Cramer’s V which are measures of effect size.

55

PSPP Guide

Chi Square Output Table

We can also find the calculated p-value for the Chi Square test in the “Chi-square tests” output table in the column labeled “Asymp. Sig. (2-tailed)”. This column displays the calculated two-tailed p-value for the test. In our example above the calculated p-value is 0.04, which is less than our confidence level of 0.05 (but not by much). It should be noted that the PSPP output tables will only show calculated p-values to two decimal places. Therefore any calculated values that appear as “.00” should be reported as “p < 0.01”.

56

Second Edition

Interpreting Output Tables: Chi Square In this example we have two Chi Square output tables, also called Crosstabs. The output table on the left is asking if there are differences in Program Type enrollment (Academic, General, and Vocational) based on Race. The output table on the right is asking if there are differences in Program Type enrollment based on socioeconomic status or SES levels (low, middle, and high).

Chi Square Output Tables

1. In examining the Chi Square output table we are interested in the “Chi-square tests” table. The values we use can be found in the first row of the table labeled “Pearson Chi-Square”. 2. The second and third columns of the Chi-square tests output table contain the Chi square value and the degrees of freedom (df). We can determine that the differences in program enrollment are not significant since the p-value equals 0.56, as shown in the “Asymp. Sig.” column of the PSPP output table. 57

PSPP Guide

Statistic Value Pearson Chi-Square 4.86 Likelihood Ratio 4.91 Linear-by-Linear Association .50 N of Valid Cases 200

df 6 6 1

Asymp. Sig. (2-tailed) .56 .55 .48

Chi-square tests Output for Race X Program Type

3. In the table below for a Chi Square analysis of SES and Program Type, the “Asymp. Sig (2-tailed)” column will give us the calculated p-value. A value of less than 0.05 can be considered significant. In this example the differences based on SES levels are significant (pCompare Means>T-Test. Be sure to select the correct t-Test for your data set. In this case we are using the One Sample t-Test. With One Sample t-Test we will investigate the differences between one group’s performance on a measure compared to some known average on the SAME measure.

PSPP One Sample t-Test Menu

When using the One Sample t-Test select the test variable. Be sure to move the measure to be compared into the test variable window.

One Sample t-Test window

We will also enter the known average for this measure. The known average would come from a source outside of your own data collection, such as a norm referenced test or some national assessment measure that has a published average for the entire test group.

69

PSPP Guide

One Sample t-Test with known average of “48”

The resulting output table will show the significance level (p-value), along with the mean difference and the confidence interval for the mean difference. In this case we find that the p-value for our measure when compared to the known mean is statistically significant.

Output table for Reading score compared to known mean of 48

The table also reveals that the mean difference between our sample and the known mean to be 4.23 points, with the 95% Confidence Interval for the actual mean of this sample to be between 2.80 and 5.66 points higher.

70

Second Edition

Independent Samples t-Test using PSPP Using the menu, select Analyze>Compare Means>T-Test. Be sure to select the correct t-Test for your data set. In this case we are using the Independent Samples tTest. With Independent Samples we will investigate the differences between TWO groups on the SAME measure.

PSPP Independent Samples t-Test Menu

When using the Independent Samples t-Test select the test variable and the groups. Be sure to define the two groups from within the variable. Click “Define Groups” and enter the codes used for the two groups to be compared. Check your codebook to be sure you are using the correct values for your comparison groups.

Select a Continuous variable

Select a Categorical variable Independent t-Test Sample Window

Define the two groups in your categorical data set that will be compared. In this example we have selected SES and will compare groups 1 and 3, which in the codebook corresponds to “low” SES students and “high” SES students respectively.

71

PSPP Guide

Click Define Groups button

Use label values to enter group or the pull-down menu Defining Groups for t-Test

Determine if the variance of the two groups is significant. If conducting an Independent Sample t-Test, we use the Levene’s Test of Equality of Variances to determine if the difference in variation is significant. The Levene’s Test value directs us to which line in the output table to use in our analysis. If the Levene's Test is not significant (“Sig.” value is greater than .05), then equal variances can be assumed and the two variances are not significantly different and we read the values in the first line of the table. If the Levene’s Test is significant (the value under "Sig." value is less than .05), then equal variances are not assumed and the two variances are significantly different and we read the values from the second line in the table. Levene’s Test is simply a measure of how much variation is present in our data set. Data with a lot of variation undergoes one statistical process while data with very little variation can be put to other statistical tests. PSPP handles this for us and produces both results. This is one of the few times that a researcher hopes for values that are not significant because this will indicate that the data set has equal variances.

72

Second Edition

t-Test Output Window showing Levene's Test

In reviewing the Independent Samples output table be sure to use the row that is indicated by the Levene’s test. The first row, “Equal variances assumed” is used when the value for the Levene’s test is NOT significant. The second row, “Equal variances not assumed”, is used when the values for Levene’s test IS significant. In this example the Levene’s Test significance is 0.074, higher than 0.05, therefore we would look at the results from the “Equal variances assumed” row of the output table.

t-Test output table for differences on reading scores by low & high SES

The t-Test output tables, as shown in the Independent Samples Test table above, also displays the calculated p-value for the test in the column labeled “Sig. (2-tailed)”. We can determine if the differences are statistically significantly from this column of the output table. In this example we find that the differences in mean reading scores between the high and low SES students is statistically significant with a p- value less than 0.001. In the output table the p-value is shown as “.000”, therefore we report this value as “p < 0.001”. We also notice that the 95% Confidence Interval for the actual differences in the mean scores for these two groups is between 4.25 and 12.20 points. 73

PSPP Guide

Interpreting Output Tables: Independent Samples t-Test In this example we have an Independent Samples t-Test output table to determine if there are differences in Science Test scores between the low SES level students and the high SES level students.

t-Test output table for science scores & SES

1. We begin by looking at the “Levene’s Test for Equality of Variances”. If the significance value (Sig. column of the table) is greater than 0.05, meaning that the differences in variance are not significant, then we use the first row of the output table. If the value is less than 0.05, meaning that the differences in variance are significant, then we use the second row of the output table. Levene's Test for Equality of Variances F

Sig.

t

df

Sig. (2tailed)

.86

.36

-3.92

103.00

.00

Equal variances not -3.89 95.85 assumed t-Test Output Table for SES and Science Scores (Levene’s test)

.00

Equal variances

science score assumed

In this example the value is 0.36 which is greater than our selected confidence level of 0.05, therefore we can use the first row labeled “Equal variances assumed”. 2. The “Asymp. Sig (2-tailed)” column will also give us the calculated p-value. A value of less than 0.05 can also be considered significant. In this example the calculated p-value of “.00” indicates that p < 0.01. As noted earlier, the PSPP output tables will only show calculated p-values to two decimal places. Therefore any calculated values that appear as “.00”, as displayed in our example output table above, should be reported as “p < 0.01”. 74

Second Edition

Levene's Test for Equality of Variances

Equal variances

science score assumed

F

Sig.

t

df

Sig. (2tailed)

.86

.36

-3.92

103.00

.00

Equal variances not -3.89 95.85 assumed t-Test Output Table for SES and Science Scores (t-value)

.00

3. The Confidence Interval also gives us an important measure of statistical significance. When using the t-Test to compare the means of two groups, the hypothetical difference between the means for the Null hypothesis would be zero (0). 4. The output table for Independent Samples t-Test will show the mean interval difference at the 95% confidence level. The range of means displayed shows the upper and lower values of what we believe the differences to be within this sample. In this example (see the table below) we note that the mean difference range is from 3.80 to 11.69 for our two selected student groups. Levene's Test for Equality of t-test for Equality of Means Variances 95% Confidence Interval of the Difference F Equal science variances .86 score assumed Equal variances not assumed

Sig. .36

t

Sig. (2- Mean Std. Error tailed) Difference Difference

df

Lower

Upper

3.92 103.00

.00

7.75

1.99

3.80

11.69

3.89 95.85

.00

7.75

1.99

3.80

11.70

Output for Independent Samples with Confidence Interval

If the confidence level DOES NOT contain zero (0) we may reject the null hypothesis that the mean difference is zero and note that the differences are statistically significant. If the confidence interval had CONTAINED zero (0), then we would probably have failed to reject the null hypothesis and concluded that there were not statistically significant differences between the means. 5. Now that we have determined that the difference in science assessment scores between high SES students and low SES students is statistically significant (p