Elements of Statistics: A Hands-On Primer [1 ed.] 1527500349, 9781527500341

This book represents a crucial resource for students taking a required statistics course who are intimidated by statisti

330 74 3MB

English Pages 270 [264] Year 2017

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Elements of Statistics: A Hands-On Primer [1 ed.]
 1527500349, 9781527500341

Table of contents :
Dedication
Table of Contents
List of Illustrations
Introduction
1 Basic Math and Symbols Used in Statistics
2 Statistical Thinking: Levels of Measurement and Variables
3 Graphs and Charts: Use and Misuse of Visuals
4 Central Tendency: Average tends to be a Central Number
5 Variability: Measures of Dispersion
6 Probability
7 Sampling
8 The Sampling Distribution and the Normal Curve: Generalizing from a Sample to the Population
9 Normal Distribution and its Relationship with the Standard Deviation and the Standard Scores
10 Examining Relationship
11 Tests of Significance for the Nominal Level Variables
12 Tests of Significant for the Ordinal Level of Variables
13 Tests of Significant for the Interval Level of Variables
14 Measuring Relationship between two Interval Level Variables
15 Power of a Statistical Test
16 Analysis of Variance (ANOVA) Learning Objectives
17 Regression Analysis: A Prediction and Forecasting Technique
Solutions to Exercises for Practice
Appendices
References
Statistical Procedure and Tests by Appropriate Level of Measurement
Index

Citation preview

Elements of Statistics

Elements of Statistics: A Hands-on Primer By

Raghubar D. Sharma

Elements of Statistics: A Hands-on Primer By Raghubar D. Sharma This book first published 2017 Cambridge Scholars Publishing Lady Stephenson Library, Newcastle upon Tyne, NE6 2PA, UK British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Copyright © 2017 by Raghubar D. Sharma Cover Image From Chaos to Statistics © Jaspal Singh Cheema, 2017 All rights for this book reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner. ISBN (10): 1-5275-0034-9 ISBN (13): 978-1-5275-0034-1

For Anjana, Vivek, Shilpa and Pratik Dedicated to: My High School Mathematics Teacher Mr. C.L. Sharma whose gentle teaching style instilled love for mathematics in me and my brothers.

TABLE OF CONTENTS

List of Illustrations .................................................................................... xv Introduction .............................................................................................. xix Chapter One ................................................................................................. 1 Basic Math and Symbols Used in Statistics Learning Objectives ............................................................................... 1 Introduction............................................................................................ 1 Basic Math Needed to Learn Statistics .................................................. 2 The Order of Operations ........................................................................ 2 Decimals and Fractions .......................................................................... 3 Truncation ........................................................................................ 6 Exponents ......................................................................................... 6 Square Root ...................................................................................... 7 Logarithms ....................................................................................... 7 Some Common Symbols Used in Statistics ........................................... 7 Summation (™) and Constant (c) ...................................................... 8 Summation (™), Constant (c), and a Variable .................................. 8 Exercises for Practice ............................................................................. 9 Chapter Two .............................................................................................. 11 Statistical Thinking: Levels of Measurement and Variables Learning Objectives ............................................................................. 11 Introduction.......................................................................................... 11 Types of Numbers and Levels of Measurement................................... 12 Types of Variables ............................................................................... 15 Continuous Variables ..................................................................... 15 Discrete or Qualitative or Categorical Variables ............................ 15 Why Different Levels of Measure Matter? .......................................... 16 Exercises for Practice ........................................................................... 18 Chapter Three ............................................................................................ 19 Graphs and Charts: Use and Misuse of Visuals Learning Objectives ............................................................................. 19 Introduction.......................................................................................... 19 Grouped Data ....................................................................................... 20

viii

Table of Contents

Cumulative Frequency ................................................................... 22 Cumulative Percentage ................................................................... 22 Graphs and Charts ................................................................................ 22 Rules for Creating Graphs and Charts ............................................ 23 Pie Chart and Double-counting ...................................................... 27 Misuse of Graphic Representation ....................................................... 28 Eliminating Zero from Y-axis ........................................................ 28 Stretching or Shrinking the Axis .................................................... 30 Avoiding Misuse of Graphic Representation ................................. 31 Exercises for Practice........................................................................... 32 Chapter Four .............................................................................................. 36 Central Tendency: Average tends to be a Central Number Learning Objectives ............................................................................. 36 Introduction.......................................................................................... 36 Central Tendency ................................................................................. 37 Measures of Central Tendency ............................................................ 37 Mode .................................................................................................... 37 Bimodality and Multimodality ....................................................... 40 Median ................................................................................................. 40 Median of Grouped Data ................................................................ 41 Some Limitations of the Median .................................................... 43 Mean .................................................................................................... 43 Mean of Grouped Data ................................................................... 43 Weighted Mean: The Mean of Means.................................................. 45 Skewness, Mean and Median ............................................................... 46 Symmetrical Distribution ............................................................... 47 Positively or Right-Skewed Distribution........................................ 48 Negatively or Left-Skewed Distribution ........................................ 50 Summary Table: Level of Measurement and Type of Central Tendency ........................................................................................ 51 Exercises for Practice ........................................................................... 51 Chapter Five .............................................................................................. 53 Variability: Measures of Dispersion Learning Objectives ............................................................................. 53 Introduction.......................................................................................... 53 Dispersion or Variability...................................................................... 54 The Range ...................................................................................... 54 The Deviation ................................................................................. 55 The Mean or Average Deviation .................................................... 56

Elements of Statistics: A Hands-on Primer

ix

Variance and Standard Deviation ................................................... 56 Variance ......................................................................................... 57 Calculating Variance ...................................................................... 57 Standard Deviation ......................................................................... 58 Standard Deviation of Grouped Data ............................................. 58 An Alternate Method ...................................................................... 60 Coefficient of Variation ....................................................................... 62 Practical Use of Coefficient of Variation ............................................. 63 Exercises for Practice........................................................................... 63 Chapter Six ................................................................................................ 65 Probability Learning Objectives ............................................................................. 65 Introduction.......................................................................................... 65 The Concept of Randomness ............................................................... 66 Moving beyond the Descriptive Statistics............................................ 66 Probability Statement ........................................................................... 67 Rules of Probability ............................................................................. 67 Theoretical and Empirical Probability ................................................. 68 Theoretical Probability ................................................................... 68 Empirical Probability ..................................................................... 68 Law of Large Numbers and Probability ............................................... 69 The Multiplication Rule ....................................................................... 69 Independent Events ........................................................................ 69 Dependent Events........................................................................... 70 The Addition Rule ............................................................................... 71 Mutually Exclusive Events............................................................. 71 Non-Mutually Exclusive Events .................................................... 72 Odds ..................................................................................................... 72 Odd Ratios...................................................................................... 73 Calculating Probability from Odds ...................................................... 73 Probability Ratio ............................................................................ 74 Role of Probability in Statistical Inference .......................................... 74 Exercises for Practice........................................................................... 75 Chapter Seven............................................................................................ 77 Sampling Learning Objectives ............................................................................. 77 Introduction.......................................................................................... 77 A Sample ............................................................................................. 78 Sampling Frame ............................................................................. 78

x

Table of Contents

Sampling Error ............................................................................... 78 Non-Sampling Error ....................................................................... 79 Types of Samples ................................................................................. 79 Probability Samples........................................................................ 79 Advantages of Probability Sampling .............................................. 82 Sample Size.......................................................................................... 82 Non-Probability Samples ............................................................... 83 Exercise for Practice ............................................................................ 85 Chapter Eight ............................................................................................. 87 The Sampling Distribution and the Normal Curve: Generalizing from a Sample to the Population Learning Objectives ............................................................................. 87 Introduction.......................................................................................... 87 Sampling Distribution of Means .......................................................... 88 Central Limit Theorem ................................................................... 88 Normal Curve ................................................................................. 89 Salient Features of Sampling Distribution of Means...................... 91 Standard Error of the Sample Mean ............................................... 91 Confidence Intervals ............................................................................ 92 Calculating Confidence Interval ..................................................... 93 Exercises for Practice........................................................................... 94 Chapter Nine.............................................................................................. 97 Normal Distribution and its Relationship with the Standard Deviation and the Standard Scores Learning Objectives ............................................................................. 97 Introduction.......................................................................................... 97 Standardized Scores ............................................................................. 99 Conversion of a Raw Score to a Standard Score (z-Scores) ........... 99 Calculating the z-Score ................................................................ 100 Conversion of a z-Score into a Raw Score ................................... 102 Finding Probability of an Event using z-Score and the Normal Curve ............................................................................................ 104 Tails of a Curve ............................................................................ 105 Exercises for Practice......................................................................... 106 Chapter Ten ............................................................................................. 108 Examining Relationship Learning Objectives ........................................................................... 108 Introduction........................................................................................ 108

Elements of Statistics: A Hands-on Primer

xi

Cross-tabulation ................................................................................. 109 Univariate Tables ......................................................................... 109 Bivariate Tables ........................................................................... 109 Multivariate Tables ...................................................................... 110 Test of Significance ........................................................................... 111 Hypothesis Testing ....................................................................... 112 Assumptions ................................................................................. 113 Independence................................................................................ 114 Normality ..................................................................................... 114 Randomness ................................................................................. 114 One-tailed and Two-tailed Tests .................................................. 114 Level of Significance ................................................................... 115 Level of Confidence ..................................................................... 115 p-Values ....................................................................................... 115 Calculating p-Value...................................................................... 115 Use of p-Values ............................................................................ 116 The Number of Degrees of Freedom ............................................ 116 Type 1 Error ................................................................................. 117 Type 2 Error ................................................................................. 118 Four Steps for Testing a Hypothesis .................................................. 118 Step 1: Hypothesis Statement ....................................................... 119 Step 2: The Test of Significance .................................................. 119 Step 3: Calculations of Test Statistic ............................................ 119 Step 4: Decision Rule ................................................................... 120 Exercises for Practice......................................................................... 120 Chapter Eleven ........................................................................................ 122 Tests of Significance for the Nominal Level Variables Learning Objectives ........................................................................... 122 Introduction........................................................................................ 122 Research Question, the Level of Measurement and Choice of a Test ........................................................................................ 123 Visual Evaluation of Relationship between two Variables ................ 124 The Chi-square (Ȥ2): A Test of Significance ...................................... 126 Uses of the Chi-square (Ȥ2) Test ................................................... 126 Some Requirements for a Chi-square (Ȥ2) Test ............................ 127 Calculation Steps for the Chi-square (Ȥ2) ..................................... 127 The Chi-square (Ȥ2): A Test Dependency ..................................... 127 Expected Frequencies ................................................................... 129 The Chi-square (Ȥ2): A Test of “Goodness of Fit” ....................... 130 Expected Frequencies ................................................................... 132

xii

Table of Contents

Measuring the Strength of Association .............................................. 133 Phi (ࢥ)........................................................................................... 133 Interpretation of Phi ..................................................................... 135 Cramer’s V ................................................................................... 136 Lambda (Ȝ) ................................................................................... 137 Exercises for Practice......................................................................... 139 Chapter Twelve ....................................................................................... 144 Tests of Significant for the Ordinal Level of Variables Learning Objectives ........................................................................... 144 Introduction........................................................................................ 144 Kruskal’s Gamma (Ȗ) ......................................................................... 145 Spearman’s Rho (ȡs) or Spearman’s Rank Correlation Coefficient ... 147 Significance Level of Spearman’s Rho (ȡs).................................. 149 Somers’ D .......................................................................................... 150 Kendall’s Tau-b (IJb) ........................................................................... 154 Significance Level of Somers’ D and Kendall’s Tau-b (IJb) ......... 156 Exercises for Practice......................................................................... 157 Chapter Thirteen ...................................................................................... 163 Tests of Significant for the Interval Level of Variables Learning Objectives ........................................................................... 163 Introduction........................................................................................ 163 Generalizing from Sample to Its Population ...................................... 164 The t-Distribution .............................................................................. 164 The t-test ............................................................................................ 165 Standard Error .............................................................................. 165 Calculating t value ........................................................................ 166 The t-test for Comparing Two Sample Means ................................... 168 Exercises for Practice......................................................................... 171 Chapter Fourteen ..................................................................................... 172 Measuring Relationship between two Interval Level Variables Learning Objectives ........................................................................... 172 Introduction........................................................................................ 172 Coefficient of Correlation Pearson’s r) .............................................. 173 Calculation of Coefficient of Correlation ..................................... 174 The Significance of Sample Size.................................................. 178 Summary: Interpretation of the Correlation Coefficient .............. 179 Coefficient of Determination (R-Squared)......................................... 180 Linearity ............................................................................................. 180

Elements of Statistics: A Hands-on Primer

xiii

An Ingenious Utility of Correlation ................................................... 182 Exercises for Practice......................................................................... 183 Chapter Fifteen ........................................................................................ 185 Power of a Statistical Test Learning Objectives ........................................................................... 185 Introduction........................................................................................ 185 Power of a Test .................................................................................. 186 Calculating Power of a Test ......................................................... 186 Factors Affecting the Power of a Test .......................................... 187 The Effect of Sample Size on Power ............................................ 188 The Effect of the Significance Level (Į) on Power ...................... 189 The Effect of the Directional Nature of the Alternate Hypothesis .............................................................................. 190 Parametric and Non-parametric test and Power ........................... 191 Type 1 Error, Type 2 Error and Power ......................................... 191 Exercises for Practice......................................................................... 192 Chapter Sixteen ....................................................................................... 194 Analysis of Variance (ANOVA) Learning Objectives Introduction........................................................................................ 194 Analysis of Variance (ANOVA) ........................................................ 195 One-way Analysis of Variance .......................................................... 195 Total Sum of Squares (SST) ......................................................... 197 Sum of Square within Groups (SSW) .......................................... 198 Sum of Square between Groups (SSB) ........................................ 198 ANOVA Table ................................................................................... 199 Simpler Method ................................................................................. 199 Total Sum of Squares (SST) ......................................................... 200 Sum of Square within Groups (SSW) .......................................... 200 Sum of Square between Groups (SSB) ........................................ 201 Limitation of ANOVA ....................................................................... 201 Exercises for Practice......................................................................... 202 Chapter Seventeen ................................................................................... 204 Regression Analysis: A Prediction and Forecasting Technique Learning Objectives ........................................................................... 204 Introduction........................................................................................ 204 Regression Equation and its Application for Forecasting .................. 206 Regression Coefficients................................................................ 206 Variance Explained: The Second Major Use of Regression .............. 208

xiv

Table of Contents

Calculation of Coefficient of Correlation (R2) ............................ 208 Interpretation of R-Square ............................................................ 210 The Correlation Coefficient (r) and the Regression Coefficient (b)... 210 Exercises for Practice ......................................................................... 211 Solutions to Exercises for Practice .......................................................... 213 Appendices .............................................................................................. 228 Tables References ............................................................................................... 238 Statistical Procedure and Tests by Appropriate Level of Measurement ....................................................................................... 239 Index ........................................................................................................ 240

LIST OF ILLUSTRATIONS

Figures Figure 3-1: Vertical Bar Graph Figure 3.2: Horizontal Bar Graph Figure 3-3: Line Graph Figure 3-3A: Vertical Bar Graph Figure 3-4: Pie Chart Figure 3-5: Vertical Bar Graph with Zero on Y-Axis Figure 3-6: Vertical Bar Graph without Zero on Y-Axis Figure 3-7: Vertical Bar Graph without Stretched Axis Figure 3-8: Vertical Bar Graph with Stretched X-Axis Figure 4-1: Mode in a Pie Chart Figure 4-2: Mode in a Vertical Bar Graph Figure 4-3: Mode in a Line Graph Figure 4-4: Bimodal Graph Figure 4-5: Symmetrical Distribution Figure 4-6: Positively or Right-Skewed Distribution Figure 4-7: Individual Total Income, Canada, 2010 Figure 4-8: Negatively or Left-Skewed Distribution Figure 4-9: Death Rates by Age, Canada, 2000-02 Figure 8-1: Normal Curve or Gaussian Curve Figure 8-2: Normal Distribution of Observations Between 1, 2, 3, and 4 Standard Deviations Figure 9-1: Normal Curve with Mean = 30 and Standard Deviation = 4 Figure 9-2: Mary’s Score (166) Figure 9-3: John’s Score (145) Figure 9-4: Area under the Normal Curve between two Scores Figure 14-1: Wages Earned by Number of Hours Worked Figure 14-2: Death Rates by Age, Canada, 2005 Figure 17-1: Regression Line

xvi

List of Illustrations

Tables Table 1-1 Calculations for a. Sum of Squared Deviations of X and Y, b. Sum of Products of X and Y, and c. Sum of Product of Summations of X and Y Table 3-1: A Frequency Distribution of Books by Subject Table 3-2: A Frequency Distribution of Number of Persons by Income Table 3-3: A Frequency Distribution of Number of Persons by Income Table 3-4: Workforce by Employment Equity Status Table 4-1: Responses for Perceiving Pollution in a City Table 4- 2: Number of Persons by Age Interval Table 4- 3: Frequency, Midpoint and Estimated Total Years by Age Interval Table 4-4: Calculation of Weighted Mean Table 4-5: Symmetrical, Positively Skewed and Negatively Skewed Distributions Table 4.6: Measure of Central Tendency and Level of Measurement Table 5-1: Calculation of the Mean Deviation Table 5–2: Calculation of Variance Table 5-3: Calculation of Standard Deviation of Grouped Data Table 5-4: Standard Deviation of Grouped Data by Simpler Method Table 7-1: Calculation of the Sample Size for each Stratum Table 7-2: Calculation of Sample Size from the Percentage Distribution of Population by Age in each Stratum Table 10-1: Number of Families by Type Table 10-2: Income by Sex Table 10-3: Income by Sex and Age Table 11-1: Number of Voters by Party Preference Table 11-2: Number of Voters by Party Preference and Sex of Respondents Table 11-3: Percentage of Voters by Party Preference and Sex of Respondents Table 11-4: Party Affiliation by Religion Table 11-4a: Party Affiliation by Religion, Calculation of Expected Frequencies

Elements of Statistics: A Hands-on Primer

xvii

Table 11-4b: Party Affiliation by Religion, Expected Frequencies Table 11-4c: Party Affiliation by Religion, Calculation of Chi-square Table 11-5: Sales Patterns before the Introduction of a New Cereal Table 11-5a: Calculation of Expected Frequencies after the Introduction of a New Cereal Table 11-5b: Expected Sales Patterns after the Introduction of a New Cereal Table 11-5c: Sales Patterns before and after the Introduction of a New Cereal, Calculations of the Chi-Square Table 11-6: Took Remedial Course and Did Well on the Test, Observed Frequencies (fo) Table 11-6a: Took Remedial Course and Did Well on the Test, Calculations for Expected Frequencies (fe) Table 11-6b: Took Remedial Course and Did Well on the Test, Expected Frequencies (fe) Table 11-6c: Calculation for Chi-Square Table 11-7 Attitude toward Abortion by Religion Table 12-1: Support for Abortion (y) by Religiosity (x) Table 12-2: The Gross Domestic Product and the Human Development Index of Selected Nations Table 12-3: Support for Charitable Giving by Church Attendance Table 12-3a: Calculations of Concordant Cells Table 12-3b: Calculation of Discordant Cells Table 12-3c: Calculation of Ties (Ty) on the Dependent Variable Table 12-3d: Calculation of Ties (Tx) on the Independent Variable Table 14-1 Number of Kilometres Walked (x) per Day in a Month and Number of Kilograms Reduced (y) Table 14-2: Wages Earned by Number of Hours Worked Table 15-1: Critical Value of z to Reject H0 at a Given Į-Level Table 15-2: Decision on H0 and Type of Error Table 16-1: Reduction in Blood Pressure by Treatment Table 16-2: ANOVA Table Table 16-3: Reduction in Blood Pressure in groups Using medication (Group 1), Diet (Group 2), and Exercise (Group 3) Table 17-1 Number of Kilometres Walked (x) per Day in a Month and Number of Kilograms of Weight Reduced (y)

xviii

List of Illustrations

Table 17.2: Number of Kilometres Walked (x), Actual Number of Kilograms Reduced (y), and Predicted Number of Kilograms Reduced (ǔ)

INTRODUCTION

Statistics has become a form of logic or rhetoric that everyone needs to learn to navigate the modern world. Though this book is primarily aimed at undergraduate students who are required to take at least one compulsory statistics course before graduation, it is also a valuable guide for anyone who has little or no background in statistics and wants to become statistics literate. Without the pretensions of the famous book that the learning of statistics can be without tears or that you don’t need to understand symbols, formulae, and equations, this book will prepare you to understand basic statistics and to complete your statistics course without anxiety. This book has been written with the conviction that you don’t need to be a mathematician to learn statistics. It is a crucial resource for students taking a required statistics course who are intimidated by statistical symbols, formulae, and daunting equations. The application of statistics in social research has become imperative. A gap usually exists between the time when students take their first statistics course and when they engage in their first serious research project (typically during an internship, or a research methods course, or their final year project/thesis). Because of this gap, students often don’t remember basic statistics well enough to apply it effectively in their research. Hence, there is a need for a “desk reference,” “refresher,” or “core concept” text—an Elements of Statistics for burgeoning researchers, à la Strunk and White's Elements of Style. This book will serve as an excellent desk reference, refresher, or core concept text for the budding researcher. It will also be helpful while interning or working as a research assistant or research associate. Those who feel left out when their colleagues, supervisors, or bosses use statistics will benefit from this book. This particular group of people has been on my mind for a long time. When I was employed as a workforce data analyst for an employment equity office in the early 1990s, I routinely prepared reports and presentations for senior management. The basic education of my manager was grade 11, and her boss, an assistant deputy minister (ADM), had grade 12 with an accounting certification. They both had missed the opportunity to learn statistics in school. One day, my manager returned from her boss’s office after discussing a report I had prepared and told me that the ADM preferred “circles” not “hills” in

xx

Introduction

the report. After some pondering, I realized that her boss preferred pie charts over bar graphs. This report was on employment equity designated groups, which included Aboriginal peoples, women, and persons with disabilities—where a respondent could be counted more than once. For example, the same person could be counted as a woman, an Aboriginal person, and a person with a disability. In Chapter 3, I explain that a pie chart is not appropriate where there is double-counting of respondents or observations. Because my manager had little to no statistical literacy, it would have been futile to explain to her why pie charts are not a good idea for this kind of data. But since then, I have felt that I should write a book on statistics that could help people like my manager and her boss. Today, even if you work in non-statistical areas such as policy, communications, and journalism, you need to have some knowledge of statistics. Nowadays, statistical literacy is as important as literacy itself. This book is written in a self-help, hands-on learning style so the reader can easily attain the skills needed to achieve a basic understanding of statistics and be comfortable with presentations loaded with statistics. This book follows an easy-to-comprehend format. It gives a strong foundation in the basics, while calculations elaborate on the basics in sequences designed for students and general readers who have never taken a statistics course. Simply put, it is a hands-on primer. The idea is that when you’re reading it, you won’t need a calculator or a computer. The book contains 17 chapters along with statistical tables in the appendix. Chapter 1 provides a refresher on basic math and statistical symbols, while Chapter 2 discusses the levels of data measurement and types of variables. Chapter 3 deals with the visual representation of data. It also cautions researchers on the use and misuse of graphics. Chapters 4 and 5 discuss the measures of central tendency and variability, respectively. Chapter 6 familiarizes the reader with the basic concepts of probability. As researchers are always required to work with samples, Chapter 7 is devoted to methods for selecting appropriate samples. The next three chapters deal with important concepts that a researcher must know before embarking on applying a statistical technique to data. Chapter 8 discusses the sampling distribution and the normal curve, particularly with respect to generalizing from a sample to the population. Chapter 9 elucidates the relationship between the normal distribution and standard scores. It also includes conversion of raw scores into standard scores, an essential requirement for comparative research. Chapter 10 examines relationships between variables as well as the procedure and essential concepts for testing a hypothesis.

Elements of Statistics: A Hands-on Primer

xxi

Chapters 11, 12, and 13 are devoted to tests of significance for the nominal-, the ordinal-, and the interval/ratio-level variables, respectively. Chapter 14 discusses the correlation coefficient, and Chapter 15 is devoted to the statistical power of a statistical test. Chapter 16 provides the basics on analysis of variance (ANOVA), and Chapter 17 focuses on regression. Thus, the book ends with a comprehensive survey of applied statistics and fills the lacunae left by the majority of statistics books. All exercises and examples in the book have been developed by the author. Due care has been taken to credit the sources used in the book. Any omission in referencing is, of course, unintentional and once pointed out will be rectified in the next edition.

CHAPTER ONE BASIC MATH AND SYMBOLS USED IN STATISTICS

Learning Objectives In order to learn statistics, you need some knowledge of basic mathematics, which most of you have already acquired during your grade and high school years. Because memories tend to fade over time, this chapter serves as a refresher of basic mathematical operations. It also introduces some commonly used statistical symbols. Specifically, you will learn about: x the order of operations; x fractions, decimals, exponents, and logarithms; and x the most frequently used statistical symbols.

Introduction H.G. Wells once said, “Statistical thinking one day will be as necessary for the efficient citizenship as the ability to read and write.”1 Arguably, that day has arrived, as we are bombarded daily with statistics from television commentators, newspapers, popular multimedia, and advertising billboards. Terms and phrases such as batting average, the chance of winning an election, outliers, and median income are all statistics that are routinely used in popular media. Yet many people, including university students, think that statistics is not relevant to their learning. Statistics is not taken seriously because we continuously hear that statistics is not an objective science. Yet, experts in various fields often use statistics to

1

Wells, H.G. 1903. Mankind in the Making, London: Chapman & Hall, Page 204.

2

Chapter One

refute each other’s claims (Haan2, 2008). Politicians frequently use statistics during election campaign debates to score points over one another. The idea that statistics can be used to lie was made popular by Darrell Huff3 in 1954. In fact, Huff’s book, How to Lie with Statistics, does not propagate lying with statistics; it illustrates how statistics can be misused. The following example shows that the average is meaningless without reference to the spread (variability) of data. Let’s say that five students in a class brought $5, $6, $2, $2, and $60 each to buy lunch. If you used only the average, you would say that on average a student brought $15 for lunch. The use of the average suggests that students brought more-than-sufficient money for lunch, whereas at least two, or 40% of the students had enough money only to buy a soft drink. In statistical terms, the $60 is an outlier, which increases the spread of data. Using an average without considering the spread of data around the mean value can create a misleading impression. Another misconception is that to learn basic statistics you need to know advanced mathematics, when, in fact, knowledge of basic math is sufficient to understand most statistics. In the next section, we review the necessary math needed to learn statistics. Though some equations in statistics textbooks may look daunting, you can easily understand and apply basic statistics without attempting to solve these intimidating equations.

Basic Math Needed to Learn Statistics You can learn to apply advanced-level statistics without learning advanced-level math. These days, most statistical calculations are done by computer software. You can learn to interpret statistics produced by statistical software such as SPSS and SAS without learning advanced-level mathematics. Even if your basic math is rusty, the following review will be sufficient to learn essential statistics.

The Order of Operations The order of operations is the basic principle that governs the sequence of operations, which is bracket (parenthesis), exponent, division, multiplication, 2

Haan, Michael. Introduction to Statistics for Canadian Social Scientists, Don Mills, Ontario: Oxford University Press, 2008. 3 Huff, Darrell, and Irving Geis (illustrator). How to Lie with Statistics, New York: W.W. Norton and Company Inc., 1954.

Basic Math and Symbols Used in Statistics

3

addition, and subtraction. You may be familiar with BEDMAS, an acronym you likely learned in grade or high school to memorize the sequence of operations, where: B stands for Bracket; E stands for Exponent; D stands for Division; M stands for Multiplication; A stands for Addition; and S stands for Subtraction. Let’s say we have to solve the following equation to find the value of x: x = 2 × (3 + 5)2 ÷ 4 + 6 – 7 Applying BEDMAS, we will first find a solution for the equation in the bracket: (3 + 5) = 8. Next, we will find the value of the exponent of 8 = 82 = 64. The next step will be division: 64 ÷ 4 = 16. After division, the next operation is multiplication: 2 × 16 = 32. After multiplication, the next step is addition: 32 + 6 = 38. The final step is subtraction: 38 – 7 = 31. Note: If there is more than one operation within the bracket, follow the same sequence. For example, if the equation in the bracket is (5 + 6 ÷ 2 × 4 – 1), then perform the operations in the following sequence: first, divide 6 by 2 = 3; next, multiply 3 by 4 = 12; then, add 5 to 12 = 17, and then subtract 1 from 17 = 16. The value in the bracket is 16.

Decimals and Fractions It is necessary to be comfortable with decimals because many statistics and statistical relationships are expressed in decimals. You can use a calculator to calculate a decimal from a fraction. One way to convert a fraction to a decimal is to divide the numerator by the denominator. For ଼ example, is equal to 8 divided by 16, which is equal to 0.5. A fraction ଵ଺

Chapter One

4

can also be expressed as a percentage by converting the fraction to a decimal and then multiplying it by 100: 0.5 × 100 = 50%. You will also learn that the concept of probability is central to statistical prediction. Probability is expressed in decimal points, and a chance of an event happening is expressed in percentage. For example, if there is a 30% chance of catching a fish from a river, you could say that the probability of catching a fish is 0.3. The following is a refresher from grade or high school mathematics on fractions, decimals, and percentages, which you can calculate by hand or with a basic calculator. Finding the Percent of a Number Example: to find out what is 92% of 28. x Multiply the number by the percent: 28 × 92 = 2576. x Divide the total by 100: 2576 ÷ 100. x To find out the answer, move the decimal point two places to the left: 25.76. Finding Percentage Example: to find out what percent is 28 of 92? x Divide the first number by the second: 28 ÷ 92 = 0.3043 x Multiply the answer by 100: 0.3043 × 100 x Move the decimal point two places to the right: 30.43%. Converting a Fraction to a Decimal Example: Convert

ଵ ଷ

to a decimal.

x Divide the numerator of the fraction by the denominator: 1 ÷ 3 = 0.3333. Converting a Fraction to a Percent x After converting a fraction to a decimal, simply multiply by 100 and move the decimal point two places to the right: 0.3333 × 100 = 33.33%.

Basic Math and Symbols Used in Statistics

5

Converting a Percent to a Fraction Example: Convert 75% to a fraction. x Remove the percent sign from 75. x Make a fraction with the percent as the numerator and 100 as the ଻ହ = 75 ÷ 100 = 0.75. denominator: ଵ଴଴

Converting a Decimal to Percent Example: Convert 0.75 to a percent. x Multiply the decimal by 100: 0.75 × 100 = 75. x Add a percent sign to 75: 75%. Converting a Percent to a Decimal Example: Convert 75% to a decimal. x Divide the percent by 100: 75 ÷ 100 = 0.75. Rounding Decimals The rounding of decimal points makes calculation easier and presentation of the numbers clearer. One reason we usually need to round a number is because a decimal may extend endlessly. For example, 1/3 results in 0.3333333333. In the rounding of decimals, we need to consider two questions. 1. How many places should we carry the decimal point? 2. How do we decide that the last number reflects the remainder? The answer to the first question is that it depends on the type of data. Some demographic data, such as survival rates, may extend to sixth decimal point; whereas in many other situations, you might decide to keep only one decimal place. Conventionally, we round to two decimal places. The answer to the second question is that we retain the value of the last decimal place if the value next to the retained decimal place number is less than 5. For example, 2.344 is rounded to 2.34. We increase the value of the retained decimal place by 1 if the next decimal place is 5 or greater. For example, 3.765 is rounded to 3.77.

6

Chapter One

Some statisticians suggest that in a dataset, the number ending with 5 after the decimal point should be rounded up for one-half of the time and down for the other half of the time. For example, say the first number ending with 5 after the decimal point is 2.235; it is rounded down to 2.23. Say the second number ending with 5 after the decimal point is 6.475; it is rounded up to 6.48. However, I would suggest that for numbers ending with 5 after the decimal point, you round up every time.

Truncation When we retain the decimal place just as it is, without changing the value of the decimal place, it is called truncation. Some computer programs truncate the numbers without changing the value of the retained decimal place; for example, 2.334 and 2.337 are both retained as 2.33. In other words, both numbers are truncated to 2.33.

Exponents In statistics, exponents are used quite routinely; therefore, it is important to familiarize yourself with them. Basically, an exponent indicates the number of times a numeral should be multiplied with another. For example, 23 means that 2 is multiplied 3 times: 2 × 2 × 2. In the expression 23, the 2 is called the base and the 3 is called the exponent. It is also commonly called 2 raised to the power of 3. Here, the 2 is called a base and the 3 a power. Multiplication and Division of Two Exponents with Identical Bases If two exponents with identical bases are multiplied, the rule is to add the exponents. For example, to multiply 24 by 23, you add the exponents 4 and 3. Thus, 24 × 23 becomes 2(4 + 3), or 27. Its value is: 2 × 2 × 2 × 2 × 2 × 2 × 2 = 128. If two exponents with identical bases are divided, the rule is to subtract the exponents. For example, to divide 25 by 23, you subtract the exponent 3 from 5. Thus, 25 ÷ 23 becomes 2(5 – 3), or 22. Its value is: 2 × 2 = 4.

Basic Math and Symbols Used in Statistics

7

Square Root The square root of a number is the reverse operation of a square of that number. For example: Square of n = n2 If n = 3, then 32 = 3 × 3 = 9. The square root of n = ξ݊. If n = 9, then ξͻ = ξ͵ ൈ ͵= 3.

Logarithms You may have learned in high school that a logarithm is a special type of exponent. The base of a logarithm is either 10 or 2.718. When the base is 10, it is called the common logarithm; when the base is 2.718, it is called the natural logarithm. What this means is that if we raise 10 to the power of 3 (103), the answer is 10 × 10 × 10 = 1000, and, hence, the common log of 1000 is 3. Similarly, if we raise 2.718 to the power of 3 (2.7183), the answer is about 20 (2.718 × 2.718 × 2.718 § 20.01), which means the natural log of 20 will be about 3, or, to be precise, 2.9957. The § stands for “approximately equal to.”

Some Common Symbols Used in Statistics The symbols used in statistical equations may intimidate a nonmathematical person. The fear of symbols sometimes disheartens a person to learn statistics. This fear is unnecessary. Once you understand the meaning of symbols, the fear disappears and the learning of statistics becomes easy. Most statistics textbooks use X and Y as symbols for variables. Basically, a variable is a characteristic (such as sex or social class) or a quantity (such as age or income). A variable varies between its categories. For example, sex can take a value of a male or a female, and class might vary between the lower, middle, or upper class. Similarly, age might take any value from one day old to 100 years old, and income could vary from 0 dollars to billions of dollars. The symbol N is used for the number of persons or the number of cases in a population, and the symbol n is used for the number of persons or the number of cases in a sample. Generally, uppercase letters (X, Y, Z) are used to represent population characteristics, and lowercase letters (x, y, z) are used to denote sample characteristics.

8

Chapter One

The most dreaded symbol for a person unfamiliar with statistics is a Greek-alphabet uppercase sigma, which is written as ™ and denotes the adding up or summing up of numbers. For example, ™(X1, X2, X3) indicates that we are adding quantities represented by the symbols X1, X2, and X3. Simply put, if X1 = 2, X2 = 3, and X3 = 4, then ™(X1, X2, X3) = 2 + 3 + 4 = 9. It is that simple. You will see sigma written as follows: ே

෍ ܺ௜ ௜ୀଵ

In the above example, N = 3. Xi indicates that X takes i values. Because X takes three values (2, 3, and 4), in this example i is equal to 3. The summation sign, ™, is the most frequently used symbol in statistics. The following rules will be helpful to understand its use.

Summation (™) and Constant (c) Written in symbols: ™c = N × c. ™c means that the sum of constants is equal to the number of times a constant appears in the series multiplied by the value of the constant: if c = 10 and N = 6, then N × c = 10 × 6 = 60. It is the same as: 10 + 10 + 10 + 10 + 10 + 10 = 60.

Summation (™), Constant (c), and a Variable Written in symbols: ™cXi The symbols above suggest that you first multiply each value of variable X with the constant and then add them up. If c = 10, X1 = 3, X2 = 4, and X3 = 5, then, ™cXi = 10 × 3 + 10 × 4 + 10 × 5 = 30 + 40 + 50 = 120. Summation (™) and Two Variables (X and Y) a. ™(X – Y)2 (Sum of Squared Deviations of X and Y) b. ™XY (Sum of Products of X and Y) c. ™X™Y (Product of Summations of X and Y) Table 1-1 provides the calculations for a., b., and c. for two variables X and Y with three values.

Basic Math and Symbols Used in Statistics

9

Table 1-1 Calculations for a. Sum of Squared Deviations of X and Y, b. Sum of Products of X and Y, and c. Sum of Product of Summations of X and Y

(X – Y) Deviation (4 – 6) = –2 (7 – 2) = 5 (6 – 8) = –2

X Y 1 4 6 2 7 2 3 6 8 ™ 17 16 c. ™X™Y = 17 × 16 = 272

(X – Y)2 Square of Deviation (–2)2 = 4 (5)2 = 25 (–2)2 = 4 a. ™(X – Y)2 = 33

XY Product of X and Y (4 × 6) = 24 (7 × 2) = 14 (6 × 8) = 48 b. ™XY = 86

a. the sum of squared deviations of X and Y b. the sum of products of X and Y c. the sum of product of the summations of X and Y With only this much knowledge of math and statistical symbols, you will be able to understand most of statistics.

Exercises for Practice Please solve the following equations: 1. 40 – 30 – (–5) + 3 = 1. (50 – 30) + 2 × 5 = 2. 5 × (10 – 8) – 3 = 3. 27 ÷ 32 × 5 = 4. 5 × 3 – 9 ÷ 3 = 5. 9 ÷ 3 – 5 × 3 = 6. ξʹͷ = 7. 52 + 43 = 8. 54 – 52 =

Chapter One

10

9. Find the common log of 100 (Hint: log10? = 100). 10. Find the natural log of 7.388 (Hint: log2.718? § 7.388). 11. Use the values of X and Y in the following table and fill in the values for the question marks (?). X

Y

(X – Y) Deviation

1 2

5 4

(X–Y)2 Square of Deviation

XY Product of X and Y

(a). ™(X – Y)2 =?

(b). ™XY = ?

3 2

3

3 6 ™X = ™Y = ? ? (c). ™X™Y = ? × ? = ?

CHAPTER TWO STATISTICAL THINKING: LEVELS OF MEASUREMENT AND VARIABLES

Learning Objectives In this chapter, you will learn that in statistics a particular type of number has a particular meaning and purpose. Specifically, you will learn about: x x x x

types of numbers; levels of measurement; the basic unit of statistics, the variable; and the importance of levels of measurement in statistics.

Introduction Statistics in terms of data collection and analysis has existed for a long time. The Romans took the census of able-bodied men to estimate the potential number of army recruits. But thinking in terms of probability and laws governing probability is new. People began to see the world less in terms of chance or fatalism. According to Hacking4 (1975), a Canadian philosopher, the publication of numbers increased in the 19th century. The exponential growth in printed numbers led to the adoption of statistical thinking, which made the use of statistics pervasive in the modern world. During election campaigns, you often hear TV anchors saying that the prediction of electorate voting for a particular party will be true for 19 4

Hacking, Ian. 1975. The Emergence of Probability: A Philosophical Study of Early ideas about Probability, Induction and Statistical Inference. London: Cambridge University Press. Quoted in Haan, Michael. 2013. An Introduction to Statistics for Canadian Social Scientists. Don Mills, Ontario: Oxford University Press.

12

Chapter Two

times out of 20 times. When they say that an event will be true for 19 times out of 20 times, they are making a probability statement. This casual use of probability implies that a common person understands it. People use statistics daily to frame an argument just like rhetoric or logic. Statistics has become part of everyday life. It is used in every discipline, whether it be social sciences, physical sciences, engineering, medicine, or biology. You just can’t escape anymore by saying, “I don’t plan to use statistics, so why should I learn it?” The world has become so number-oriented that even if you don’t use statistics, you can’t avoid the media, politicians, colleagues, friends, and common folks quoting statistics to you. It seems that it is now essential to understand statistics to function in society. Statistics has become a fundamental part of our functional literacy. Learning statistics amounts to learning a new way of thinking. The purpose of this book is to help you acquire basic statistical literacy so you can appreciate this new way of thinking.

Types of Numbers and Levels of Measurement Numbers are used in different ways; for example, a number 7 and a number 9 on the uniforms of two baseball players are symbols. We can’t say who is a better player from these numbers, but we can identify the players. The number used for runs made by each player is a different type of number than the number on each uniform. This number can inform us which player played better. Basically, the number used for the runs represents quantity, whereas the number on the uniforms is just like a name. A number that is used similar to a name is a called a nominal-level number. Anything that has quantity or quality that varies is called variable. For example, sex varies between two categories: male and female. Sex is a nominal-level variable because its categories, male and female, are just labels for men and women, respectively. Similarly, if uniform number was a variable, it would be called a nominal-level variable. Categories of a variable are commonly called response categories or observations. When we can’t mathematically measure the difference between the response categories of a variable, it is called a nominal-level variable. Because we can’t measure the difference between two categories of sex (i.e., male and female), sex is a nominal-level variable. In other words, male and female are just names that we give to two genders. The other three levels of measurement are the ordinal level, the interval level, and the ratio level. At the ordinal level, as the name suggests, we can order response categories or observations of a variable. For example, social class is an

Statistical Thinking: Levels of Measurement and Variables

13

ordinal-level variable because we can order its response categories into the lower, middle, and upper classes. But we can’t exactly measure the difference between the lower class and the middle class or between the middle class and the upper class. The third type of variable is called an interval-level variable. The distance between response categories of an interval-level variable can be measured because categories of an interval-level variable form measurable intervals. For example, temperature is an interval-level variable. We know that –20 degree Celsius is colder than 0 degree Celsius. But 0 here does not mean that temperature does not exist. We can’t say that 20 degree Celsius is twice as warm as 10 degree Celsius because 0 does not represent “no temperature.” Similarly, the year 0 in our calendar does not imply that time did not exist before the year 0. We use 1 B.C. and 1 A.D., which surround the year 0. Zero is arbitrarily fixed in the interval-level variables. A variable in which 0 implies nothing is called a ratio-level variable. For example, income is a ratio-level variable because 0 dollar income means no income. We can also measure the difference between the categories of a ratio-level variable. We can say a person who earns $30,000 is richer by $10,000 than a person who earns $20,000. Similarly, weight, height, and age are ratio-level variables. We can add, subtract, multiply, or divide values of both interval-level and ratio-level variables. The only difference is that in interval-level variables, 0 is arbitrarily fixed and in ratio-level variables, 0 means nothing. Because of the absence of absolute zero, the values of interval-level variables can’t be used substantively in multiplication or division. For example, 20 degrees Celsius is not twice as warm as 10 degrees Celsius, and a score of 700 on the Scholastic Aptitude Test (SAT) is not twice the score of 350. On the other hand, the values of a ratio-level variable have a substantively meaningful 0 value, interpreted as "none." Therefore, a ratiolevel variable can be converted into ratios by multiplying or dividing its values. In other words, we can make ratio statements such as a person aged 70 years is twice as old as a person 35 years old and the income of $25,000 is half as much as the income of $50,000. Students sometimes get confused while differentiating between interval and ratio levels of measurement due to the over-emphasis on 0 meaning “none,” especially if no zero values are observed (e.g., no respondent will be less than age 18 in a sample of adults). Similarly, if cherries picked by pickers are measured in kilograms, it would be unlikely to observe a worker who picked 0 kilograms of cherries. In a grouped data, the variable "income" could be ordinal level, not ratio level. In other words, differentiating the level of measurement between interval-level and

14

Chapter Two

ratio-level variables becomes nuanced, and statisticians who work with social or economic data commonly use the term interval level for both interval-level variables and ratio-level variables. The following summary outlines the properties of the four variable levels:

Level Nominal

Summary Properties A variable whose response categories are used as names. Distance between its response categories cannot be measured. Response categories cannot be ordered from low to high and vice versa. Example: Sex, hair colour, religion, and ethnic status.

Ordinal

A variable whose response categories can be ordered. Distance between its response categories cannot be measured. Example: the social class, a letter grade on exam, and conservatism

Interval

A variable whose response categories can be ordered. Exact distance between response categories can be measured. Zero is arbitrarily fixed. Example: the temperature and the calendar.

Ratio

A variable whose response categories can be ordered. Exact distance between response categories can be measured. Zero means nothing or none. Example: income, age, and weight.

Apart from the difference in the nature of 0, the interval-level and the ratio-level variables have the same properties and are treated alike. In most areas, these two categories are collapsed into one category, and researchers usually refer to both types of variables as interval-level variables. For example, in social sciences, the issue of “true 0” or “untrue 0” rarely arises, but it is important to be aware of the difference between intervallevel and ratio-level variables.

Statistical Thinking: Levels of Measurement and Variables

15

Types of Variables A variable in statistics is similar to an atom in chemistry or physics or a cell in biology. It is a building block of statistical analysis. We must learn the definitions of various types of variables because we will be using names of various types of variables quite frequently in this book. You have already learned about levels of measurement of variables; therefore, now is the best time to introduce you to various types of variables.

Continuous Variables A variable is called continuous if its response categories or observations can assume any (infinite) number of real values. For example, time is a continuous variable because it can assume a value of 2.0, 2.1, 2.3 … 2.9 seconds. Similarly, age, income, height, and weight are continuous variables and are different from discrete variables, whose response categories assume a fixed value. Interval-level and the ratio-level variables are continuous variables as their response categories are not fixed but can take any value and the distance between these values can be measured.

Discrete or Qualitative or Categorical Variables Discrete variables are also called qualitative or categorical variables because response categories or observations of these variables assume fixed values. For example, sex, family size, and the number of cars a family owns are discrete variables. You can’t use decimal points for these variables (e.g., you can’t say 1.2 males or 1.3 females). The following five types of variables are discrete variables because they have fixed response categories: 12. Nominal-level variables are discrete variables. For example, sex has two fixed values: male and female. 13. Ordinal-level variables are also discrete variables. For example, social class has fixed values such as the lower, middle, and upper classes. 14. Dummy variables are “proxy” variables. A variable is called a dummy variable when the response categories of “no” and “yes” of a qualitative variable are assigned quantitative values of 0 and 1 or the response categories of a qualitative variable are assigned values of 1, 2, 3, 4, 5 to transform it into a categorical variable. For

Chapter Two

16

example, names of regions of a country are assigned numbers as North = 1; South = 2; East = 3; West = 4; and Northwest = 5. 15. Preference variables are discrete variables in which response categories increase from 0 to 9 or decrease from 9 to 0. For example, in a survey data where a respondent’s preference or satisfaction about a perfume is solicited by such categories as 1. Most disliked; 2. Disliked; 3. Neither liked nor disliked; 4. Liked; or 5. Most liked. 16. Multiple response variables are variables in which response categories can assume more than one value. For example, a survey questionnaire soliciting responses on the use of technology may give discretion to respondents to select more than one category from the following six categories: 1. 2. 3. 4. 5. 6.

Computer Cell phone Email Face-book Twitter Internet

Why Different Levels of Measurement Matter? Because specific types of statistical techniques can be used to calculate the relationship between variables of specific levels of measurement, it is pertinent to know the level of measurement of variables when planning data analysis. We know that nominal-level variables have few or no arithmetic properties. Therefore, we can only identify association between two nominal-level variables. For example, the place of residence with response categories of London, Berlin, and Paris and the type of beverage preference with the response categories of tea, coffee, and hot chocolate are two nominal-level variables. If our data shows that tea is more popular in London, coffee in Berlin, and hot chocolate in Paris, we can only say that the place of residence is associated with the type of beverage preferred. Ordinal-level variables have some arithmetic properties as their response categories can be ordered from low to high and vice versa. The strength and the direction of the relationship between two ordinal-level variables can be determined. But the rate of change can’t be accurately measured. For example, social class with response categories of the lower, middle, and upper classes can be ranked from lower to upper; and

Statistical Thinking: Levels of Measurement and Variables

17

conservatism with response categories of low and high levels of conservatism can be ranked or ordered from low to high. When we say that conservatism is strongly associated with social class and it increases with an increase in social class, we are referring to the strength of the relationship by saying that the association between these variables is strong and to the direction of the relationship by saying that conservatism increases with an increase in social class. But we can’t find the rate of change in their relationship because we can’t quantify how much conservatism increases with an increase in social class. Interval-level and ratio-level variables are the most versatile for statistical analysis because we can exactly measure the difference between their categories. We can find a relationship as well as measure the direction and the strength of the relationship between interval-level and ratio-level variables. The rate of change between two interval-level or ratio-level variables can also be determined. For example, there is a positive relationship between education and income (i.e., income increases with an increase in education). When we say the relationship is positive, we are alluding to the direction of the relationship. We can also calculate the strength of the relationship between interval-level and ratio-level variables with the help of statistical techniques. We can also determine the rate of change because we can find out the exact amount of an increase in income for an increase in education. Interval-level and ratio-level data can be treated the same for most statistical analyses. Therefore, as far as statistical analysis is concerned, there are basically three levels of variables: nominal, ordinal, and interval/ratio. From this chapter on, we will use the term interval level for both interval-level and ratio-level variables. The material discussed in these two chapters is sufficient to embark on a journey of learning statistics—to use the proverbial statement, statistics is not rocket science; it is not hard to learn.

Chapter Two

18

Exercise for Practice Q1:

What is the level of measurement (nominal, ordinal, interval, or ratio) of the following variables? 1. Social class 2. Gender 3. Hair colour 4. Income in dollars 5. Age in years 6. Temperature in Celsius 7. Height in inches 8. Ethnic origin 9. Letter grade on exam 10. Place of birth

Q2: What is the difference between an interval-level variable and a ratiolevel variable? Q3: Is the following variable a continuous or a discrete variable? 1. 2. 3. 4. 5. 6.

Age in years Social class Percent grade on exam Height in inches Gender Ethnic origin

CHAPTER THREE GRAPHS AND CHARTS: THE USE AND MISUSE OF VISUALS

Learning Objectives In this chapter, you will learn about the reduction of data and the visual representation of data. Specifically, you will learn about: x x x x

frequency distribution; data grouping and percentiles; visual representation of data by graphs and charts; and misuse of graphic representation.

Introduction Data is the plural form of the singular noun datum. In common parlance, however, people use data as a singular noun. In this book, we will also use data as a singular noun. Data collected in the form of responses or observations on a sample or a population is unwieldy and difficult to comprehend in raw form. Therefore, we need to reduce data into a comprehensive format called a frequency table or the frequency distribution. Frequencies indicate the number of times an item, observation, or response category has occurred in a sample or a population. A frequency distribution is a simple tally of observations or response categories of a variable. Let’s say you want to construct a frequency distribution of books by subject on a library shelf. You assume that the books are placed randomly on the shelf. You pick up a book, recognize its subject, and then mark a slash next to the subject of the book in the column labelled Tally. You use four slashes to count four books and a strike-though line for the fifth book, as shown in Table 3-1. These are called tallies. Next, count the

Chapter Three

20

number of tallies for each subject and write a number in the column labelled Frequency (f) for the subject. The lowercase f is used to indicate frequency. Table 3-1 A Frequency Distribution of Books by Subject Subject Arts Science Medicine Engineering Total

Tally //// //// /// //// //// //// //// //// / //// ////

Frequency (f) 13 9 16 10 48

The subject of a book is a nominal-level variable. Frequency tables are usually constructed for nominal- and the ordinal-level variables, but they can be constructed for all types of data. Table 3-2 is an example of a frequency table constructed on the interval-level variable of income. Table 3-2 A Frequency Distribution of Persons by Income Income $30,000 and over $20,001–$30,000 $10,001–$20,000 Less than $10,000 Total

Tally //// //// //// //// //// //// //// //// //// //// //// //// / //// //// //// //// //// ////

Frequency (f) 14 20 26 30 90

Grouped Data You’ll notice that income data in Table 3-2 has been grouped in intervals of $10,000. The intervals are not constructed arbitrarily. There is a conventional method that should be followed to construct intervals to group data. Let’s say we have the following data on 10 persons by age: Age: 18, 19, 17, 19, 21, 20, 18, 19, 19, 25 We take the following steps to group this data into intervals of equal size: 1. Sort data from low to high frequency so you can easily locate the lower limit of the lowest interval and the upper limit of the highest interval. 17, 18, 18, 19, 19, 19, 19, 20, 21, 25

Graphs and Charts: The Use and Misuse of Visuals

21

The lower limit of the lowest interval is 17 or below 17. The upper limit of the highest interval is 25 or more than 25. 2. Find the range (i.e., the difference between the highest value and the lowest value), and add 1 to the range. Range = 25 – 17 = 8 8+1=9 3. Divide this number (the range + 1) by a value that is a multiple of this value (3 is a multiple of 9 and will give a whole number). 9÷3=3 The interval size is 3. 4. Next, search for numbers that are multiples of this interval size. 3, 6, 9, 12, 15, 18, 21, 24, 27 Begin with an interval that contains the lowest number of the distribution and stop at an interval that contains the highest number of the distribution. In this data, the lowest number is 15 and the highest number is 25. Starting with 15 for the interval size of 3, our frequency distribution is shown in Table 3-3 below. The 15–17 interval includes only one number, 17; therefore, we put one slash for the 15–17 interval. There are seven values in the 18–20 interval: 18, 18, 19, 19, 19, 19, 20. We assign seven slashes to this interval. There is only one number, 21, in the 21–23 interval, and only one number, 25, in the 24–26 interval; therefore, these two intervals get one slash each. Table 3-3 A Frequency Distribution of Number of Persons by Income Age

Tally

15–17 18–20 21–23 24–26 Total

/ //// // / /

Frequency (f) 1 7 1 1 10

Cumulative Frequency (cf) 1 8 9 10

Cumulative Percentage 10 80 90 100

22

Chapter Three

In Table 3-3 above, two new columns have been added: the cumulative frequency (cf) and the cumulative percentage. The cumulative frequencies are needed to calculate the cumulative percentages. Here is how cumulative frequencies and cumulative percentages are calculated.

Cumulative Frequency The cumulative frequency is the sum of all previous frequencies to the current point or interval. Frequencies for each interval are added from the lowest interval to the highest interval. In our example, the lowest age interval is 15–17 and includes only one person. The next age interval is 18–20 and includes seven people. We add 1 from the 15–17 interval and 7 from the next age interval, 18–20; the total cumulative frequencies for this interval are 8. Similarly, we add frequencies of the age interval 21–23 to the cumulative frequencies up to the previous interval, that is, 8 plus 1 is equal to 9 cumulative frequencies for the age interval 21–23. We add 1 frequency of the age interval 24–26 to the 9 cumulative frequencies of the previous interval to arrive at 10 cumulative frequencies for the age interval 24–26. The cumulative frequencies of the highest interval 10 include all the frequencies.

Cumulative Percentage A cumulative percentage is a percentage of cumulative frequencies. For example, the total frequencies in Table 3-3 are 10 and the cumulative frequencies in the 18–20 age group are 8; thus, 8 ÷ 10 × 100 = 80% of 10. This means that 80% of the persons are below the age of 21. Looking at Table 3-3, we can say that in this distribution 10% of the persons are under the age of 18; 80% of the persons are under the age of 21; 90% of the persons are under the age of 24; and all 100% of the persons are under the age of 27.

Graphs and Charts Graphs and charts are visual aids that make it easy and quick to understand data. Bar graphs, line graphs, and pie charts are the basic types of graphical presentation of data; others are variants of these three types. Data is tabulated in the form of frequency tables and then graphs are created using data in the columns or rows of these tables. Graphs can be easily created with the help of Microsoft Excel or other software. But we

Graphs and Charts: The Use and Misuse of Visuals

23

must remember the rules and conventions for creating graphs and charts. These rules and conventions are discussed in the following section.

Rules for Creating Graphs and Charts Vertical Bar Graph 1. X-Axis: The horizontal axis on the graph is called the x-axis, or the abscissa. Response categories or observations should always appear on the x-axis. For example, the subject of books shown in Figure 3-1 and sex in Figure 3-2 should appear on the x-axis. 2. Y-Axis: The vertical axis on the graph is called the y-axis, or the ordinate. Frequencies or its variants (such as numbers, percentages, proportions, rates, etc.) should always be plotted on the y-axis. For example, the number of books for the graph in Figure 3-1 appears on the vertical axis, that is, the y-axis. 3. Title: The graph’s title should be brief and clearly convey the relationship between the x-axis and the y-axis. For example, in Figure 3-1, the title “Number of Books by Subject” clearly states a relationship between the x-axis (the subject of books) and the y-axis (the number of books). 4. Scale: As shown in Figure 3-1, scale is related to the y-axis. It should appear in equal increments (such as 2, 4, 6, 8 or 10, 20, 30, 40) but not in unequal increments (such as 2, 10, 100). It is preferable to start the scale with 0, but it is acceptable to start the scale with a non-zero number. 5. Legend: The legend identifies response categories. The legend at the bottom of the graph in Figure 3-2 identifies categories of sex of respondents.

Chapter Three

24

Figure 3-1 Vertical Bar Graph Number of Books by Subject 18 16 14 12 Number

10 8 6 4 2 0 Arts

Science Medicine Subject of Books

Engineering

Horizontal Bar Graph When names of categories are long and need more space to fit the text, the y-axis (vertical) can be used for labelling names of the categories and the x-axis can be used for frequencies or percentages. This shifting of axes shows bars horizontally; hence, it is called a horizontal bar graph. Because the drug names in Figure 3-2 are long, the y-axis has been used for them. Figure 3-2 Horizontal Bar Graph

Cocaine

Hashish or Marijuana

Heroin, Opimium or Morphine

Lysergic Acid Diethylamide (LSD)

Solvent 0

5

10 Boys

15 Girls

20

25

30

35

40

Graphs and Charts: The Use and Misuse of Visuals

25

Line Graph Line graphs are drawn to show a trend or change in a variable over a time period. The line graph follows the same rules as those followed by the bar graph. The major difference is that one should always draw a line graph for a time series data. A time series is a set of observations obtained on a single variable over time (e.g., daily counts of crime against property or against person). The measurement for time could be hourly, daily, weekly, monthly, or annually. Though people often draw bar graphs with a time series data (see Figure 3-3A), it is an imaginary line on the type of these vertical bars that gives you an idea of change in the variable over time. In this sense, a vertical bar graph with a time series on the x-axis is perceived as a line graph. Figure 3-3 is a line graph drawn from Statistics Canada data on the crime rate from 1962 to 2011. You can clearly make out from this graph that the crime rate in Canada has increased from 1962 to 1991, then it started to decline. Although a graph is primarily a visual aid, it can also serve as a preliminary analytical tool. You can also see from this graph that a line in Figure 3-3 gives a better perception of the change over time than bars in Figure 3-3A. Figure 3-3 Line Graph Crime Rate, Canada, 1962 to 2011 12,000 10,000 8,000

Crime Rate 6,000 4,000

0

1962 1964 1966 1968 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010

2,000

Year

Source: Statistics Canada, Canadian Centre for Justice Statistics, Uniform Crime Reporting Survey. Available on line.

26

Chapter Three

Figure 3-3A A Vertical Baar Graph

Source: Statisstics Canada, Canadian C Centrre for Justice S Statistics, Unifo orm Crime Reporting Surrvey. Availablee on line.

Pie Chart The same ruules that applyy to bar graph hs also apply tto pie charts. The T main difference iss that pie charrts should not be drawn on ddata in which there is a multiple couunting of response categories. A pie charrt is drawn by y dividing 360 degreess (0) of a circle accordin ng to the prroportion of response categories oor observations in the data-sset. Let’s say that in a worrkforce of 60 persons, 42 were menn and 18 weree women. In oother words, 70% 7 were men and 330% were women. w Becaause a circlee has 3600, men are represented by a 2520 angle (3600 × 70% = 22520) and wo omen are represented by a 1080 anngle (3600 × 30% 3 = 1080).. The pie chaart of this workforce bby sex is show wn in Figure 3--4.

Graphs and Charts: The Usse and Misuse oof Visuals

27

Figure 3-4 P Pie Chart

Pie Ch hart and Double-Countiing in Respoonse Catego ories We were abble to draw a pie chart in n Figure 3-4 because theree was no double-counnting in the response cateegories (malee and femalee) of the variable (sexx). Everyone was counted just j once, eithher a male or a female. If the sam me workforce was design nated into em mployment equity e or affirmative action groupss (such as wo omen, racial minorities, Aboriginal A peoples, andd persons withh disabilities), there would bbe a chance of o doublecounting. A person couldd belong to more m than one group. For ex xample, a person couldd be counted once as a raccial minority aand again as a woman. The actual nnumber of perrsons and the number of deesignated grou ups would not be the same due to this double-ccounting. Lett’s say that among a 20 women in T Table 3-4, theere are 5 raciaal minorities, 3 Aboriginall peoples, and 7 persoons with disaabilities. The designated ggroups add up u to 60, whereas the number of peersons is only 45. This is beecause 15 wom men have been counteed more than one time—o once among tthe women group g and then also am mong the raccial minoritiess, Aboriginal peoples, and d persons with disabiilities group. This is mu ultiple-countinng of groupss. If you calculate thhe percentage of designateed groups baased on the actual a 45 persons, it w will exceed 100% 1 (see thee last columnn of Table 3-4 4). A pie chart based on these perccentages woulld be an erronneous represen ntation of data. The tottal of the categories in a piee should not eexceed 100%.

Chapter Three

28

Table 3-4 Workforce by Employment Equity Status

Designated Group

Designated Statuses in Workforce

Women with Other Designated Status

Actual Total Persons

Designated Status Among 45 Persons

Women

20

Racial Minorities

10

5

44.40% 22.20%

Aboriginal Peoples Persons with Disabilities

5

3

11.10%

15

7

33.30%

Non-Designated

10

0

22.20%

Total

60

15

45

133.30%

As mentioned, basically there are three types of graphic presentation: the bar graph, the line graph, and the pie chart. Software like Microsoft Excel gives other variants of these graphs, such as an area graph, where the area under the line of a line graph is filled with a colour; a stock graph, where bars are represented by lines; a doughnut shaped graph, where a pie or a circle is presented in the form of a doughnut; and three-dimensional variants of bar graphs.

Misuse of Graphic Representation Frequently graphs are used in the commercial world to entice customers to buy a product or to convince clients that the item presented in the graph is worth their consideration. In this process, sometimes graphs are presented to mislead. This is done primarily by manipulating the graph’s x-axis or yaxis.

Eliminating Zero from the Y-Axis In his book How to Lie with Statistics, Darrell Huff5 describes a manipulation of the vertical axis and calls it the “Gee Whiz Graph.” The manipulation is carried out by eliminating 0 from the scale of the y-axis. Figures 3-5 and 3-6 illustrate this manipulation with graphs that compare the 5

Huff, Darrell, and Irving Geis (illustrator). How to Lie with Statistics, New York: W.W. Norton & Company Inc., 1954.

Graphs and Charts: The Use and Misuse of Visuals

29

population size of three hypothetical small towns. The scale of the y-axis in Figure 3-5 includes 0. Visually, it appears from this graph as if there is almost no difference in the population size of the three towns. Because all three towns have populations between a very small range (19,000 and 19,250), the differences are understated by the graph when 0 is included on the y-axis. The visual in Figure 3-6 for the same data is dramatically changed by eliminating 0 on the y-axis and replacing it with 18,850. The very small differences in the population size of the three towns shown in Figure 3-5 have been dramatically exaggerated in Figure 3-6 without changing the actual data. This is one way to exaggerate or understate the small differences in data. The other method to change the visual of a graph without changing the actual data is stretching or shrinking the axis. Figure 3-5 Vertical Bar Graph with Zero on Y-Axis Population of Three Towns 25,000

20,000

15,000

10,000

5,000

0 Town A

Town B

Town C

Chapter Three

30

Figure 3-6 Vertical Bar Graph without Zero on Y-Axis Population of Three Town 19,250 19,200 19,150 19,100 19,050 19,000 18,950 18,900 18,850 Town A

Town B

Town C

Stretching or Shrinking the Axis Another manipulation is carried out by stretching or shrinking either the horizontal x-axis or the vertical y-axis without changing the data. When you compare Figures 3-7 and 3-8 (which represent the same data), you’ll note that in Figure 3-8 the horizontal x-axis has been stretched and the vertical y-axis has been shrunk slightly. The result is that the differences in the population of the three hypothetical town in Figure 3-8 have become less dramatic compared with those in Figure 3-7 without changing the data.

Graphs and Charts: The Use and Misuse of Visuals

31

Figure 3-7 Vertical Bar Graph without Stretched Axis 120

100

80

60

40

20

0 Town A

Town B

Town C

Figure 3-8 Vertical Bar Graph with Stretched X-Axis 120 100 80 60 40 20 0 Town A

Town B

Town C

Avoiding the Misuse of Graphic Representation The above-mentioned manipulations are contrary to an honest presentation. You have to be watchful of these manipulations not only when drawing graphs but also when interpreting graphs prepared by

Chapter Three

32

others. The elimination of 0 on the y-axis can be corrected by starting the y-axis scale with a 0 frequency. The stretching or shrinking of axes can be avoided by adhering to a convention. The best convention to follow is that the size of the vertical axis and the size of the horizontal axis are kept the same so that the graph area looks like a square. According to Runyon and Haber6 (1980), most statisticians agree that the height of the vertical axis should be 0.75 of the horizontal axis, with an acceptable proportion being between 0.70 and 0.80. Whatever the proportion you choose between these boundaries, keep it consistent throughout your presentation.

Exercises for Practice Exercise 1: Use the following graph to answer Q1 to Q3. Birth Rate by Country 46.1

50

B i r t h

45 40 35 30 25

19.9

20

R 15 a t 10 e 5

13.4 8.1

10.3

0 Japan

Canada

United States

India

Niger

Country

Q1:

What is the largest number on the vertical axis?

Q2:

Out of the given five countries, which country has the secondlargest birth rate?

Q3:

Which country has the highest birth rate and which country has the lowest birth rate, and what is the difference between the highest birth rate and the lowest birth rate?

6

Runyon, Richard P., and Audrey Haber. Fundamentals of Behavioural Statistics, 4th ed., Don Mills, Ontario: Addison-Wesley Publishing Company, 1980.

Graphs and Charts: The Use and Misuse of Visuals

33

Exercise 2: Use the following vertical bar graph to answer Q4 to Q6. Books Sold per Month 35 30 25 N u 20 m b 15 e r 10 5 0 January

February

March

April

May

June

Month

Q4:

In which month is the number of books sold half of the number of books sold in March? A. B. C. D.

Q5:

What does the scale on the left that begins with 0 and ends with 35 represent? A. B. C. D.

Q6:

January April February June

the months in the graph number of books sold in a month number of days in each month number of days each month that books were sold

Which month shows a 400% increase in sales from April? A. B. C. D.

March January June February

Chapter Three

34

Exercise 3: Use the following pie chart to answer Q7 to Q10. Time Spent per Day by an Averge Person by Activity

Commuting 8%

Sleeping 33%

Working 21%

Toileting 8%

Gossiping 17% Eating 13%

Q7.

Approximately how many hours a day are spent sleeping? A. 6 hours B. 9 hours C. 8 hours D. 10 hours (Hint: Number of hours = Percent × 24; e.g., 8% will be 0.08 × 24 = 1.92 hours, or approximately 2 hours.)

Q8.

Approximately how many hours are spent gossiping? A. B. C. D.

Q9.

4 hours 2 hours 5 hours 6 hours

Approximately how many hours are spent working? A. B. C. D.

6 4 7 5

Graphs and Charts: The Use and Misuse of Visuals

35

Q10. On which of the two activities does an average person spends the same amount of time? A. B. C. D.

commuting and eating sleeping and working toileting and commuting commuting and gossiping

CHAPTER FOUR CENTRAL TENDENCY: AVERAGE TENDS TO BE A CENTRAL NUMBER

Learning Objectives In this chapter, we will discuss measures of central tendency. Specifically, you will learn about: x the three measures of central tendency, i.e., the mode, the median, and the mean; x how to calculate measures of central tendency from ungrouped and grouped data; x bimodality and multimodality; and x the relationship of skewness and symmetry of a distribution with its measures of central tendency.

Introduction We frequently refer to the average income of a nation, the average weight of Americans, and the average height of Canadians. To obtain an average, you add all the values of a distribution and then divide this total by the total number of values of the distribution. For example, if you’re interested in the average weight of 25 children, you weigh each child, add the weight of each of the 25 children, and divide by 25. The average, or the mean, can be calculated only for interval-level variables such as income, weight, and height. We frequently misuse the term average in phrases such as an average Joe or an average day. It would be statistically wrong to use the term average as a quality of a person or a day cannot be quantified because the quality is not an interval-level variable. There are two important traits of any frequency distribution: first, numbers of a distribution tend to cluster around a value that lies at the

Average Tends to be a Central Number

37

centre of its lowest and its highest value; and second, the numbers in the distribution tend to be dispersed around the central value in a way that can be statistically measured. The first quality is called the central tendency, and the second quality is called the dispersion, or variability. These two qualities of a distribution combined with the probability of happening of an event constitute the basis of quantitative statistics. This chapter focuses on the central tendency, and the next two chapters will discuss the dispersion and the probability, respectively.

Central Tendency A measure of the central tendency is a central point of a distribution that can be used as a proxy for the entire sample or the entire population. For example, countries take census of their populations, collect data on income, and then calculate an average from the total income of the country. By doing so, countries reduce a large data-set on income into a simple single value called the average income or the per capita income. This number indicates how much on average a citizen of a country earns. This average is also used to compare income of nations. In other words, the average or mean is a representative of the entire distribution and can be used to compare two or more distributions.

Measures of Central Tendency There are three measures of central tendency: the mode, the median, and the mean. The mode is a number that appears most frequently in a distribution, the median is the middle number that splits the distribution into two equal halves, and the mean is the average number.

Mode The mode is the most frequent score in a distribution. For example, in the following distribution of the age of 10 individuals: Age: 18, 19, 17, 19, 21, 20, 18, 19, 19, 25; 19 is the most frequently occurring number. Therefore, 19 is the mode for this distribution. It is advisable that in order to find the mode, you first sort the distribution from the lowest to the highest number and then find the most frequent number. For example, if you first sort this distribution on

Chapter Four

38

age as: 17, 18, 18, 19, 199, 19, 19, 20, 21, 25; you’’ll be able to locate 19 easily as thee most frequennt number in th he distributionn. m, the mode iss the highest frequency f x In daata tabulated inn a table form or peercentage. Forr example, in Table T 4-1, thee response “N Normal” is the hhighest frequeency or the highest perceentage. Thereefore, the respoonse “Normal”” is the mode.. Table 4-1 R Responses forr Perceiving Pollution P in a City

Verry high Higgh Norrmal Low w Verry Low Tottal

Num mber 21 1,358 17 7,521 32 2,140 22 2,324 23 3,261 116 6,604

Percen nt 18% % 15% % 28% % 19% % 20% % 100% %

x Grapphically, as shoown in Figurees 4-1, 4-2, annd 4-3, the mo ode is the biggeest slice of a pie chart, thee highest bar in a bar grap ph, or the higheest point in a line l graph. Mode in a Piee Chart Figure 4-1 M

Average Tends to be a Central Number

39

Figure 4-2 Mode in a Vertical Bar Graph Number of Responses 35,000 Mode 30,000 25,000 20,000 15,000 10,000 5,000 Very high

High

Normal

Low

Very Low

Figure 4-3 Mode in a Line Graph Number of Responses Mode

35,000 30,000 25,000 20,000 15,000 10,000 5,000 Very high

High

Normal

Low

Very Low

The mode is not an arithmetic measure. It is an observation. Though the mode can be located in data of all levels of measurement, it is often used with nominal-level data. For example, if you want to know what hair colour has been sold most frequently by a drugstore, you’ll utilize the mode as a measure of central tendency because the mean and the median of a nominal-level variable will be meaningless. In this case, you can determine what colour is bought most frequently but not which colour divides the distribution into two equal halves or what colour is the average colour.

Chapter Four

40

Bimodality and Multimodality Some variables could have two or more values that are equally most frequent in its distribution. For example, in many countries, the household size is bimodal. In Figure 4-4, you see that distribution of household size in this hypothetical population is bimodal. If the frequency distribution of an opinion poll on an issue is bimodal, you could predict that the opinion is polarized. Similarly, if there are more than two peaks in a graph, you would identify it as a multimodal distribution. If it is a graph on an opinion poll, you could predict that there are several divergent views about the issue. The bimodality or multimodality indicates that distribution is not normal. The normal curve has only one mode. The normal distribution is an important concept that you will learn in detail in Chapter 9. Figure 4-4 Bimodal Graph Household Size

40 35

Mode

Mode

3 Persons

4 Persons

30 25 20 15 10 5 0 2 Persons

5 Persons

6 and More Persons

Median The median is the middle value of a frequency distribution of a variable sorted from the lowest to the highest value. For example, the following distribution has nine values; the fifth value would be the median that splits the distribution into two equal halves. The fifth value is 7; therefore, the median of this frequency distribution is 7. Age of boys: 4, 5, 6, 7, 7, 8, 9, 10, 11

Average Tends to be a Central Number

41

If you have a similar distribution on age for 10 perons, you would take an average of the fifth and sixth values of the distribution to find the middle value. For example, in the following distribution of ten persons, 19 is the fifth value and 20 is the sixth value; therefore, (19 + 20) ÷ 2 = 19.5 will be the median. 19.5 Age of girls: 17, 18, 18, 19, 19, ‫ޔ‬, 20, 21, 21, 25, 26 Median

Median of Grouped Data Many times data is grouped into intervals. Let’s say that the ages of 103 persons were only available by the following age groups: Table 4-2 Number of Persons by Age Interval Age Frequency (f) 5–9 10 10–14 17 15–19 31 20–24 22 25–29 23 Total 103

Cumulative Frequency 10 27 58 80 103

The median is the middle value in a frequency distribution. In the distribution in Table 4-2, the total number of frequencies is 103, and the median value will be the age of the 52nd person. The age of the 52nd person will divide the distribution into two equal halves. The values of the age of 51 persons will be below the age of the 52nd person, and the values of the age of 51 persons will be above the age of this 52nd person. The 52nd person has to be in the 15–19 interval because cumulative frequencies corresponding to this interval have reached up to the 58th person. We call this 15–19 interval the median interval. Looking at the cumulative frequencies, we see that the cumulative frequencies have reached up to the 27th person in the interval below the median interval. Because 27 + 25 = 52; the age of the 25th person in the median interval (15–19) will be the median age of this distribution. We have to figure out the age of this 25th person in the median interval (15–19) to find out the median of this distribution. It has to be above the lower limit of the median interval (15 years) and less than the upper limit of the median interval (19 years). Next we need to figure out what proportion of 5 years of the

Chapter Four

42

median interval we should add to its lower limit (15 years) to reach the age of the 25th person. The number we need to add to 15 years is half of the ே ଵ଴ଷ total frequencies ( , thus, ) minus the cumulative frequencies (27) ଶ ଶ reached before the median interval divided by the number of frequencies (31) of the median interval (15–19). This will be equal to the proportion of the interval size that needs to be added to the lower limit (15) of the median interval. Because the interval size (i) is 5, we have to multiply the proportion by 5. The formula can be written as: Median = ‫ ܮ‬൅

ಿ ି஼ி మ



ൈ ሺ݅ሻ,

where L is the lower limit of the interval containing the median (15), N is the total number of frequencies of the distribution (103), CF is the number of cumulative frequencies of the interval preceding the median interval (27), f is the frequencies of the median interval (31), and i is the interval size (5). ͳͲ͵  െ ʹ͹ ൌ ͳͷ ൅ ሼ ʹ ሽ  ൈ ͷ ͵ͳ ൌ ͳͷ ൅ ሼ

ͷͳǤͷ െ ʹ͹ ሽ  ൈ ͷ ͵ͳ

ൌ ͳͷ ൅ ሼ

ʹͶǤͷ ሽ  ൈ ͷ ͵ͳ

ൌ ͳͷ ൅ ሼǤ͹ͻ ൈ ͷሽ = 15 + 3.95 = 18.95 is the median. x The median (or, for that matter, the mode and the mean) of large data-sets is difficult to calculate manually; most of the time you’ll use a computer to calculate the median of a large data-set.

Average Tends to be a Central Number

43

Some Limitations of the Median It follows from the above discussion that: x The median, the middle score of a frequency distribution of a variable, cannot be calculated for nominal-level data. x Though the median can be calculated for ordinal-level data, it may not always be meaningful. x The median can be calculated without restriction for interval-level and ratio-level data.

Mean The mean is also known as the arithmetic average. It is the sum of all the observations in a distribution divided by the total number of observations. Here, the sum of all the observations is denoted by the Greek letter sigma (™), which was discussed in Chapter 1. The formula of the mean is: xժ =

σ௫೔ ௡

,

where xժ is the mean, xi stands for observations, and n is the total number of observations. Let’s say we have six observations denoted by i, and i goes from 1 to 6. These observations are: x1, x2, x3, x4, x5, x6, and their values are x1 = 2, x2 = 1, x3 = 4, x4 = 6, x5 = 5, x6 = 8. Observations = 2, 1, 4, 6, 5, 8; n = 6; Sum (™xi) = 2 + 1 + 4 + 6 + 5 + 8 = 26 ଶ଺ xժ = = 4.3 ଺

Mean of Grouped Data The mean of grouped data is easier to conceptualize than the median. If we have to calculate the mean age of the same 103 persons, we have to take the following steps:

Chapter Four

44

Table 4-3 Frequency, Midpoint and Estimated Total Years by Age Interval

Age 5–9 10–14 15–19 20–24 25–29 Total

Frequency 10 17 31 22 23 103

(f)

Midpoint (x) 7 12 17 22 27

Estimated Total Years ( f × x) 70 204 527 484 621 1906

STEP1: Find the Midpoints for Each Interval First, calculate the midpoint of each interval as shown in the third column of Table 4-3. To find a midpoint of an interval, add its lower and its upper limits and divide by 2. For example, for the 5–9 age interval, the midpoint is 5 + 9 = 14 ÷ 2 = 7 years. The five single years in this interval are 5, 6, 7, 8, 9; and 7 is the middle number of these five numbers; therefore; the midpoint of the interval is 7. We will assume that all children in this interval are 7 years old. Similarly, calculate the midpoints of all the other intervals. STEP2: Estimate the Total Number of Years of Age for Each Interval The estimated total number of years of age for 10 children in the 5–9 years interval is: 7 × 10 = 70 years. Similarly, we calculate the estimated total number of years in other intervals assuming that all 17 children in the 10– 14 interval are 12 years old; all 31 children in the 15–19 interval are 17 years old; all 22 persons in the 20–24 interval are 22 years old; and finally, we assume that all 23 persons in the 25–29 interval are 27 years old. STEP3: Find Sum of the Estimated Years of Age As shown in Table 4-3, the sum of the estimated number of years of age of 103 persons is 1906 years. STEP4: Calculate the Mean The mean age of these 103 persons is 1906 divided by 103 as shown below:

Average Tends to be a Central Number

xժ = xժ =

σ௫೔

ଵଽ଴଺ ଵ଴ଷ

45



= 18.5

Weighted Mean: The Mean of Means The weighted mean is the mean of several means. The sample size of each group must be known to calculate their weighted mean. Let’s say we have four classrooms of grade 8 students in a school and the mean score (xժ i) and the class size (ni) for each classroom are as follows: Class A, xժ 1 = 65, Class B, xժ 2 = 70, Class C, xժ 3 = 80, Class D, xժ 4 = 75,

n1 = 40; n2 = 35; n3 = 20; n4 = 30.

The principal wants to know the mean score for all the grade 8 students in the school. Can you just add these four means and divide by 4: (65 + 70 + 80 + 75 = 290; 290 ÷ 4 = 72.5) and tell the principal that the mean score for all grade 8 students is 72.5? The answer is No! because the mean is influenced by the number of observations (n) in a distribution. If the number of students in each class was the same, then you could just add the four means and divide by 4. But because the number of students in each class is not the same, we need to take the number of students in each classroom into consideration to calculate a weighted mean. The following steps will accomplish this: STEP1: Find the Total Score for Each Class Multiply the mean (xժ i) of each classroom with its number (ni) of students to find the total of scores (xժ i × ni) for each classroom. These total scores for each class are given in the last column of Table 4-4. STEP2: Find the Sum of Scores of All the Groups Add the scores of each class to find the total scores of the four classes.

46

Chapter Four

STEP3: Calculate the Weighted Mean from the Sum of Scores Now divide the total scores by the total number of students in the four classrooms to calculate the weighted mean. This will give the weighted mean of 71.2, which is the actual reflection of the mean of the scores of all the grade 8 students in the four classrooms. Table 4-4 Calculation of Weighted Mean

Classroom A B C D Total Weighted Mean

Mean (xժ i) 65 70 80 75 290

Number (ni) 40 35 20 30 125

Total Scores xժ i × ni 2600 2450 1600 2250 8900

8,900 ÷ 125 = 71.2

Skewness, Mean, and Median When frequencies are evenly distributed around the middle value (mean), the distribution looks symmetrical, as shown in Figure 4-5. The distribution becomes skewed when most of the values are distributed toward one side. The mean of the distribution is pulled toward the skew (tail). Table 4-5 provides hypothetical data for three types of distribution, namely, the symmetrical, the positively skewed, and the negatively skewed. Skewness occurs when distribution is not symmetrical. The symmetry or the skewness in data affects measures of central tendency. Let’s use three hypothetical distributions in Table 4-5 to show the influence of the skewness on the measures of central tendency:

Average Tends to be a Central Number

47

Table 4-5 Symmetrical, Positively Skewed, and Negatively Skewed Distributions Symmetrical Distribution Age f Age × f 4 1 4 5 1 5 6 3 18 7 6 42 8 3 24 9 1 9 10 1 10 Total 16 112 Mean 7.0 Median 7.0 Mode 7.0

Positively or RightSkewed Distribution Age f Age × f 6 5 30 7 4 28 8 3 24 9 1 9 10 1 10

14

Negatively or LeftSkewed Distribution Age × f Age f 4 1 4 5 1 5 6 3 18 7 4 28 8 5 40

101 7.2 7.0 6.0

14

95 6.8 7.0 8.0

Symmetrical Distribution Order from low to high all the age values of the symmetrical distribution given in Table 4-5 as follows: 4, 5, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 9, 10 x The total sum of these values is 112, and the total number (n) of children is 16. The mean, xժ = ™xi ÷ n, is 112 ÷ 16 = 7. x There are 16 values; the average of the eighth and ninth values is the median. These values are 7 and 7, respectively. The average of 7 and 7 is 7; therefore, the median is also 7. x The most frequently occuring value in this distribution is also 7; therefore, the mode is also 7. In a perfectly symmetrical distribution, the mean, median, and mode are the same. The number of frequencies on the left and the right side of the mean are also the same (Figure 4-5).

Chapter Four

48

Figure 4-5 Symmetrical Distribution 7 F r e q u e n c y

6 5 4 3 2 1 0 4

5

6

7

8

9

10

Age

Positively or Right-Skewed Distribution Order from low to high all the age values of the positively skewed distribution given in Table 4-5 as follows: 6, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9, 10 x The total of these values is 101, and the total number of children (n) is 14. The mean of this distribution is ™xi ÷ n, 101 ÷ 14 = 7.2. x Because there are 14 values, the average of the seventh and eighth values, which are 7 and 7, respectively, is the median. The average of 7 and 7 is 7; therefore, the median is 7. x The median is not affected by the skewness, but the mean is pulled toward the skew (tail). In a positively skewed distribution, the skew (tail) is toward higher values; therefore, the mean (7.2) is higher than the median of 7.0 (Figure 4-6).

Average Tends to be a Central Number

49

Figure 4-6 Positively or Right-Skewed Distribution

5 Median Mean F r e q u e n c y

4

3

2

1

0 6

7

8

9

10

Age

There are many variables whose distributions are positively skewed; for example, as shown in Figure 4-7; the income in Canada is a positively skewed variable. Figure 4-7 Individual Total Income, Canada, 2010 Positively or Right-Skewed 25,000,000 20,000,000 15,000,000 10,000,000 5,000,000 0

Source: Statistics Canada. 27 June 2012 (modified). CASIM, Table 111-008. Available online.

Chapter Four

50

Negatively or Left-Skewed Distribution Order from low to high all the age values of the negatively skewed distribution given in Table 4-5 as follows: 4, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8 x The total of these values is 95, and the total number of children (n) is 14. The mean of this distribution is ™xi ÷ n, 95 ÷ 14 = 6.8. x There are 14 values; the average of the seventh and eighth values, which are 7 and 7, respectively, is the median. The average of 7 and 7 is 7; therefore, the median is 7.0. x As mentioned earlier, the median is not affected by the skewness, but the mean is pulled toward the skew (tail). In a negatively skewed distribution, the skew (tail) is toward smaller values; therefore, the mean of 6.8 is lower than the median of 7.0 (Figure 4-8). Figure 4-8 Negatively or Left-Skewed Distribution Negatively or Left-Skewed Distribution 5 F r e q u e n c y

Mean Median

4 3 2 1 0 4

5

6

7

8

Age In real life, death rates by age are negatively skewed because the number of deaths in the population increases with the increase in age. Figure 4-9 shows that the distribution of the Canadian death rates by age is negatively or left-skewed.

Average Tends to be a Central Number

51

Figure 4-9 Death Rates by Age, Canada, 2000-02 Negatively or Left-Skewed Distribution 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 years 5 years 10 years 15 years 20 years 25 years 30 years 35 years 40 years 45 years 50 years 55 years 60 years 65 years 70 years 75 years 80 years 85 years 90 years 95 years 100 years 105 years

0

Source: Statistics Canada. 31 July 2006 (Modified). Table 2a Complete Life Table Canada 2000 to 2002 Male. Available online.

Summary Table: Level of Measurement and Type of Central Tendency Use Table 4.6 to decide which measure of the central tendency to use according to the level of measurement of the variable. Table 4-6 Measure of Central Tendency and Level of Measurement Type of Variable Nominal Ordinal Interval/Ratio (Skewed) Interval/Ratio (Not skewed)

Best Measure of Central Tendency Mode Median Median Mean

Exercises for Practice Q1.

A restaurant has six employees. Their weekly salaries are $60, $60, $80, $90, $70, and $30. Find the mode, median, and mean for the salary of these six employees.

Q2.

The ages of 15 factory employees are:

Chapter Four

52

45, 40, 46, 49, 35, 43, 42, 46, 36, 28, 32, 47, 48, 30, 36 Age

Frequency

45–49 40–44 35–39 30–34 25–29

A. Complete the frequency column in the table above. B. Find the modal interval. C. Find the interval that contains the median. Q3.

The following table gives the selling prices of nine houses: Price per House

Number of Houses

$200,000

1

$300,000

4

$400,000

3

$1,900,000

1

Total

9

Total Price

A. Find the mean price of the nine houses. B. Find the median of the nine houses. C. Out of these two measures, which is the better measure of the central tendency for this distribution?

CHAPTER FIVE VARIABILITY: MEASURES OF DISPERSION

Learning Objectives This chapter deals with the concept of dispersion of a distribution of a variable. You will learn the following three measures of dispersion: x the range; x the variance; and x the standard deviation.

Introduction The comparison of an average of a distribution with the average of another distribution becomes meaningless without a reference to the spread of their values. For example, the average personal income in the United States from 1979 to 2007 increased from $22,600 to $39,600,7 yet Americans showed their displeasure with income inequality through Occupy Wall Street protests because a U.S. budget paper showed that over these years the income of the top 1% of the American population increased by 275%, whereas the income of the bottom 20% grew only by 18%.8 This means that income inequality in the United States has increased over these years. In other words, the income has become more dispersed in the United States because now there is a wider gap between the income of the poorest and the richest persons. This shows the importance of reporting dispersion 7

“United States Personal Income per Capita 1929-2008: Inflation Adjusted (2008$),” Demographica, 2008. http://www.demographia.com/db-pc1929.pdf. 8 Sharma, Raghubar D. Poverty in Canada, Don Mills, ON: Oxford University Press, 2010, p. 135.

54

Chapter Five

or variability in data along with the mean or the median to give an honest impression of a real picture. In other words, we need to know how dispersed a distribution is to better understand its typical value (central tendency). For example, in the morning when we check out the Weather Channel before leaving the house, we want to know not only the average temperature on that day but also the high and low values of temperature that day, that is, the range of the temperature. In other words, we want to know variability in temperature to fully prepare for the weather of the day.

Dispersion or Variability Dispersion or variability refers to the amount of variation in the values of a variable. x The more the values of a distribution are dispersed around the mean, the higher the variability. x The greater the dispersion of values of a distribution of a variable, the less informative is the mean. The popular measures of dispersion are the range, mean deviation, variance, and standard deviation.

The Range The range is the difference in the highest and the lowest values of a distribution. Range = H – L; where H is the highest value and L is the lowest value. Let’s say we have to find the range of age of the following 9 college students: 23, 21, 19, 22, 18, 19, 18, 24, 23; We would need to take the following steps: STEP 1:Sort the Data from the Lowest to the Highest Value 18, 18, 19, 19, 21, 22, 23, 23, 24

Variability: Measures of Dispersion

55

STEP 2: Find the Lowest and the Highest Values in the Distribution, and Subtract the Lowest Value from the Highest Value L = 18 and H = 24; Range = 24 – 18 = 7 x The range is a crude measure of variability. x It is used infrequently. x It is heavily influenced by extreme values; hence, it’s not a very useful measure of variability.

The Deviation A deviation is the distance between any value and the mean of a variable. Deviation = x – ‫ݔ‬ Let’s say the mean IQ of a group of students is 107 and Mary’s IQ is 120; the deviation of Mary’s IQ score from the mean is: x = 120; ‫ ݔ‬ൌ ͳͲ͹ x – š ൌ120 – 107 = 13. If John’s IQ is 95; the deviation of John’s IQ score from the mean is: 95 – 107 = –12. A deviation can be a plus or a minus number. You will soon find out that the sum of deviations of a distribution is 0; therefore; the deviation by itself is not a useful concept. The deviation gives the relative difference of an individual’s score from the mean of a group. For example, if the average IQ of the population is 90 and Stephen Hawking’s IQ is 180, one can say that Stephen Hawking’s IQ is double of the population average. x Because the sum of all deviations from the mean is 0, it’s not a useful measure of dispersion without further manipulation.

Chapter Five

56

The Mean or Average Deviation The sum of the absolute values of deviations from the mean divided by the total number of observations is the mean deviation or the average deviation. A deviation is an absolute deviation when you ignore the negative or the positive sign (see Table 5-1). Mean deviation =

σʜ௫೔ ି௫ʜ ௡

Perpendicular lines around xi – xժ in the above formula indicate to ignore the positive or the negative signs of deviations while adding the values of the deviations of the distribution. Let’s consider that xi is a variable with the values 2, 3, 4, 7. As shown in Table 5-1, the mean of these four values is 4, and the deviations from the mean are –2, –1, 0, and 3. Because the total of these deviations is 0, we ignore the signs and consider the absolute deviations of 2, 1, 0, 3 and sum them up. The sum of these absolute deviations is 6. The sum of the absolute deviations of 6 divided by the total number of observation 4 is 1.5. The mean deviation of this distribution is 1.5. Table 5-1 provides a systematic calculation of the mean deviation of this distribution. Table 5-1 Calculation of the Mean Deviation

i 1. 2. 3. 4. Total Mean (‫= )ݔ‬

xi 2 3 4 7 16 16 ÷ 4 = 4

Absolute Deviation Deviation |xi - ࢞| xi - ࢞ 2 – 4 = –2 2 3 – 4 = –1 1 4–4=0 0 7–4=3 3 0 6 Mean Deviation = 6 ÷ 4 = 1.5

x The higher the values of the mean deviation, the greater the dispersion of values of a variable.

Variance and Standard Deviation The absolute values of deviations that are used to calculate the mean deviation have no mathematical utility because half of the deviations that are negative now do not have real relationships with the mean. Once we

Variability: Measures of Dispersion

57

decide to use a negative deviation as a positive value, its real relationship with the mean is lost. Moreover, we are giving the same treatment to the negative and the positive values, which is mathematically unreasonable. There is a better way to give the same treatment to the negative as well as the positive values. You just square them and apply another measure of dispersion called variance.

Variance Variance is the sum of squared deviations divided by the total number of deviations. Its formula is written as follows: ଶ

•ଶ ൌ

σሺš୧ െ  šሻଶ σሺ୧  െ  ሻଶ ǡı ൌ  

x A lowercase s squared is used as a symbol for the variance of a sample; the lowercase Greek letter sigma (ı) squared is used as a symbol for the variance of a population. A lowercase n is used for the total observations of a sample, and an uppercase N is used for the total observations of a population. x The numerator, ™(xi – ‫)ݔ‬2, in the equation is the sum of all the squared deviations; it is commonly called the sum of squares.

Calculating Variance We apply the following steps to calculate the variance from the data on a variable xi with n observations given in Table 5-2: 1. Subtract the mean (‫ )ݔ‬from each value of the distribution to obtain deviations (xi – ‫)ݔ‬. 2. Square each deviation to obtain squared deviations (xi – ‫)ݔ‬2. 3. Sum up the squared deviations to obtain the sum of squares (™(xi – ‫)ݔ‬2.

Chapter Five

58

Table 5–2 Calculation of Variance i xi 1. 2 2. 3 3. 4 4. 7 Total 16 n = 4; ‫ = ݔ‬16 ÷ 4 = 4; The variance:

(xi – ࢞) 2–4 3–4 4–4 7–4

(xi – ࢞) –2 –1 0 3

(xi – ࢞)2 4 1 0 9 14

™(xi – ‫)ݔ‬2 = 14 •ଶ ൌ

σሺ୶౟ ି୶ሻమ

s2 ൌ 

୬ ଵସ ସ

= 3.5

Standard Deviation Standard deviation is another improvisation of the variance. It is just a square root of the variance. To calculate the standard deviation, we first calculate the variance and then take one more step: we find out the square root of the value of the variance. The standard deviation of the above variable will be the square root of the variance, 3.5: σሺ୶౟ ି୶ሻమ

s=ට



s = ξ͵Ǥͷ= 1.87

Standard Deviation of Grouped Data Let’s say the age of 10 persons is grouped in five-year age groups as shown in Table 5-3 and we have to calculate the standard deviation of the age of these 10 persons. Because we need to calculate deviations from the mean, first we will calculate the mean of this distribution using the following steps: STEP 1: Find the Midpoints of Each Interval Calculate the midpoint of each interval. To find the midpoint of an interval, add the lower and the upper limits of the interval and divide the

Variability: Measures of Dispersion

59

sum by 2. For example, for the 5–9 years age interval, the midpoint is 5 + 9 = 14 ÷ 2 = 7. The five single years in this interval are 5, 6, 7, 8, 9; and 7 is the middle number of this interval. Because we don’t know the precise age of the five persons in this interval, we will use the midpoint as a proxy for the age of each person. We will assume that every child in the 5–9 years age interval is 7 years old. STEP 2: Estimate the Total Years of Age for Each Interval Assuming that all the children in the 5–9 years interval are 7 years old, the estimated total years of age for the four children in this interval is: 7 × 4 = 28 years. Similarly, we assume that all three children in the 10–14 years interval are 12 years old, both the children in the 15–19 years interval are 17 years old, and the person in the 20–24 years interval is 22 years old. Multiply the midpoint of each interval with its frequencies to find out the number of estimated years of age for each interval. STEP 3: Find the Sum of the Estimated Years of Age Add the estimated years of all the age groups. As shown in the fourth column of Table 5-3, the total number of estimated years is 120. STEP 4: Calculate the Mean The mean age of these persons is the total number of estimated years divided by the total number of frequencies, 120 ÷10 = 12 years. Table 5-3 Calculation of Standard Deviation of Grouped Data

Age 5–9 10–14 15–19 20–24 Total

Frequency (f)

Midpoint (xi)

Estimated Years of Age (f × xi)

4 3 2 1

7 12 17 22

28 36 34 22

10

120

Deviations for Interval (xi – ‫)ݔ‬

Sum of Square for Interval (xi – ‫)ݔ‬2

Sum of Square for Persons (xi – ‫)ݔ‬2 × f

7 – 12 = -5 12 – 12 = 0 17 – 12 = 5 22 – 12 = 10

25 0 25 100

100 0 50 100

150

250

Once we know the mean, we can use the following steps to calculate the standard deviation:

60

Chapter Five

STEP 5: Find the Deviations from the Mean for Each Age Interval Subtract the mean (‫ )ݔ‬from the midpoint of each interval to obtain deviations (xi – ‫ )ݔ‬for each interval. STEP 6: Find the Square of Deviations for Each Age Interval Square each deviation (xi – ‫ )ݔ‬to obtain squared deviations (xi – ‫)ݔ‬2. STEP 7: Find the Square of Deviations for Each Person Assuming that each person in an interval has the same squared deviation, multiply the squared deviations of each interval by its frequencies to find out the square of deviations for each person in the interval. STEP 8: Find the Sum of Squares Add the squares of deviations for each person to obtain the sum of squares ™(xi – ‫)ݔ‬2 × f (see the last column of Table 5-3). STEP 9: Calculate the Variance Divide the sum of squares by the total number of frequencies using the following formula to obtain the variance: ‫ݏ‬ଶ ൌ s2 ൌ 

σሺš୧ ൅ šሻଶ ൈ ˆ  ଶହ଴ ଵ଴

= 25

STEP 10: Find the Standard Deviation Take the square root of the variance using the following formula: s = ξʹͷ = 5

An Alternate Method You can use a simpler method where you do not have to calculate the mean of the distribution and take deviations from the mean to find out the

Variability: Measures of Dispersion

61

sum of squares. The formula for the simpler method is as follows: ଶ

• ൌ

σ୤୶౟ మ ି

σሺ౜౮౟ ሻమ σ౜

σ୤

,

where f stands for the frequencies and xi for the midpoints. Table 5-4 Standard Deviation of Grouped Data by Simpler Method

Age 5–9 10–14 15–19 20–24 Total

Frequency (f) 4 3 2 1 10

Midpoint (xi) 7 12 17 22

Estimated Years of Age (f × xi) 28 36 34 22 120

xi2 49 144 289 484 966

f × xi2 196 432 578 484 1690

From table 5-4: σˆ ൌ ͳͲ ™fxi2 = 1690 ™(fxi)2 = (120)2 = 14,400 Now we can find the variance by plugging in the above three values in the formula: σሺˆš୧ ሻଶ σˆš୧ ଶ െ σˆ •ଶ ൌ  σˆ



• ൌ •ଶ ൌ

ͳͶǡͶͲͲ ͳͲ ͳͲ

ͳ͸ͻͲ െ

ͳ͸ͻͲ െ ͳǡͶͶͲ ͳͲ

Chapter Five

62

•ଶ ൌ

ʹͷͲ ͳͲ

• ଶ ൌ ʹͷ Take the square root of the variance to find out the standard deviation: ‫ ݏ‬ൌ ξʹͷ ൌ ͷ In real life, you’ll rarely calculate the standard deviation from the grouped data. Most of the time, you’ll use your computer for such calculations. Because the purpose of this book is to give you an understanding of the concepts behind the statistics, you’ve been presented with the actual calculations involved in determining the standard deviation from the grouped data.

Coefficient of Variation (CV) The coefficient of variation is a ratio of the standard deviation to the mean.  ൌ 

•–ƒ†ƒ”††‡˜‹ƒ–‹‘ ‡ƒ

This measure is useful for comparing the degree of variability of two distributions even if the means of the distributions are quite different. Let’s calculate and compare the coefficient of variation of distributions given in Tables 5.2 and 5.3. The standard deviation for the distribution in Table 5.2 was 1.87 and the mean was 4: s = 1.87 ‫=ݔ‬4 ଵǤ଼଻  ൌ = 0.47 ସ

The standard deviation for the distribution in Table 5.3 was 5 and the mean was 12: s=5 ‫ = ݔ‬12 ହ  ൌ = 0.42 ଵଶ

Variability: Measures of Dispersion

63

Though the means of the two data-sets are quite different (4 versus 12), the values of CV of 0.47 and 0.42 indicate that the degree of variation between the two data-sets is not very different.

Practical Use of Coefficient of Variation In the investment world, coefficient of variation is used to determine the amount of risk in comparison with the amount of return from an investment such as a stock. The lower ratio of standard deviation to mean implies the lower risk-return tradeoff. The risk-return tradeoff is a principle that states that low risk is associated with low potential returns and high risk is associated with high potential returns.

Exercises for Practice Exercise 1: Ten students are graded on a test. They receive the following scores: 6, 7, 8, 7, 5, 8, 7, 5, 8, 9 Q1.

Calculate the mean.

Q2.

Find the range of this distribution.

Q3.

Calculate the total absolute deviation.

Q4.

Calculate the variance.

Q5.

Find the standard deviation.

Exercise 2: Use the following grouped data on the number of hours spent on watching television and answer Q6 to Q11. Hours 5–9 10–14 15–19 20–24 Total

Midpoint Frequency (xi) (f) 7 1 12 4 17 2 22 3 58 10

f × xi 7 48 34 66 155

(xi – xժ )

(xi – xժ )2

(xi – xժ )2 × f

Chapter Five

64

Q6.

Find the mean.

Q7.

Complete the column for deviations from the mean (x – xժ i).

Q8.

Complete the column for the squared deviations (x – xժ i)2.

Q9.

Complete the last column for the squared deviations multiplied by frequencies (x – xժ i)2 × f.

Q10. Calculate the variance (s2). Q11. Calculate the standard deviation (s). Q12. Calculate the coefficient of variation (CV).

CHAPTER SIX PROBABILITY

Learning Objectives Probability is crucial in the learning of inferential statistics. In this chapter, you will learn about: x x x x x x x

randomness; rules of probability; the difference between theoretical and empirical robability; the law of large numbers; multiplication and addition rules; odds and probability; and the role of probability in inferential statistics.

Introduction In the previous three chapters, we have discussed descriptive statistics. The visual representation of data though graphs and charts, the measures of the central tendency and the variability are helpful for describing data. More importantly, the use of statistics goes beyond the description of data. Statistics is a tool that helps us to test our hunches and hypotheses and to make inferences from data. Let’s consider a hypothetical example: we want to test whether the size of a human’s head is related to the IQ of that person. We test four men with larger head circumferences and four men with smaller head circumferences. We find that the average IQ of the men with the larger heads is 107 and that of the men with the smaller heads is 105. From our experiment, can we assume that the men with the larger heads are smarter? The answer is no! There are two problems: First, the sample size of eight persons is too small to generalize to all men in the population. Second, it may be by chance that the men with the larger heads in our sample happen to have a slightly higher mean IQ than those with the

66

Chapter Six

smaller heads. The first problem deals with the sampling and the sample size, which we will discuss in Chapter 7. The second problem pertains to probability. It is important to understand probability theory to understand the effect of the chance factor in decision making. Randomness is an important concept for the probability of a sample being representative of the population from which that sample is drawn.

The Concept of Randomness It is very difficult and expensive to collect data on the entire population of a city, a large community, or a country. We have to rely on samples that are representative of the population from which they are drawn. In other words, we have to select a sample in such a way that it is representative of its population. One way to achieve this is to use the random sampling technique. Random sampling is a technique used to select a sample from a population in such a way that each member of the population has an equal probability of getting selected. For example, when we pull a playing card from a well-shuffled deck, we are trying to draw the card randomly because each card has been given an equal chance to be chosen by shuffling the deck well. A randomly selected sample is representative of its population, and you can generalize from the characteristics of a random sample to that of the population. As mentioned, we will discuss sampling in the next chapter. For now you need to know that the same probability means that each member of the sample selected from a population has an equal chance of selection.

Moving beyond Descriptive Statistics Probability is the likelihood or chance of an event happening. Probability and probability theory are tools we can use to determine the likelihood of one event taking place relative to all other possible events. For example, it is common sense that in a toss of a coin the likelihood of the coin landing on the heads side up is 1 out of 2 events (heads or tails). However, saying that there are 19 out of 20 chances that a party will win an election with 45% of the votes is a more complicated probability statement. The ability to make a probability statement allows us to move from descriptive statistics to inferential statistics. The probability statement is a shift from description to prediction. It allows us to generalize from a sample to the entire population from which a sample has been selected. In other words, it is a shift from that which

Probability

67

exists in the form of a sample data to that which does not exist, such as data on an entire population. In a way, we move from a frequency distribution to a probability distribution. Let’s take a random sample of 100 residents of the city of Paris from which we find out that on average a resident drinks five cups of coffee a day. The distribution of this sample is a frequency distribution from which we calculated the mean, which is five cups of coffee a day. But when we generalize it to the entire population of Paris and say that on average a resident of Paris drinks five cups of coffee a day, we are implying the frequency distribution of the entire population of Paris, which is unknown. This unknown frequency distribution is a probability distribution, and our statement is a probability statement. In statistics, we attach some confidence level to a probability statement. When we say that there is a 95% chance that Paris residents on average drink five cups of coffee a day, we attach a confidence level to a probability statement.

Probability Statement A probability statement is expressed as a chance that an event will or will not occur. For example, saying that there are 19 out of 20 chances that it will rain today is a probability statement.

Rules of Probability The probability of the occurrence of an event is always between 0 and 1. Zero probability means the event will not occur, 0.5 probability means there is a 50% chance the event will occur, and the probability of 1 means the event will definitely occur as there is a 100% chance the event will occur. 1. If an event will not occur, the probability that the event will occur is 0. 2. If an event will definitely occur, the probability that the event will occur is 1. 3. The probability of occurrence of an event varies between 0 and 1. 4. The sum of the probabilities of all possible outcomes during an experiment is 1.

Chapter Six

68

Theoretical and Empirical Probability Theoretical Probability The theoretical probability of an event is the number of ways an event can occur divided by the total possible occurrences. It is the probability of events that comes from a sample space of equally likely outcomes. The total number of equally likely possible outcomes (occurrences) is called sample space. For example, in a toss of a coin there are two equally likely possible outcomes—heads or tails—therefore, the sample space is 2. In rolling a fair die, there are six equally likely possible outcomes—1, 2, 3, 4, 5, and 6—therefore, the sample space is 6. Theoretical probability can also be defined as the number of possible ways an event can occur divided by the sample space. Let’s say we have to figure out the theoretical probability of the occurrence of a 3 in the rolling of a fair die. There are six equally likely outcomes (1, 2, 3, 4, 5, and 6) and the rolling of a 3 can occur only once; thus, the theoretical probability of rolling a 3 on a fair die would be 1 ÷ 6 = 0.17. The formula for theoretical probability is as follows: P(E) =

୒୳୫ୠୣ୰୭୤୲୧୫ୣୱୣ୴ୣ୬୲ாୡୟ୬୭ୡୡ୳୰ ୘୭୲ୟ୪୬୳୫ୠୣ୰୭୤୮୭ୱୱ୧ୠ୪ୣ୭ୡୡ୳୰ୣ୬ୡୣୱ୧୬ୱୟ୫୮୪ୣୱ୮ୟୡୣ ଵ

P(rolling 3 on a die) = = 0.17 ଺

Empirical Probability The empirical probability of an event is the actual number of times that an event occurs during an experiment or data collection. It is based on actual observations. Theoretically, if we toss a coin 10 times, the heads outcome should occur 5 times and the theoretical probability of the occurrence of heads is 0.5. But in actual practice, the coin may land with the heads side up for 3 times out of 10 tosses. In this case, the empirical or actual probability of heads is: ଷ P(heads) = = 0.3 ଵ଴

Similarly, let’s say that in a survey, 30 students were asked to choose one of the three primary colours as their favourite colour: 2 chose red, 18 chose yellow, and 10 chose blue. If we have to find out the probability that a student’s favourite colour is yellow, the answer is:

Probability

69

Out of 30 students, 18 chose yellow; thus, the probability is 18 ÷ 30 = 0.6. The formula for empirical probability is as follows: P(E) =

୒୳୫ୠୣ୰୭୤୲୧୫ୣୱୣ୴ୣ୬୲ா୭ୡୡ୳୰ୱ ୘୭୲ୟ୪୬୳୫ୠୣ୰୭୤୭ୠୱୣ୰୴ୣୢ୭ୡୡ୳୰ୣ୬ୡୣୱ

P (yellow) =

ଵ଼ ଷ଴



= 0.6

Law of Large Numbers and Probability Statisticians have found that as the sample size increases, empirical probability and theoretical probability converge and, hence, the observed probability distributions increasingly resemble the theoretical distributions. For example, when you toss a coin, theoretically heads should occur 50% of the time. But in actual practice, it would not be true for the smaller number of trials. It may be that for: ¾ 10 tosses you may get heads for 4 times, or 40% of the time, ¾ 100 tosses you may get heads for 48 times, or 48% of the time, ¾ 10,000 tosses you may get heads for 5,021 times, or 50.21% of the time. In this sense, the law of large numbers implies that as the size of a sample or the number of trials increases, the mean calculated from the sample (xࡄ ) converges toward the population mean (ȝ). You will see in the next chapters that this feature of probability is very helpful in predicting the population mean (ȝ) from a sample mean (xժ ) or generalizing from a sample about the population.

The Multiplication Rule In the case of unrelated events, the probabilities of both events are multiplied to figure out the probability of both events. These events could be independent of or dependent on each other.

Independent Events If one event does not influence the outcome of the occurrence of another event, then the events are said to be independent. For example, if you roll a die twice and you get a 6 the first time and a 6 again the next time, both events are independent of each other. Independent events are unrelated

70

Chapter Six

because the occurrence of event A does not influence the occurrence of event B.

Dependent Events When one event influences the probability of the occurrence of another event, the events are said to be dependent. For example, if from a wellshuffled deck of playing cards you draw the king of diamonds and without replacing it you then draw the queen of clubs; the probability of drawing the queen of clubs becomes dependent on the drawing of the king of diamonds. The probability of drawing the king of diamonds and the queen of clubs will be different because you have changed the total number of occurrences—that is, the sample space from 52 cards to 51 cards—by not replacing the king of diamonds. In both the independent and the dependent cases, the events are unrelated. In the case of unrelated events, the probabilities of both events are multiplied to figure out the probability of both events. In the first example, the probability of getting 6 for the first roll of a die is 1 out of 6: P(A) = 1/6 The probability of getting 6 on the second roll of the die is also 1 out of 6: P(B) = 1/6 The probability of rolling two 6s in a row is: P(A and B) = 1/6 × 1/6 = 1/36 = 0.28 In other words, there are about 28 chances out 100 chances of rolling two 6s in a row. In the second example, the probability of drawing the king of diamonds is: P(A) =1/52 = 0.019 If the king of diamonds is returned back to the deck, then the probability of drawing the queen of clubs is also: P(B) = 1/52 = 0.019

Probability

71

The probability of drawing the king of diamonds and then the queen of clubs if the king of diamonds is returned to the deck before drawing the queen of clubs is: P(A and B) = 0.019 × 0.019 = 0.00036 The probability of drawing the queen of clubs if the king of diamonds is not returned to the deck before drawing the queen of clubs is: P(B) = 1/51= 0.20 In this case, the probability of drawing the king of diamonds and the queen of clubs is: P(A and B) = 1/52 × 1/51= 0.019 × 0.020 = 0.00038

The Addition Rule If two events (A and B) are mutually exclusive, the probabilities of both events are added to figure out the probability that event A or event B will occur.

Mutually Exclusive Events The addition rule is applicable to mutually exclusive events. Two events are mutually exclusive when it is impossible for them to occur together. For example, the probability of a coin toss yielding a heads and a tails together is impossible. To determine the probability of obtaining one or another event among two mutually exclusive events, we simply add the independent probabilities. For example, the probability that a toss of a coin will result in a heads or a tails is: x P(heads) = 0.5 x P(tails) = 0.5 x P(heads or tails) = 0.5 + 0.5 = 1 Important Note: The unrelated independent or dependent events are written as “(A and B)” and the mutually exclusive events are written as “(A or B).”

72

Chapter Six

Non-mutually Exclusive Events Two events are not mutually exclusive if both can occur together. For example, someone being a woman and a parent are non-mutually exclusive because a person can be a woman as well as a parent at the same time. When two events can occur simultaneously, they are considered to be non-mutually exclusive. With non-mutually exclusive events, there is a possibility of double-counting (e.g., the same person can be once counted among women and again among parents). This means that it is necessary to subtract those who have been double-counted. You need to take the following three steps while calculating the probability of non-mutually exclusive events: 1. Calculate the probability of event A. 2. Calculate the probability of event B. 3. Subtract the probability of duplications. P(A or B) = P(A) + P(B) – P(A and B) Let’s say that out of 200 students in a school, 40 students play only baseball, 60 students play only tennis and 70 students play both baseball and tennis. The probability of drawing at random a baseball or a tennis player is as follows: Probability of being a baseball player = P(A) = (40 + 70) ÷ 200 = 110 ÷ 200 = 0.55 Probability of being a tennis player = P(B) = (60 + 70) ÷ 200 = 130 ÷ 200 = 0.65 Probability of being both baseball and tennis player = P(A and B) = 70 ÷ 200 = 0.35 Probability of being a baseball or a tennis player = P(A or B) = P(A) + P(B) – P(A and B) = 0.55 + 0.65 – .35 = 1.2 – 0.35 = 0.85 The probability of somebody being a baseball or a tennis player is 0.85.

Odds Though the odds and the probability both are used interchangeably in common parlance, they are different mathematically. The odds of an event

Probability

73

are a ratio between the probability of an event happening and the probability of it not happening. In this sense, odds are the number of successes you can expect for every failure. ‘ൌ

‫݃݊݅݁݌݌݄ܽݐ݊݁ݒ݂݁݊ܽ݋ݕݐ݈ܾܾ݅݅ܽ݋ݎ݌‬ ͳ െ ‫݃݊݅݁݌݌݄ܽݐ݊݁ݒ݂݁݊ܽ݋ݕݐ݈ܾܾ݅݅ܽ݋ݎ݌‬ ൌ

ܲ ’”‘’‘”–‹‘‘ˆ•—……‡••‡• ൌ ͳെܲ ’”‘’‘”–‹‘‘ˆˆƒ‹Ž—”‡

Let’s say that out of 100 students who are admitted to a law school, 60 are males and 40 are females. The probability of being admitted to a law school for a male is 60 ÷ 100 = 0.6 and that of a female is 40 ÷ 100 = 0.4. The odds of being admitted to a law school for males and females are as follows: ଴Ǥ଺ ଴Ǥ଺ Odds for males = = = 1.5. ଵି଴Ǥ଺

Odds for females =

଴Ǥସ ଵି଴Ǥସ

଴Ǥସ

=

଴Ǥସ ଴Ǥ଺

= 0.67.

Putting it in simple English; odds are that on average 1.5 males per female and 0.67 females per male are admitted to a law school.

Odds Ratio Now, if we want to find the relative odds of the above scenario, we have to calculate the odds ratio. ଵǤହ = 2.24 Odds ratio = ଴Ǥ଺଻

This means that a male student has 2.24 times higher odds of being admitted to a law school than does a female student. x While the values of a probability are always between 0 and 1; the values of odds could be between 0 and (infinity). It implies that if the probability of occurrence of an event is 1, there are infinite odds that it will not occur.

Calculating Probability from Odds To find out the probability from odds, just reverse the formula:

Chapter Six

74

Probability for males =

଴ୢୢ ଵା଴ୢୢ

Probability for females =

=

଴ୢୢ ଵା଴ୢୢ

ଵǤହ ଵାଵǤହ

=

=

ଵǤହ ଶǤହ

଴Ǥ଺଻ ଵା଴Ǥ଺଻

=

ൌ 0.6 ଴Ǥ଺଻ ଵǤ଺଻

ൌ 0.4

Probability Ratio If you take a ratio of the probabilities of being admitted to a law school of the two sexes: ௉ሺ௠ሻ ଴Ǥ଺ Probability ratio = = = 1.5, ௉ሺ௙௘ሻ

଴Ǥସ

you can say that a male has 1.5 higher chances of being admitted to a law school than does a female. Probability is an important concept for inferential statistics. Inferential statistics are used in the testing of hypotheses, which you will learn about in forthcoming chapters.

Role of Probability in Statistical Inference The concept of probability is crucial in inferring from a sample about the population and generalizing from a sample to the population. Let’s say that an employer has to implement pay equity and she asks her statistician to find out how much women earn in comparison with men in her large corporation. The statistician selects a random sample of salaries of men and women and finds out from the sample that on average a woman makes only 75% of a man’s salary. The employer does not accept the statistician’s findings and thinks that the gender difference in salary is much narrower than this. She hypothesizes that in the entire workforce of her corporation, women make 90% of men’s salaries. However, intuition of the employer is not enough. Some statistical test is necessary to decide how different the sample value is from the hypothesized value. You will learn the importance of probability in hypothesis testing in forthcoming chapters. But before discussing tests of inference, we will need to understand another important concept based on probability that is used to make a decision to retain or to reject a hypothesis. This concept is called normal distribution. We will discuss normal distribution in the next chapters.

Probability

75

Exercises for Practice Q1.

What is the probability of getting a 5 in a single toss of a die? a. b. c. d.

Q2.

Two playing cards are drawn from a well-shuffled deck of 52 cards. What is the probability they are both kings if the first card is replaced? a. b. c. d.

Q3.

4/52 × 3/52 4/52 × 4/52 4/52 4/52 × 3/51

A ball is drawn at random from a bag containing 5 red balls, 7 white balls, and 8 blue balls. What is the probability that the ball is blue? a. b. c. d.

Q5.

4/52 × 4/52 4/52 1/52 1/52 × 1/52

Two playing cards are drawn from a well-shuffled deck of 52 cards. What is the probability they are both queens if the first queen is not replaced? a. b. c. d.

Q4.

1/6 5/6 5 1/5

2/5 1/5 1/8 3/5

In Q4 above, what is the probability that the ball is not white? a. b. c. d.

7/20 7/12 13/20 1/5

Chapter Six

76

Q6.

Find the probability of getting at least a 5 or a 6 in a toss of a fair die. a. b. c. d.

Q7.

One bag contains 8 oranges and 12 peaches; another bag contains 10 oranges and 20 peaches. If one fruit is drawn from each bag, find the probability that both are oranges. a. b. c. d.

Q8.

9/10 2/15 3/15 1/30

In Canada 60% of youth talk on their cellphones while driving. If two youth while driving were chosen randomly, what is the probability that both of them were talking on their cellphones while driving? a. b. c. d.

Q9.

1/6 1/36 2/6 1/6

2/60 3/5 9/25 2/6

Which one is the sample space when two dies are rolled? a. b. c. d.

(1, 3), (4, 5), (2, 2), (3, 6) (1), (4), (2), (3) (3), (5), (4), (6), (1), (4)

Q10. What are the odds of rolling 2 on a die? a. b. c. d.

0.17 0.83 0.17/0.83 0.17 + 0.83

CHAPTER SEVEN SAMPLING

Learning Objectives It is time-consuming and expensive to collect information on an entire population because populations are usually large in size. Statisticians use samples to infer and generalize about the population. In this chapter, you will learn about: x x x x

the sample; the sampling frame; sampling and non-sampling errors; and probability and non-probability samples.

Introduction Sampling is inherent to all animals. Animals sniff samples of surroundings to assess the entire atmosphere. We sample food to get the idea of an entire dish. Sampling in statistics is essential because it is not only cumbersome to collect data on the entire population but it is also costly. Because of these two reasons, researchers rely on samples to infer about an entire population. A population is a complete or theoretical set of individuals, objects, or observations. Most of the time a population is a hypothetical population. When a pharmaceutical company tests a drug on a sample of individuals, it expects the drug to be used by everybody in the country, and when the drug becomes generic, it is expected to be used by the whole world. In this sense, a population is not a concrete concept; it is a hypothetical concept. But when a sample of students is selected from a school, the student population of the school is definite. It is not hypothetical. Whether hypothetical or definite, population size is usually large. It would take a long time to collect data on the entire population of a

78

Chapter Seven

country. A census is undertaken every 10 years in most countries to collect information on the entire population. In reality, even the census of a nation cannot collect every piece of information required for research or planning. That is why in many countries during the census a short questionnaire is administered to every citizen to collect the basic information and a long questionnaire is administered to collect detailed data on a small sample of the population. In fact, most of the research is conducted on small samples and the findings are generalized to the entire population.

A Sample A sample is a subset of a population. Samples are selected for statistical testing when populations are too large. Ideally, a sample should have all the characteristics of the population from which it is selected. If a sample is not representative of the population due to bias toward one or more characteristics of the population, the generalization from the sample findings to the entire population will not be acceptable. Most of the statistical analysis revolves around the question: How well does the sample statistic represent the population parameter? A single measure of some attributes of a sample—such as the mean, the variance, and the standard deviation—is called a statistic and that of a population are called a parameter. A sample is selected from a sampling frame.

Sampling Frame A sampling frame is a list of units (people, families, households, or observations) of the population from which a sample is selected. A complete sampling frame is crucial for the representativeness of a sample. An incomplete sampling frame may cause sampling error.

Sampling Error A sampling error is the difference between the sample statistic and the population parameter. As mentioned earlier, a measure such as the mean, the variance, and the standard deviation of a sample is called a statistic and that of a population is called a parameter. The difference in the sample mean and the population mean is a sampling error of mean. When we use sample data, there is always a risk that the sample statistic may not be close enough to the population parameter; hence, the conclusion drawn from a sample may not apply to the entire population. There is always a

Sampling

79

risk of some sampling error. The sampling error can be reduced by increasing the sample size.

Non-sampling Errors Non-sampling errors in a survey are the errors that arise during data collection activities. They are different from a sampling error. Nonsampling errors arise due to, for example, false information provided by the respondents, data entry errors, biased questions in the questionnaire, or an incomplete sampling frame. It is virtually impossible to eliminate all non-sampling errors completely. Increasing the sample size reduces the sampling error but does not reduce non-sampling errors.

Types of Samples There are basically two types of samples: the probability sample and the non-probability sample. Different situations call for different types of strategies for selecting a probability or a non-probability sample. The aim of the probability sample is to have a lower sampling error so that a sample is representative of its population. Let’s discuss briefly some important probability and non-probability sampling strategies.

Probability Samples Primarily, there are four types of probability samples: random, systematic, stratified or hierarchical, and a multistage cluster. 1. Random Sample In random sampling, every unit or respondent is given equal probability of selection. Suppose we have a population of 80 factory workers and we want to select a sample of 10 workers. Each worker in a random sample has a 10 ÷ 80 = 0.125 probability of selection. In other words, each worker will have a 1 in 8, or 12.5%, chance of selection. The probability of selection of a unit or a respondent is called the sampling fraction. Statisticians use random tables to select a random sample. These tables are found at the end of most statistics books. To select a random sample with the help of a random table, you start with a number in the random table at random and then select numbers according to predetermined sample size. If you have decided on a sample size of 10, then select 10 individuals from the sampling frame corresponding to the random

80

Chapter Seven

numbers selected from the random table. Random numbers can also be generated with the help of a computer. Sampling Fraction A sampling fraction is a ratio between the sample size and the population size. If the sample size is n and the population size is N, then the sampling ௡ fraction will be . As mentioned, in the random sampling, the sampling ே fraction is the probability of selection of a unit or a respondent. 2. Systematic Sample A systematic sample is an approximation of a random sample. You choose a number at random as a starting point on your sampling frame and then choose every nth unit (e.g., every fourth person from a telephone directory of employees). Make sure that a sampling frame (the list of individuals or units or observations) has no special ordering. For example, if every third person in the sampling frame is a female, it will make your sample nonrandom as it will increase or decrease the probability of selection of male or female persons. 3. Stratified or Hierarchical Sample Strata are bands, levels, divisions, sections, branches, or echelons of populations or observations. Sometimes you know that your target population is disproportionately distributed in various sub-areas of the city or in sub-categories of a variable such as age groups. You need to select proportionately from each stratum or each group to make your sample representative of the source population. In the stratified or hierarchical sampling, you stratify your population by the relevant criteria—which could be the region, ethnicity, or age—and then randomly select a predetermined proportion of respondents from each category. Because each stratum contains an unequal proportion of the required population, it is hoped that selecting random samples proportionately from each stratum will make the total sample more representative of the population. The following two examples will clarify this concept: Let’s say that you have to select a random sample of 1,000 nonEnglish-speaking persons from a city of an English-speaking country with a population of 350,000. Assume there are four areas in the city: North, South, East, and West. You know from the census numbers that the non-

Sampling

81

English-speaking population is disproportionately distributed in these four areas (strata). To select a stratified sample: x Calculate the percentage of the non-English-speaking population from the census population in each stratum. x Multiply the already decided sample size (e.g., 1,000) by the percentage of the non-English-speaking population in each stratum to find out the number of persons to be selected from each stratum. x Randomly select the number of non-English-speaking persons from each stratum as shown in Table 7-1. Table 7-1 Calculation of the Sample Size for Each Stratum

Stratum North South East West Total

NonEnglishSpeaking Population (N) 75,000 100,000 45,000 130,000 350,000

Non-EnglishSpeaking Population from Census (%) 21.4% 28.6% 12.9% 37.1% 100.0%

Sample Size 1,000 × 21.4% = 214 1,000 × 28.6% = 286 1,000 × 12.9% = 129 1,000 × 37.1% = 371 Sample = 1,000

For the second example, let’s say that you have to select a stratified random sample of 1,000 persons with the following percentages of four age groups. Once you find out the percentage distribution of each age group in the population from the census numbers, you can determine the sample size for each age group as shown in Table 7-2: Table 7-2 Calculation of Sample Size from the Percentage Distribution of Population by Age in Each Stratum

Stratum Children Youth Working Age Elderly Total

Age 0–15 16–24 25–64 65 and over

Census Population (%) 25.0% 5.0% 60.0% 10.0% 100.0%

Sample Size 1,000 × 25% = 250 1,000 × 5.0% = 50 1,000 × 60.0% = 600 1,000 × 10.0% = 100 Sample = 1,000

Chapter Seven

82

4. Multistage Cluster Sample A cluster could be a community of widely dispersed population, a street of a city, or a neighbourhood. A multistage cluster sampling is used when a complete list of the population is not available but the complete lists of clusters (groups) are available. In a simple cluster sampling, you can collect data from each cluster of units consecutively. In a multistage cluster sampling, you first select sub-clusters randomly and then select individuals randomly from each sub-cluster. For example, if a complete list of the population for a country is not available, you randomly select some cities from the country. This is the first stage of a multistage cluster sampling. Next, you select a block randomly from the selected cities (second stage), and then select households from the selected blocks (third stage). This is now a three-stage cluster sampling. Using this method, you don’t need to create a list of potential respondents of all the cities in the country. You will need to create lists of potential respondents only for the selected blocks, and from these lists of blocks you can select samples of households. This reduces sampling and travel costs. The main advantage of a multistage cluster sampling is that it is costeffective, but a disadvantage is that the sampling error becomes higher, as each cluster has its own sampling error.

Advantages of Probability Sampling There is an element of probability sampling in the above four sampling strategies because they utilize random sampling to select a sample. The probability sampling is preferable because it has the following advantages: x The probability sampling allows generalizing from a sample statistic to the population parameter. x You can apply parametric inferential tests to find out the statistically significant difference between a sample statistic and the population parameter.

Sample Size The sample size is the number of observations in a sample. It is an important feature of empirical studies that are conducted with the aim of making an inference about the population from a sample. x The larger the size of the sample, the lower the sampling error.

Sampling

83

x There is diminishing return in increasing the sample size because increasing the sample size after a certain point may not proportionally decrease the sampling error. x The sample size varies with the heterogeneity of the population. A more diverse population requires a larger sample size to increase the probability of selection of each characteristic. x The sample size has to be large to satisfy the assumption of normality that is necessary for the application of many parametric inferential techniques. The assumption of normality will be discussed in forthcoming chapters. x A large sample may be necessary if the number of non-responses is large for certain variables in the study. For example, some respondents may not answer questions on some personal variables such as income or age.

Non-probability Samples You may not be interested in generalizing your findings from the sample to the entire population. In this case, randomness of the sample will not interest you. You may be interested to select a quick and dirty sample just to get information on a topic from some respondents. There are broadly three techniques for selecting a non-probability sample. 1. Convenience Sampling In this type of non-probability sampling, those respondents who are conveniently available to recruit are selected for a study. Using volunteers for a clinical trial, choosing five students from a class, or choosing the first five names from a voting list are prime examples of convenience sampling. Most pilot studies use convenience sampling because in pilot studies a researcher is interested only in acquiring the basic knowledge about the data to be collected without going through the trouble of randomization. 2. Snowball Sampling Some populations, such as homeless persons or undocumented immigrants, are difficult to locate. In such cases, you ask the contacted person to refer you to another person. For example, it would be impossible to get a list of names and whereabouts of homeless persons to study homelessness in a city. In this case, the use of the snowball sampling

84

Chapter Seven

technique is preferable. You contact some homeless persons on the street and if they agree, you interview them. At the end of the interview, you ask for their help to provide the names and whereabouts of some homeless persons who may be their friends. You continue doing this until your sample size reaches what you think is optimum size. Because your sample size increases as you go, like a snowball collecting snow, it is called the snowball sampling technique. Backer’s9 (1963) study on marijuana users that became the foundation for the famous labelling theory in sociology used the snowball technique to collect data. 3. Quota Sampling Quota sampling is stratified sampling without random sampling. A researcher divides a target population into strata or groups and then decides about the number of persons to be selected from each group. In other words, the researcher assigns a quota based on his or her discretion to each stratum or group and selects respondents according to this quota. Many times quota sampling is used because a researcher is afraid of missing some segment of the population. A researcher while collecting information on religious beliefs among faith groups may be afraid of missing a particular micro-minority group because of their smaller numbers. In quota sampling, the aim is to select groups according to their proportional representation in the population. For example, you may know that Mennonites are a micro-minority in your target population. You may decide that 2% of your sample should consist of Mennonites to reflect their proportion in the target population. The main risk in non-probability sampling is that the sample may not be representative of the population because equal probability of selection is not given to each member of the population. Therefore, you may not be able to generalize your results to the population from your sample.

9

Becker, Howard S., ed. The Other Side: Perspectives on Deviance. New York: The Free Press, 1964.

Sampling

85

Exercises for Practice Q1.

Which of the following statements is correct? 1. The sampling error is the difference between the sample size and the population size. 2. The sampling error is the difference between the sample statistic and the population parameter. 3. The sampling error is the difference between the techniques of sampling. 4. The sampling error is the difference between the sample characteristics and the population characteristics.

Q2.

Suppose we have a population of 1,000 students and we want to select a sample of 100 students. Each student in a random sample would have which probability of selection? 1. 2. 3. 4.

Q3.

0.001 0.01 0.1 1.0

The difference between the sampling frame and the sampling fraction is: 1. A sampling frame is a ratio between the sample size and the population size and a sampling fraction is the sample size. 2. A sampling frame is a list of units and a sampling fraction is the ratio between the number of units and the sample size. 3. A sampling frame is a list of units and a sampling fraction is the ratio between the sample size and the population size. 4. A sampling frame is the ratio between the sample size and the population size and a sampling fraction is a list of units.

Q4.

You want to conduct a survey on homeless persons and you are worried about getting a large enough sample size; which sampling technique would you use and why?

Chapter Seven

86

Q5.

The Ministry of Finance wants to conduct a survey of racial minorities among its employees. There are five divisions in the ministry, and each division has six branches and each branch has three units. You want to draw a representative sample of the ministry’s workforce and want to make sure that each division, branch, and unit has a sufficient number of respondents. What sampling strategy will you use and why?

Q6.

A faculty of music at a university wants to create a new orchestra. There are students enrolled in 10 different instrument types. The faculty wants each instrument type to be proportionately represented. Which of the sampling techniques would you advise the faculty to use and why?

Q7.

When will a researcher use quota sampling, and what is the risk of using quota sampling?

Q8.

When you have a complete sampling frame and your main concern is representativeness, which sampling technique will you use and why?

Q9.

Which of the following samples include all examples of probability samples? 1. 2. 3. 4.

Cluster, snowball, and quota samples Simple random, snowball, and quota samples Cluster, stratified, and systematic random samples Cluster, snowball, and stratified random samples

Q10. Which of the following samples include all examples of nonprobability samples? 1. 2. 3. 4.

Convenience, cluster, and quota samples Convenience, snowball, and quota samples Simple random, quota, and cluster samples Simple random, stratified random, and cluster samples

CHAPTER EIGHT THE SAMPLING DISTRIBUTION AND THE NORMAL CURVE: GENERALIZING FROM A SAMPLE TO ITS POPULATION

Learning Objectives In this chapter, you will learn concepts that are crucial to generalizing from a sample to its population. These concepts are: x x x x

the sampling distribution of means; the central limit theorem; standard error; and confidence intervals.

Introduction Due to the difficulty in finding the mean of large populations, statisticians have devised techniques to find out how close a sample mean would be to its population mean. This helps us to estimate the population mean from its sample and hence, to generalize about a population’s characteristics from those of a sample. Three concepts are crucial for generalizing from a sample to its population: the sampling distribution, the central limit theorem, and the normal distribution. Once you comprehend these three concepts, you’ll be ready to understand the statistical inference. As vocabulary and grammar are essential to the learning of a language, these three concepts, along with the measures of central tendency and variability, are essential to the learning of statistics. You already know the mean and the standard deviation; once you comprehend the concepts outlined in this chapter, you’ll have the required foundation for learning statistics.

88

Chapter Eight

Sampling Distribution of Means Let’s say that you work for an addiction research organization and you’ve been asked to design an anti-smoking program that targets adolescents aged 13 to 19. You need to find out how many cigarettes on average an adolescent smoker smokes per week. You know from the census that the total adolescent population in the country is 4 million. You also know that various surveys conducted before your research estimate that 10% (400,000) of these 4 million adolescents are smokers. One option involves you contacting every adolescent smoker and asking them how many cigarettes per day each of them smoked and then calculating the average number of cigarettes smoked by an adolescent. This method will be time-consuming, however; even if you interview 1,000 smoker adolescents per day, it will take 400 days (i.e., more than a year) to complete your survey. Not only will it be silly to spend more than a year to just find out an average, but it will also be very expensive. Therefore, you decide to take a random sample of the target population, that is, smoker adolescents. First, you’ll consider the sample size. A larger sample will give you a mean closer to the population mean. But a larger sample will also be expensive. To be cautious, you could take a series of same-sized samples, find out the mean for each sample, and then take the mean of those sample means. You can be rest assured that these means will differ from one another, and taking the mean of these means will still not be any consolation; you will not be confident that your mean of means is close to the population mean. Another thing you can do with the means of a large number of samples is present them in a vertical bar graph, that is, a histogram. This distribution of means is called a sampling distribution of means. A sampling distribution of means has very important qualities. It follows the central limit theorem.

Central Limit Theorem The central limit theorem states that if you take a sufficiently large number of random samples from a population with replacement of the sample to the population, then the distribution of the sample means will be approximately a normal distribution. The central limit theorem also states that the sampling distribution of any statistic will be normally distributed or nearly normally distributed if the sample size is large enough. As a rough rule of thumb, most statisticians believe that a sample size of 30 is large enough.

The Sampling Distribution and the Normal Curve

89

Normal Curve A normal curve, also called a bell curve or a Gaussian curve is bellshaped, as shown in Figure 8-1. Figure 8-1 Normal Curve or Gaussian Curve

-4

-2

0

2

4

In school, when people refer to “marking a test on the curve,” they mean that a graph representing a class’s marks would look like a bellshaped curve or a normal curve. Many variables in real life are normally distributed. For example, scores for entrance tests, the blood pressure of the population, and the income of households are found to be normally distributed. The normal curve has distinct features that are very useful in the statistical inference. 1. A normal curve has the same mode, mean, and median. Remember that the mode is the most frequent number and is the highest point on a graph. This highest point is also the median in the case of normal distribution—one half of the values of the distribution are on the left side of this highest point, and the other half of the values of the distribution are on the right side of this highest point. This highest point is also the mean of the normal distribution. It follows then that one half of the values of the distribution are on the left side of the mean, and the other half of the values of the distribution are on the right side of the mean. In other words, 50% of the values

Chapter Eight

90

are below the mean (to the left) and 50% of the values are above the mean (to the right). 2. It follows from the above discussion that a normal curve is a symmetrical curve which implies that both halves around the vertical line at the mean would match perfectly because there is exactly the same number of values of the distribution in each half. 3. Each standard deviation around the mean has the same proportion of the values of the distribution. This is called the 68–95–99.7% rule: a. Approximately 68% (34% on each side of the mean) of all the values of the distribution fall within one standard deviation of the mean. b. Approximately 95% (47.5% on each side of the mean) of all the values of the distribution fall within two standard deviations of the mean. c. Approximately 99.7% (49.85% on each side of the mean) of all the values of the distribution fall within three standard deviations of the mean. Figure 8.2 shows the exact percentages of the observations that would fall between one, two, three and four standard deviations (ı) above and below the mean. Figure 8-2 Normal Distribution of Observations between 1, 2, 3 and 4 Standard Deviations (ı)

34.1%

13.6%

0.1% -4

34.1%

13.6%

2.1% -3

-2

0.1%

2.1% -1

0

1

Standard Deviations (ı)

2

3

4

The Sampling Distribution and the Normal Curve

91

Salient Features of the Sampling Distribution of Means The sampling distribution of means has the following important features: 1. It is asymptotically (reaching close to) normal distribution. 2. The standard deviation of the sampling distribution of means is called the standard error. The standard error is equal to the population standard deviation (ı) divided by the square root of the ı sample size (n): . ξ୬ 3. The mean of the means is more likely to be an accurate mean of the population than any one mean. 4. Because the sampling distribution of the means is normally distributed, the mean of the sampling distribution is equal to the true population mean (ȝ). 5. Because the sample means are tightly clustered around the true population mean, the standard deviation of a sampling distribution is small. Therefore, we can be confident that the mean of any random sample will be very close to the population mean. 6. Using statistical techniques, with the help of the standard error of a sample mean, we can find out whether a sample mean is a close approximation of the population mean.

Standard Error of the Sample Mean The standard deviation of a sampling distribution is called the standard error of the mean. It is written as ıxࡃ , and it is a measure that helps us to find out how close our sample mean is to the population mean. The larger the value of the standard error, the farther the sample mean is from the population mean or vice versa. We calculate the standard error of the sample mean with the following equation when the standard deviation of the population is known: ఙ ıxࡃ = , ξ௡

where ıxࡃ is the standard error, n is the size of the sample, and ı is the standard deviation of the population. x Assuming that population is normally distributed; with the help of the standard error of the sample mean, we can calculate how close a sample mean is to the true population mean.

92

Chapter Eight

For a normally distributed population, we can assume that the standard error of the population (ıxࡃ ) is equal to the standard error of a sample: sxࡃ =

௦ ξ௡

,

where sxࡃ is the standard error of the sample, s is the standard deviation of the sample, and n is the sample size. x Because a sample will rarely have the same mean as that of the population, we attach a measure of confidence with which we can say that the sample mean is close to the true population mean. This is achieved with the help of the confidence interval.

Confidence Intervals Because the sample mean is rarely the same as the population mean, there is always a chance of error in the estimate of the population mean. Statisticians have devised a way to say with some measurable level of confidence that the sample mean is the same as the population mean. To accomplish this, they attach a specific probability to say that the sample mean is “close to” or “differs from” the population mean. The standard error is used to indicate a confidence level with which one can say that the sample mean is “close to” or “differs from” the population mean. The standard error that is obtained by dividing the standard deviation by the square root of the sample size (n) is the standard deviation of the population. Think of the sampling distribution as a normal distribution and a sample mean as an observation. Just as we know that 68% of the observations fall within ±1 standard deviation of the mean in a normal distribution, we can be certain that approximately 68% of the means will fall within ±1 standard error of the population mean in a sampling distribution. Similarly, we can say that 95% of all means will fall within ±2 standard errors of the mean, and 99.7% of all the means will fall within ±3 standard errors of the mean. This plus or minus (±) implies the upper and lower limits of the standard error and is called the confidence limit, and the interval within these two limits is called the confidence interval. Let’s clarify further. Let’s say that we’re testing whether the sample mean will fall within two standard errors of the population mean and our sample mean is 50 and standard error is 10. In this case, we’re testing whether the sample mean is within 50 ± 2 ×10, that is, between 50 plus and minus 20, so between 30 and 70. We know that 95% of the observations fall within two standard errors; so we can say that we are

The Sampling Distribution and the Normal Curve

93

testing at 95% confidence level, where the confidence limit is between 30 and 70 and the confidence interval is from 30 to 70.

Calculating Confidence Interval We can use the confidence level to attach a specific probability to say that a sample mean is close to the population mean. In other words, we can indicate with a level of confidence that a sample mean is close to the population mean. To indicate a confidence level, we use the standard error ఙ because the standard error ( ) is nothing but the standard deviation per ξ௡

observation or per person, n being the number of observations or the number of persons in the sample. We can use the standard error almost the same way as the standard deviation. In the case of a sample, rather than saying that the mean falls within one, two, or three standard errors of the mean, we say “with a certain confidence level” that the sample mean falls within one, two, or three standard deviations of the population mean. Let’s say that we select many random samples of the height of men and the mean of means of these samples is 6‫މ‬0‫ފ‬. The mean of most of the random samples will hover around 6‫މ‬0‫ފ‬. If standard deviation of this sampling distribution is 2‫ފ‬, the height of men will be between 6‫ މ‬plus or minus 2‫ފ‬, that is, between 5‫މ‬8‫ ފ‬and 6‫މ‬2‫ފ‬. The mean heights of 4‫މ‬8‫ ފ‬or 7‫މ‬0‫ފ‬ will be outliers. These means will fall in the tail of the sampling distribution (normal distribution), and we will say that mean heights of 4‫މ‬8‫ ފ‬and 7‫މ‬0‫ ފ‬are significantly different from the population mean of 6‫މ‬0‫ފ‬. The normal curve and the concepts of the confidence level, the confidence limits, and the confidence interval are crucial to the inferential statistics and hypothesis testing, which you’ll learn in the forthcoming chapters.

Chapter Eight

94

Exercises for Practice Choose a correct answer for the following 10 multiple-choice questions. Q1.

The central limit theorem states that: 1. the means of sufficiently large-sized samples drawn from a population without replacement will be normally distributed. 2. the means of sufficiently large-sized samples from a population with replacement will be normally distributed. 3. the sampling distribution of any statistic will be normally distributed or nearly normally distributed if the number of the samples is sufficiently large irrespective of their size. 4. the sampling distribution of any statistic will be normally distributed or nearly normally distributed if the samples are drawn randomly irrespective of their size.

Q2.

The standard error of the sample mean is equal to: 1. the population standard deviation (ı) divided by the square root ı of the sample size (n): . ξ୬ 2. the population standard deviation (ı) divided by the square root ı of the population size (N): . ξ୒ 3. the sample standard deviation (s) divided by the square root of ୱ the population size (N): . ξ୒ 4. the sample size (n) divided by the square root of the population ୬ size (N): . ξ୒

Q3.

In a normal distribution curve: 1. the upper and lower limits of the standard error are called the confidence interval, and the interval within these two limits is called the confidence limit. 2. the upper and the lower limits of the standard error are called the confidence level, and the interval within these two limits is called the confidence limit. 3. the upper and the lower limits of the standard error are called the confidence level, and the interval within these two limits is called the confidence interval.

The Sampling Distribution and the Normal Curve

95

4. the upper and the lower limits of the standard error are called confidence limit, and the interval within these two limits is called the confidence interval. Q4.

As the sample size of a random sample increases, the sample mean is: 1. 2. 3. 4.

Q5.

As the size of random samples increases, the shape of the sampling distribution of the means tends to: 1. 2. 3. 4.

Q6.

be wide and flat. approach normality. be more skewed. remain the same.

As you increase the size of a random sample, the standard error of the mean will: 1. 2. 3. 4.

Q7.

less likely to be closer to the population mean. more likely to be farther away from the population mean. likely to remain the same. more likely to be closer to the population mean.

increase. decrease. remain the same. become unstable.

The sampling distribution of the sample means is: 1. the distribution of samples of various sizes. 2. the distribution of the different possible values of the sample means. 3. the distribution of the values of the objects or individuals in the population. 4. the distribution of the data values in a given sample.

Chapter Eight

96

Q8.

The central limit theorem states that the sampling distribution of the sample means is approximately normal under the following condition: 1. 2. 3. 4.

Q9.

The sample size has to be large. The population size has to be large. The sample variance has to be small. The population variance has to be large.

If samples of size 100 are selected from a non-normal population with mean 20 and standard deviation 10; the distribution of possible values of the sample means will be: 1. approximately normal, with mean 20 and standard deviation 10. 2. approximately normal, with mean 2 and standard deviation 10. 3. approximately normal, with mean 100 and standard deviation 10. 4. None of the above.

Q10. A random sample of 100 observations is to be drawn from a normal population with a mean of 20 and a standard deviation of 10. The mean of the distribution of the possible values of the sample means will be: 1. 2. 3. 4.

20. 10. 100. 2.5.

CHAPTER NINE NORMAL DISTRIBUTION AND ITS RELATIONSHIP WITH THE STANDARD DEVIATION AND THE STANDARD SCORES

Learning Objectives In this chapter, we will discuss an important utility of the normal distribution in statistics. Specifically, you will learn about: x x

the standard deviation and the normal curve; and the standardized scores or z-scores.

Introduction We saw in the last chapter that the standard deviation (SD) provides standardized cut-points on each side of the mean, which falls at the centre of the normal curve. We also know from the central limit theorem that: x 68% of all the observations fall within ±1 standard deviation of the mean. x 95% of all the observations fall within ±2 standard deviations of the mean. x 99.7% of all the observations fall within ±3 standard deviations of the mean. Let’s illustrate this concept with a concrete example. Let’s say that one summer you were hired as a lifeguard at a large public swimming pool. To ensure that you remain alert while on duty, your supervisor asked you to count the number of persons in the pool every ten minutes and write an

98

Chapter Nine

hourly averaage in a notebbook. After ev very hour, youu added your 6 counts, divided by 66, and took an a hourly averrage. You toook 1,000 sam mples over the summer and you calcuulated 1,000 hourly h means. When yoou plotted these 1,000 mean ns, it turned oout they were normally distributed. Next, you caalculated the mean m and thee standard dev viation of these meanss. Let’s say thhat you found out that the m mean of mean ns was 30 and the stanndard deviation of means was w 4. As show wn in Figure 9.1 9 in the normal curvve: x ±1 sttandard deviattion of the mean would be eequal to ±4; x ±2 sttandard deviattions of the meean would be equal to ±8; and a x ±3 sttandard deviattions of the meean would be equal to ±12. Normal Curvve with Mean n = 30 and Staandard Devia ation = 4 Figure 9.1 N

8% of the obseervations are within w ±1 Figure 9.1 shows that more than 68 % of the obserrvations are ouutside the ran nge of ±1 SD. In otherr words, 32% SD. Similarrly, because 95% 9 of the observations o ffall within thee ±2 SD, only 5% off the observattions will falll outside the range of thee ±2 SD. Finally, because more thaan 99.7% of all a the observaations fall bettween the ±3 SD, onlyy 0.3% of the observations o are a left out of this range. In our eexample, the mean is 30 and the staandard deviation is 4. Combining the idea of the level of conffidence and thhe normal distribution, we can say we are confiident that for 68% of the ttimes the meaan of our sample will fall within the t population n mean μ ±1 SD or 30 ±4 4, that is, between 26 and 34. We can c also say with w 95% conffidence that th he sample mean will ffall within thhe population mean μ ± 22SD or 30 ±8 8, that is, between 22 and 38; and we w can say witth 99.7% conffidence that th he sample

The Standard Deviation and the Standard Scores

99

mean will fall within the population mean μ ±3 SD or 30 ±12, that is, between 18 and 42.

Standardized Scores To compare any two things, you use a third thing as a reference. Let’s say you want to buy a compact car. You owned a Honda Civic, which is your reference point. When you’re looking at a Toyota Corolla, you’d compare it with the Civic you owned. Let’s say that a Hyundai Elantra is the second car you’re considering. You’ll also compare the Elantra with the Civic you owned. Of the two prospective cars, you decide to buy the one you think is better than your Civic. In essence, you have mentally converted both prospective cars into the Civic you owned and then tried to find out which of the two “mentally converted Civics”—the Corolla or the Elantra—is better. Standardized scores work on this logic.

Conversion of a Raw Score to a Standard Score (z-Scores) The concept of a standard score is easy to grasp. Every raw score in a distribution can be converted to an equivalent standard score. The standard score is also called a z-score. A standard score for an observation (raw score) is obtained by subtracting the population mean (ȝ) from the raw score (X) and dividing it by the population standard deviation (ı). The formula for converting a raw score X into a z-score is as follows: The standard score or the z-score for a score X: œൌ

௑ିఓ ఙ

,

where X is an observation, ȝ is the mean of the population from which X is derived, and ı is the standard deviation of the population. Here, we are using an uppercase X because it is an element of a population, not that of a sample. As shown above, the z-score is calculated from the known population parameters: the population mean (μ) and the population standard deviation (ı). Because population size is usually large, the true population mean and the population standard deviation is not usually known except in such cases as standardized tests, where raw scores of the participant population who took the test are known. The sample standard deviation (s) is substituted for the population standard deviation (ı) and the sample mean (xࡄ ) is substituted for the population mean (ȝ). The formula for the z-score changes to:

100

Chapter Nine

z=

௫ି୶ ௦

We can convert any score into a standard score (z-score). Statisticians have prepared z-Tables that give area under the curve between the mean and any z-score (Appendix Table 1A) as well as the area in the tail of the curve beyond any z-score (Appendix Table 1B). Uses of z-Scores z-scores have two important uses. First, we can compare two scores and find out how many persons in the distribution scored below or above the scores of a person. Second, we can find out the statistically significant difference between a score and the mean.

Calculating the z-Score Let’s say that 130,000 students in a country took the Law School Admission Test (LSAT) in 2016, and the mean of their scores was 150 and the standard deviation was 30. The population mean and the population standard deviation is known in this case because 130,000 is the entire population of the LSAT participants in the country. Let’s say that Mary scored 166, and we want to know how many students scored higher and how many students scored lower than Mary. Here are the values we would need to plug in to the z-score formula shown above: X = 166 ȝ = 150 ı = 30 zൌ = =

௑ିȝ ı

ଵ଺଺ିଵହ଴ ଷ଴ ଵ଺ ଷ଴

= 0.53

Appendix Table 1A provides the area between the mean and the zscore. The first column gives the z-score to the first decimal point, and the first row gives values of the z-score to the second decimal point. For example, for our z-score of 0.53, we would look in the Appendix Table A1

The Standard Deviation and the Standard Scores

101

for the area between the mean and the values of the z-score against 0.5 in the first column and under 0.03 in the first row. The answer is 0.2019. The mean divides the area under the curve into two equal halves (Figure 92). The half of the area (0.5) under the curve is covered below the mean, and 0.2019 is covered above the mean; therefore, 0.5 + 0.2019 = 0.7019 or 70.19% area under the curve is up to the z-score value of 0.53. We can say that 70.19% of 130,000 students scored less than Mary. In other words, out of the total 130,000 students, 0.7019 × 130,000 = 91,247 students scored below Mary’s score of 166. We can also say that 130,000 – 91,247 = 38,753 students scored above Mary’s score of 166. Figure 9-2 Mary’s Score (166)

0.2019

0.5

Now let’s look at another example. Let’s say that John scored 145, which is below the mean. His z-score would be: X = 145 ȝ = 150 ı = 30 zൌ z= =

௑ିȝ

ı ଵସହିଵହ଴ ିହ ଷ଴

ଷ଴

= –0.17

Chapter Nine

102

We look for the area under the curve in Appendix Table 1A against 0.1 in the first column for the first decimal point and under 0.07 in the first row for the second decimal point. The answer is 0.0675. Because John’s zscore is negative (–0.17), it is below the mean. Hence, it will be on the left side of the mean. Negative z-scores lie on the left side of the mean, and positive z-scores lie on the right side of the mean. To calculate the area covered under the curve up to the z-score of –0.17, you have to subtract 0.0675 from 0.5: 0.5 – 0.0675 = 0.4325. We can say that 43.25% of 130,000 students scored lower than John. In other words, 0.4325 × 130,000 = 56,225 students scored below John’s score of 145. We can also say that 130,000 – 56,225 = 73,775 students scored above John’s score of 145. Figure 9.3 John’s Score (145)

0.0675

0.4325

Conversion of a z-Score into a Raw Score The z-score can be converted back to a raw score. Suppose we want to know with 95% confidence what LSAT score any student will get in the above population with the mean of 150 and the standard deviation of 30. We have to work backwards with the same formula to determine this. The formula, z =

௑ିȝ ı

ǡ can be rewritten as X= z × ı + ȝ.

We know that 0.95 or 95% of the area under the curve lies between ±2 (–2 and +2) standard deviations (Figure 9.1). The first thing we have to

The Standard Deviation and the Standard Scores

103

find out is the value of the z-score corresponding to two standard deviations. When 95% of the area is under the curve, 5% of the area is in the two tails of the curve. In other words, 2.5% or 0.025 of the area is in each tail. A z-score that corresponds to two standard deviations is 0.475 below and 0.475 above the mean; the area of 0.475 being half of 0.95 or 0.5 – 0.025 = 0.475. Looking up the value of z in the Appendix Table 1A, we find that the value of z that corresponds to the area of 0.475 under the curve is 1.96. The 0.475 corresponds to 1.9 in the first column under 0.06 in the first row in Appendix Table 1A. Now we can find the upper and lower limits of the raw score of any student with 95% confidence in a distribution with ȝ = 150 and ı = 30. œൌ

௑ିȝ ı

We know from Appendix Table A1 that the value of z is 1.96 with 95% confidence and from our data we know that the mean (ȝ) score is 150 and the standard deviation (ı) is 30. We will plug these values into the rewritten z-score formula given below: X=z×ı+ȝ The upper limit of the raw score X will be: 1.96 × 30 + 150 = 58.8 + 150 = 208.8. The lower limit of the raw score X will be: –1.96 × 30 + 150 = –58.8 + 150 = 91.2. We can say with 95% confidence that if the population mean is 150 and the population standard deviation is 30, the value of LSAT score that any student will get will be between 91.2 and 208.8. We can also calculate the percentage of candidates who will score less than a specific raw score by finding the percentage of candidates in the tail of the curve. Let’s find the z-score for a raw score of 185. œൌ œൌ

ܺ െ ȝ ı

ͳͺͷ െ ͳͷͲ ͵Ͳ

Chapter Nine

104

œൌ

͵ͷ ͵Ͳ

œ ൌ ͳǤͳ͹ As we can see from Appendix Table 1A, the z value of 1.17 corresponds to 0.3790 of the area under the curve above the mean; the mean corresponds to 0.5 of the area under the curve; therefore, the total area under the curve up to the z-score of 1.17 is 0.5 + 0.3790 = 0.8790. We can say with 95% confidence that 87.9% of students will score below 185. Another interpretation is that 95% of the times students will score below 185. We also can say that 95% of the times (100.0 – 87.9 = 12.1), only 12.1% of students will score above 185. In other words, the scores of 12.1% of the students will be in the right tail of the curve. We can use this to conclude that the test does not allow for achieving extreme scores by a large number of students. Therefore, the test is a good test in terms of difficulty.

Finding Probability of an Event Using z-Score and the Normal Curve Let’s say that last year those students who scored between 144 and 160 on the LSAT were eligible to get into a law school. Therefore, this year we want to know the probability of randomly selecting a student who would score between 144 and 160. First, we have to find values of z-scores corresponding to LSAT scores of 144 and 160 in a population with the mean of 150 and the standard deviation of 30, and then we can figure out the area under the curve between these two z-scores: œൌ

ܺ െ ȝ ı

z-score for the LSAT score of 144: œൌ

ଡ଼ିȝ ı

=

ଵସସିଵହ଴ ଷ଴

ି଺

=

ଷ଴

= -0.20

z-score for the LSAT score of 160: œൌ

ଡ଼ିȝ ı

=

ଵ଺଴ିଵହ଴ ଷ଴

=

ଵ଴ ଷ଴

= 0.33

The Standard Deviation and the Standard Scores

105

As shown in Appendix Table 1A, the area covered under the curve up to the z-score of –0.20 is 0.0793 below the mean, and the area covered by the z-score of 0.33 is 0.1293 above the mean. The total area covered between the two scores is 0.0793 + 0.1293 = 0.2086, that is, 20.86%. We can say that there is about a 21% chance that a student will score between 144 and 160. Figure 9.3 gives a graphical representation of the calculated areas under the curve between the LSAT scores of 144 and 160 with the mean of 150 and the standard deviation of 30. Figure 9.4 Area Under the Normal Curve between Two Scores

0.0793

0.1293

0.2086

X=144

X=160

These are some of the uses of standardized scores and the normal curve. We will learn in the next few chapters that the concept of the normal curve is central to inferential statistics.

Tails of the Curve It is time to introduce the term the tail of the curve. Two ends of the curve are called “tails.” We will soon find out about the one-tailed and the two tailed tests. You know from Figure 9.1 that 95% of the observations under the normal curve fall between ±2 standard deviations. And 5% of the observations fall in the tails of the curve. It implies that for ±2 standard deviations, 47.5% of observations are above the mean and 47.5% observations are below the mean. It also implies that 2.5% of the observations are in the left tail of the curve and 2.5% of the observations are in the right tail of the curve. We also know from our z-Table that the value of the z-score that corresponds to 0.475 area of the curve is 1.96. We

Chapter Nine

106

can say that 95% of the values under the normal curve will fall between the z-scores of –1.96 and 1.96.

Exercises for Practice Q1.

What is the area between the mean and the z values given below? a. b. c. d. e.

Q2.

z = 2.31 z = –1.23 z = 2.97 z = –0.69 z = 1.00

What is the percentile (i.e., the percentage) of students scoring below the z-scores of the following five students? John Shilpa Anwar Mary Jesse

Q3.

z = 1.56 z = 2.67 z = –0.84 Z = 1.91 Z = –2.63

Now find the percentages of students who scored above the zscores of the same five students. John Shilpa Anwar Mary Jasse

z = 1.56 z = 2.67 z = –0.84 Z = 1.91 Z = –2.63

Q4.

If the mean of a standardized test is 63 and the standard deviation 9; what is the percentile rank of a student who scored 79?

Q5.

If Albert Einstein’s IQ was 165 and the mean IQ of the population was 105 and the population standard deviation was 20; what was percentile rank of Einstein’s IQ?

Q6.

If Stephen Hawking’s IQ is 169 and the mean IQ of the population is 118 and the standard deviation was 17; based on their percentile rank how much smarter is Hawking than Einstein?

The Standard Deviation and the Standard Scores

107

Q7.

Last year the mean score of a standardized test was 160 and the standard deviation was 15; what is the probability a student will get a score between 140 and 180?

Q8.

If the mean of a standard test is 145 and the standard deviation is 5; find with 95% confidence what score any student will get.

CHAPTER TEN EXAMINING RELATIONSHIPS

Learning Objectives This chapter forms a bridge between univariate and bivariate statistics. You will learn important concepts associated with hypothesis testing. Specifically, you will learn: x cross-tabulation; x an introduction to the test of significance; x hypothesis testing; x one-tailed and two-tailed tests; x the level of significance and the level of confidence; x p-values; x the number of degrees of freedom; x type 1 error and type 2 error; and x steps for testing hypothesis.

Introduction Most of the time, researchers have a hunch that they use to test their hypothesis. Before collecting data, researchers reflect on the expected relationships between variables that they intend to explore and use this reflection to develop the hunch. There is always a possibility of finding unexpected relationships between the variables, which may surprise a researcher at the time of data analysis. This is a fun part of statistical exploration. There are three primary tools for searching relationships between variables: cross-tabulation, graphs, and statistical techniques.

Examining Relationships

109

Cross-Tabulation Cross-tabulation is a process of constructing tables with one or more variables. When you construct a table with only one variable at a time, it is called a univariate table.

Univariate Tables Univariate tables are frequency tables that are constructed with only one variable. You have seen this type of table in Chapter 3. A univariate table shows frequencies (numbers) or percentages of each category of a variable. For example, Table 10-1 gives the number of families by type. It is a univariate table because it provides information on only one variable (family type). Table 10-1 Number of Families by Type

Type of Families Opposite-Sex Families Same-Sex Families Single-Parent Families Total Families

Number of Families 7,000 600 400 8,000

Percent of Families 87.5 7.5 5.0 100.0

x A univariate table shows one variable at a time. x The purpose of a univariate table is to describe a variable.

Bivariate Tables Bivariate tables show relationship between two variables. Rows of a bivariate table are assigned to the categories of one variable, and columns are assigned to the categories of the other variable. Though it does not make a difference which of the two variables you assign to rows or columns, it is advisable to set up your table so that the dependent variable is placed in the rows and the independent variable is placed in the columns. By setting up your table this way, you will be able to compare change in the dependent variable for each category of the independent variable. Let’s say you’re studying income by sex. Whatever you’re studying is your dependent variable. Because you’re interested in differences in income by sex, income is the dependent variable and sex is the independent variable; which is the reason why in Table 10-2 income is

Chapter Ten

110

assigned to rows and sex is assigned to columns. In this table, you can easily compare the change in income for each category of sex. There are more women in the lowest income category, and there are more men in the highest category of income. This is true for the number as well as the percentage of men and women. There is a small difference by sex in the middle income group. You can observe from this bivariate table that there is inequity in income distribution by sex. The bivariate and the multivariate tables are also called contingency tables; by analyzing a contingency table, you can find out whether the dependent variable is dependent or contingent on the independent variable. Table 10-2 shows that the amount of income earned depends on respondent’s gender. You are likely to make more money if you are male. In other words, income changes with a change in gender. x A bivariate table compares two variables: the dependent and the independent variables. x A bivariate table is the first step for searching relationships between two variables. For example, we can infer from Table 10-2 that income is related to the sex of respondents. Table 10-2 Income by Sex Income Less than $20,000 $20,000 to $50,000 $50,000 or more Total

Male 50 70 100 220

% 23 32 45 100

Female 70 75 60 205

% 34 37 29 100

Multivariate Tables When a table is constructed with more than two variables, it is called a multivariate table. A multivariate table usually has one dependent variable and two or more independent variables. Table 10-3 is a multivariate table with three variables: income as a dependent variable, and sex and age as independent variables.

Examining Relationships

111

Table 10-3 Income by Sex and Age Income

Male

Female Age

Less than $20,000 $20,000 to 50,000 $50,000 or More Total Families

Under 40 # % 30 24% 50 40% 45 36% 125 100%

40 and over # % 20 21% 20 21% 55 58% 95 100%

Under 40 # % 35 33% 40 38% 30 29% 105 100%

40 and over # % 35 35% 35 35% 30 30% 100 100%

If you look at the percent distributions in Table 10-3; you find that in the “40 and over” age group more men are concentrated in the highest income category ($50,000 or more), whereas women are equally distributed in the three income categories. The gender difference in income could be due to the age distribution, as older persons tend to have higher income and there are relatively more men than women among the highest income earners ($50,000 or more) in the “40 and over” age group. In other words, the income differences among men and women may not be entirely due to gender discrimination, they could also be due to age distribution. x Analysis of multivariate tables is used to further elaborate the relationship between the dependent and the independent variables. x With the help of multivariate tables, we can find the influence of an independent variable on the dependent variable by controlling the influence of another independent variable. In the above table, we looked at the influence of age on income by controlling the influence of sex.

Test of Significance Statisticians use a formal decision-making rule to test a hypothesis. Some tests are used to evaluate differences in our expectations and the actual observations. For example, we may expect and hypothesize that there is no gender gap in the earned income of men and women. When we collect actual data, we may see that on average women earn 25% less than men. We can use a statistical test of significance to find out if this observed difference in income by gender is statistically significant from our expectation. A test of significance can also be used to test whether differences between two groups or between a sample and the population are just by

112

Chapter Ten

chance. A test of significance in multivariate analysis can also be used to decide which variable should be kept in the further analysis. Primarily, a test of significance is used to: x test how representative a sample characteristic is of the population characteristic, x evaluate differences between the expected and the observed values, and test whether differences are actual or by chance. Tests of significance are discussed in the next four chapters.

Hypothesis Testing A hypothesis is a statement about the relationship between two or more variables based on an educated guess. It could be as mundane as stating that if I drank beet juice every day my systolic blood pressure will be lower than 139, or it could be very elaborate, requiring a large-scale data collection. For example, a large scale family planning project was run by Johns Hopkins University in India in 1960s to test the hypothesis that if survival of an infant can be assured, couples will have fewer children and, hence, fertility will decline. The idea was that fertility is associated with child mortality. The research suggested that although couples desired a smaller family size, they ended up having more children because they did not expect all the children to survive due to high infant mortality. To test the hypothesis of the assurance of survival of children and the decline of fertility, 25 villages were divided into five groups to collect data. In the first group of villages, medical care was provided to mothers and their children; in the second group, maternity and health care was provided only to mothers; in the third group, medical care was given only to children; in the fourth group, only the extensive education on family planning was provided; and the fifth group of villages was kept as a control group in which no services were provided. The data was collected for almost 25 years to obtain a large enough sample of births and child deaths. A hypothesis like this may require longer time and intensive resources. Traditionally, we set up a null hypothesis to test a research hypothesis. The research hypothesis is also called an alternate hypothesis. The following four steps are taken to test a hypothesis: 1. Null hypothesis: Based on the same idea as you are innocent until proven guilty, a null hypothesis is a default position. For example, if your hypothesis is that the gender gap in income is statistically

Examining Relationships

113

significant, your null hypothesis will be that the gender gap in income is not statistically significant. 2. Research hypothesis or alternate hypothesis: If the null hypothesis is rejected, you want to accept the alternate condition of your research i.e., the alternate hypothesis. In statistics, you can only support your hypothesis by rejecting the null hypothesis. In other words, first you prove that the null hypothesis is wrong, then you accept your notion, which is the alternate hypothesis. You assume that your alternate hypothesis is wrong until you find evidence to the contrary through a statistical test. As mentioned, it’s like saying a person is innocent until proven guilty. Because you always hypothesize based on some hunch that could be your bias, you test the null hypothesis to cut down the bias. In the above example, the alternate hypothesis or the research hypothesis is that the gender gap in income is statistically significant. 3. Finding evidence with the help of a test statistic: You will learn in the next four chapters that a statistic is a value that is calculated from the sample data. This value is evaluated by comparing it with the distribution of the statistic. 4. Decision rule: Finally, the calculated value of the statistic is compared with a set of given values of the statistic distribution, and a decision is made either to reject or not to reject the null hypothesis. There are several concepts that are relevant to tests of statistical significance. They include: independence, randomness, normality, one- or two-tailed tests, the level of significance, the level of confidence, the pvalue, degrees of freedom, and type 1 or type 2 errors. We must understand these concepts before embarking on applying tests of significance.

Assumptions Our data must meet three assumptions for testing a hypothesis with a statistical test of significance. These assumptions pertain to independence, normality, and randomness.

114

Chapter Ten

Independence We assume that observations in our sample are independent of each other. For example, if data has been collected by the snowball sampling technique in which you asked one respondent to refer to the other respondent, the sample would not be a random sample. In this case, the probability of selection of a respondent was dependent on the bias of the first respondent, who referred the other respondents to you. As described in Chapter 7 on sampling, researchers use the snowball sampling technique to collect data on persons without a fixed address, such as homeless persons and delinquent gangs.

Normality We assume that the sample is sufficiently large and the population from which the sample is drawn is normally distributed. If the population is normally distributed, a sufficiently large random sample can also be assumed to be normally distributed.

Randomness It is assumed that the sample has been selected randomly. Because a random sample is a representative of a normally distributed population from which the sample is drawn, we can compare a sample statistic (e.g., the sample mean) with that of the population parameter (e.g., the population mean).

One-Tailed and Two-Tailed Tests In a test of significance, a researcher needs to decide whether to use a onetailed or a two-tailed test. If the direction of the test is known, the researcher uses the one-tailed test, which means that the direction of the difference is already known. For example, when you are testing a hypothesis that the mean income of your sample will be more than the population mean of $50,000, you will apply a one-tailed test because you already know the direction of the test, which is more than $50,000. In this case, your null hypothesis will be that the mean income of your sample is not more than $50,000 (populations mean). When the direction of the difference is not known, you use a two-tailed test. Generally, the population mean is not known to researchers. In this case, your null hypothesis will be that the mean income of your sample is

Examining Relationships

115

not equal to the population mean. Because you don’t know whether the sample mean would be more or less than the population mean, the direction of the difference is not known to you. Hence, you will apply a two-tailed test.

Level of Significance The level of significance is the level of risk one is willing to take. This value is also called the alpha (Į) value. Testing at a 0.05 level of significance means that you’re willing to take a risk of making a wrong decision 5 times out of 100 times. In other words, you’re taking a risk of going wrong for 1 time out of 20 times. The alpha (Į) of 0.01 would mean that you’re willing to take a risk of going wrong for 1 time out of 100 times.

Level of Confidence The level of significance and the level of confidence both are probabilities that vary between 0 and 1 and can be easily confused with each other. The level of confidence is 1 minus the level of significance. The level of confidence is usually expressed in percentage. When you are testing a hypothesis at a level of significance of 0.05, you are testing it with a level of confidence of 1 – 0.05 = 0.95, or 95%. Similarly, when you are testing a hypothesis at a level of significance of 0.01, you are testing it with a level of confidence of 1 – 0.01 = 0. 99, or 99%.

p-Values A p-value is another number that is associated with the test of significance. The p-value corresponds to the area in the tail of a normal curve beyond the value of a statistic. Every value of a statistic has a corresponding probability or p-value. A p-value is a probability of obtaining a value of the statistic. For example, the probability of obtaining a z-score of 1.96 is 0.05. In this case, 0.05 is the p-value for obtaining a z-score of 1.96.The smaller the area left in the tail of the normal curve, the larger the value of the statistic. The following calculation for the p-value will make it clear.

Calculating p-Value x Calculate the z-score.

116

Chapter Ten

x Find the value of the probability (area under the curve) corresponding to the calculated value of the z-score in Appendix Table A1. x The p-value is 0.5 minus the probability in Appendix Table 1A corresponding to the calculated value of the z-score. You can directly find p-values corresponding to calculated values of zscores in Appendix Table 1B. 1. Suppose your calculated z-score is 2.0. 2. The value of the probability in Appendix Table A1 corresponding to the z-score of 2.0 is 0.4772. 3. For a one-tailed test, the p-value for obtaining the z-score of 2.0 is equal to 0.5 – 0.4772 = 0.0228. 4. For a two-tailed test, the p-value for obtaining the z-score of 2.0 is equal to 0.0228 + 0.0228 = 0.0456.

Use of p-Values P-values are commonly used to test a null hypothesis. As mentioned, a null hypothesis generally states that there is no statistically significant difference between two groups or between a sample characteristic and the population characteristic. The smaller the p-value, the less likely that an observed value of a statistic will occur by chance—provided the null hypothesis is true. A p-value of 0.05 or less is generally considered to indicate that a finding is statistically significant.

The Number of Degrees of Freedom The number of degrees of freedom is the most elusive and confusing concept for a new learner. It is denoted as df. Let’s try two simple ways to understand the concept of the degrees of freedom. The degree of freedom is the number of observations in a sample minus the number of population parameters that must be estimated from the sample. The mean, the median, chi-square, and so on of a population are all parameters. Let’s say that we have to calculate the mean of a sample of 100 observations. In this case, there is one parameter—the mean—and 100 observations (n). So the degree of freedom will be n – 1, or 100 – 1 = 99. If we have to calculate the mean of the two samples, then the df will be (n1 – 1) for the first sample and (n2 – 1) for the second sample. The number of degrees of freedom for both the samples will be (n1 – 1) + (n2 – 1) = (n1 + n2 – 2). If the

Examining Relationships

117

size of the first sample is 50 and the size of the second sample is 30; the df will be (50 + 30 – 2) = 78. If you have a contingency table, using the same concept of n – 1, you will use a multiple of the number of rows minus one (r – 1) and the number of columns minus one (c – 1), that is, the df is equal to (r – 1)(c – 1). Suppose you have a table with four rows and three columns: the degrees of freedom are (4 – 1) (3 – 1) = 3 × 2 = 6. There is another way to understand the concept of degrees of freedom. The degrees of freedom are the minimal number of values that should be fixed to determine all the data. For example, if there are five values and their mean is known, you need to know at least four values out of the five values to complete the data. Suppose the mean of these five values is 7 and the sum of these five values is 5 × 7 = 35. If the four known values are 5, 6, 8, and 9, then you automatically know that the fifth value will be 35 minus the sum of these four values: 35 – (5 + 6 + 8 + 9) = 35 – 28 = 7. In other words, if the mean (xࡄ ) of a sample of size (n) of five values is known, you must know at least four out of the five values to figure out the fifth value. You can have only one unknown value to complete the data of the sample. In a sample of size 5, the number of degrees of freedom is df = n – 1, or 5 – 1 = 4.

Type 1 Error A type 1 error is the act of rejecting a null hypothesis when it should be retained. In other words, a type 1 error is the rejection of a true null hypothesis. For example, in a trial of a person accused of a crime, the null hypothesis is that the accused is not guilty (innocent), and the alternate hypothesis is that the accused is guilty. In this case, the type 1 error will occur if the accused is not guilty but is found guilty. It will be possible if we set the level of significance (Į), the tail area of the curve, too wide. The wider the tail area, the more flexible is Į. For example, the Į will be more flexible if we say that to find an accused guilty, only a simple majority of 6 out of 10 jurors must give a guilty verdict, compared with saying that the accused will be found guilty only if all 10 jurors agree. The probability of getting 6 jurors to agree will be higher than getting all 10 jurors to agree for the same verdict. In this case, the chance of our calculated value of a statistic (z or t) being larger than the critical value will be high. In other words, the chances of rejecting a true null hypothesis will be high, which means that the probability of making a type 1 error will be high. The value in the z-table or t-table with which we compare our calculated value is called the critical value or the tabulated value.

118

Chapter Ten

In the case of gender differences in income, our null hypothesis is that there is no statistically significant difference between male and female income, and the alternate or research hypothesis is that there is a statistically significant difference between male and female income. If there was actually no statistically significant gender difference in income but we proclaimed a statistically significant gender difference in income, we have rejected a true null hypothesis that should have been retained. Hence, we have made a type 1 error.

Type 2 Error A type 2 error is failing to reject a null hypothesis when it should be rejected. In other words, a type 2 error is accepting a false null hypothesis. For example, Į will be more stringent if we say the verdict will be acceptable only if all 10 jurors find an accused guilty. In this case, we have to set Į too far away from the mean, and the tail area under the curve will be too small. The value of Į equal to 0.01 is more stringent than the value of Į equal to 0.05. The value of Į equal to 0.01 makes the null hypothesis more difficult to prove than the value of Į equal to 0.05. When the Į is too stringent, say 0.0001, it will be very difficult to have a calculated value of a statistic (z or t) larger than the critical value, and the chances of failing to reject the null hypothesis will be very high. In other words, the probability of a type 2 error will be very high. In the gender difference in income example, if we proclaimed no statistically significant gender difference in income but actually there was a statistically significant gender difference in the income, we have failed to reject a false null hypothesis. In other words, we have accepted a false null hypothesis. Hence, we have made a type 2 error. x A type 1 error is rejecting a true null hypothesis, and a type 2 error is accepting a false null hypothesis.

Four Steps for Testing a Hypothesis We already know how to calculate the z-score. Let’s test a hypothesis using the z-score on data obtained from a sample of residences of New York City. Let’s say that, according to a city census, the average income of New York households is $33,000 and its standard deviation is $1,500. The average income of a sample drawn from the city’s households is $36,000. We want to know if the average income of the sample is

Examining Relationships

119

significantly higher than that of the population of New York. We will need to follow the following four steps to test our hypothesis:

Step 1: The Hypothesis Statement Null Hypothesis (H0): The average income of the sample (xࡃ ) is not significantly higher than the average income of the population (ȝ). The statement can be written as follows: xࡃ  ȝ Alternate Hypothesis (H1): The average income of the sample (xࡃ ) is significantly higher than the average income of the population (ȝ). The statement can be written as follows: xࡃ > ȝ

Step 2: The Test of Significance The test will require choosing a test statistic, the level of significance, and a two-tailed or a one-tailed test. We choose as follows: x x x

Test statistics: z-statistics The level of significance: Į = 0.05 One-tailed versus two-tailed test: We choose a one-tailed test because we are testing that the sample average income is more than the population average income. It means we know the direction of the test.

Step 3: Calculations Test Statistics We will plug in values of the population mean, the sample mean, and the population standard deviation in the following formula to calculate zstatistics: ୶lj ିρ , z= ஢

where ȝ is the mean of the population, xࡃ is the mean of the sample, and ı is the standard error (the standard deviation of the population).

Chapter Ten

120

z=

ଷ଺ǡ଴଴଴ିଷଷǡ଴଴଴

z=

ଵǡହ଴଴ ଷǡ଴଴଴ ଵǡହ଴଴

= 2.0

Step 4: The Decision Rule According to the decision rule, you reject the null hypothesis if the calculated value of the z-statistic is higher than the critical value of the zstatistic as given in the z-Table corresponding to the p-value of 0.05, the area in the tails of the curve. In a one-tailed test, the p-value of 0.05 implies that 0.05 area under the curve is in its tail; therefore, 0.5 – 0.05 = 0.45 is the area between the mean and the z-score. The value of z that corresponds to 0.45 is 1.65 in Appendix Table1A. In other words, the value of z is 1.65 for the level of significance of 0.05. Since, the calculated z value of 2.0 is greater than the tabulated (critical) z value of 1.65, we reject the null hypothesis. In other words, we accept the alternate hypothesis that there is a statistically significant difference between the sample average income and the population average income. This chapter sets the stage for inferential statistics. Inferential statistics pertain to making inference from the data by applying a test of significance. The proceeding chapters are devoted to tests of significance according to the level of the measurement of data.

Exercises for Practice Q1.

It is advisable to set up your table so that: a. the dependent variable is in the rows and the independent variable is in the columns. b. the dependent variable is in the columns and the independent variable is in the rows. c. the independent variable is not in the columns. d. None of the above.

Q2.

To find the influence of a variable on the dependent variable by controlling the influence of another variable, you have to construct a: a. univariate table. b. bivariate table.

Examining Relationships

121

c. multivariate table. d. None of the above. Q3.

Primarily, tests of significance are used to: a. test how representative a sample characteristic is of the population characteristic. b. evaluate the difference between the expected and the observed values. c. test whether differences are actual or by chance. d. All of the above.

Q4.

When you are testing a hypothesis that the mean income of your sample will be more than the population mean of $30,000, you will apply a: a. b. c. d.

two-tailed test. one-tailed test. directional test. non-directional test.

Q5.

Find the p-value for the z-score of 1.81 from Appendix Table 1A or Appendix Table 1B.

Q6.

Find the degrees of freedom for: a. a sample of 50 observations and with one parameter. b. a contingency table with 5 rows and 3 columns.

Q7.

Which of the following statements is correct? a. A type 1 error is failing to reject the null hypothesis when it should be retained. b. The probability of the type 1 error increases when the value of Į is too stringent. c. A type 1 error is rejecting the null hypothesis when it should be retained. d. None of the above.

Q8.

What will be the level of confidence when a researcher is testing a hypothesis at the level of significance of 0.10?

CHAPTER ELEVEN TESTS OF SIGNIFICANCE FOR NOMINAL-LEVEL VARIABLES

Learning Objectives In this chapter, we start bivariate statistics, which involves finding a relationship between two variables. Specifically, you will learn: x that choice of a statistical test depends upon the level of measurement of data; x to calculate and interpret the chi-square (Ȥ2), a test frequently used when the dependent variable or the independent variable is a nominal-level variable; x two more measures derived by a simple modification to the chi-square: the Phi (ࢥ) and Cramer’s V; and x to calculate the Lambda (Ȝ), a measure of the proportional reduction of error (PRE).

Introduction In previous chapters, we worked with only one variable as you learned about univariate statistics. In Chapter 2, we discussed the four levels of measurement of data: nominal, ordinal, interval, and ratio. In Chapter 3, we introduced frequency distribution. The conversion of raw data into frequency tables is also called data reduction because a large amount of data is reduced into a smaller version in the form of frequency tables. Then, we presented data in graphs and charts to visually comprehend the shape and change of data. In Chapter 4, we learned to find a typical value of a distribution with the help of measures of central tendency (the mode, the median, and the mean). In Chapter 5, we looked at the spread of values of a variable with the help of the measures of dispersion, that is, the range,

Tests of Significance for Nominal-Level Variables

123

the variance, and the standard deviation. But we were still only dealing with one variable at a time. In the last chapter, we briefly introduced univariate, bivariate, and multivariate tables. Moving forward, we are going to deal with finding relationships between two variables, or bivariate statistics. Sometimes the use of terms such as univariate or bivariate can make a novice learner feel intimidated. Rather than saying univariate and bivariate, we could just say one-variable statistics and two-variable statistics. Whenever we learn a new discipline, we are introduced to new vocabulary. Once you understand the meaning of these new words, the learning of a new subject becomes easy. As you have learned, normal distribution is pertinent only to the interval-level and ratio-level data. In this book, we use the term interval level for both interval-level and ratio-level data. A specific statistical test is applicable to a specific level of data. At the end of the book, we provide a summary template to describe which test is appropriate for what level of data. In this chapter, we will discuss tests that can be used to find an association between two nominal-level variables.

The Research Question, the Level of Measurement, and the Choice of a Test Our choice of a test depends on whether the level of measurement of our data is nominal, ordinal, or interval. What kind of data we should collect depends on the research question we need to tackle. For example, if our research topic is ethnic discrimination, then our first question will solicit the name of the ethnic group to which each respondent belongs. Because we are only soliciting names of ethnic groups, this variable will be a nominal-level variable. Our second question will ask about experiencing discrimination. If responses to the experience of discrimination by various ethnic groups are solicited as “yes” or “no,” then this variable will also be a nominal-level variable. However, if we are interested to know about conservatism by ethnicity, then the question will be about the level of conservatism. In this case, conservatism will be an ordinal-level variable because conservatism may differ in degrees (e.g., liberal, moderately conservative, and highly conservative). If the purpose of the research is to find out the association between the level of conservatism and the social class, then both variables (the level of conservatism and the social class) will be ordinal-level variables. Finally, if the purpose of the research question is to find out an association between ethnicity and income, then the name of the ethnic group will be a nominal-level variable and income

124

Chapter Eleven

will be an interval-level variable. The level of measurement of data depends on our choice of a research question, and our choice of a statistical test to find out a relationship between two variables depends on the level of measurement of data. In short, the following considerations are important in choosing a statistical test: x the representativeness of the sample—Is the sample a random sample? x the level of measurement of variables—Is the variable a nominal, an ordinal, or an interval level? x the type of variables—Which variable is the dependent variable and which variable is the independent variable? x Finally, which test of significance or a measure of association will be appropriate for the variables under consideration? A template is provided at the end of this book to help you choose an appropriate test according to the level of measurement of data.

Visual Evaluation of a Relationship between Two Variables It is useful to create a cross-tabulation of two variables before setting up a test to find an association between them. Tables created by a crosstabulation are called contingency tables. Always put the dependent variable in the rows of a table and the independent variable in the columns of a table. The following example illustrates the use of contingency tables before applying a test of significance. Let’s say we want to find out whether voting patterns are related to the sex of voters from the following contingency table: Table 11-1 Number of Voters by Party Preference Party Liberal Conservative Democratic Total

Frequency 150 100 50 300

Table 11-1 is a univariate (one-variable) table; the only thing we can learn from this table is that half of the voters voted for the Liberal Party and the rest of the votes were split between the other two parties, with the Conservative Party taking double of the Democratic Party votes. We have

Tests of Significance for Nominal-Level Variables

125

to cross-tabulate this data by party preference and sex to find a relationship between sex and voting by party preference. Let’s say that voting by sex was as given in Tables 11-2 and 11-3: Table 11-2 Number of Voters by Party Preference and Sex of Respondents Party Liberal Conservative Democratic Total

Male 100 70 25 195

Female 50 30 25 105

Total 150 100 50 300

When you study Table 11-2, it seems that men overwhelmingly vote for the Liberal Party. But Table 11-3 seems to refute this: it shows that just like men, close to half of women also voted for the Liberal Party. Similarly, the raw numbers in Table 11-2 give an impression that more men than women are Conservative voters, and both sexes in equal number are Democratic voters. Table 11-3 shows that men are more likely than women to vote for the Conservative Party whereas, women are more likely than men to vote for the Democratic Party. Table 11-3 Percentage of Voters by Party Preference and Sex of Respondents Party Liberal Conservative Democratic Total

Male 51% 36% 13% 100%

Female 48% 29% 24% 100%

Total 50% 33% 17% 100%

It’s hard to say from these tables whether there is a statistically significant association between party preference and the sex of a voter. We need to apply a statistical test to find this out. The chi-square (Ȥ2) is a test of statistical significance that can be applied when one or more variables in a study are nominal level. Because both party preference and sex are nominal-level variables, we can use a chi-square (Ȥ2) to test an association between these two variables.

126

Chapter Eleven

The Chi-Square (Ȥ2): A Test of Significance The chi-square was discovered by Karl Pearson.10. It is a non-parametric test, which means that data is not required to fit a normal distribution. Though the chi-square can be used for data of all levels of measurement— nominal, ordinal, and interval—it has its limitations. Because the value of the chi-square is always a positive number, you can’t determine from its value whether when the scores of one variable increase, the scores of the other variable will increase or decrease. Hence, it cannot measure the direction of a relationship between two variables. It also cannot measure the strength of the relationship. It can only inform whether the relationship between two variables is statistically significant or not.

Uses of the Chi-Square (Ȥ2) Test 1. The chi-square can be used on cross-tabulated data to test whether membership in one category is related to membership in another category. For example, using data in Table 11-2, you can test whether being a member of a particular sex (male or female) is related to voting preference for a particular party. 2. The chi-square can also be used to test independence of crosstabulated variables. For example, you can test whether voting preference for a particular party depends on voter’s sex. 3. It is also used to test goodness of fit, that is, to test whether observed data fits the expectation. For example, an auto manufacturer may have sold two small cars and three mediumsized cars for every one large car. When the company introduces a new brand of cars, management expects the same pattern in the sale of the new brand. Using the chi-square, the manufacturer can test whether the observed sales pattern meets the expected sales pattern.

10

Pearson, Karl. “Contributions to the Mathematical Theory of Evolution—I. On the Dissection of Asymmetrical Frequency-Curves.” Philosophical Transactions CLXXXV, 1894, p. 80. Quoted in Haan, Michael.

Tests of Significance for Nominal-Level Variables

127

Some Requirements for a Chi-Square Test 1. As shown in Table 11-2, you need data in the form of a frequency table. 2. There should be only one category for each possible case, which means there should be no duplication. For example, as mentioned in Chapter 3 in the case of designated group data, the same person could be counted among three categories: women, Aboriginal persons, and persons with disability. The chi-square should not be used if there is a possibility that a respondent or an observation belongs to more than one category. Each observation is required to be independent of the other. 3. Each cell of the contingency table should have at least five observations (counts). As a convention, at least 80% of the cells of the table should have counts of more than five observations. No cell of the table should have 0 counts. 4. Data should have been collected using the probability sampling technique.

Calculation Steps for the Chi-Square (Ȥ2) As mentioned, the chi-square (Ȥ2) test is used to test for independence between two variables. For example, the test can be used to find out whether the IQ of children depends on their parents’ income; or if the size of a house depends on the size of the family; or if the size of the office depends on the position of an employee. The chi-square (Ȥ2) is also used to find a relationship between the observed pattern of frequencies and the expected pattern of frequencies. For example, it can be used to determine if the affiliation of respondents with a particular political party depends on their religion. The following discussion shows stepwise calculations for the chi-square (Ȥ2) as a test of dependency and “goodness of fit”.

The Chi-Square (Ȥ2): A Test Dependency Using the data in the following contingency table, let’s test whether a respondent’s party affiliation depends on his or her religion.

Chapter Eleven

128

Table 11-4 Party Affiliation by Religion Religion (Observed Frequencies) Party Liberal Conservative Democratic Column Total

Protestant

Catholic

Other

Row Total

7 6 5 18

5 7 8 20

8 12 7 27

20 25 20 65

Step 1: Formulation of Hypothesis x Null hypothesis (H0): The party affiliation does not depend on respondent’s religion. x Alternate hypothesis (H1): The party affiliation is related to respondent’s religion. Step 2: Statistical Test: Since the party affiliation and religion are nominal variables, and data is in the form of a frequency table, the chi-square (Ȥ2) is an appropriate test for testing the independency of party affiliation and religion. Step 3: Significance Level: We test at a conventional level of significance of Į = 0.05. In other words, we will reject or fail to reject the null hypothesis with 95% confidence. It implies that we are prepared to take only 1 out of 20 chances for making a wrong decision. Step 4: Sampling Distribution: The calculated value of the chi-square will be compared with the values of chi-square in the sampling distribution in Appendix Table 2A. Step 5: Calculations: The chi-square is calculated with the help of the following formula: Ȥ2 = σ

ሺ୓ୠୱୣ୰୴ୣୢ୊୰ୣ୯୳ୣ୬ୡ୧ୣୱି୉୶୮ୣୡ୲ୣୢ୊୰ୣ୯୳ୣ୬ୡ୧ୣୱሻమ ୉୶୮ୣୡ୲ୣୢ୊୰ୣ୯୳ୣ୬ୡ୧ୣୱ

= σ

ሺ୤୭ି୤ୣሻమ ୤ୣ

Expected frequencies are the frequencies that we expect to be in each cell of a contingency table. For example, a school board may have a regulation that each class in a school should have no more than 20 students. Thus, a class size of 20 for each class is the expected frequency.

Tests of Significance for Nominal-Level Variables

129

Expected Frequencies The formula of the chi-square requires that the expected frequencies correspond to the observed frequencies of each cell of the table. The expected frequencies for the observed frequencies of each cell of Table 11-4 are calculated in Table-11-4a and given in Table 11-4b. To calculate the expected frequencies for a cell of a contingency table, simply multiply row totals by column totals corresponding to the cell and divide it by the total number of frequencies (n) in the table: Table 11-4a Party Affiliation by Religion, Calculation of Expected Frequencies Religion (Expected Frequencies) Party

Protestant

Liberal Conservative Democratic Column Total

(20 × 18) ÷ 65 (25 × 18) ÷ 65 (20 × 18) ÷ 65 18

Catholic (20 × 20) ÷ 65 (25 × 20) ÷ 65 (20 × 20) ÷ 65 20

Other (20 × 27) ÷ 65 (25 × 27) ÷ 65 (20 × 27) ÷ 65 27

Row Total 20 25 20 65

Table 11-4b Party Affiliation by Religion, Expected Frequencies Religion (Expected Frequencies) Party Protestant Catholic Other Liberal 5.5 6.2 8.3 Conservative 6.9 7.7 10.4 Democratic 5.5 6.2 8.3 Column Total 18.0 20.0 27.0 Note: Totals may not add due to rounding error

Row Total 20.0 25.0 20.0 65.0

Now we can put the calculated expected frequencies of each cell corresponding to its observed frequencies to calculate the formula for the chi-square.

Chapter Eleven

130

Table 11-4c Party Affiliation by Religion, Calculation of Chi-Square

Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 9

Liberal Protestant Liberal Catholic Liberal Other Conservative Protestant Conservative Catholic Conservative Other Democratic Protestant Democratic Catholic Democratic Other

Ȥ2 = σ

fo 7 5 8 6 7 12 5 8 7

fe 5.5 6.2 8.3 6.9 7.7 10.4 5.5 6.2 8.3

ሺ୤୭ି୤ୣሻమ ୤ୣ

(fo - fe) 1.5 –1.2 –0.3 –0.9 –0.7 1.6 –0.5 1.8 –1.3

(fo - fe)2 2.25 1.44 0.09 0.81 0.49 2.56 0.25 3.24 1.69

ሺ܎‫ ܗ‬െ ܎‫܍‬ሻ૛ ܎‫܍‬ 0.41 0.23 0.01 0.12 0.06 0.25 0.05 0.52 0.20 ™ = 1.85

= 1.85

Step 6: Number of Degrees of Freedom (df): The formula for the number of degrees of freedom is: x df = (r – 1) (c – 1), where r is the number of rows and c is the number of columns. df = (3 – 1) (3 – 1) = 2 × 2 = 4 Step 7: Decision: x Calculated value of Ȥ2 = 1.85 x df = 4 x Į = .05 x Tabulated (critical) value of Ȥ2 for df = 4 and Į = 0.05 from Appendix Table A2 is 9.49. x The calculated value of the chi-square of 1.85 does not exceed the tabulated (critical) value of 9.49; therefore; we fail to reject Ho. It means in our data, the party affiliation is not dependent on the religion of the respondent. In other words, there is no statistically significant relationship between party affiliation and religion.

Chi-Square (Ȥ2): A Test of “Goodness of Fit” As mentioned, the chi-square is also used to test the “goodness of fit,” that is, to find out whether observed data fits the expectation. Suppose a cereal manufacturer sold 80 small boxes and 75 mediumsized boxes for every 60 large boxes before the introduction of a new cereal. After introducing a new cereal, the manufacturer sold 70 small boxes and 80 medium-sized boxes for every 90 large boxes. Let’s use the chi-square to find out whether the observed pattern of sales after the

Tests of Significance for Nominal-Level Variables

131

introduction of a new cereal fits the pattern of expected sales by the manufacturer. The procedure to calculate the chi-square is the same, only the interpretation changes. You will take the same steps to calculate the chisquare. Step 1: Formulation of Hypothesis x Null hypothesis (H0): The observed pattern of cereal sales by box size is not different from the expected pattern of cereal sales. x Alternate hypothesis (H1): The observed pattern of cereal sales by box size is significantly different from the expected pattern of cereal sales. Step 2: Statistical Test: Chi-square Step 3: Significance Level: Į = 0.05 Step 4: Sampling Distribution: The calculated value of the chi-square is compared with the values of the chi-square in the sampling distribution in Appendix Table 2A. Table 11-5 Sales Patterns before the Introduction of a New Cereal Size of the Cereal Boxes (Observed Frequencies) Introduction of Small Medium Large Row Total New Cereal Before 80 70 60 210 After 70 80 90 240 Column Total 150 150 150 450

Step 5: Calculations: Calculate the values of expected frequencies to plug into the following chi-square formula: Ȥ2 = σ

ሺ୓ୠୱୣ୰୴ୣୢ୊୰ୣ୯୳ୣ୬ୡ୧ୣୱି୉୶୮ୣୡ୲ୣୢ୊୰ୣ୯୳ୣ୬ୡ୧ୣୱሻమ ୉୶୮ୣୡ୲ୣୢ୊୰ୣ୯୳ୣ୬ୡ୧ୣୱ

= σ

ሺ୤୭ି୤ୣሻమ ୤ୣ

Expected Frequencies The following Table 11-5a shows the calculations of the expected frequencies after the introduction of the new cereal, and Table 11-5c shows the calculations for the chi-square.

Chapter Eleven

132

Table 11-5a Calculations of the Expected Frequencies after the Introduction of a New Cereal Size of the Cereal Boxes (Expected Frequencies) New Cereal Before After Column Total

Small

Medium

Large

(210×150) ÷ 450 (240×150) ÷ 450

(210×150) ÷ 450 (240×150) ÷ 450

(210×150) ÷ 450 (240×150) ÷ 450

150

150

150

Row Total 210 240 450

Table 11-5b Expected Sales Patterns after the Introduction of a New Cereal Size of the Cereal Boxes (Expected Frequencies) Introduction of New Cereal Before After Column Total

Small

Medium

70 80 150

70 80 150

Large 70 80 150

Row Total 210 240 450

Table 11-5c Sales Patterns before and after the Introduction of a New Cereal, Calculations of the Chi-Square

Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6

Before Small Before Medium Before Large After Small After Medium After Large

fo 80 70 60 70 80 90

fe 70 70 70 80 80 80

(fo - fe) 10 0 –10 –10 0 10

(fo - fe)2 100 0 100 100 0 100

ሺ܎‫ ܗ‬െ ܎‫܍‬ሻ૛ ܎‫܍‬ 100 ÷ 70 = 1.43 0 ÷ 70 = 0 100 ÷ 70 = 1.43 100 ÷ 80 = 1.25 0 ÷ 80 = 0 100 ÷ 80 = 1.25 Ȥ2 = 5.36

Step 6: Number of Degrees of Freedom (df): The formula for the number of degrees of freedom is: x (r – 1) (c – 1), where r is the number of rows and c is the number of columns. x df = (r – 1) (c– 1) = (2 – 1) × (3 – 1) = 1 × 2 = 2

Tests of Significance for Nominal-Level Variables

133

Step 7: Decision x x x x

Calculated value of Ȥ2 = 5.36 df = 2 Į = 0.05 The tabulated (critical) value of Ȥ2 for df = 2 and Į = 0.05 in Appendix Table A2 is 5.99.

The calculated value of 5.36 does not exceed the tabulated (critical) value of 5.99; therefore, we fail to reject Ho. It means that the expected pattern of sales after the introduction of a new cereal is not statistically different from the observed pattern of sales before the introduction of a new cereal. In other words, we accept the null hypothesis that there are no statistically significant differences in the patterns of sales before and after the introduction of a new cereal.

Measuring the Strength of Association There are several measures of association that can be applied to nominallevel variables. Two measures, Phi (‫ )׋‬and Cramer’s V, are derived by a simple modification to the chi-square. The third measure, lambda (Ȝ), is a measure of the proportional reduction of error (PRE).

Phi (ࢥ) Phi is a modification of the chi-square. It is used to estimate an association between two nominal-level variables with only two response categories. Let’s say that we want to determine whether math students who took a remedial course did well on the final test. Table 11-6 gives the number of students who took the remedial course and did well on the final test. Both are nominal variables with “yes” and “no” response categories. Step 1: Contingency Table: Create a contingency table with “Took Remedial Course” as a column variable and “Did Well on the Test” as a row variable. It does not matter which variable is in the columns and which variable is in the rows.

Chapter Eleven

134

Table 11-6 Took Remedial Course and Did Well on the Test, Observed Frequencies (fo) Took Remedial Course Yes No Total

Did Well on the Test Yes 10 90 100

No 90 310 400

Total 100 400 500

Step 2: Calculation of Expected Frequencies: x From the observed frequencies in Table 11-6, calculate the expected frequencies with the formula (Column Total × Row Total) ÷ Total Number of Respondents (N), as shown in Table 11-6a. Table 11-6a Took Remedial Course and Did Well on the Test, Calculations for Expected Frequencies (fe) Took Remedial Course Yes No Total

Did Well on the Test Yes No Total 100 × 100 ÷ 500 = 20 100 × 400 ÷ 500 = 80 100 400 × 100 ÷ 500 = 80 400 × 400 ÷ 500 = 320 400 100 400 500

x Put the expected frequencies of each cell corresponding to its observed frequencies, as shown in Table 11-6b. Table 11-6b Took Remedial Course and Did Well on the Test, Expected Frequencies (fe) Took Remedial Course Yes No Total

Did Well on the Test Yes No 20 80 80 320 100 400

Total 100 400 500

Tests of Significance for Nominal-Level Variables

135

Step 3: Calculation of the Chi-Square: Calculate the chi-square as shown in Table 11-6c. Table 11-6c Calculation for the Chi-Square Took Remedial Course Yes No Yes No

Did Well on the Test Yes Yes No No

Chi-square (Ȥ2) = σ

fo 10 90 90 310

ሺ୤୭ି୤ୣሻమ ୤ୣ

ሺ܎‫ ܗ‬െ ܎‫܍‬ሻʹ (fo-fe) (fo - fe)2 ܎‫܍‬ –10 100 5.00 10 100 1.25 10 100 1.25 –10 100 0.31 Chi-square (Ȥ2) = 7.81

fe 20 80 80 320

=

7.81

Step 4: Calculation of Phi (‫)׋‬: Divide the value of the chi-square (7.81) by the number of respondents (500) and take the square root of the value obtained. ஧ଶ

ࢥ=ට



଻Ǥ଼ଵ

=ට

ହ଴଴

= ξͲǤͳͷ͸ = 0.125

Interpretation of Phi Value of Phi

Strength of Association

Between 0.0 to 0.10 Between 0.10 and 0.30 Greater than 0.30

Weak Moderate Strong

In the above hypothetical example, the calculated value of the chisquare is 7.81. The degrees of freedom are (c – 1) × (r – 1) = (2 – 1) × (2 – 1) = 1 × 1 = 1. The tabulated (critical) value of the chi-square from Appendix Table 2A for 1 degree of freedom and Į = 0.05 is 3.84.

Chapter Eleven

136

Because the calculated value of 7.81exceeds the tabulated value of 3.84, we reject the null hypothesis that there is no association between taking a remedial course and doing well on the test. It means that there is a statistically significant association between taking a remedial course and doing well on the test. But the value of phi of 0.125 indicates that this association is moderate. Some Points to Remember x x x x

Phi is used for 2 × 2 tables (tables that have 2 columns and 2 rows). The value of Phi varies between 0 and 1. Phi can exceed 1 in large tables. Because it is a measure for nominal data, Phi indicates only the strength of a relationship. We cannot predict the direction of a relationship with Phi. In other words, with the help of Phi, we can only say whether the relationship between two variables is weak, moderate, or strong, but we cannot say whether when one variable increases, the other will increase or decrease.

Cramer’s V Cramer’s V is similar to Phi: it is also used to determine the strength of a relationship between two nominal variables. Its advantage over Phi is that it can be used to overcome Phi’s 2 × 2 table limitation. Cramer’s V can be used for any size of contingency table. Though its formula looks different, it is quite similar to that of Phi. It adjusts the denominator so that the value of the measure remains between 0 and 1. This adjustment to the denominator is done by multiplying number of observations (N) by the number of rows or columns (whichever is smaller) minus 1. ஧ଶ

As you know, the formula for Phi is ࢥ =ට . In the case of Cramer’s V, ୒

we multiply N in the denominator by the number of rows or columns (whichever is smaller) minus 1. The formula is as follows: V=ට

஧ଶ

୒ൈሺ୫୧୬ሺ௥ିଵሻ௢௥ሺ௖ିଵሻሻ

The value of the chi-square calculated from the data in Table 11-5 in the Step 5 was 5.36.

Tests of Significance for Nominal-Level Variables

137

Step 6: Calculate Cramer’s V: Table 11-5 has 2 rows and 3 columns. Its number of rows minus 1, (r – 1) is (2 – 1) = 1, and its number of columns minus 1, (c – 1) is (3 – 1) = 2. The “rows – 1” (r – 1) is smaller. It means we have to use N × (min (r – 1)) in the denominator of the formula of Cramer’s V. N in Table 11-5 is 450. Now, let’s plug these values into the formula: ஧ଶ

V=ට

୒ሺ୫୧୬ሺ௥ିଵሻ௢௥ሺ௖ିଵሻ

஧ଶ

=ට

୒ሺ௥ିଵሻ

ൌට

ହǤଷ଺ ସହ଴ሺଶିଵሻ

ହǤଷ଺

ൌට

ସହ଴

= ξͲǤͲͳͳͻ = 0.1091 = 0.11 Value of Cramer’s V

Strength of Association

Between 0.0 to 0.10 Between 0.11 and 0.30 Between 0.31 and 0.5 Greater than 0.5

Little if any association Low Moderate Strong

The value of 0.11 for Cramer’s V indicates that there is a low association between the patterns of sales after and before the introduction of a new cereal.

Lambda (Ȝ) Lambda is also used to measure an association between two nominal variables. The lambda varies between 0 and 1. It is called a measure of proportional reduction of error (PRE) because it measures the percentage increase in our ability to predict a dependent variable by knowing the independent variable. For example, if sex of a person is the independent

Chapter Eleven

138

variable and height is the dependent variable, with the help of lambda we can find out how much our ability to predict height will improve by knowing the sex of a person. In other words, how much error in the prediction of a dependent variable is reduced by the knowledge of an independent variable? Let’s say we want to find out an association between person’s attitude toward abortion and his or her religion from the data in Table 11-7. Table 11-7 Attitude toward Abortion by Religion Religion Attitude Favours Neutral Opposed Column Total

Catholic

Protestant

Other

8 5 10 23

10 8 5 23

12 7 8 27

No Religion 15 10 12 37

Row Totals 45 30 35 110

The formula for lambda is: ɉൌ

ͳ െ ʹ ͳ

Step 1: Calculate E1 = the total number of observations minus the mode of the row totals (i.e., the largest row totals): E1 = 110 – 45 = 65 Step 2: Calculate E2: The total of each column minus the mode of the respective columns (i.e., the largest column number): E2 = (first column total – first column mode) + (second column total – second column mode) + (third column total – third column mode) + (fourth column total – fourth column mode) = (23 – 10) + (23 – 10) + (27 – 12) + (37 – 15) = 13 + 13 + 15 + 22 = 63

Tests of Significance for Nominal-Level Variables

139

Step 3: Calculate Lambda: ɉൌ ૃ ൌ =

୉ଵି୉ଶ ୉ଵ ଺ହି଺ଷ



଺ହ

଺ହ

= 0.0308 = 0.031 Step 4: Interpretation: Multiply the value of the lambda by 100 to find out the percentage of improvement in the prediction of the dependent variable by taking the independent variable into account. In this case, by knowing the religion of a respondent, the prediction of his or her attitude toward abortion will improve by 0.031 × 100 = 3.1%. Non-parametric measures do not use normal distribution for their interpretation, but they are roughly based on ideas borrowed from parametric measures, the measures that use normal distribution for their interpretation. You will learn in the analysis of variance (ANOVA) that a ratio of differences within groups to the differences between groups is used to interpret the significance of differences. Here, E1 is equivalent to the total difference; E2 is equivalent to the within groups difference; and E1 – E2 is equivalent to the between group differences. In this sense, lambda is a ratio of the difference between groups (E1 – E2) and the total difference. You will find out that in the analysis of variance, a similar ratio (F-ratio) is used to make a decision for rejecting or failing to reject a null hypothesis.

Exercises for Practice Q1.

The chi-square test can be used to test: a. whether membership in one category is related to membership in another category. b. the independence of variables. c. goodness of fit. d. All of the above.

Chapter Eleven

140

Q2.

To use the chi-square test: a. each observation is required to be independent of the other. b. data is not required to fit a normal distribution. c. Both a and b. d. None of the above.

Q3.

For the chi-square test:

Q4.

a. at least 80% of the cells should have counts of more than 5. b. no cell of the table should have 0 counts. c. Both a and b. d. None of the above. Select a true statement: a. b. c. d.

Q5.

The advantage of Cramer’s V over Phi is that: a. b. c. d.

Q6.

Phi is used for tables that have more than 2 rows. The value of Phi cannot exceed 1 in large tables. We can predict the direction of change with Phi. None of the above.

it can be used for any size of contingency table. the value of the measure remains between 0 and 1. a and b. None of the above.

Calculate the chi-square by completing the templates of Tables E6a, E6b, and E6c to test if the number of rooms in the house is dependent on the number of persons in the family. a. Formulate the Null Hypothesis (H0) b. Formulate the Alternate Hypothesis (H1) c. Decision at Į = 0.05

Tests of Significance for Nominal-Level Variables

141

Table E6 Observed Values Number of Families

Size of the House

1 to 2 Persons

3 to 4 Persons

5 or More Persons

Row Total

1-3 Rooms

10

5

10

25

4-5 Rooms

5

10

5

20

6 or More Rooms

5

15

5

25

20

30

20

70

Column Total

Table E6a Calculation of Expected Values Number of Families

Size of the House

1 to 2 Persons

3 to 4 Persons

5 or More Persons

Row Total

1-3 Rooms

25

4-5 Rooms

20

6 or More Rooms

25

Column Total

20

30

20

70

5 or More Persons

Row Total

Table E6b Expected Values Number of Families

Size of the House

1 to 2 Persons

3 to 4 Persons

1-3 Rooms

25

4-5 Rooms

20

6 or More Rooms

25

Column Total

20

30

20

70

Chapter Eleven

142

Table E6c Calculations of the Chi-Square

fo Cell 1

1-3 Rooms, 1 to 2 Persons

Cell 2

1-3 Rooms, 3 to 4 Persons

Cell 3

1-3 Rooms, 5 or More Persons

Cell 4

4-5 Rooms, 1 to 2 Persons

Cell 5

4-5 Rooms, 3 to 4 Persons

Cell 6

4-5 Rooms, 5 or More Persons

Cell 7

6 or More Rooms, 1 to 2 Persons

Cell 8

6 or More Rooms, 3 to 4 Persons

Cell 9

6 or More Rooms, 5 or More Persons

fe

(fo – fe)

(fo – fe)

2

(fo – fe)2 fe

™= Q7.

To determine the relationship between religion and marital status from the following contingency table: Married

Divorced

Row Totals

Catholic

40

10

50

Protestant

25

5

30

Jewish

15

5

20

Column Totals

80

20

100

Tests of Significance for Nominal-Level Variables

a. b. c. d. e.

Q8.

143

Find the value of the chi-square Formulate the null hypothesis (H0) Formulate the alternate hypothesis (H1) Decision at Į = 0.05 Calculate the value of Cramer’s V and find out the strength of the relationship.

From the following table, calculate the lambda (Ȝ) and find out how much the prediction in the party affiliation can be improved by knowing the religion of the respondent. Religion Party

Protestant

Catholic

Other

Row Total

Liberal

15

20

20

55

Conservative

10

40

15

65

Democratic

10

10

25

45

Column Total

35

70

60

165

CHAPTER TWELVE TESTS OF SIGNIFICANCE FOR ORDINAL-LEVEL VARIABLES

Learning Objectives In this chapter, you will learn measures of association suitable for ordinal-level data. Specifically, you will learn the following four measures: x x x x

Kruskal’s gamma(Ȗ); Spearman’s rho (ȡs); Somers’ D; and Kendall’s Tau-b (IJb).

Introduction We discussed in Chapter 2 that we can order response categories or observations of an ordinal-level variable (e.g., social class is an ordinallevel variable because its response categories—lower, middle, and upper class—can be ranked from low to high or vice versa). Because of this property, we can ascertain whether the dependent variable is increasing or decreasing when a related independent variable is increasing or decreasing. In other words, we can measure the direction of change between related ordinal-level variables. For this reason, ordinal-level data is more desirable than nominal-level data. As mentioned in the last chapter, the chi-square can be used to find an association between two nominal-level variables, between a nominal-level and an ordinal-level variable, or between two ordinal-level variables. There are several other measures that are used to measure the association between ordinal-level variables. The following four measures are commonly used to determine the statistical association between two ordinal-level variables:

Tests of Significance for Ordinal-Level Variables

x x x x

145

Kruskal’s gamma (Ȗ); Spearman’s rho (ȡs); Somers’ D; and Kendall Tau-b (IJb).

Kruskal’s Gamma (Ȗ) Let’s say we want to measure an association between a respondent’s religiosity and approval of abortion from data in Table 12-1. The table gives the values of an independent variable, religiosity (x) and a dependent variable, the support for abortion (y). Since both variables are ordinal level, we can use gamma (Ȗ) to measure the association between these two variables. Table 12-1 Support for Abortion (y) by Religiosity (x) Support for Abortion (y) Low Support High Support Column Totals

Low Religiosity 5 15 20

High Religiosity 10 20 30

Row Totals 15 35 50

Step 1: Formulation of Hypothesis x Null hypothesis (H0): The level of religiosity does not influence the level of support for abortion. x Alternate hypothesis (H1): The level of religiosity influences the level of support for abortion. To calculate gamma (Ȗ), we need to decide about the case pairs that are ranked in the same order (Ns) on both variables and the case pairs that are ranked in a different order (Nd) on both variables. Step 2: Same Order Case Pair (Ns) In Table 12-1, the pair in the same order (Ns) will be “low religiosity and low support for abortion” (5) and “high religiosity and high support for abortion” (20). Ns = 5×20 = 100

146

Chapter Twelve

Step 3: Different Order Case Pair (Nd) The case pair in a different order will be “low religiosity and high support for abortion” (15) and “high religiosity and low support for abortion” (10). Nd = 15×10 = 150 Step 4: Calculation of Gamma (Ȗ) ɀൌ

ேೞ ିே೏ 

,

ேೞ ାே೏ 

where Ns is the same order pair and Nd is the different order pair. ɀൌ

ͳͲͲ െ ͳͷͲ ͳͲͲ ൅ ͳͷͲ

ɀൌ

െͷͲ ʹͷͲ

Ȗ = –0.20 Step 5: Interpreting Gamma (Ȗ) Values Value

Strength of Association

Negative Relationship Between 0.00 and –0.10 Between –0.11 and –0.30 Greater than –0.30

Weak Moderate Strong

For Positive Relationship Between 0.00 and 0.10 Between 0.11 and 0.30 Greater than 0.30

Weak Moderate Strong

The value of gamma (Ȗ) can vary between –1.0 (perfect negative association) and +1.0 (perfect positive association). If the relationship between an independent and a dependent variable is positive, the value of gamma will be positive. The value of our gamma (Ȗ) is –0.20; therefore, the relationship between these two variables is negative and moderate. It

Tests of Significance for Ordinal-Level Variables

147

means that with the increase in religiosity, the support for abortion will decrease moderately. Gamma is also a measure of the proportional reduction of error (PRE). The other interpretation of our result is that we would make 20% fewer errors in predicting the support for abortion if we knew a person’s level of religiosity. In other words, our prediction for the support of abortion will increase by 20% if we knew a person’s level of religiosity.

Spearman’s Rho (ȡs), or Spearman’s Rank Correlation Coefficient It is cumbersome to calculate gamma (Ȗ) for a table larger than 2 × 2. It is advisable to use another measure for ordinal-level data: Spearman’s rho (ȡs), also called Spearman’s rank correlation coefficient. Because Spearman’s rho (ȡs) uses ranks of observations, it is an ideal measure for ordinal-level data that can only be ranked. The following formula of Spearman’s rho (ȡs) is relatively easy to work with: ȡs = 1 –

଺ൈσୈమ

,

୬ൈሺ୬మ ିଵሻ

where D is the difference between the ranks of observations and n is the total number of observations. Let’s say we want to know whether there is a relationship between the rankings of the gross domestic product (GDP) and the Human Development Index (HDI) of the following five nations. We take the following simple steps to calculate Spearman’s rho: Step 1: Formulation of Hypothesis x Null hypothesis (H0): There is no relatioship between a nation’s GDP and its Human Development Index. x Alternate hypothesis (H1): There is a positve relationship between a nation’s GDP and its Human Deveopment Index. Step 2: Ranking the Independent Variable (x) and the Dependent Variable (y) Create a column Rank1 for ranking the independent variable (x) and a column Rank2 for ranking the dependent variable (y) and then assign the ranks to the values of both variables as shown in Table 12-2.

Chapter Twelve

148

Step 3: Calculate the Values of D (i.e., the distance between Rank1 and Rank2) Substact values of Rank2 from values of Rank1 as shown in Table 12-2. Step 4: Calculate the Squared Values of D Square each value of D as shown in the last column of Table 12-2. Step 5: Add the values of the squared D to calculate ™D2. Table 12-2 The Gross Domestic Product and the Development Index of Selected Nations

Japan USA Canada Russia China

GDP (x) $50,000 $45,000 $40,000 $15,000 $7,000

Rank1 1 2 3 4 5

HDI (y) 0.960 0.956 0.966 0.817 0.772

Rank2 2 3 1 4 5

(D) = (Rank1 – Rank2) 1 – 2 = –1 2 – 3 = –1 3–1= 2 4–4= 0 5–5= 0

Human

D2 (–1)2 = 1 (–1)2 = 1 (2)2 = 4 (0)2 = 0 (0)2 = 0

™D2 = 6 Step 6: Calculations of Spearman’s rho (ȡs) There are five nations in the study; therefore, n is 5. Plug in the values of the number of cases for n and the calculated value of ™D2 in the following formula: ȡs = 1 – ȡs = 1 – ȡs = 1 – ȡs = 1 – ȡs = 1 –

଺ൈσୈమ ୬ൈሺ୬మ ିଵሻ ଺ൈ଺ ହൈሺହమ ିଵሻ ଷ଺ ହൈሺଶହିଵሻ ଷ଺ ହൈଶସ ଷ଺ ଵଶ଴

Tests of Significance for Ordinal-Level Variables

ȡs = 1 –

149

ଷ଺ ଵଶ଴

ȡs = 1 – 0.3 = 0.7 Interpretation of Spearman’s rho (ȡs) We can use the following guide to interpret the strength of Spearman’s rho (ȡs): Value

Strength of Association

0–0.19 0.20–0.39 0.40–0.59 0.60–0.79 0.80–1.00

Very weak Weak Moderate Strong Very strong

The value of Spearman’s rho (ȡs) can vary between 0.0 and 1.0. The value of 0.7 of Spearman’s rho (ȡs) indicates that there is a strong and a positve relationship between the gross domestic product (GDP) and the Human Development Index (HDI). As the GDP increases, the HDI also increases.

Significance Level of Spearman’s Rho (ȡs) The statistical significance of Spearman’s rho (ȡs) can be assessed by calculating the t value using the following formula and then comparing it with the critical value of t found in the t-Table given in Appendix Table 3A: ࢚ൌඨ

ൌ ඨ

࢔െ૛ ૚ െ ࢙࣋ ૛

ͷെʹ ͳ െ ͲǤ͹ଶ

150

Chapter Twelve

ൌඨ

͵ ͳ െ ͲǤͶͻ

ൌඨ

͵ ͲǤͷͳ

ൌ ξͷǤͻ ‫ ݐ‬ൌ ૛Ǥ ૝3 There are five countries and two variables; therefore, the degrees of freedom is df =5 – 2 = 3. The calculated value of t (2.43) exceeds the tabulated (critical) value of t (2.35) corresponding to df =3 and one-tailed Į=0.05 in Appendix Table A3. Hence, we reject H0 and conclude that the relationship between the GDP and the HDI is strong and statistically significant. To find out how much error will be reduced in our prediction of the HDI by using GDP as an independent variable, we have to square the value of Spearman’s rho (ȡs). The square of 0.7 is 0.49; therefore, the error in predicting the HDI by using the GDP as an independent variable will be reduced by 49%. In the above example of Spearman’s rho (ȡ), nowhere were the ranks on the dependent variable and the independent variable the same (i.e., there were no ties). Somers’ D and Kendall’s tau-b are two measures for ordinal-level data that provide adjustment for tied ranks. The calculations for these two measures are simple but a bit more unwieldy than Spearman’s rho. Although you will use a computer in real life to calculate Somers’ D and Kendall’s tau-b, the following easily understandable calculations for these measures are provided.

Somers’ D Let’s say we want to know from the data in Table 12-3 whether there is a relationship between the frequency of church attendance and the support for charitable giving. Both variables are ordinal level and there are tied ranks on the dependent variable (charitable giving)—there are ties for persons who attend church occasionally and those who attend church rarely among the supporters (5) and opposers (9) of charitable giving.

Tests of Significance for Ordinal-Level Variables

151

Therefore, we decide to calculate Somers’ D using the following formula: ݀ൌ

ேೞ ିே೏  ேೞ ାே೏ ା்೤

,

where Ns represents concordant pairs or the same order case pairs, Nd represents discordant pairs or the different order case pairs, and Ty represents tied ranks on the dependent variable (i.e., support for charitable giving in our example). Step 1: Formulation of Hypothesis x Null hypothesis (H0): The frequency of church attendance does not influence the support for charitable giving. x Alternate hypothesis (H1): The frequency of church attendance influences the support for charitable giving. Table 12-3 Support for Charitable Giving by Church Attendance Support for Charitable Giving Support Neutral Oppose Total

Church Attendance Regularly 13 (Cell 1) 9 (Cell 4) 5 (Cell 7) 27

Occasionally 5 (Cell 2) 7 (Cell 5) 9 (Cell 8) 21

Rarely 5 (Cell 3) 6 (Cell 6) 9 (Cell 9) 20

Total 23 22 23 68

Step 2: Calculation for the Concordant Pairs or the Same Order Case Pairs (Ns) The concordant pairs for a cell are calculated by adding the frequencies of cells that are below and to the right side of the cell and then multiplying by the frequencies of the cell. In Table 12-3, Cells 5, 6, 8, and 9 are below and to the right side of Cell 1; therefore, they are concordant with Cell 1. The frequencies of Cells 5, 6, 8, and 9 are 7, 6, 9, and 9, respectively; these four frequencies are added and then multiplied by the frequencies in Cell 1. The addition of frequencies in Cells 5, 6, 8, and 9 is 7 + 6 + 9 + 9 = 31, and the number of frequencies in Cell 1 is 13; therefore, the total concordant observations are: 31 × 13 = 403. The calculations of observations concordant to the rest of the cells are shown in Table 12-3a.

152

Chapter Twelve

Table 12-3a Calculations of Concordant Cells

Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 9 Total Concordant (Ns)

Concordant Cells (Cells 5, 6, 8, 9) (Cells 6, 9) None (Cells 8, 9) (Cells 9) None None None None

Total of Concordant Observation 7 + 6 + 9 + 9 = 31 6 + 9 = 15 0 9 + 9 = 18 9 0 0 0 0

Total 13 × 31 = 403 15 × 5 = 75 0 9 × 18 = 162 7 × 9 = 63 0 0 0 0 703

Step 3: Calculation for the Discordant Pairs or Different Order Case Pair (Nd) The discordant pairs for a cell are calculated by adding the frequencies of cells that are below and to the left side of the cell and then multiplying by the frequencies of the cell. For example, in Table 12-3, there are no cells below and to left of Cell 1; therefore, discordant pairs for Cell 1 are 0. Cells 4 and 7 are below and to the left of Cell 2; therefore, they are discordant cells for Cell 2.The frequencies of Cells 4 and 7 are 9 and 5, respectively; the addition of 9 and 5 is 14. This number is then multiplied by the frequencies in Cell 2. The number of frequencies in Cell 2 is 5; therefore, the discordant pair is 14 × 5 = 70. The calculations of the rest of the discordant pairs are shown in Table 12-3b.

Tests of Significance for Ordinal-Level Variables

153

Table 12-3b Calculations of Discordant Cells

Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 9 Total Discordant (Nd)

Concordant Cell None (Cells 4, 7) (Cells 4, 5, 7, 8) None (Cells 7) (Cells 7, 8) None None None

Total of Concordant Observation 0 9 + 5 = 14 9 + 7 + 5 + 9 = 30 0 5 5 + 9 = 14 0 0 0

Total 0 5 × 14 = 70 5 × 30 = 150 0 7 × 5 = 35 6 × 14 = 84 0 0 0

339

Step 4: Calculation of Ties (Ty) on the Dependent Variable The cells to the right of a cell in the same row are considered tied with the cell. For example, in Table 12-3, Cell 1 is tied with Cells 2 and 3; Cell 2 is tied with Cell 3; Cell 4 is tied with Cells 5 and 6; Cell 5 is tied with Cell 6; Cell 7 is tied with Cells 8 and 9; Cell 8 is tied with Cell 9. The calculations for ties are shown in Table 12-3c. Table 12-3c Calculations of Ties (Ty) on Dependent Variable

Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 9 Total Ties

Concordant Cell (Cells 2, 3) (Cell 3) None (Cells 5, 6) (Cell 6) None (Cells 8, 9) (Cell 9) None

Total of Concordant Observation 5 + 5 = 10 5 0 7 + 6 = 13 6 0 9 + 9 = 18 9 0

Total 13 × 10 = 130 5 × 5 = 25 0 9 × 13 = 117 7 × 6 = 42 0 5 × 18 = 90 9 × 9 = 81 0 485

154

Chapter Twelve

Step 5: Calculating Somers’ D Plug in the values of Ns, Nd, and Ty in the following formula: ݀ൌ

݀ ൌ

ܰ௦ െ  ܰௗ ܰ௦ ൅  ܰௗ ൅ ܶ௬ ͹Ͳ͵ െ ͵͵ͻ ͹Ͳ͵ ൅ ͵͵ͻ ൅ Ͷͺͷ ݀ൌ

͵͸Ͷ ͳͷʹ͹

d = 0.24 Step 6: Interpreting Somers’ D Value

Strength of Association

Between 0.00 and 0.10 Between 0.11 and 0.30 Greater than 0.30

Weak Moderate Strong

The value of Somer’s D can vary between –1.0 (perfectly negative association) and +1.0 (perfectly positive association). If the relationship between the independent and the dependent variable is positive, the value of Somers’D will be positive. Our value of Somers’ D is 0.24; therefore; the relationship between two variables is positive and moderate. It means that charitable giving increases moderately with an increase in church attendance.

Kendall’s Tau-b Kendall’s Tau-b takes one further step: it also calculates ties on the independent variable. You consider tied cells below a cell to calculate ties (Tx) on the independent variable. For example, in Table 12-3, Cell 1 is tied with Cells 4 and 7; Cell 2 is tied with Cells 5 and 8; Cell 3 is tied with Cells 6 and 9; Cell 4 is tied with Cell 7; Cell 5 is tied with Cell 8; and Cell 6 is tied with Cell 9. Because there is no cell below Cells 7, 8, and 9, they are not tied with any cell. The values for Tx are calculated for the data in Table 12-3d.

Tests of Significance for Ordinal-Level Variables

155

Table 12-3d Calculations of Ties (Tx) on the Independent Variable

Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell9 Total Ties (Tx)

Total of Concordant Observation 9 + 5 = 14 7 + 9 = 16 6 + 9 = 15 5 9 9 0 0 0

Number of Tied Cell (Cells 4, 7) (Cells 5, 8) (Cells 6, 9) (Cell 7) (Cell 8) (Cell 9) None None None

Total 13 × 14 = 182 5 × 16 = 80 5 × 15 = 75 9 × 5 = 45 7×9 = 63 6×9 = 54 0 0 0 499

Step 5: Calculating Kendall’s Tau-b Plug in the values of Ns, Nd, Ty, and Tx in the following formula: ܰ௦ െ  ܰௗ  ƒ— െ „ ൌ ට൫ܰ௦ ൅  ܰௗ ൅ ܶ݅݁‫ݏ‬௬ ൯ ൈ ሺܰ௦ ൅  ܰௗ ൅ ܶ݅݁‫ݏ‬௫ ሻ ƒ— െ „ ൌ

͹Ͳ͵ െ ͵͵ͻ ඥሺ͹Ͳ͵ ൅ ͵͵ͻ ൅ Ͷͺͷሻ ൈ ሺ͹Ͳ͵ ൅ ͵͵ͻ ൅ Ͷͻͻሻ ƒ— െ „ ൌ

͵͸Ͷ ξͳͷʹ͹ ൈ ͳͷͶͳ

ƒ— െ „ ൌ

͵͸Ͷ ඥʹǡ͵ͷ͵ǡͳͲ͹

ƒ— െ „ ൌ

͵͸Ͷ ͳͷ͵͵Ǥͻͺ

Tau-b = 0.24

Chapter Twelve

156

Step 6: Interpreting Tau-b Value

Strength of Association

Between 0.00 and 0.10 Between 0.11 and 0.30 Greater than 0.30

Weak Moderate Strong

The value of Tau-b can also vary between –1.0 (perfect negative association) and +1.0 (perfect positive association). If the relationship between the independent and the dependent variables is positive, the value of Tau-b will be positive. Our value of Tau-b is 0.24. It indicates that the relationship between the two variables is positive and moderate. Our Taub also gives the same result as Somers’ D (i.e., charitable giving increases moderately with an increase in church attendance).

Significance Level of Somers’ D and Kendall’s Tau-b The statistical significance of gamma (Ȗ), Somers’ D, and Tau-b can be assessed by calculating the z value using the following formula and then comparing with the critical value for z found in the z-Table given in Appendix Table 1A.

Value of z for Tau-b ேೞ ାே೏

œ ൌ ܶܽ‫ ݑ‬െ ܾ ൈ ට

ேሺሺଵȂ்௔௨ି௕మ ሻ

Value of z for Somers’ D œ ൌ ܵ‫ ݏݎ݁݉݋‬ᇱ ‫ ܦ‬ൈ ඨ

ܰ௦ ൅  ܰௗ ሺͳ െ ܵ‫ݏݎ݁݉݋‬Ԣ‫ ܦ‬ଶ ሻ

N is total number of observations. For our Somers’ D and Tau-b, the z is:

‫ ݖ‬ൌ ͲǤʹͶ ൈ ඨ

͹Ͳ͵ ൅ ͵͵ͻ ͸ͺ ൈ ሺͳ െ ͲǤʹͶଶ ሻ

Tests of Significance for Ordinal-Level Variables

‫ ݖ‬ൌ ͲǤʹͶ ൈ ඨ

157

ͳͲͶʹ ͸ͺ ൈ ሺͳ െ ͲǤͲͷ͹͸ሻ

ൌ ͲǤʹͶ ൈ ඨ

ͳͲͶʹ ͸ͺ ൈ ሺͲǤͻͶʹͶሻ

ൌ ͲǤʹͶ ൈ ඨ

ͳͲͶʹ ͸ͶǤͲͺ

ൌ ͲǤʹͶ ൈ ξͳ͸Ǥʹ͸ ൌ ͲǤʹͶ ൈ ͶǤͲ͵ ‫ ݖ‬ൌ ૙Ǥ ૢૠ As explained in the last chapter, the tabulated (critical) value of z for a two-tailed test at Į = 0.05 in Appendix Table 1A is 1.96. Our calculated value does not exceed the tabulated value; we fail to reject H0 and conclude that although we found a moderate association between the frequency of church attendance and the support for charitable giving, it is not statistically significant.

Exercise for Practice Q1.

Select a true statement: a. gamma is used for nominal data. b. gamma is not a direction measure. c. Just like lambda, gamma ranges between 0 and 1. d. None of the above.

Q2.

The gamma value of 0.60 indicates that the relationship between the two variables is: a. weak. b. moderate. c. strong. d. None of the above.

Q3.

Which of the following measures uses ranking?

Chapter Twelve

158

a. b. c. d.

Spearman’s rho gamma lambda Cramer’s V

Q4.

Which of the following ordinal level measures adjusts for the possibility of tied ranks on the dependent variable only? a. Cramer’s V b. Somers’ D c. Kendall’s Tau-b d. gamma

Q5.

Which of the following ordinal measures uses tied ranks on both the dependent and independent variable? a. Cramer’s V b. Somers’ D c. Kendall’s Tau-b d. gamma

Q6.

The following table gives hypothetical data on the support for legalization of marijuana by party affiliation: Support for Legalizing Marijuana Low Support High Support Column Totals

Liberal 20 10 30

Conservative 15 25 40

Row Totals 35 35 70

Note: Assume the Conservatives will give low support and the Liberals will give high support.

f. g. h. i.

Find the value of gamma (Ȗ) Formulate the null hypothesis (H0): Formulate the alternate hypothesis (H1): Interpret the relationship between party affiliation and support for legalization of marijuana.

Tests of Significance for Ordinal-Level Variables

Q7.

159

The following table gives rankings of the top American and Canadian universities in 2013 and 2014 by Time Magazine’s Higher Education World Reputation Rankings System: Ranking 2013 1 2 3 4 5 6 7 8 9 10 20 33

Harvard MIT Stanford Cambridge Oxford Berkley Princeton Yale CalTech UCLA Toronto McGill

Ranking 2014 1 2 6 3 4 5 7 10 11 8 16 31

a. Calculate Spearman’s rho using the ranking of universities for 2013 and 2014 to determine if there is a correlation between the rankings of these two years. b. How much error will be reduced in your prediction by knowing the ranking of universities in a previous year? Q8.

The following table gives data on the frequency of church attendance and support for euthanasia:

Table 12-3 Support for Euthanasia by Church Attendance Church Attendance Support for Euthanasia Support Neutral Oppose Total

Regularly 11 7 3 21

Occasionally 3 5 7 15

Complete Tables A, B, C, and D and: a. b. c. d.

calculate Somers’ D interpret Somers’ D calculate Tau-b interpret Tau-b

Rarely 3 4 7 14

Total 17 16 17 50

160

Chapter Twelve

Table A Calculations of Concordant (in the Same Direction) Cells Concordant Cell

Total of Concordant Observation

Cell 1: Support and Regular Cell 2: Support and Occasionally Cell 3: Support and Rarely Cell 4: Neutral and Regularly Cell 5: Neutral and Occasionally Cell 6: Neutral and Rarely Cell 7: Oppose and Regularly Cell 8: Oppose and Occasionally Cell 9: Oppose and Rarely Total Concordant Observation Ns Note: The concordant cells are below and to the right of the cell.

Total 253 33

98 35

419

Table B Calculations of Discordant (in the Different Direction Cells

Discordant Cell

Total of Concordant Observation

Cell 1: Support and Regular Cell 2: Support and Occasionally Cell 3: Support and Rarely Cell 4: Neutral and Regularly Cell 5: Neutral and Occasionally Cell 6: Neutral and Rarely Cell 7: Oppose and Regularly Cell 8: Oppose and Occasionally Cell 9: Oppose and Rarely Total Concordant Observation (Nd) Note: The discordant cells are below and to the left of the cell.

Total 0 30 66 15 40

151

Tests of Significance for Ordinal-Level Variables

161

Table C Calculations of Ty Tied Dependent Cell (Ty)

Total of Concordant Observations

Total Cell 1: Support and Regular 66 Cell 2: Support and Occasionally 9 Cell 3: Support and Rarely 0 Cell 4: Neutral and Regularly 63 Cell 5: Neutral and Occasionally 20 Cell 6: Neutral and Rarely 42 Cell 7: Oppose and Regularly 49 Cell 8: Oppose and Occasionally Cell9: Oppose and Rarely Total Ties (Ty) 249 Note: For the dependent variable (y), a cell is tied with the cells located to its right.

Table D Calculation of Tx Tied Independent Cell (Tx)

Total of Concordant Observations

Total Cell 1: Support and Regular 110 Cell 2: Support and Occasionally 36 Cell 3: Support and Rarely 33 Cell 4: Neutral and Regularly 21 Cell 5: Neutral and Occasionally 35 Cell 6: Neutral and Rarely 28 Cell 7: Oppose and Regularly Cell 8: Oppose and Occasionally Cell9: Oppose and Rarely 263 Total Ties (Tx) Note: For the independent variable (x), a cell is tied with the cells located below this cell.

162

Q9.

Chapter Twelve

Use the value of Tau-b from Q8 to calculate the z value and find out if the calculated z value is statistically significant at Į = 0.05.

CHAPTER THIRTEEN TESTS OF SIGNIFICANCE FOR INTERVAL/RATIO-LEVEL VARIABLES

Learning Objectives In this chapter, we will discuss the measures of association suitable for interval/ratio-level variables. Specifically, you will learn about: x data assumptions required when applying a test of significance to generalize from a sample to its population; x the t-distribution and the t-test for comparing a sample mean with the population mean; and x the t-test for comparing two samples.

Introduction We discussed in Chapter 2 that the main difference between interval-level and ratio-level variables is the absence or presence of an absolute zero. We learned that temperature is an interval-level variable because a 0° temperature does not imply that temperature does not exist. We know that –20°Celsius is colder than 0°Celsius, but 00Celsius does not mean there is no temperature. A variable in which 0 implies “none” is called a ratiolevel variable. Income is a ratio-level variable, where 0 dollar income implies no income. Apart from the absence or presence of an absolute zero, interval-level and ratio-level variables have similar properties; therefore; going forward, we will use the term interval level for interval-level and ratio-level variables. One of the main features of interval-level data is that we can calculate its mean and therefore, apply statistical tests to ascertain whether:

164

Chapter Thirteen

1. the sample mean is significantly close to the population mean to generalize from a sample to its population; 2. the difference in the means of two samples is statistically significant; and 3. two variables under study are significantly related to each other.

Generalizing from a Sample to Its Population In Chapter 10, we used the z-test as a test of significance to find out if the sample mean is significantly different or similar to its population mean. We can also use a test of significance to compare means of two samples and to generalize from a sample to the population if the data meets the following five assumptions: 1. Independence: The observations are independent if the selection of one observation (respondent) does not affect the selection of another observation. As mentioned in Chapter 8, data collected by the snowball sampling technique does not fulfill this condition. For example, when you ask the first homeless respondent to refer to his or her friends for an interview, it makes observations dependent on the respondent who provides names and locations of the other homeless persons to be interviewed. 2. Normality: The population is normally distributed. 3. The sample size is sufficiently large. 4. Randomness: The sample has been selected using the random sampling technique. 5. Interval-level data: The data is interval level when its mean (average) can be calculated.

The t-Distribution William Sealy Gosset,11 who published under the pseudonym “Student,” introduced the t-distribution. Gosset determined an exact relationship between small samples and the normal curve. The t-distribution is an infinite number of curves for sample sizes greater than or equal to 2. What it means is that if you draw sampling curves of t-statistics of sample sizes of 2, 3, 4, 5, and so forth; the t-distribution will increasingly resemble a normal distribution. There is hardly any difference in a t-distribution of a sample size of 50 and the normal curve. 11

Student. “Probable Error of a Mean.”Bimetrica 6 (1908): 1–25.

Tests of Significance for Interval/Ratio-Level Variables

165

The t-test, which is also called the Student’s t-test, is generally used to test a hypothesis when the sample size is smaller than 30. The z-test is usually used to test a hypothesis when the sample size is larger than 30. This is a rule of thumb one should follow. You will find in literature that researchers use the t-test for sample sizes larger than 30. Moreover, you will see t-distributions in appendix tables of statistics books with degrees of freedom of more than 120. The t-distribution of larger sample sizes tends to resemble the z-distribution. You can use a t-test for a sample size larger than 30; but you should not use a z-test for a smaller size sample because the z-distribution for the smaller sample size does not fulfill the statistical theorem of large numbers (i.e., the sampling distribution of a sufficiently large sample will increasingly resemble normal distribution).

The t-Test The value of t is the sample mean (xժ ) minus the population mean (μ) divided by the standard error of the sample mean. ‫ݐ‬ൌ

šlj െ Ɋ

‫ ݔݏ‬Ȁξ െ ͳ

Standard Error The denominator of the above equation is the standard deviation of the sample divided by the sample size minus 1. It is called the standard error of the sample mean, and it is an estimate of the population standard deviation (ı). We can calculate the population standard deviation only if each value of the population is known. Because it is impossible to know each value of a large population, we estimate its standard deviation with the help of the sample standard deviation. Because the standard error estimate of a sample is slightly lower than the actual standard deviation (ı) of the population, we subtract 1 from the sample size (n) to slightly increase the value of the standard error. We already know from our discussion on sampling distribution in Chapter 8 that the sampling error decreases as the sample size increases. When sample size is small, it contains a smaller proportion of the population, and the likelihood of a sample being a representative of the population decreases. The symbol of the standard error of the sample is sxժ and its formula is ௌ௫ sxժ = , where sx is the standard deviation of the sample and n is the  ξ௡

Chapter Thirteen

166

sample size. Because the sample size (n) is in the denominator of the equation of the standard error, subtracting 1 from the sample size (n) slightly increases the size of the standard error. Because the sample means do not vary as much as the individual values in the normal population the sample standard error is smaller than the population standard error that is why we increase the size of the standard error of a sample by subtracting 1 from n.

Calculating t Value Though the population mean is rarely known, there are instances when the mean of a small population can be determined. For example, the mean of a grade point average (GPA) of the entire university student population can be calculated. Let’s say that the mean of the GPA for a university’s entire student population is 3.0 and the mean of the GPA for sociology students is 2.7. The number of sociology students is 26 and the standard deviation of these sociology students is 0.5. We want to know whether the sociology GPA is significantly lower than the university’s average GPA. ‫ݔ‬lj ൌ ʹǤ͹ μ = 3.0 sx = 0.5 n = 26, where ‫ݔ‬lj is the sample mean, μ is the population mean, sx is the sample standard deviation, and n is the sample size. Step 1: Formulation of Hypothesis x Null hypothesis (H0): The average GPA of sociology students is not significantly lower than the average GPA of all the university’s students. x Alternate hypothesis (H1): The average GPA of sociology students is significantly lower than the average GPA of all the university’s students. Step 2: Statistical Test: Since the sample size is under 30 and the population mean is known, we will use the t-test. The t-test is appropriate when the sample size is smaller than 30.

Tests of Significance for Interval/Ratio-Level Variables

167

Step 3: Significance Level: Because we are trying to prove that the average GPA of sociology students is significantly lower than the average GPA of all the university’s students, we know the direction of the test; therefore; we will use a one-tailed test. We will test our hypothesis at Į = 0.05. Step 4: Calculating the Degrees of Freedom: There is one sample. The number of values free to vary will be n – 1. Therefore, the degrees of freedom (df) will be n – 1. n = 26; df = n – 1 = 26 – 1 = 25 Step 5: Calculating the t value: Plug in the values of the sample mean (šlj ሻǡ the population mean (μ), the sample standard deviation (sx) of the sample, and the sample size (n) in the following formula: ‫ݐ‬ൌ

‫ݐ‬ൌ

šlj െ Ɋ ‫ݏ‬௫ Ȁξ݊ െ ͳ ʹǤ͹ െ ͵ǤͲ

ͲǤͷȀξʹ͸ െ ͳ

‫ݐ‬ൌ

െͲǤ͵ ͲǤͷȀξʹͷ

‫ݐ‬ൌ

െͲǤ͵ ͲǤͷȀͷ

‫ݐ‬ൌ

െͲǤ͵ ͲǤͳ

‫ ݐ‬ൌ െ૜Ǥ ૙ Step 6: Decision: x If you look for the critical value of t in Appendix Table 3A under a one-tailed test for df = 25 and Į = 0.05, the value of t is 1.71.

Chapter Thirteen

168

Because our calculated value of t is –3.0, we use the tabulated (critical) value on the left side of the curve, which is –1.71. x The calculated value of –3.0 exceeds the critical (tabulated) value of –1.71; therefore, we reject the null hypothesis that the mean GPA of sociology students is not significantly lower than that of the university average. In other words, we conclude that the mean GPA of sociology students is significantly lower than the mean GPA of all the university’s students. Tip: Always bear in mind that if the tcalculated > ttabulated, we reject the null hypothesis.

The t-Test for Comparing Two Sample Means The t-test can be used to compare means of two samples to find out whether the difference between them is statistically significant. Let’s say that this time you want to find out whether the difference between the mean GPA of engineering students and the mean GPA of sociology students is statistically significant. Let’s assume that the means, standard deviations, and sizes of the two samples are as follows: Sociology xժ 1 = 2.7 sx1 = 0.5 n1 = 26

Engineering xժ 2 = 3.2 sx2 = 0.6 n2 = 37,

where xժ 1 is the mean GPA of sociology students, xժ 2 is the mean GPA of engineering students, sx1 is the standard deviation of the GPA of sociology students, sx2 is the standard deviation of the GPA of engineering students, n1 is the sample size for sociology students, and n2 is the sample size for engineering students. With some modification to the denominator, the formula is similar to the one that we used to compare the mean GPA of sociology students with that of all the university’s students. In the earlier example, we subtracted the population mean from the sample mean in the numerator. In this instance, we will subtract one mean from the other mean. It does not matter which mean is subtracted from which. The difference will be only in the positive or negative sign of the value of the t. The positive t value implies that we will look at the right side of the t-distribution curve, and the negative t value implies that we will look at the left side of the t-

Tests of Significance for Interval/Ratio-Level Variables

169

distribution curve. Because we have two samples, the denominator is the standard error of the difference between the two means. Let’s set up the hypothesis. Step 1: Formulation of Hypothesis x Null hypothesis (H0): The average GPA of sociology students is not significantly lower than the average GPA of engineering students. x Alternate hypothesis (H1): The average GPA of sociology students is significantly lower than the average GPA of engineering students. Step 2: Statistical Test: Because we are testing whether the difference in the means of two independent samples is statistically significant, a two-sample t-test is appropriate. Step 3: Significance Level: Because we are trying to prove that the mean GPA of sociology students is significantly lower than the mean GPA of engineering students, it implies that we know the direction of the test. Therefore, we will use a one-tailed test at a conventional level of significance of Į = 0.05. Step 4: Calculating the Degrees of Freedom: There are two samples. The number of values free to vary for the first sample is n1 – 1 and for the second sample is n2 – 1. Therefore, the degrees of freedom (df) for both samples are (n1 – 1) + (n2 – 1) = n1 + n2 – 2. Degree of freedom = 26 + 37 – 2 = 63 – 2 = 61 Step 5: Calculating the Value t: The formula for two-sample t is: ‫ ݐ‬ൌ

‫ݔ‬ҧଵ െ  ‫ݔ‬ҧଶ ǡ ‫ݏ‬௫ҧభష ௫ҧమ

170

Chapter Thirteen

where xժ 1 is the mean of the GPA for the sample of sociology students, xժ 2 is the mean of the GPA for the sample of engineering students, and ‫ݏ‬௫ҧభష ௫ҧమ  is the standard error of the difference between the two means. Let’s first calculate the standard error of the difference between the two means (‫ݏ‬௫ҧభష ௫ҧమ ሻ for the denominator of the t formula above. The formula of the difference between the two means ‫ݏ‬௫ҧభష ௫ҧమ is as follows: ‫ݏ‬௫ҧభష ௫ҧమ = ටቀ

௡ଵ௦ೣభ మ ା௡ଶௌ௦ೣమ మ ௡ଵା௡ଶȂଶ

ቁൈቀ

௡ଵା௡ଶ ௡ଵ୬ଶ



= ටቀ

= ටቀ

=ටቀ

=ටቀ

ଶ଺ൈ଴Ǥହమ ାଷ଻ൈ଴Ǥ଺మ ଶ଺ାଷ଻ିଶ

ଶ଺ൈǤଶହ ାଷ଻ൈǤଷ଺ ଺ଵ

ቁൈቀ

଺ǤହାଵଷǤଷ

଺ଷ

଺ଵ

ଽ଺ଶ

ቁൈቀ

ଵଽǤ଼

଺ଷ

଺ଵ

ଽ଺ଶ

ቁൈቀ

ቁൈቀ



ଶ଺ାଷ଻ ଶ଺ൈଷ଻

଺ଷ ଽ଺ଶ







ቁ

= ඥሺͲǤ͵ʹሻ ൈ ሺͲǤͲ͸ͷሻ = ξͲǤͲʹͲͺ= 0.1442 Now let’s plug in the values of the numerator and the denominator into the formula for the t value: ௫ҧ ି௫ҧ ‫ ݐ‬ൌ భ మ ඥ௦ೣҧభష ೣҧమ

ൌ =

ʹǤ͹ െ ͵Ǥʹ ͲǤͳͶͶʹ ି଴Ǥହ

଴Ǥଵସସଶ

= –3.47

Tests of Significance for Interval/Ratio-Level Variables

171

Step 6: Decision: x The tabulated (critical) value of t in Appendix Table 3A that corresponds to larger than 30 degrees of freedom (our df being 61) under Į = 0.05 for a one-tailed test is 1.64. Looking at the left side of the t-distribution, it is –1.64. x Our calculated value of –3.47 exceeds the critical value of –1.64; therefore, we reject the null hypothesis. Statistically, the mean GPA of sociology students is significantly lower than the mean GPA of engineering students.

Exercises for Practice Q1.

The average faculty salary across the nation is $101,000. A consortium of 10 universities wants to confirm if the average faculty salary of their 10 universities is significantly higher than the national average. Test the null hypothesis if the average faculty salary of the consortium of 10 universities is $105,000, with a standard deviation of $5,700.

Q2.

In 2001, life expectancy for men in Canada was 77.0 years. A demographer calculated that the life expectancy of a sample of 26 First Nations men on a reserve was 70.0 years, with a standard deviation of 10 years. Using the t-test, test your null hypothesis at Į = 0.05 that there is no statistically significant difference in the life expectancy of the sample of First Nations men and Canadian men as a whole.

Q3.

A sample of 10 boys and 12 girls was given an achievement test. The boys’ average score was 84, with a standard deviation of 2; and the girls’ average score was 82, with a standard deviation of 3. Test at Į = 0.05 level to determine whether there are statistically significant gender differences in the achievement scores.

CHAPTER FOURTEEN MEASURING THE RELATIONSHIP BETWEEN TWO INTERVAL/RATIO-LEVEL VARIABLES

Learning Objectives This chapter focuses on the techniques for finding a relationship between two interval-level variables. It discusses one of the most important statistical measures, the correlation coefficient. Specifically, you will learn: x the correlation coefficient (Pearson’s r); x the interpretation of r; x the coefficient of determination (R2) and the variation explained in a dependent variable by an independent variable; x the meaning of linearity; and x an ingenious utility of correlation.

Introduction When we say that there is a relationship between two variables, we imply that the dependent variable changes when there is a change in the independent variable. For example, if education is an independent variable and income is the dependent variable, we would be interested to know how much a person’s income changes with a change in his or her education. The relationship between an independent variable and the dependent variable could be positive or negative. If the relationship between income and education is positive, the income will increase for each year of extra education a person has, and if the relationship is negative, the income will decrease for each extra year of education. Apart from knowing the positive or negative direction of the relationship, we would be interested to know the actual amount of change in the dependent variable for a unit change in the independent variable (i.e., to find out how

Measuring the Relationship between Two Interval/Ratio-Level Variables 173

much a person’s income increases or decreases for each extra year of education). The correlation coefficient or Pearson’s r is used to find out a relationship between two interval-level variables.

Coefficient of Correlation (Pearson’s r) The correlation coefficient, also referred to as the coefficient of correlation, is represented by the lowercase r. It varies between –1 and +1. The value of +1 reflects a perfect positive relationship between two variables; it means that when the independent variable x increases by one unit, the dependent variable y also increases by a constant set of units. The value of –1 reflects a perfect negative relationship between two variables; it means that when the variable x increases by one unit, the variable y decreases by a constant set of units or vice versa. The value of 0 for the correlation coefficient indicates that there is no linear relationship between the two variables. The coefficient of correlation (r) predicts the amount of change in the dependent variable when the independent variable changes. For example, if the value of the correlation coefficient (r) between standardized12 variables—education and income—is 0.8, the income will increase by 0.8 units for every 1 unit change in education. If a unit of education is equal to one year of education and a unit of income is equal to $1,000, then the one year of increase in education will result in an extra 0.8 thousand, or $800 of income. You will learn in the chapter on regression that predicting a value of a dependent variable for a corresponding value of an independent variable is slightly more involved. At this point, you can roughly interpret that with the correlation of 0.8 between standardized units of education and income, if a person with 10 years of education earns $20,000, the person with 11 years of education will earn $20,000 + $1,000 × 0.8 = $2,000 + $800 = $20,800. Similarly, a person with 12 years of education will have 2 more units of education than the person with 10 years of education; hence, that person’s income would be $20,000 + 2ൈ$800 = $20,000 + $1,600 = $21,600. In short, the r measures a change in the value of one variable as the value of another variable changes.

12

To standardize your variables, you can calculate z-scores for each value of the dependent variable (income) and each value of the independent variable (education), and then calculate the correlation coefficient between these z-scores.

Chapter Fourteen

174

Calculation of Coefficient of Correlation You learned in Chapter 5 that the variance of a sample is the sum of the squared deviations divided by the total number of deviations. You already know that the formula for variance is: •ଶ ൌ

σሺ‫ ݔ‬െ ‫ݔ‬ҧ ሻଶ ݊

The variance provides a measure of the spread of the values of a variable x around its mean, xժ . Similarly, when you have two variables, say x and y, the combined spread between the values of these two variables is called the covariance and is calculated with the following formula:

Apart from indicating whether the two variables are positively or negatively correlated, the coefficient of correlation also indicates the degree to which the dependent variable tends to change for each unit of change in the independent variable. The coefficient of correlation is derived by a ratio of the covariance of two variables (x and y) and the product of their standard deviations. Its formula is: r=

ഥ൯൧ σൣ൫௫Ȃ௫൯൫௬Ȃ࢟ మ



ටσቂ൫௫Ȃ௫൯ ൈσ൫௬Ȃ௬ത൯ ቃ

This definitional formula can be written as a computation formula to make the computing of r less cumbersome. The computational formula of r is: r=

௡σ௫௬ିሺσ௫ሻሺσ௬ሻ ටሾ௡σ௫ మ ିሺσ௫ሻమ ሿሾ௡σ௬ మ ିሺσ௬ሻమሿ 

If x is an independent variable and y is a dependent variable, you need to calculate the following seven terms to compute r: 1. 2. 3. 4.

the sum of the product of x and y = ™xy the sum of x = ™x the sum of y = ™y the sum of x2 = ™x2

Measuring the Relationship between Two Interval/Ratio-Level Variables 175

5. the sum of y2 = ™y2 6. the square of the sum of x = (™x)2 7. the square of the sum of y = (™y)2 It is fairly simple to calculate these terms from the values of variables x and y. Let’s say that we want to calculate the coefficient of correlation between the average number of kilometres walked per day (x) for a month and the number of kilograms of weight reduced (y) by the four persons given in Table 14-1to test the following hypothesis: Step 1: Formulation of Hypothesis x Null hypothesis (H0): There is no correlation between the number of kilometres walked per day (x) for a month and the number of kilograms of weight reduced (y). x Alternate hypothesis (H1): There is a correlation between the number of kilometres walked per day (x) for a month and the number of kilograms of weight reduced (y). Step 2: Calculating the Value of the Coefficient of Correlation (r): Table 14-1 gives hypothetical data. The values of the x and y variables are kept small for the convenience of calculation. Let’s calculate the abovementioned seven terms required by the formula of the coefficient of correlation: Table 14-1 Number of Kilometres Walked (x) per Day in a Month and Number of Kilograms Reduced (y) Variables Person x y Tom 2 3 John 3 2 Terry 4 3 Monica 5 4 Ȉx = 14 Ȉy = 12

x2 4 9 16 25 Ȉx2 = 54

y2 9 4 9 16 Ȉy2 = 38

xy 6 6 12 20 Ȉxy = 44

The following values that are calculated in Table 14-1 are needed for the formula of the coefficient of correlation r:

Chapter Fourteen

176

1. 2. 3. 4. 5. 6. 7. 8.

the sum of the product of x and y = ™xy = 44 the sum of x = ™x = 14 the sum of y = ™y = 12 the sum of x2 = ™x2 = 54 the sum of y2 = ™y2 = 38 the square of the sum of x = (™x)2 = (14)2 = 196 the square of the sum of y = (™y)2 = (12)2 = 144 the number of subjects n = 4

Next, plug these values into the following formula of the coefficient of correlation: ݊σ‫ ݕݔ‬െ ሺσ‫ݔ‬ሻሺσ‫ݕ‬ሻ ඥሾ݊σš ଶ ‫ݎ‬ൌ

െ ሺσ‫ݔ‬ሻଶ ሿሾ݊σ› ଶ െ ሺσ‫ݕ‬ሻଶሿ  ͶšͶͶ െ ͳͶšͳʹ

ඥሾͶšͷͶ െ ͳͶଶ ሿšሾͶš͵ͺ െ ͳʹଶ ሿ

‫ݎ‬ൌ

ͳ͹͸ െ ͳ͸ͺ ඥሾʹͳ͸ െ ͳͻ͸ሿšሾͳͷʹ െ ͳͶͶሿ ‫ݎ‬ൌ

ͺ ඥሾʹͲሿšሾͺሿ

‫ݎ‬ൌ

‫ݎ‬ൌ

ͺ ξͳ͸Ͳ ͺ ͳʹǤ͸ͷ

‫ ݎ‬ൌ ͲǤ͸͵ Step 3: Decision about the Strength of the Coefficient of Correlation Using a Rule of Thumb One way to interpret r with the help of a simple rule of thumb is to use the following ranges:

Measuring the Relationship between Two Interval/Ratio-Level Variables 177

Value of r 0.0 Between 0.1 and 0.3 Between 0.4 and 0.6 Between 0.7 and 0.9 1.0

Strength of Correlation No linear relationship Weak Moderate Strong Perfect

Based on this rule of thumb, we can say from the coefficient of correlation of 0.63 that the relationship between the number of kilometres walked and the number of kilograms of weight reduced is moderate. But we cannot conclusively reject or fail to reject the null hypothesis. We need to take one more step to make a decision about the level of significance of the correlation coefficient between these two variables. Step 4: Applying the t-Test to the Value of the Correlation Coefficient The alternative to the rule of the thumb is to calculate a measure that can be evaluated against a distribution. That measure is the t-test. We can calculate the value of t from the value of r using the following formula, where n is the number of subjects or observations and r is the coefficient of correlation: ‫ ݐ‬ൌ ‫ ݎ‬ൈඨ

ൌ ͲǤ͸͵ ൈ ඨ

݊െʹ ͳ െ ”ଶ

Ͷെʹ ͳ െ ͲǤ͸͵ଶ

ൌ ͲǤ͸͵ ൈ ඨ

ʹ ͳ െ ͲǤͶ

ൌ ͲǤ͸͵ ൈ ඨ

ʹ ͲǤ͸

ൌ ͲǤ͸͵ ൈ ξ͵Ǥ͵͵ ൌ ͲǤ͸͵ ൈ ͳǤͺʹ ൌ ૚Ǥ ૚૞

178

Chapter Fourteen

The next step is to find out the tabulated (critical) value of t in Appendix Table 3A to compare with the value of the calculated value of t (i.e., 1.15). Step 5: The Degree of Freedom: The degree of freedom (df) is the sample size minus the number of parameters. The sample size is 4 and there are two variables (walking and weight reduction); thus the df is n – 2, which is 4 – 2 = 2.

Step 6: Decision: The critical value of t for 2 degrees of freedom in the t-distribution in Appendix Table 3A for a two-tailed test at Į =0.05 is 4.30. Our calculated value of 1.15 is much smaller than the critical value. Based on this hypothetical data, our conclusion is that although the correlation between the number of kilometres walked and the number of kilograms reduced in weight seems to be moderate, it is not statistically significant. Hence, we fail to reject the null hypothesis that there is no statistically significant correlation between the number of kilometres walked and the number of kilograms reduced in weight. In other words, we are unable to prove from our limited hypothetical data that there is a statistically significant relationship between the number of kilometres walked and the number of kilograms of weight reduced. The lower value of t is due to a very small sample size (4) in our example. In the next section, we will discuss how the same value of r (0.63) would have been statistically significant if it had been derived from a larger sample size.

The Significance of Sample Size Using the rule of thumb, we concluded that the 0.63 value of correlation coefficient (r) indicates a moderate correlation, yet the t-test concluded that the value of r is not statistically significant. In the t-test formula, the sample size (n) is in the numerator. It means that the larger the sample size, the larger the value of the t statistic. For example, if the value of the correlation coefficient (r) was 0.63 and the sample size was 30 rather than 4, then, the same formula of t would give a value of t equal to 4.3, which would be statistically significant because it is higher than the critical value of 2.05 for 28 degrees of freedom, as given in Appendix Table 3A.

Measuring the Relationship between Two Interval/Ratio-Level Variables 179

‫ ݐ‬ൌ ‫ ݎ‬ൈඨ ൌ ͲǤ͸͵ ൈ ඨ

݊െʹ ͳ െ ”ଶ

͵Ͳ െ ʹ ͳ െ ͲǤ͸͵ଶ

ൌ ͲǤ͸͵ ൈ ඨ

ʹͺ ͳ െ ͲǤͶ

ൌ ͲǤ͸͵ ൈ ඨ

ʹͺ ͲǤ͸

ൌ ͲǤ͸͵ ൈ ξͶ͸Ǥ͹ ൌ ͲǤ͸͵ ൈ ͸Ǥͺ ൌ ૝Ǥ ૜ It implies that the size of the calculated value of t for a given value of the correlation coefficient (r) depends on the sample size. The sample size of 4 is too small for an interpretation of the t-test; the sample size in this hypothetical example was kept smaller to make calculations easier.

Summary: Interpretation of the Correlation Coefficient x The correlation coefficient varies between –1 and +1. x A positive correlation means that as the scores of one variable increase, the scores of the other variable also increase. x A negative correlation means that as the scores of one variable increase, the scores of the other variable decrease. x The value of +1 indicates that there is a perfect positive correlation; if the scores of one variable increase by one unit, the scores of the other variable also increase by a constant set of units. x The value of –1 indicates that there is a perfect negative correlation; if the scores of one variable increase by one unit, the scores of the other variable decrease by a constant set of units.

180

Chapter Fourteen

x The value of 0 indicates no linear relationship between the two variables under consideration. There are other two important statistical concepts related to the coefficient of correlation: the variation explained in the dependent variable by the independent variable and the assumption of linearity.

The Coefficient of Determination (R-Squared) We obtain the coefficient of determination or R-squared by squaring the value of the coefficient of correlation. The R2 is a very useful measure. It indicates the amount of variation explained in the dependent variable by an independent variable. In our example, the value of r is 0.63 and its square is 0.632 = 0.63 x 0.63 = 0.3969. We can conclude that in our hypothetical example 39.69% of the variation in the values of the dependent variable—the number of kilograms of weight reduced—can be explained by the variation in the independent variable—the number of kilometres walked. You will learn more about R2 in Chapter 17 on regression.

Linearity In a linear relationship, any change in the independent variable also produces constant units of change in the dependent variable. The graph of a perfect linear relationship is represented by a straight line. In a perfectly linear relationship, the value of the coefficient of correlation (r) is 1. Let’s illustrate with an example to understand this. In Table 14-2, the relationship between the number of hours worked and the wages earned is linear, and its graph in Figure 14-1 is a straight line. This relationship is linear because a unit change in the independent variable (the number of hours worked) causes constant units of change (i.e., $5.00 per hour) in the dependent variable (wages earned).

Measuring the Relationship between Two Interval/Ratio-Level Variables 181

Table 14-2 Wages Earned by Number of Hours Worked Number of Hours Worked

Wages Earned ($)

1 2 3 4 5 6 7 8 9 10

5 10 15 20 25 30 35 40 45 50

Figure 14-1 Wages Earned by Number of Hours Worked 60

Dollars Earned

50 40 30 20 10 0 1

2

3 4 5 6 7 8 9 Number of Hours Worked

10 11

All relationships are not perfectly linear. For example, in Figure 14-2, the graph of death rates by age looks like a tub-shaped curve because death rates are higher in infancy, then decline during early childhood to early adulthood, and then increase steeply after age 50. The line of this graph makes a curve, and this relationship is called curvilinear.

182

Chapter Fourteen

Figure 14-2 Death Rates by Age, Canada, 2005

An Ingenious Utility of Correlation In the documentary The Joy of Stats, Hans Rosling13 describes how scientists at Google’s headquarters in California are working on software that translates one language to another using statistical correlation between words and phrases. Rosling says that when the computer is learning to translate, it is actually learning to find correlations between words and phrases. Scientists at Google fed a large amount of text data into the system and discovered that certain words and phrases of one language correlate very often with words and phrases of the other language. Based on these correlations, Google’s website allows translation between 57 languages purely statistically using correlated multilingual text. It means that in this context when you are translating from Chinese to English, you don’t have to know Chinese. What you have to know is statistics and computer science. It does not end here. Google is now working on statistical voice recognition software that uses correlations to have instant conversation between two people who don’t speak a common language. One person can talk in one language and another person can hear in his or her language and answer back in real time. The translation will be done for both of them simultaneously by a technology that uses statistical correlations.

13 Rosling, Hans. “Automatic Translation”(6/6) from the Joys of Stats. YouTube, February 2, 2011. https://www.youtube.com/watch?v=AEac-jP5Eho

Measuring the Relationship between Two Interval/Ratio-Level Variables 183

The examples above make you realize just how useful the knowledge of statistics is in the modern world.

Exercises for Practice Q1.

The following table gives the unemployment rates (x) and the poverty rates (y) per 1,000 population for 10 years:

1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

Unemployment Rate (y)

Poverty Rate (x)

9 8 8 7 7 8 8 7 7 6

14 14 11 12 11 10 11 10 9 8

a. Calculate the coefficient of correlation (r) using the following template. b. Calculate value of t and determine if r is significant at Į = 0.05. c. Calculate the coefficient of determination (R2) and find out how much variation is explained by the independent variable (x) in the dependent variable (y).

Chapter Fourteen

184

Use the following table for a hint: x

y

1997

9

14

1998

8

14

1999

8

11

2000

7

12

2001

7

11

2002 2003

8 8

10 11

2004

7

10

2005

7

9

2006

6

8

Ȉx =

Ȉy =

x2

y2

xy

Ȉx2 =

Ȉy2 =

Ȉxy =

CHAPTER FIFTEEN POWER OF A STATISTICAL TEST

Learning Objectives In this chapter, you will learn about the power of a test. Specifically, you will learn about: x the power of a test and how it is calculated; x the factors that affect the power of a test; x parametric and non-parametric tests and power; and x type 1 error, type 2 error, and power.

Introduction We have learned that the type 1 error is a rejection of a null hypothesis when it should be retained. In other words, the type 1 error is the incorrect rejection of a true null hypothesis. The possibility of the type 1 error increases if we set the level of significance (Į), the tail area of the curve, too wide. The wider the tail area, the more flexible is Į. For example, the Į equal to 0.1 will be less stringent than the Į equal to 0.05 because the Į equal to 0.1 will leave a wider area in the tail. The type 2 error is a failure to reject a null hypothesis when it should be rejected. In other words, the type 2 error is incorrectly accepting a false null hypothesis. The possibility of the type 2 error increases when we set Į too far away from the mean, leaving the tail area under the curve too small. The Į equal to 0.01 is more stringent than the Į equal to 0.05. The value of the Į equal to 0.01 makes the null hypothesis more difficult to prove than the value of the Į equal to 0.05. When the Į is too stringent, say 0.0001, it will be very difficult to have a calculated value of a statistic (z or t) larger than the critical value, and the chances of failing to reject a null hypothesis will be very high. In other words, the chances of incorrectly accepting a false null hypothesis (the probability of the type 2 error) will

186

Chapter Fifteen

be high when Į is too stringent; the Į of 0.01 is more stringent than that of 0.05. Researchers use the concept of the power of a test to strike a balance between the type 1 error and the type 2 error because they don’t want the rejection of a null hypothesis to be either too hard or too easy.

Power of a Test The power of a test is the probability of rejecting a null hypothesis when actually it is false. In other words, the power of a test is the probability of rejecting a null hypothesis when it should be rejected. Basically, it is a probability of making a right decision. The power quantifies the probability or the chance that the null hypothesis will be rejected when it is actually false. In this sense, statistical power is inversely related to the probability of making the type 2 error (ȕ). Symbolically, power is defined as: Power = 1 – the probability of the type 2 error; Power = 1 – ȕ. It means that the higher the probability of making the type 2 error (ȕ), the lower the power.

Calculating the Power of a Test Let’s say that a school board allows the average class size of 20 students in its 9 schools. But the school board suspects that the average class size in these 9 schools is 22. Suppose that the standard deviation based on the past data is 6. Based on this information, we can set up our hypothesis as follows: Null hypothesis: H0: ȝ0 = 20 Alternate hypothesis: H1: ȝ1 = 22 n= 9; ı = 6, where ȝ0 is the average class size under the null hypothesis; ȝ1 is the average class size under the alternate hypothesis; n is the sample size, and ı is the standard deviation. Let’s say that we use Į = 0.01 as a decision criterion to reject the null hypothesis. The tabulated value of z statistics at the 0.01 level of significance is 2.33 (Appendix Table 1B). We will reject the null hypothesis if the calculated value of z is greater than 2.33. Let’s find out

Power of a Statistical Test

187

the value of the sample mean (xժ ) for ȝ0 = 20, ı = 6 and number of schools, n = 9. ௫lj ିρబ  z= ஢Ȁξ௡ ௫lj ିଶ଴

2.33=

଺Ȁξଽ

=

௫lj ିଶ଴ ଺Ȁଷ

=

௫lj ିଶ଴ ଶ

௫lj ିଶ଴

2.33= ଶ xժ = 2.33 × 2 + 20 xժ = 4.66 + 20 xժ = 24.66 This is the value of the sample mean under the null hypothesis when ȝ0 = 20. Now let’s calculate the value of z for the sample mean (xժ ) of 24.66 and ȝ1 = 22. ௫lj ିρభ  z= ஢Ȁξ௡

= = =

ଶସǤ଺଺ିଶଶ ଺Ȁξଽ ଶǤ଺଺ ଺Ȁଷ ଶǤ଺଺ ଶ

=1.33 If you look in Appendix Table 1B, the area under the curve beyond the z value of 1.33 is 0.0918. This means the probability of correctly rejecting H0 is 9.18%. This probability is 1-ȕ, or the power of a test. Consequently, the probability of making a type 2 error (ȕ) is 1 – the power: 1 – 0.0918 = 0.9082, or 90.82%

Factors Affecting the Power of a Test You may increase the power of a test by increasing: 1. the sample size; 2. the significance level (Į);

188

Chapter Fifteen

3. the effect size; which is the difference between the values of the null (H0) and alternate (H1) hypotheses; or 4. the variabilityሺɐ).

The Effect of Sample Size on Power The power varies with the sample size. Let’s examine the effect of sample size on the power by increasing the sample size from 9 to 16 in the above example. Null hypothesis: H0: ȝ0 = 20 Alternate hypothesis: H1: ȝ1 = 22 n = 16; ı = 6, where ȝ0 is the average class under the null hypothesis; ȝ1 is the average class size under the alternate hypothesis; n is the sample size, which has been increased to 16, and ı is the standard deviation. We use Į = 0.01 as a decision criterion to reject the null hypothesis. The tabulated value of z statistics at the 0.01 level of significance is 2.33 (Appendix Table 1B). Let’s find the value of the sample mean (xժ ) for ȝ0 = 20. ୶lj ିρబ  z= ஢Ȁξ௡ ௫lj ିଶ଴

2.33 =

଺Ȁξଵ଺

=

௫lj ିଶ଴ ଺Ȁସ

=

௫lj ିଶ଴ ଵǤହ

௫lj ିଶ଴

2.33 = ଵǤହ xժ = 2.33 × 1.5 + 20 xժ = 3.495 + 20 xࡄ = 23.495 = 23.5 Now let’s calculate z for the mean (xժ ) of 23.5 and ȝ1 of 22 for the alternative hypothesis: ௫lj ିρభ  z= =

஢Ȁξଵ଺ ଶଷǤହିଶଶ

= =

଺Ȁξଵ଺ ଵǤହ ଺Ȁସ ଵǤହ ଵǤହ

= 1.0

Power of a Statistical Test

189

If you look in Appendix Table 1B, the area under the curve beyond the z value of 1.0 is 0.1587. This probability is 1-ȕ, or the power of a test. The power of a test has increased from 9.18% to 15.87% with the increase in the sample size from 9 to 16. We can conclude that the power of a test is a function of sample size (n). We can increase the power of a test by increasing the sample size.

The Effect of an Increase in the Significance Level (Į) on Power The higher the likelihood of making a type 1 error (rejecting true H0), the lower the likelihood of making a type 2 error (accepting a false H0). The probability of making type 2 error is ȕ. The power is 1 – ȕ. When the level of significance (Į) is higher, the ȕ is lower and, hence, the power is higher. In other words, there is a direct relationship between the level of significance (Į) and the power. Let’s see the change in the power with an increase of Į from 0.01 to 0.05. In the above example with Į = 0.01 (one-tailed test), ȝ0 = 20, ȝ1 = 22, n = 16, and ı = 6, the power is 15.87%. If we increase Į to 0.05 (onetailed test), the critical value of z is 1.645 (see Appendix Table 1B) and the value of the sample mean (xժ ) for ȝ0 = 20 is: ௫lj ିρబ 

z= 1.645 =

஢Ȁξ௡ ௫lj ିଶ଴

=

௫lj ିଶ଴

଺Ȁξଵ଺ ௫lj ିଶ଴

଺Ȁସ

=

௫lj ିଶ଴ ଵǤହ

1.645 = ଵǤହ xժ = 1.645 × 1.5 + 20 xժ = 2.5 + 20 xժ = 22.5 Now, we calculate z for the mean (xժ ) of 22.5 and ȝ1 of 22 for the alternative hypothesis: ௫lj ିρభ  z= ஢Ȁξଵ଺

= = =

ଶଶǤହିଶଶ ଺Ȁξଵ଺ ଴Ǥହ ଺Ȁସ ଴Ǥହ ଵǤହ

= 0.33.

Chapter Fifteen

190

If you see in Appendix Table 1B, the area under the curve beyond the z value of 0.33 is 0.3707. This means that the probability of correctly rejecting H0, the power of a test, is 37.07%. In other words, the power increases from 15.87% to 37.05% with an increase in the level of significance (Į) from 0.01 to 0.05. This shows that an increase in the level of significance (Į) results in an increase in the power of a test.

The Effect of the Directional Nature of the Alternate Hypothesis (H1) We learned in Chapter 10 that if the direction of the test is known, we use a one-tailed test, and if the direction of the test is not known, we use a twotailed test. The direction of the test (whether the test is one-tailed or twotailed) is related to the power of the test. We have seen in the above section that the power increases with an increase in Į. The critical value of z decreases with an increase in Į (Table 15.1). Therefore, it follows that the power increases as the critical value of z decreases. As shown in Table 15.1, the critical value of z is lower for the onetailed test than for a two-tailed test for any given level of Į. Therefore, the one-tailed test is more powerful than the two-tailed test because a calculated value of z that may not be significant for a two-tailed test may be significant for a one-tailed test. For example, a calculated value of z of 1.65 will be significant at Į =.05 for a one-tailed test but not for a twotailed test (see Table 15.1). In this sense, an alternate hypothesis (H1) with a one-tailed test will be more powerful than one with a two-tailed test as long as the value of z is in the same direction. Table 15.1 The Critical Value of z to Reject H0 at a Given Į Level

Į 0.005 0.01 0.025 0.05

Nature of H1 One-Tailed Test Two-Tailed Test z z 2.58 ± 2.81 2.33 ± 2.58 1.96 ± 2.24 1.65 ± 1.96

Power of a Statistical Test

191

Parametric and Non-Parametric Tests and Power The tests that assume normally distributed populations with the same variance are called parametric tests (e.g., the z-test and the t-test are the parametric tests). Some tests, such as Spearman’s rho (ȡs) and Kendall’s Tau-b (IJb), do not assume normally distributed population or the same variance and they are called non-parametric tests. Parametric tests are more powerful for the same sample size because the parametric tests in comparison with the non-parametric tests make use of maximum information when criteria of normality and the same variance is met. Let’s say that we have data on scores of five students and these scores are 100, 50, 40, 35, and 15. If we were to rank them to calculate Spearman’s rho (ȡs), we would rank them 1, 2, 3, 4, and 5. We would lose all the information of the difference in the magnitude of scores because in the ranking we would express the difference between scores of 100 and 50 or between scores of 40 and 35 the same. But a parametric test such as a t-test will use the raw scores, keeping the difference in the magnitude of the scores the same. The greater sensitivity to the magnitude of scores makes the parametric tests not only more accurate but also lets us arrive at probability values when basic assumptions are met. For example, we make a probability statement in a parametric test when we use the level of significance at 0.05 or 0.01. One must keep in mind that parametric tests are more powerful when the underlying assumptions are met. When assumptions of normality or the same variance of populations are not met, a non-parametric test may be equally powerful. In many cases, the nature of data may prohibit the use of a parametric test. For example, we are forced to use less powerful nonparametric tests for nominal- or ordinal-level data or when data does not fulfill the assumption of a parametric test.

Type 1 Error, Type 2 Error, and Power Researchers use the concept of the power of a test to strike a balance between type 1 and type 2 errors. The testing of a null hypothesis (H0) entails two mutually exclusive possibilities: either the H0 is true that is, there is no significant difference in the means of the two samples (xժ 1 = xժ 2) or it is false that is, there is a significant difference in the means of the two samples (xժ 1  xժ 2). The type 1 error occurs when we reject a true H0. There are two observations worth noting: (i) if the H0 is false, the probability of a type 1

Chapter Fifteen

192

error is zero; (ii) only if the H0 is true is there a possibility of the type 1error. This probability of the type 1 error is Į. The type 2 error occurs when we accept a false H0. Here also, there are two observations worth noting: (i) if the H0 is true, the probability of a type 2 error is zero; (ii) only if the H0 is false is there a possibility of the type 2 error. This probability of the type 2 error is ȕ. It is clear then that the concept of power (1 – ȕ), which is defined with the help of the type 2 error, is applicable only when the H0 is false. Table 15-2 summarizes the above discussion regarding probabilities related to the acceptance or rejection of the H0. Table 15-2 Decision on H0 and Type of Error Decision Accept

H0 Is True Correct 1–Į

H0 Is False Type 2 Error ȕ

Reject

Type 1 Error

Correct

Į

1–ȕ

Power is the probability of rejecting a false null hypothesis (i.e., 1-ȕ). The bottom most cell of Table 15-2 represents power.

Exercises for Practice Q1.

Use the example problem in the section “Calculating the Power of a Test” and calculate the power for the following data: H0: ȝ0 = 10 H1: ȝ1 = 11 n = 4; ı = 3 Į =0.05 (one-tailed)

Q2.

Now, increase the sample size to 9 and find out what the effect of change in the sample size is on the power. Your new data will be: H0: ȝ0 = 10 H1: ȝ1 = 11 n = 9; ı = 3 Į = 0.05(one-tailed)

Power of a Statistical Test

Q3.

193

Now, change Į from 0.05 to 0.01 and find out what the effect of change in Į is on the power. Your new data will be: H0: ȝ0 = 10 H1: ȝ1 = 11 n = 4; ı = 3 Į = 0.01(one-tailed) Hint: See the example problem in the section “The Effect of an Increase in the Significance Level (Į).”

Q4.

Why does the concept of power not apply when the null hypothesis is true?

Q5.

Explain why a type 1 error cannot be made when the null hypothesis is false?

CHAPTER SIXTEEN ANALYSIS OF VARIANCE (ANOVA)

Learning Objectives In this chapter, we will discuss the one-way analysis of variance (ANOVA). It is used to compare two or more groups. Specifically, you will learn to: x test a hypothesis with one-way analysis of variance; x calculate the sum of squares within groups (SSW), the sum of squares between group (SSB), and the total sum of squares (SST); x construct an ANOVA table; and x calculate and interpret the F-ratio.

Introduction In the previous chapters, we have used statistical techniques to find an association or the statistically significant difference between a sample and its population or between two samples. But what if we have to compare more than two groups? For example, what if we have to compare the religiosity of four religious groups, say, Protestants, Catholics, Muslims, and Jewish. We can use the t-test to find out how significantly different these groups are from each other. But we would need to perform a series of t-tests between Protestants and Catholics; Protestants and Muslims; Protestants and Jewish; Catholics and Muslims; Catholics and Jewish; and Muslims and Jewish. In this case, we would need to perform six tests. It becomes cumbersome to conduct so many tests when there are more than two groups. The analysis of variance (ANOVA) circumvents this problem.

Analysis of Variance (ANOVA)

195

Analysis of Variance (ANOVA) ANOVA provides a single statistical test to find out differences between two or more samples. When we compare two or more groups, there are two sources of variance. 1. The first type of variance is due to differences between the values of the groups, which is called between groups variation. 2. The second type of variance is due to differences in the values within each group, which is called within group variation. Let’s say that you’re comparing two photographs. The first photo shows people coming out of a subway station in London, England, and the second photo shows people coming out a subway station in New Delhi, India. You’ll be able to tell easily which photo is from India and which photo is from England because the differences between the two photographs are quite striking. But when you look at the photos individually, you see that yes, there are differences within the groups of people in the photo of the New Delhi commuters because people are dressed differently. Some women are in saris, some are in jeans and pants; and some men are in traditional Indian dress while others are in pants and shirts. But there are more similarities within each group of people than differences. Similarly, although the photo from London has differences within the groups of commuters, there are more similarities than differences. But the two photographs are quite different from each other. In other words, there is a higher variance between the photographs than within the photographs. You’ll soon find that you can calculate variation between samples of groups and within samples of groups. A statistic called the F-ratio is calculated from these two types of variations to find out whether the ratio is statistically significant.

One-Way Analysis of Variance When we compare two or more groups, the aim of analysis of variance (ANOVA) is to find out whether there is more variation (difference) between groups than within groups. Let’s say that we want to find out from data in Table 16-1 whether there are significant differences in the reduction of blood pressure between three groups using three different strategies, namely: medication, diet, and exercise, respectively.

Chapter Sixteen

196

Step 1: Setting Up the Hypothesis: x Null hypothesis (H0): There are no significant differences in the reduction of blood pressure in three different types of treatment. x Alternate hypothesis (H1): There are significant differences in the reduction of blood pressure in three different types of treatment. Table 16-1 Reduction in Blood Pressure by Treatment

Observation 1 2 3 4 5 Total (XT) Group Mean (xժ i)

Medication Group 1 x1 2 4 3 4 2 15 3

Diet Group 2 x2 4 4 3 5 4 20 4

Exercise Group 3 x3 4 5 6 4 6 25 5

Step 2: Calculation of the Means: Group mean xժ i =

σ௑௜ ௡

,

where xi is the observation in a group and n is the number of observations in that group. n=5 Mean for Group 1: Mean for Group 2: Mean for Group 3: Grand mean = xժ T =

xժ 1 = 15 ÷ 5 = 3 xժ 2 = 20 ÷ 5 = 4 xժ 3 = 25 ÷ 5 = 5 σ୶౐ ே

=

ଵହାଶ଴ାଶହ ଵହ

=

଺଴ ଵହ

= 4,

where ™XT is the sum of all the observations in the three groups and N is the total number of observations in the three groups.

Analysis of Variance (ANOVA)

197

Step 3: Calculation of the Sum of Squares: For ANOVA, we need to calculate three types of variations: the total variation, the variation within groups, and the variation between groups. These three variations are represented by the total sum of squares (SST), the sum of squares within groups (SSW), and the sum of squares between groups (SSB):

The Total Sum of Squares (SST) First, we calculate the total variation in all the values of the three groups. To accomplish this, we subtract the grand mean from each value of the three groups under consideration and then calculate the sum of squared deviations, called the total sum of squares (SST). SST = ™(xi – xժ T)2, where xi represents observations of all the groups and xժ T is the grand mean calculated from all the values of the three groups. We subtract the grand mean from each observation and square the deviations; then add all the squared deviations. The grand mean (xժ T) is 4; we subtract 4 from each value of the three groups and take the square of the deviations to calculate the total sum of squares (SST): SST

= (2 – 4)2 + (4 – 4)2 + (3 – 4)2 + (4 – 4)2 + (2 – 4)2 + (4 – 4)2 + (4 – 4)2 + (3 – 4)2 + (5 – 4)2 + (4 – 4)2 + (4 – 4)2 + (5 – 4)2 + (6 – 4)2 + (4 – 4)2 + (6 – 4)2 = (–2)2 + (0)2 + (–1)2 + (0)2 + (–2)2 + (0)2 + (0)2 + (–1)2 + (1)2 + (0)2 + (0)2 + (1)2 + (2)2 + (0)2 + (2)2 = (4) + (0) + (1) + (0) + (4) + (0) + (0) + (1) + (1) + (0) + (0) + (1) + (4) + (0) + (4) =9 +2 +9

SST

= 20

Chapter Sixteen

198

The Sum of Squares within Groups (SSW) Next, we calculate the variation within each group. To accomplish this, we subtract values of each group from its respective group mean and calculate the sum of squared deviations, called the sum of squares within groups (SSW). SSW = ™(x1i – xժ 1)2 + ™(x2i – xժ 2)2 + ™(x3i – xժ 3)2 In other words, this formula says to first subtract the mean of the first group from each value of the first group and take the square of the deviations; next, subtract the mean of the second group from each value of the second group and take the square of the deviations; next, subtract the mean of the third group from each value of the third group and take the square of the deviations; finally, sum up these squared deviations. The sum of these squared deviations is called the sum of squares within groups. The result of the operations is as follows: SSW = (2 – 3)2 + (4 – 3)2 + (3 – 3)2 + (4 – 3)2 + (2 – 3)2 + (4 – 4)2 + (4 – 4)2 + (3 – 4)2 + (5 – 4)2 + (4 – 4)2 + (4 – 5)2 + (5 – 5)2 + (6 – 5)2 + (4 – 5)2 + (6 – 5)2 = (–1)2 + (1)2 + (0)2 + (1)2 + (–1)2 + (0)2 + (0)2 + (–1)2 + (1)2 + (0)2 + (–1)2 + (0)2 + (1)2 + (–1)2 + (1)2 = (1) + (1) + (0) + (1) + (1) + (0) + (0) + (1) + (1) + (0) + (1) + (0) + (1) + (1) + (1) =4 +2 +4 SSW = 10

The Sum of Squares between Groups (SSB) The total sum of squares is equal to the sum of squares within groups plus the sum of squares between groups.

Analysis of Variance (ANOVA)

199

SST = SSW + SSB; therefore, the sum of squares between groups is: SSB = SST – SSW. SSB = SST – SSW = 20 – 10 = 10 Step 4: Construction of an ANOVA Table: The next step is to create a table using between and within groups sum of squares to calculate F-statistics.

ANOVA TABLE Table 16-2 ANOVA Table Source of Variation Between Groups

Degrees of Freedom k –1= 3–1=2

Sum of Squares 10

Mean Square

F-Ratio

10÷2=5.0

5.0÷ 0.83=6.02

Within Groups N – k=15–3=12 10÷12=0.83 10 k is the number of groups. There are three groups. N represents the total number of observations. Each group has five observations, and the total observations of three groups are 15.

Note: The between groups sum of squares (SSB) is also called treatment or factor, and the within groups sum of squares (SSW) is also called error. Step 5: Decision: We look at Appendix Table 4A to find the critical value of the F-ratio under the df-between 2 for the between groups and against the df-within 12 in the vertical column for the degrees of freedom for the within groups. This tabulated (critical) value of the F-ratio is 3.89. Our calculated value of the F-ratio (6.02) exceeds the critical value of the F-ratio (3.89); therefore, we reject the null hypothesis. We conclude that there are statistically significant differences in the reduction of blood pressure in the three groups using medication, diet, and exercise, respectively.

Chapter Sixteen

200

A Simpler Method There is a simpler method for calculating the sum of squares. In this method, you don’t have to take squared deviations. You just have to square the values of the observations of the groups as shown in Table 16-3 and use these squared values to calculate the sum of squares (see below). Table 16-3 Reduction in Blood Pressure in Groups Using Medication (Group 1), Diet (Group 2), and Exercise (Group 3) N 1 2 3 4 5 Total xժ i

Group 1 x1 2 4 3 4 2 15 3

Group 2 x2 4 4 3 5 4 20 4

Group 3 x3 4 5 6 4 6 25 5

Group 1 x12 4 16 9 16 4 49

Group 2 x22 16 16 9 25 16 82

Group 3 x32 16 25 36 16 36 129

Total Sum of Squares (SST) SST = ™™x2 – Nxժ T2, where ™™x2 = the sum of total of squared values of observations = 49 + 82 + 129 = 260 N = the total number of observations = 15 xժ T = Grand mean = (15+20+25) ÷15 = 60÷15 = 4 xժ T2 = the square of the grand mean = 42 = 16 Nxժ T2 = the total number of observations × the square of the grand mean = 15 × 16 = 240 SST = 260 – 240 = 20

Sum of Square within Groups (SSW) SSW = ™™x2 – ™n(xժ i)2, where ™™x2 = sum of the totals of squared values of observations = 49 + 82 + 129 = 260 n = the number of observations in the respective group = 5

Analysis of Variance (ANOVA)

201

™n(xժ i)2 = Sum of the number of observations of respective groups multiplied by the squares of their respective group mean: = (5 × 32) + (5 × 42) + (5 × 52) = (5 × 9) + (5 × 16) + (5 × 25) = (45) + (80) + (125) = 250 SSW = ™™x2 – ™n(xժ i)2 = 260 – 250 = 10

Sum of Square between Groups (SSB) SSB = SST – SSW = 20 – 10 = 10 Next, you create an ANOVA table similar to Table 16-2 and proceed with the decision making as shown in the previous example.

Limitations of ANOVA There are two limitations of the analysis of variance (ANOVA). First, it assumes that all the groups have equal variance. You will remember from Chapter 5 that the variance is the sum of squared deviations divided by the σሺ௫ା௫ҧ ሻమ 

). In other words, it assumes that total number of deviations (s2ൌ ௡ the values of each group have similar distribution around their mean, which may not be the case for some groups. Second, the analysis of variance only informs us that groups differ from each other. It does not indicate how much difference is between any two groups. To find this out, we would have to perform more cumbersome tests, called post-hoc tests, which are out of the scope of this book. You will learn these types of tests in higher level statistics courses.

Chapter Sixteen

202

Exercise for Practice Q1.

Shilpa is thinking of buying exercise equipment. She is interested to know whether there is a difference in calories burned per minute among the three equipment choices. She collected the following data on three groups of women using three different types of equipment. From the following data:

Calories Burned per Minute in Three Groups of Women Using Stationary Bike, Elliptical Machine and Treadmill N 1 2 2 4 Total Group Mean

Group 1 x1 4 5 6 5 20

Group 2 x2 4 5 4 3 16

Group 3 x3 3 4 3 2 12

5

4

3

a. Complete the following templates to calculate the total sum of squares (SST); the sum of squares between groups (SSB), and the sum of squares within groups (SSW): Grand mean xժ T =

σ௫೅ ୒

SST = ™(xi – xժ T)2 =™(( )2 + ( )2 + ( )2 + ( )2 + ( )2 + ( )2 + ( ) 2 + ( ) 2 + ( )2 + ( )2 + ( )2 + ( )2)) SST = SSW = ™(x1i – xժ 1)2 + ™(x2i – xժ 2)2 + ™(x3i – xժ 3)2 SSW = ( )2 + ( )2 + ( )2 + ( )2 + ( )2 + ( )2 + ( )2 + ( )2 + ( )2 + ( )2 + ( )2 + ( )2 SSW = SSB = SST – SSW

Analysis of Variance (ANOVA)

203

b. Now, complete the following ANOVA table to find the F-ratio: Source of Variation SSB SSW SST

Degrees of Freedom k–1= N–k=

Sum of Squares

Mean Square

F-Ratio

c. Test H0 at Į = 0.05. Q2.

Now use a simpler version to find out SST, SSW, and SSB.

Observation

Stationary Bike Group 1

Elliptical Machine Group 2

Treadmill Group 3

x12

x22

x32

102

66

38

1 2 2 4 ™x2 SST = ™™x2 – Nxժ T2 where ™™x2 = the sum of total of squared values of observations N = the total number of observations xժ T = Grand mean SST =

Within Sum of Squares (SSW) SSW = ™™x2 – ™n(xժ i)2, where ™™x2 = sum of the totals of squared values of observations ™n(xժ i)2 = Sum of the number of observations of respective groups multiplied by the squares of their respective group mean SSB = SST – SSW =

CHAPTER SEVENTEEN REGRESSION ANALYSIS: A PREDICTION AND FORECASTING TECHNIQUE

Learning Objectives This chapter is an introduction to regression analysis. If you do study more advanced statistics in future, you will learn that regression analysis is a technique that helps us to move from bivariate analysis to multivariate analysis. In this chapter, you will be introduced to: x the regression equation; x the use of regression for forecasting and predicting; x the variance explained by an independent variable in the dependent variable; x the interpretation of R-squared (R2); and x the relationship between the regression coefficient (b) and the correlation coefficient (r).

Introduction Correlation is used to find a relationship between two variables. Regression analysis is used to predict the value of a dependent variable from the known values of an independent variable or several independent variables correlated with the dependent variable. It is also used to assess the affect or influence of each independent variable or the combined effect of several independent variables on the dependent variable. At this stage, you should keep in mind that the dependent variable is the variable that a researcher intends to study with the help of the other variables that are correlated with the dependent variable. These other variables are called independent variables. For example, you may want to study income, which will be your

Regression Analysis: A Prediction and Forecasting Technique

205

dependent variable, with the help of some variables that are related with income. These variables could be the age of respondents because older employees tend to earn more; the education of respondents because persons with higher education tend to earn more than those with lower education; and the years of experience in a job because the longer the experience usually the higher the salary. These three variables will be your independent variables. The predicting or forecasting of the future is a major use of regression analysis. For example, because there is a correlation between the age of drivers and auto accidents, auto insurance companies use the age of its clients to predict the probability of accidents by the age of drivers. Based on these predictions, auto insurance companies set higher rates for younger drivers. In this case, age is used as an independent variable to predict the dependent variable, the accident rate. Economists call independent variables predictors or predictor variables. We don’t say that the age of the driver causes the accident, but we do say that the age of the driver is correlated with the auto accident rate. The higher the correlation between an independent variable and a dependent variable, the better the prediction. In other words, a high correlation between a dependent variable and an independent variable is necessary for a better prediction of a dependent variable. Another important use of regression analysis is that we can determine the contribution of each independent variable to the prediction of a dependent variable. To accomplish this, we determine the percentage of variation explained by each independent variable in the total variation of the dependent variable. This concept will become clearer in a moment when we perform a regression analysis. You will also learn that we can calculate a statistic called the coefficient of determination (R-squared) from a regression analysis. The higher the value of R-squared, the better the prediction. The term regression was introduced by the 19th-century statistician Francis Galton.14 He observed that the heights of sons tend to regress toward the heights of fathers. He found that if the fathers were very tall, their sons tended to be tall but shorter than their fathers. He also observed that the sons of short fathers tended to be taller than their fathers but shorter than the average height of all the fathers. For example, the mean height of the sons whose fathers were 63 inches tall was 66.5 inches, and the mean height of the sons whose fathers were 73 inches tall was 72 inches. Galton concluded from his observations that the heights of the sons tend to regress toward the mean height of their fathers. He called his discovery a regression to the mean. 14

Galton, Francis. Natural Inheritance. London: MacMillan & Co., 1889.

Chapter Seventeen

206

The Regression Equation and Its Application for Forecasting We know from our hypothetical data on walking and losing weight in Chapter 14 that there is a relationship between the independent variable x, the number of kilometres walked for a month, and the dependent variable y, the number of kilograms of weight lost. The relationship between these two variables can be described by the following equation, called the regression equation: y = a + bx This equation is also called the equation of the goodness of fit. There are two unknown coefficients in this equation: a and b. These coefficients are also called constants or regression coefficients. If we can find the values of the a and b coefficients in the above equation, we can estimate the value of y for any given value of x.

Regression Coefficients The coefficient a is called the intercept, and the coefficient b is called the slope of the regression line. If we plot values of the dependent variable y against the independent variable x, the intercept will be located where the value of x intercepts the y-axis. For example, in Figure 17.1, the regression line intercepts the y-axis at 2. The slope of the line is the angle that the regression line makes with the y-axis. Figure 17-1 Regression Line

5 4 3 Y-Axis 2 1 0 0

2

4 X-Axis

6

8

Regression Analysis: A Prediction and Forecasting Technique

207

As mentioned, if we can find out the values of the a and b coefficients of a regression equation, we can estimate the value of y for any value of x. Let’s again use the same hypothetical data on walking and weight reduction for four men. Assuming that the independent variable x is the number of kilometres walked every day for a month and the dependent variable y is the number of kilograms of weight reduced in a month, Table 17.1 gives the values of x and y, deviations of x and y from their respective means, the product of these deviations, and the square of deviations of x from its mean (xժ ). Table 17-1 Number of Kilometres Walked (x) per Day in a Month and Number of Kilograms of Weight Reduced (y) Tom John Terry Monica Sum (™) Mean

x 2 3 4 5 14 ࢞ ൌ ૜Ǥ ૞

y 3 2 3 4 12 ࢟ ൌ3

(x – ࢞) –1.5 –0.5 0.5 1.5

(y – ࢟) 0 –1 0 1

(x – ࢞)(y – ࢟) 0.0 0.5 0.0 1.5 2.0

(x – ࢞)2 2.25 0.25 0.25 2.25 5.0

The following formula can be used to calculate the slope or regression coefficient b: σሾሺ‫ ݔ‬െ  ‫ݔ‬ሻሺ‫ ݕ‬െ  ›ሻሿ ܾ ൌ σሺ‫ ݔ‬െ  ‫ݔ‬ሻଶ ʹ ܾൌ ͷ ܾ ൌ ͲǤͶ Once we have determined the value of the coefficient b, finding the value of the coefficient a is easy: ‫ = ݕ‬a + b‫ݔ‬ a = ‫ – ݕ‬b‫ݔ‬ a = 3 – (0.4 × 3.5) a = 3 – 1.4 a = 1.6 Once we know the values of the regression coefficients a and b, we can use the regression equation to predict a value of the dependent variable (y) for any value of the independent variable (x). For example, if David

Chapter Seventeen

208

walked 6 kilometres every day for a month, we can predict the number of kilograms of weight he would lose by using the above regression equation. y = a + bx, where x is the independent variable, the number of kilometres walked, and y is the dependent variable, the number of kilograms of weight lost. y = 1.6 + (0.4 × 6) y = 1.6 + 2.4 y = 4.0 David is likely to lose 4 kilograms of weight. That is how the regression can be used for predicting or forecasting.

Variance Explained: The Second Major Use of Regression The second major use of regression analysis is that we can find out the amount of variation explained in a dependent variable (y) by an independent variable (x). To accomplish this, we use the regression equation to calculate the predicted value (ǔ) equivalent to each actual value of the dependent variable y. Then we find the difference between the predicted values and the actual values, which is helpful in calculating a statistics called R-squared. R-squared is an indicator of a correlation between the predicted values and the actual values of a dependent variable. R-squared can be used to estimate the variance explained by an independent variable in the total variance of a dependent variable. We can use the above-mentioned regression equation to calculate R-squared.

Calculation of Coefficient of Determination (R2) Let’s use the values of x in Table 17-1 and the regression coefficients a and b from the above regression equation to calculate the predicted values for y of each person: ǔ = a + bx, where ǔ is a predicted value equivalent to an actual value of y, a = 1.6, b = 0.4, x = number of kilometres walked Predicted value of ǔ for:

Regression Analysis: A Prediction and Forecasting Technique

209

Tom: ǔ = a + bx = 1.6 + 0.4×2 = 1.6 + 0.8 = 2.4 John: ǔ = a + bx = 1.6 + 0.4×3 = 1.6 + 1.2 = 2.8 Terry: ǔ = a + bx = 1.6 + 0.4×4 = 1.6 + 1.6 = 3.2 Monica: ǔ = a + bx = 1.6 + 0.4×5 = 1.6 + 2.0 = 3.6 R-squared is a ratio between the sum of squares of the predicted values to the sum of squares of the actual values. To find out values of these two sum of squares, plug in the values of the predicted value (ǔ) in the following table: Table 17.2 Number of Kilometres Walked (x), Actual Number of Kilograms Reduced (y), and Predicted Number of Kilograms Reduced (ǔ)

Tom John Terry Chris Sum Mean

x 2 3 4 5 ™ ࢟

y 3 2 3 4 12 3

ǔ 2.4 2.8 3.2 3.6

y–ǔ 0.6 –0.8 -0.2 0.4

(y – ǔ)2 0.36 0.64 0.04 0.16 1.20

(y – ࢟) 0 –1 0 1

(y – ࢟)2 0 1 0 1 2

R-squared gives the variance explained by the independent variable (x) in the dependent variable (y). Its formula is as follows: ܴଶ ൌ

ୗ୳୫୭୤ୗ୯୳ୟ୰ୣୢୈୣ୴୧ୟ୲୧୭୬ୱ୭୤୲୦ୣ୅ୡ୲୳ୟ୪୚ୟ୪୳ୣୱ୭୤௬୤୰୭୫  ୲୦ୣ୧୰୔୰ୣୢ୧ୡ୲ୣୢ୚ୟ୪୳ୣୱ  ୗ୳୫୭୤ୗ୯୳ୟ୰ୣୢୈୣ୴୧ୟ୲୧୭୬ୱ୭୤୲୦ୣ୅ୡ୲୳ୟ୪୚ୟ୪୳ୣୱ୭୤௬୤୰୭୫୘୦ୣ୧୰୑ୣୟ୬

ܴଶ ൌ

σሺ‫ ݕ‬െ þሻଶ σሺ‫ ݕ‬െ ›ሻଶ

ܴଶ ൌ

ͳǤʹͲ ʹ

ܴଶ ൌ ͲǤ͸Ͳ In the simple linear regression, the square root of R2 is equivalent to the correlation coefficient; therefore, when the correlation between the two variables is linear, you can convert R2 into the correlation coefficient r as follows:

Chapter Seventeen

210

r = ξܴଶ = ξǤ ͸Ͳ = 0.775

Interpretation of R-Squared R-squared is also called the coefficient of determination. It is an indicator of the variance explained by an independent variable x in the total variance in the dependent variable y. In our hypothetical example, we can say that 0.6 × 100 = 60% of the variation in the kilograms of weight reduced is explained by the number of kilometres walked every day for a month.

The Correlation Coefficient (r) and the Regression Coefficient (b) The third utility of regression analysis is that the regression coefficient b can be used as a measure of correlation between an independent variable and a dependent variable. In simple linear regression, the regression coefficient b and Pearson’s correlation coefficient r are the same if both variables (x and y) are standardized first. The similarities can be proven with a slightly complicated formula that proves that r2 = bx × by, where bx is the regression coefficient when x is assumed to be an independent variable and by is the regression coefficient when y is assumed be an independent variable. We are going to take a much simpler approach. We will assume that if the variance of x and y is the same, then r = b. If x is an independent variable and y is a dependent variable, the regression coefficient b is the covariance between x and y divided by the variance of x, and it can be written as follows: b=

σሺ௫ି௫ሻሺ௬ି୷ሻ σሺ௫ି௫ሻమ

=

େ୭୴ୟ୰୧ୟ୬ୡୣ୭୤௫ୟ୬ୢ௬ ୚ୟ୰୧ୟ୬ୡୣ୭୤௫

If x is a independent variable and y is a dependent variable, the correlation coefficient r is the covariance of x and y divided by the square root of the product of the standard deviations of x and y, and it can be written as follows: r=

σሾሺ௫ି௫ሻሺ௬ି௬ሻሿ మ



ටσሾ൫௫Ȃ௫൯ ൈσ൫௬Ȃ௬൯ ሿ

=

େ୭୴ୟ୰୧ୟ୬ୡୣ୭୤௫ୟ୬ୢ௬ ୔୰୭ୢ୳ୡ୲୭୤ୗ୲ୟ୬ୢୟ୰ୢୈୣ୴୧ୟ୲୧୭୬ୱ୭୤௫ୟ୬ୢ௬

If the variance of the independent variable x and that of the dependent variable y was the same, then we can substitute y with x in the denominator of the above equation:

Regression Analysis: A Prediction and Forecasting Technique σሾሺ௫ି௫ሻሺ௬ି௬ሻሿ

r=

మ



=

σሺ௫ି௫ሻሺ௬ି௬ሻ

ටσሾ൫௫Ȃ௫൯ ൈσ൫௫Ȃ௫൯ ሿ

r=

σሺ௫ି௫ሻሺ௬ି୷ሻ σሺ௫ି௫ሻమ

211

σሺ௫ି௫ሻమ

is the same formula: b =

σሺ௫ି௫ሻሺ௬ି௬ሻ σሺ௫ି௫ሻమ

You can see that the regression coefficient b is similar to the correlation coefficient r; therefore, it can be used as a measure of correlation between an independent variable and a dependent variable. This is the third utility of regression. The purpose of this chapter was to provide you with a basic understanding of regression analysis. It will help you to understand higher level multiple regression. With this basic knowledge, if you have to carry out a regression analysis for your research using a computer, you’ll be able to interpret computer outputs with greater understanding and ease. I have seen many researchers struggle to interpret regression analysis because they lacked this basic knowledge.

Exercises for Practice Q1.

From the following data on the family size (x) and the family room size (y): i. Draw a regression line to show that the relationship between the family size (x) and the family room size (y) is perfectly linear. ii. Use the following template and calculate the regression coefficients a and b. x

y

A

1

10

B

2

20

C

3

30

D

4

40

5 15 ‫=ݔ‬

50 12 ‫ݕ‬ൌ

E Sum (™) Mean

ഥ) (x – ࢞

(y – ‫ܡ‬ത)

ഥ) (x – ‫ܠ‬ത)(y – ࢟

ഥ)2 (x – ࢞

212

Chapter Seventeen

Q2.

Using the regression line (y = a + bx), predict the values of ǔ: i. ii. iii. iv. v.

Q3.

x = 10; a = 5; b = 3 x = 5; a = 3; b = 2 x = 15; a = 10; b = 12 x = 3.2; a = 2.5; b = 0.3 x = 2; a = 0.2; b = 1.8

If the variation explained by the number of years of education in the salary of new graduates is 36%, what is the correlation coefficient (r) between the number of years of education and the salary of new graduates?

SOLUTIONS TO EXERCISES FOR PRACTICE Chapter One 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

40 – 30 – (–5) + 3 = 40 – 30 + 5 +3 = 48 – 30 = 18 (50 – 30) + 2 × 5 = 20 + 10 = 30 5 × (10 – 8) – 3 = 5 × 2 – 3 = 10 – 3 = 7 27 ÷ 32 × 5 = 15 5 × 3 – 9 ÷ 3 = 12 9 ÷ 3 – 5 × 3 = -12 ξʹͷ = 5 52 + 43 = 25 +64 = 89 54 – 52 = 600 Find common log of 100 (Hint: log10? = 100 ) =2 Find natural log of 7.389 (Hint: log2.718? § 7.389 ) =2 Use the values of x and y in the following table and fill in the values for the questions marks (?) in the table. x

y

(x – y) Deviation

5 3 2 4 2 2 3 6 -3 ™x = ™y = 12 11 (c). ™x™y = 12 × 11 = 132

(x – y)2 Square of Deviation 4 4 9 (a). ™(x – y)2 = 17

xy Product of x and y 15 8 18 (b). ™xy = 41

Chapter Two Q1.

What is the level of measurement (nominal, ordinal, interval, or ratio) of the following variables? 1. 2. 3. 4. 5.

ordinal nominal nominal ratio ratio

Solutions to Exercises for Practice

214

6. interval 7. ratio 8. nominal 9. ordinal 10. nominal Q2.

What is the difference between an interval-level and ratio-level variable? The interval-level variable does not have an absolute 0, whereas the ratio-level variable has an absolute 0.

Q3.

Is the following variable a continuous or a discrete variable? 1. 2. 3. 4. 5. 6.

continuous discrete continuous continuous discrete discrete

Chapter Three Exercise 1: Look at the graph below and answer Q1 to Q3: Q1.

50

Q2.

India

Q3.

Niger and Japan; 38.0

Exercise 2: Look at the following vertical bar graph and answer Q4 to Q6: Q4.

B

Q5.

B

Q6.

C

Exercise 3: Look at the following pie chart and answer Q7 to Q10:

Elements of Statistics: A Hands-on Primer

Q7.

C

Q8.

A

Q9.

D

215

Q10. C

Chapter Four Q1.

Mode. $60 Median. ($60 + $70) ÷ 2 = $130 ÷2 = $65 Mean. $390 ÷ 6 = 65

Q2. A. Age

Frequency

Cumulative Frequencies

45–49

6

15

40–44

3

9

35–39

3

6

30–34

2

3

25–29

1

1

B. The largest number of frequencies is in the interval 45–49; therefore, the interval 45–49 is the modal interval. C. There are a total 15 frequencies in the distribution. The eighth frequency will divide the distribution in equal two halves. If we cumulate frequencies from the bottom, the lowest interval to the highest interval, we find that the eighth frequency lies in the 40–44 interval; therefore, the 40–44 interval is the interval that contains the median.

Solutions to Exercises for Practice

216

Q3.

The following table gives the selling prices of nine houses. Price per House

Number of Houses

$200,000

1

$300,000

4

$400,000

3

$1,900,000

1

Total

9

A. ($200,000 × 1 + $300,000 × 4 + $400,000 × 3 + $1,900,000 × 1) ÷ 9= ($200,000 + $1,200,000 + $1,200.000 + $1,900,000) ÷ 9 = $4,500,000 ÷ 9 = $500,000. The mean is $500,000. B. The distribution of the price of the five houses is as follows. $200,000, $300,000, $300,000, $300,000, $300,000, $400,000, $400,000, $400,000, $1,900,000 There are nine values; the fifth value in the middle of distribution is the median, and the fifth value is $300,000; therefore, the median is $300,000. C. Because distribution is skewed and distorted by one very high value ($1,900,000), the median will be a better representative of the central tendency.

Chapter Five Exercise 1: Q1.

7

Q2.

4

Elements of Statistics: A Hands-on Primer

Q3.

10

Q4.

1.6

Q5.

1.26

217

Exercise 2: Q6.

15.5

Q7 to Q9 Table Number of Hours Spent Watching Television Hours Midpoint Frequency (x) (f) 5–9 10–14

f×x

(x – xժ )

(x – xժ )2 (x – xժ )2 × f

7

1

7

–8.5

72.25

72.25

12

4

48

–3.5

12.25

49.00

15–19

17

2

34

1.5

2.25

4.50

20–24 Total

22

3

66

6.5

42.25

126.75

58

10

155

Q10. 25.25 Q11. 5.02 Q12. 0.32

Chapter Six Q1.

a

Q2.

a

Q3.

d

Q4.

a

Q5.

c

252.50

Solutions to Exercises for Practice

218

Q6.

c

Q7.

b

Q8.

c

Q9.

a

Q10. c

Chapter Seven Q1.

2

Q2.

3

Q3.

3

Q4.

Homeless persons are hard to reach because they have no fixed address and there will be no sampling frame to work with; therefore, I will choose the snowball technique.

Q5.

Because I want that each division, branch, and unit is adequately represented, the stratified random sampling will be the most appropriate.

Q6.

Since the proportional representation from each instrument type is required, the stratified random sampling will be appropriate.

Q7.

When a researcher thinks that the number of certain groups in the population is so small that he or she may miss adequate representation of a certain group, then the use of the quota sampling is appropriate. Risk: Because quota sampling is a non-probability sampling technique, the risk is that the sample may not represent the population.

Q8.

Because you have a complete sampling frame and your only concern is representativeness, a simple random sampling will serve the purpose.

Elements of Statistics: A Hands-on Primer

Q9.

219

3

Q10. 2

Chapter Eight Q1.

2

Q2.

1

Q3.

4

Q4.

4

Q5.

2

Q6.

2

Q7.

2

Q8.

1

Q9.

4

Q10. 1

Chapter Nine Q1.

What is the area between mean and the z values given below? f. g. h. i. j.

0.4896 0.3907 0.4985 0.2549 0.3413

Hint: Consult Appendix Table 1A. Find the area under the curve for the z value. Q2.

What is the percentile (i.e., the percentage) of students scoring below the z-scores of the following five students?

Solutions to Exercises for Practice

220

John Shilpa Anwar Mary Jesse

0.9406 0.9962 (0.5 – 0.2995 =2005) = 0.9719 (0.5 – 0.4957=0.0043) =

(94.06%) (99.62%) (20.05%) (97.19%) (0.43%)

Hint: Consult Appendix Table 1A. Find the area under the curve for the z value. Q3.

Q4.

Now find the percentages of students who scored above the zscores of the following five students. John

0.9406

1 – 0.9406 = 0.0594

5.94%

Shilpa

0.9962

1 – 0.9962 = 0.0038

0.38%

Anwar

0.2005

1 – 0.2005 = 0.7995

79.95%

Mary

0.9719

1 – 0.9719 = 0.0281

2.81%

Jesse

0.0043

1 – 0.0043 = 0.9957

99.57%

z = 1.78 Area under the curve between the mean and the z-score = 0.4625 (Consult Appendix Table 1A) Percentile Rank = 0.5 + .4625 = 0.9625 = 96.25%

Q5.

z = 3.0 Area under the curve = 0.4987 Percentile Rank = 0.5 + 0.4987 = 99.87%

Q6.

z = 3.0 Area under the curve = 0.9987 Percentile Rank = 99.87%

Based on their percentile rank, Einstein and Hawking are equally intelligent. Q7.

For score of 140, z = -1.33 For the score of 160, z = 1.33 From Appendix Table 1A: Area of under the curve up to z = -1.33 = 0.5 – 0.4082 = 0.0918 Area of under the curve up to z = 1.33 = 0.5 + 0.4082 = 0.9082 Area under the curve between z = –1.33 and z = 1.33 will be

Elements of Statistics: A Hands-on Primer

221

0.9082 – 0.0918 = 0.8164 or 81.64% The probability is 0.8164 that a student will score between 140 and 180. In other words, there are 81.64% chances that a student will score between 140 and 180. Q8.

Between 135.2 and 154.8 Hint: Half of 0.95 is 0.475. z-score corresponding to 0.475 area in Appendix Table 1A is 1.96. To find the value X, plug values into this formula. X= z × ı + ȝ

Chapter Ten Q1.

a

Q2.

c

Q3.

d

Q4.

b

Q5.

In Appendix Table 1A, the area between the mean and the z value of 1.81 = 0.4649; and p-value (i.e., area in the tail of curve beyond the z value) would be 0.5 – 0.4649 = 0.0351. Or in Appendix Table 1B, the area beyond the z value of 1.81 = 0.0351, which is the p-value.

Q6.

a. df = n – 1; therefore, 50 – 1 = 49 b. df = (r – 1)(c – 1) = (5 – 1) × (3 – 1) = 8

Q7.

c

Q8.

1 – 0.10 = 0.90. The level of confidence is 90%.

Chapter Eleven Q1.

d

Q2.

c

Q3.

c

Solutions to Exercises for Practice

222

Q4.

d

Q5.

c

Q6.

a. H0: The number of rooms in a house is not dependent on the number of persons in the family. b. H1: The number of rooms in a house is dependent on the number of persons in the family. c. Calculated Ȥ2 = 8.75; Tabulated (critical) value of Ȥ2 at Į = 0.05 level with df 4 is 9.49. The calculated value does not exceed the tabulated value. We fail to reject H0 that the number of rooms is not dependent on the number of persons in the family. In other words, we accept that the house size and the family size are not statistically significantly related.

Q7.

a. Calculated Ȥ2 = 0.52. b. H0: There is no relationship between religion and marital status. c. H1: There is a statistically significant relationship between religion and marital status. d. Tabulated (critical) value of Ȥ2 with df 2 at Į = 0.05 level is 5.99; Calculated value of the chi-square does not exceed the tabulated value. We fail to reject H0 that there is no relationship between religion and marital status. In other words, we accept that there is no statistically significant relationship between the religion and the marital status. e. Cramer’s V = 0.07; Little if any association between religion and marital status.

Q8.

Ȝ = 0.15. The prediction in the party affiliation is improved by 15% by knowing the religion of a person.

Chapter Twelve Q1.

d

Q2.

c

Q3.

a

Q4.

b

Elements of Statistics: A Hands-on Primer

Q5.

223

c

Q6. a. Ȗ = 0.54 b. Null hypothesis (H0): The party affiliation does not influence the support for legalization of marijuana. c. Alternate hypothesis (H1): The party affiliation influences the support for legalization of marijuana. d. Our value of gamma (Ȗ) is 0.54; therefore, the association between the two variables is positive and strong. It means that with an increase in liberal affiliation, the support for legalization of marijuana will increase. There will be 54% fewer errors when estimating support for legalization of marijuana by knowing a person’s party affiliation. Q7. a. ȡ = 0.85 b. 85% error will be reduced in our prediction by knowing the ranking of universities in a previous year. Q8. a. Somers’ D = 0.33 b. The value of Somers’ D is 0.33; therefore, the association between the two variables is positive and strong. c. Tau-b = 0.32 d. The value of Tau-b is also 0.32; therefore, the association between the two variables is positive and strong. Q9.

The calculated value of z = 1.14 does not exceed the tabulated (critical) value of 1.96 at Į = 0.05 (two-tailed). We fail to reject H0. Though value of Tau-b shows strong association between the frequency of church attendance and the support for euthanasia, the association is not statistically significant.

224

Solutions to Exercises for Practice

Chapter Thirteen Q1.

Calculated t = 2.1 H0:. The average salary of the faculty of consortium of 10 universities is not significantly higher than the faculty salaries across Canada. H1. The average salary of the faculty of consortium of 10 universities is significantly higher than the faculty salaries across Canada. Decision: one-tailed test; Į = 0.05; df = n – 1 = 10 – 1 = 9. Tabulated (critical) t = 1.83 Calculated value of the t (2.1) exceeds the tabulated value of t (1.83); therefore, we reject the H0. The average salary of the faculty of the consortium of 10 universities is a statistically significantly higher than the average faculty salaries across the nation.

Q2.

H0. There is no statistically significant difference in the life expectancy of the sample of First Nations men and that of the Canadian men. H1. There is a statistically significant difference in the life expectancy of the sample of the First Nations men and that of the Canadian men. Calculated t = -3.5; tabulated t (two-tailed) with df = 25, Į = 0.05 = -2.06 Decision: Calculated value of t exceeds the tabulated value of t; we reject the null hypothesis that there is no statistically significant difference in the life expectancy of the sample of First Nations men and that of the Canadian men. We conclude that there is a statistically significant difference in the life expectancy of the sample of the First Nations men and that of the Canadian men.

Elements of Statistics: A Hands-on Primer

225

Q3. – ൌ ܵšҧଵ െ  šҧଶ = ටቀ

šҧଵ െ  šҧଶ ܵšҧଵ െ  šҧଶ

௡ଵୗ୶ҧଵమ ା௡ଶୗ୶ҧଵమ

= ටቀ

୬ଵା୬ଶȂଶ ଵ଴ൈଶమ ାଵଶൈଷమ ଵ଴ାଵଶିଶ

– ൌ

଼ସି଼ଶ ହǤ଴

ቁൈቀ

ቁൈቀ =

୬ଵା୬ଶ ୬ଵ୬ଶ

ଵ଴ାଵଶ ଵ଴ൈଵଶ

ଶ ଵǤଵ଺



ቁ = 1.16

= 1.7

Tabulated t (two-tailed) Į = 0.05 df = 10 + 12 – 2 = 20; t = 2.09 Decision: Calculated value of t (1.7) does not exceed the tabulated value of t = 2.09; we fail to reject the H0 that there is no significant difference in the scores of boys and girls. In other words, we accept that there is no statistically significant difference in the scores of boys and girls.

Chapter Fourteen Q1. a. r = 0.74 b. t = 3.1; the calculated value of t exceeds the tabulated value of t (two-tailed) Į = 0.05 df = 10 – 2 = 8; t = 2.31; therefore, we reject H0. There is statistically significant correlation between unemployment rates and poverty rates. c. R2 = 0.548; the variation explained by x in y is 54.8%.

Chapter Fifteen Q1.

15.87%

Q2.

The power of a test is 27.46%. The power of a test has increased from 15.87% to 27.43%, with an increase in the sample size from 4 to 9. The power of a test has decreased from 15.87% to 4.75%, with a decrease of Į from 0.05 to 0.01.

Q3.

226

Solutions to Exercises for Practice

Q4.

The power of a test is the probability of rejecting a false null hypothesis. When a null hypothesis is true, there is no false null hypothesis to reject.

Q5.

A type 1 error is rejection of a true null hypothesis. When a null hypothesis is false, if you accept it, you will make a type 2 error, and if you reject it, you make no error. Therefore, when a null hypothesis is false, the possibility of making a type 1 error does not exist.

Chapter Sixteen Q1. a. SSB = 8; SSW = 6; SST = 14 b. F-ratio = 6.0 c. Tabulated F-ratio; df (2, 9) at Į = 0.05 is 4.26 The calculated F-ratio exceeds the tabulated (critical) F-ratio; we reject the H0 that there are no significant differences in calories burned by the three types of exercise equipment. In other words, we accept H1 that there are significant differences in calories burned by the three types of exercise equipment. Q2. a. SSB = 8; SSW = 6; SST = 14 b. F-ratio = 6.0 c. Tabulated F-ratio; df (3, 8) at Į = 0.05 is 4.26 Calculated F-ratio exceeds the tabulated (critical) F-ratio; we reject the H0 that there are no significant differences in calories burned by the three types of exercise equipment. In other words, we accept H1 that there are significant differences in calories burned by the three types of exercise equipment.

Chapter Seventeen Q1. (i) Because points lie on a straight line, the relationship is linear (see graph below).

Elements of Statistics: A Hands-on Primer

60 F a m 50 i l 40 y

227

Family Size and Family Room Size

R 30 o o m 20 S 10 i z e 0 1

ii. a = 0;

2

3 Family Size

4

5

b = 10

Q2. i. ii. iii. iv. v. Q3.

35 13 190 3.46 3.80

If the variation explained by the number of years of education in the salary of new graduate is 36%, then the value of R2 is 0.36; therefore, the coefficient of correlation r is r = ξǤ ͵͸ = 0.6

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.00 0.0000 0.0398 0.0793 0.1179 0.1554 0.1915 0.2257 0.2580 0.2881 0.3159

0.01 0.0040 0.0438 0.0832 0.1217 0.1591 0.1950 0.2291 0.2611 0.2910 0.3186

0.02 0.0080 0.0478 0.0871 0.1255 0.1628 0.1985 0.2324 0.2642 0.2939 0.3212

0.03 0.0120 0.0517 0.0910 0.1293 0.1664 0.2019 0.2357 0.2673 0.2967 0.3238

0.04 0.0160 0.0557 0.0948 0.1331 0.1700 0.2054 0.2389 0.2704 0.2995 0.3264

0.05 0.0199 0.0596 0.0987 0.1368 0.1736 0.2088 0.2422 0.2734 0.3023 0.3289

Table 1A. Standard Norm mal Z-Table: Area bettween Mean (0) and z

APPENDICES

0.06 0.0239 0.0636 0.1026 0.1406 0.1772 0.2123 0.2454 0.2764 0.3051 0.3315

0.07 0.0279 0.0675 0.1064 0.1443 0.1808 0.2157 0.2486 0.2794 0.3078 0.3340

0.08 0.0319 0.0714 0.1103 0.1480 0.1844 0.2190 0.2517 0.2823 0.3106 0.3365

0.09 0.0359 0.0753 0.1141 0.1517 0.1879 0.2224 0.2549 0.2852 0.3133 0.3389

1.0 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0

0.00 0.3413 0.3849 0.4032 0.4192 0.4332 0.4452 0.4554 0.4641 0.4713 0.4772 0.4821 0.4861 0.4893 0.4918 0.4938 0.4953 0.4965 0.4974 0.4981 0.4987

0.01 0.3438 0.3869 0.4049 0.4207 0.4345 0.4463 0.4564 0.4649 0.4719 0.4778 0.4826 0.4864 0.4896 0.4920 0.4940 0.4955 0.4966 0.4975 0.4982 0.4987

0.02 0.3461 0.3888 0.4066 0.4222 0.4357 0.4474 0.4573 0.4656 0.4726 0.4783 0.4830 0.4868 0.4898 0.4922 0.4941 0.4956 0.4967 0.4976 0.4982 0.4987

0.03 0.3485 0.3907 0.4082 0.4236 0.4370 0.4484 0.4582 0.4664 0.4732 0.4788 0.4834 0.4871 0.4901 0.4925 0.4943 0.4957 0.4968 0.4977 0.4983 0.4988

0.04 0.3508 0.3925 0.4099 0.4251 0.4382 0.4495 0.4591 0.4671 0.4738 0.4793 0.4838 0.4875 0.4904 0.4927 0.4945 0.4959 0.4969 0.4977 0.4984 0.4988

0.05 0.3531 0.3944 0.4115 0.4265 0.4394 0.4505 0.4599 0.4678 0.4744 0.4798 0.4842 0.4878 0.4906 0.4929 0.4946 0.4960 0.4970 0.4978 0.4984 0.4989

0.06 0.3554 0.3962 0.4131 0.4279 0.4406 0.4515 0.4608 0.4686 0.4750 0.4803 0.4846 0.4881 0.4909 0.4931 0.4948 0.4961 0.4971 0.4979 0.4985 0.4989

Elements of Statistics: A Hands-on Primer

0.07 0.3577 0.3980 0.4147 0.4292 0.4418 0.4525 0.4616 0.4693 0.4756 0.4808 0.4850 0.4884 0.4911 0.4932 0.4949 0.4962 0.4972 0.4979 0.4985 0.4989

0.08 0.3599 0.3997 0.4162 0.4306 0.4429 0.4535 0.4625 0.4699 0.4761 0.4812 0.4854 0.4887 0.4913 0.4934 0.4951 0.4963 0.4973 0.4980 0.4986 0.4990

0.09 0.3621 0.4015 0.4177 0.4319 0.4441 0.4545 0.4633 0.4706 0.4767 0.4817 0.4857 0.4890 0.4916 0.4936 0.4952 0.4964 0.4974 0.4981 0.4986 0.4990

229

Appendices

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4

0.00 0.5000 0.4602 0.4207 0.3821 0.3446 0.3085 0.2743 0.2420 0.2119 0.1841 0.1587 0.1357 0.1151 0.0968 0.0808

0.01 0.4960 0.4562 0.4168 0.3783 0.3409 0.3050 0.2709 0.2389 0.2090 0.1814 0.1562 0.1335 0.1131 0.0951 0.0793

0.02 0.4920 0.4522 0.4129 0.3745 0.3372 0.3015 0.2676 0.2358 0.2061 0.1788 0.1539 0.1314 0.1112 0.0934 0.0778

0.03 0.4880 0.4483 0.4090 0.3707 0.3336 0.2981 0.2643 0.2327 0.2033 0.1762 0.1515 0.1292 0.1093 0.0918 0.0764

0.04 0.4840 0.4443 0.4052 0.3669 0.3300 0.2946 0.2611 0.2296 0.2005 0.1736 0.1492 0.1271 0.1075 0.0901 0.0749

Table 1B. Standard Norm mal Z-Table: Area beyyond

230

0.05 0.4801 0.4404 0.4013 0.3632 0.3264 0.2912 0.2578 0.2266 0.1977 0.1711 0.1469 0.1251 0.1056 0.0885 0.0735

0.06 0.4761 0.4364 0.3974 0.3594 0.3228 0.2877 0.2546 0.2236 0.1949 0.1685 0.1446 0.1230 0.1038 0.0869 0.0721

0.07 0.4721 0.4325 0.3936 0.3557 0.3192 0.2843 0.2514 0.2206 0.1922 0.1660 0.1423 0.1210 0.1020 0.0853 0.0708

0.08 0.4681 0.4286 0.3897 0.3520 0.3156 0.2810 0.2483 0.2177 0.1894 0.1635 0.1401 0.1190 0.1003 0.0838 0.0694

0.09 0.4641 0.4247 0.3859 0.3483 0.3121 0.2776 0.2451 0.2148 0.1867 0.1611 0.1379 0.1170 0.0985 0.0823 0.0681

1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0

0.00 0.0668 0.0548 0.0446 0.0359 0.0287 0.0228 0.0179 0.0139 0.0107 0.0082 0.0062 0.0047 0.0035 0.0026 0.0019 0.0013

0.01 0.0655 0.0537 0.0436 0.0351 0.0281 0.0222 0.0174 0.0136 0.0104 0.0080 0.0060 0.0045 0.0034 0.0025 0.0018 0.0013

0.02 0.0643 0.0526 0.0427 0.0344 0.0274 0.0217 0.0170 0.0132 0.0102 0.0078 0.0059 0.0044 0.0033 0.0024 0.0018 0.0013

0.03 0.0630 0.0516 0.0418 0.0336 0.0268 0.0212 0.0166 0.0129 0.0099 0.0075 0.0057 0.0043 0.0032 0.0023 0.0017 0.0012

0.04 0.0618 0.0505 0.0409 0.0329 0.0262 0.0207 0.0162 0.0125 0.0096 0.0073 0.0055 0.0041 0.0031 0.0023 0.0016 0.0012

0.05 0.0606 0.0495 0.0401 0.0322 0.0256 0.0202 0.0158 0.0122 0.0094 0.0071 0.0054 0.0040 0.0030 0.0022 0.0016 0.0011

0.06 0.0594 0.0485 0.0392 0.0314 0.0250 0.0197 0.0154 0.0119 0.0091 0.0069 0.0052 0.0039 0.0029 0.0021 0.0015 0.0011

Elements of Statistics: A Hands-on Primer 0.07 0.0582 0.0475 0.0384 0.0307 0.0244 0.0192 0.0150 0.0116 0.0089 0.0068 0.0051 0.0038 0.0028 0.0021 0.0015 0.0011

0.08 0.0571 0.0465 0.0375 0.0301 0.0239 0.0188 0.0146 0.0113 0.0087 0.0066 0.0049 0.0037 0.0027 0.0020 0.0014 0.0010

0.09 0.0559 0.0455 0.0367 0.0294 0.0233 0.0183 0.0143 0.0110 0.0084 0.0064 0.0048 0.0036 0.0026 0.0019 0.0014 0.0010

231

Append dices

232

Table 2A. R Right-Tail Arrea for the Ch hi-Square Disstribution

df\area 1 2 3 4 5 6 7 8 9

0.1 2.70554 4.60517 6.25139 7.77944 9.23636 10.64464 12.01704 13.36157 14.68366

0.05 3.84146 5.99146 7.81473 9.48773 11.07050 12.59159 14.06714 15.50731 16.91898

0.025 5.02389 7.37776 9.34840 11.14329 12.83250 14.44938 16.01276 17.53455 19.02277

0.01 6.6349 9.2103 11.3449 13.2767 15.0863 16.8119 18.4753 20.0902 21.6660

0.005 7.87944 7 10.59663 1 12.83816 1 14.86026 1 16.74960 1 18.54758 1 20.27774 2 21.95495 2 23.58935 2

Elements of Statistics: A Hands-on Primer

df\area 10 11 12 13 14 15 16 17 18 19 20 21

0.1 15.98718 17.27501 18.54935 19.81193 21.06414 22.30713 23.54183 24.76904 25.98942 27.20357 28.41198 29.61509

0.05 18.30704 19.67514 21.02607 22.36203 23.68479 24.99579 26.29623 27.58711 28.86930 30.14353 31.41043 32.67057

0.025 20.48318 21.92005 23.33666 24.73560 26.11895 27.48839 28.84535 30.19101 31.52638 32.85233 34.16961 35.47888

0.01 23.2093 24.7250 26.2170 27.6883 29.1412 30.5779 31.9999 33.4087 34.8053 36.1909 37.5662 38.9322

233

0.005 25.18818 26.75685 28.29952 29.81947 31.31935 32.80132 34.26719 35.71847 37.15645 38.58226 39.99685 41.40106

Append dices

234

Table 3A. t--Table Right-Tail Probab bilities

One-taileed Two-taileed

0.005 0.1

0.0 025 0.05

00.01 00.025

0.005 0.01

Df / p 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

6.331 2.992 2.335 2.113 2.002 1.994 1.889 1.886 1.883 1.881 1.8 1.778 1.777 1.776 1.775 1.775 1.774 1.773 1.773 1.772 1.772

12 2.71 4.3 4 3.18 2.78 2.57 2.45 2.36 2.31 2.26 2.23 2.2 2 2.18 2.16 2.14 2.13 2.12 2.11 2.1 2 2.09 2.09 2.08

331.82 66.96 44.54 33.75 33.36 33.14 3 2.9 22.82 22.76 22.72 22.68 22.65 22.62 2.6 22.58 22.57 22.55 22.54 22.53 22.52

63.66 9.92 5.84 4.6 4.03 3.71 3.5 3.36 3.25 3.17 3.11 3.05 3.01 2.98 2.95 2.92 2.9 2.88 2.86 2.85 2.83

Elementts of Statistics: A Hands-on Prrimer

235

Table 4A. F Distribution n Table for Alpha A = 0.05

One-taiiled Two-tailed Df / p 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0.05 0.1

0.025 0 0.05

0.01 0.025

0.005 0.01

6.31 2.92 2.35 2.13 2.02 1.94 1.89 1.86 1.83 1.81 1.80 1.78 1.77 1.76 1.75 1.75 1.74 1.73 1.73 1.72 1.72 1.72 1.71

12.71 1 4.3 3.18 2.78 2.57 2.45 2.36 2.31 2.26 2.23 2.20 2.18 2.16 2.14 2.13 2.12 2.11 2.01 2.09 2.09 2.08 2.07 2.07

31.82 6.96 4.54 3.75 3.36 3.14 3 2.9 2.82 2.76 2.72 2.68 2.65 2.62 2.60 2.58 2.57 2.55 2.54 2.53 2.52 2.51 2.50

63.66 9.92 5.84 4.6 4.03 3.71 3.5 3.36 3.25 3.17 3.11 3.05 3.01 2.98 2.95 2.92 2.90 2.88 2.86 2.85 2.83 2.82 2.81

Appendices

236

One-tailed Two-tailed Df / p 24 25 26 27 28 29 30 inf

0.05 0.1

0.025 0.05

0.01 0.025

0.005 0.01

1.71 1.71 1.71 1.70 1.70 1.70 1.70 1.64

2.06 2.06 2.06 2.05 2.05 2.05 2.04 1.96

2.49 2.49 2.48 2.47 2.47 2.46 2.46 2.33

2.80 2.79 2.78 2.77 2.76 2.76 2.75 2.58

48 17 14 132 21 9 161 54 72 4 196 169 126

160 48 7 39 32 195 134 136 99 108 191 149 116

82 135 135 15 123 104 118 96 61 0 117 68 61

153 74 38 112 6 173 158 22 168 103 8 114 137

156 74 18 5 89 99 117 4 86 113 47 111 199

Random Numbers 1-200 Generated by Excel 172 111 119 26 61 12 191 69 188 144 30 48 150 77 102 17 13 178 200 28 70 143 135 40 34 56 98 118 172 43 27 162 32 46 59 185 142 64 41 6 60 106 172 35 145 77 13 155 27 143 24 31 116 128 51 57 25 18 164 78 129 50 85 159 13 21 63 138 50 160 34 132 60 200 113

Table 5A. Random Number Table

Elements of Statistics: A Hands-on Primer

38 95 108 29 30 68 129 25 160 33 155 154

180 40 49 32 64 5 198 99 156 16 195 116

179 164 22 68 42 157 11 171 173 19 196 140

53 178 174 18 11 159 116 181 196 57 42 79

158 6 53 28 39 152 83 139 136 198 96 154

237

REFERENCES

Becker, Howard S., ed. 1964. The Other Side: Perspectives on Deviance. New York: The Free Press. CDC. 2005. “Deaths Rates 1,000 People by Age Groups,” Mortality Data for US Population. Available online. Galton, Francis. 1889. Natural Inheritance. London: MacMillan & Co. Haan, Michael. 2008 and 2013. Introduction to Statistics for Canadian Social Scientists, Don Mills, ON: Oxford University Press. Hacking, Ian. 1975. The Emergence of Probability: A Philosophical Study of Early Ideas about Probability, Induction and Statistical Inference. London, UK: Cambridge University Press. Quoted in Haan, Michael. 2013. An Introduction to Statistics for Canadian Social Scientists. Don Mills, ON: Oxford University Press. Huff, Darrel, and Irving Geis. 1954. How to Lie with Statistics, New York: W.W. Norton. Pearson, Karl. 1894. “Contributions to the Mathematical Theory of Evolution – I. On the Dissection of Asymmetrical Frequency-Curves.” Philosophical Transactions CLXXXV, p. 80. Quoted in Haan, Michael. Rosling, Hans. 2011. “Automatic Translation” (6/6) from the Joys of Stats. YouTube, February 2, 2011. https://www.youtube.com/watch?v=AEac-jP5Eho. Runyon, Richard P., and Audrey Haber. 1980. Fundamentals of Behavioural Statistics, 4th ed. Don Mills, ON: Addison-Wesley Publishing Company. Sharma, Raghubar D. 2012. Poverty in Canada. Don Mills, ON: Oxford University Press, p. 135. Statistics Canada. 31 July 2006 (Modified). Table 2a Complete Life Table Canada 2000 to 2002 Male. Available online. —. 27 June 2012 (Modified). CASIM, Table 111-008. Available online. Student. 1908. “Probable Error of a Mean.” Bimetrica 6: 1–25. “United States Personal Income per Capita 1929-2008: Inflation Adjusted (2008$),” Demographica, 2008. http://www.demographia.com/dbpc1929.pdf. Wells, H.G. 1903. Mankind in Making, London, UK: Chapman and Hall, p. 204.

STATISTICAL PROCEDURE AND TESTS BY APPROPRIATE LEVEL OF MEASUREMENT Tables and Graphs

Measures Central Tendency

Nominal Ordinal Frequency Distribution Table; Bar Graph; and Pie Chart. Mode

Measure of Variability

Tests of Significance

Cramer’s V; Phi (ࢥ); and Lambda (Ȣ).

Interval/Ratio Grouped data Frequency Table; Bar Graph; and Line Graph. Mode; and Mode; Median. Median (if data has outlier); and Mean (if data has no outliers) . Range Range; Average Deviation; Variance; and Standard Deviation. Chi-Square Kruskal’s gamma(Ȗ); Spearman’s rho (ȡs); Somers’ D; and Kendall’s Taub (IJb).

z-test; t-test; Correlation Coefficient (r); ANOVA; and Regression.

INDEX

analysis of variance (ANOVA) total sum of squares, sum of squares within groups, sum of squares between groups, Anova Table, limitations, 194, 197, 198, 199, 201 average arithmetic, 43 Bell Curve, 89 bimodality and multimodality, 40 central limit theorem, 88 central tendency measures of, 37 chi-square test uses, requirements, calculations; expected frequencies, 122, 126, 127, 128, 129, 130, 131, 132, 133, 135, 137, 239 coefficient of correlation, of determination, significance of sample size, interpretation of, 147, 173, 174, 175, 177, 179, 180, 182, 208, 210, 239 coefficient of variation practical use, 62, 63 confidence intervals calculating, 92, 93 Cramer’s V, 136 cross-tabulation univariate; bivariate, 109 cumulative frequency cummulative percentage, 22 decimals and fractions, 3 degrees of freedom, 116 deviation mean deviation, 8, 9, 55, 56 exponents, 6

frequency table frequency distribution, 20 Galton, Francis, 205 gamma. see Kruskal's gamma Gaussian Curve, 89 Gosset, William Sealy, 164 graphs and charts vertical bar graph, horizontal bar graph, 19, 23, 24 grouped data, 20 Hacking, Ian, 11 hypothesis testing assumptions, independence, normality, 112, 113 Kendall’s tau-b, 154, 155, 156, 239 Kruskal's gamma, 144, 145 lambda, 137, 138, 139 level of confidence, 115 levels of measurement nominal, ordinal, interval, 12, 13, 14, 16 linearity, 180, 181 logarithm, 7 mean grouped data, 43, 44, 45, 49, 50, 59, 239 weighted, 46 measures of central tendency mode, mean, median, 37 median grouped data, 41, 42, 43, 46, 49, 239 misuse of graphic, 28 eliminatig zero from y-axis, 29, 30, 31 non-probability samples convenience, snowball, quota, 83, 84

Elements of Statistics: A Hands-on Primer normal curve. see Bell curve, Gaussian curve normal distribution, 97 odds odds ratio, 72, 73 order of operations BEDMAS (brackets, exponents, division, multiplication, addition and substraction), 2, 3 one-tailed and two-tailed tests, 114 parametric and non-parametric tests and power, 191 phi interpretation, 133, 136, 239 pie chart and double counting, 26, 27, 38, 239 power a test callculating sample size; significance level, 185, 186, 188, 189, 191, 192 probability statement, rules of probability, theoretical, empirical independent, dependent events mutually exclusive venets, 4, 11, 65, 66, 67, 68, 69, 70, 74, 104 probability samples random, systematic, stratified, 79, 80, 81, 82 p-values calculating p-values, 115 randomness concept of, 66 range, 54 regression equation. coefficients interpretation, 204, 205, 206, 208, 209, 210, 211 sample size, 82, 83, 187

241

samples non probability sampling convenience, snowball, 83, 84 probability samples random, systematic, stratified or hieracchical, 79, 80, 81, 82 sampling a sample, sampling frame, sampling error, non-sampling error, 77, 78, 79 skewness, 46, 47, 48, 49, 50 Somers’ D, 150, 151, 154, 156, 239 Spearman’s rho, 147, 148, 149, 150, 239 standard deviation grouped data, 58 standard error of the sample mean, 91 standardized scores z-score, 99 steps for testing a hypothesis, 118, 119, 120 symbols used in statistics sigma, 7, 8, 9 symmetrical distribution, 47 tails of the curve, 105 t-distribution, 164 test of significance, 111 truncation, 6 t-test standard error; comparing two samples;, 165, 168, 169, 170, 177, 179 type 1 error, 117 type 2 error, 118 utility of correlation, 182 variability measures of, 53-62 variable types discrete; continuous, 15 variance calculating, 57, 61, 239 Wells, H.G., 1