Table of contents :
Cover
Data Science for Business
and Decision Making
Copyright
Dedication
Epigraph
1
Introduction to Data Analysis and Decision Making
Introduction: Hierarchy Between Data, Information, and Knowledge
Overview of the Book
Final Remarks
2
Types of Variables and Measurement and Accuracy Scales
Introduction
Types of Variables
Nonmetric or Qualitative Variables
Metric or Quantitative Variables
Types of Variables x Scales of Measurement
Nonmetric Variables-Nominal Scale
Nonmetric Variables-Ordinal Scale
Quantitative Variable-Interval Scale
Quantitative Variable-Ratio Scale
Types of Variables x Number of Categories and Scales of Accuracy
Dichotomous or Binary Variable (Dummy)
Polychotomous Variable
Discrete Quantitative Variable
Continuous Quantitative Variable
Final Remarks
Exercises
Part II:
Descriptive Statistics
3
Univariate Descriptive Statistics
Introduction
Frequency Distribution Table
Frequency Distribution Table for Qualitative Variables
Frequency Distribution Table for Discrete Data
Frequency Distribution Table for Continuous Data Grouped into Classes
Graphical Representation of the Results
Graphical Representation for Qualitative Variables
Bar Chart
Pie Chart
Pareto Chart
Graphical Representation for Quantitative Variables
Line Graph
Scatter Plot
Histogram
Stem-and-Leaf Plot
Boxplot or Box-and-Whisker Diagram
The Most Common Summary-Measures in Univariate Descriptive Statistics
Measures of Position or Location
Measures of Central Tendency
Arithmetic Mean
Case 1: Simple Arithmetic Mean of Ungrouped Discrete and Continuous Data
Case 2: Weighted Arithmetic Mean of Ungrouped Discrete and Continuous Data
Case 3: Arithmetic Mean of Grouped Discrete Data
Case 4: Arithmetic Mean of Continuous Data Grouped into Classes
Median
Case 1: Median of Ungrouped Discrete and Continuous Data
Case 2: Median of Grouped Discrete Data
Case 3: Median of Continuous Data Grouped into Classes
Mode
Case 1: Mode of Ungrouped Data
Case 2: Mode of Grouped Qualitative or Discrete Data
Case 3: Mode of Continuous Data Grouped into Classes
Quantiles
Quartiles
Deciles
Percentiles
Case 1: Quartiles, Deciles, and Percentiles of Ungrouped Discrete and Continuous Data
Case 2: Quartiles, Deciles, and Percentiles of Grouped Discrete Data
Case 3: Quartiles, Deciles, and Percentiles of Continuous Data Grouped into Classes
Identifying the Existence of Univariate Outliers
Measures of Dispersion or Variability
Range
Average Deviation
Case 1: Average Deviation of Ungrouped Discrete and Continuous Data
Case 2: Average Deviation of Grouped Discrete Data
Case 3: Average Deviation of Continuous Data Grouped into Classes
Variance
Case 1: Variance of Ungrouped Discrete and Continuous Data
Case 2: Variance of Grouped Discrete Data
Case 3: Variance of Continuous Data Grouped into Classes
Standard Deviation
Standard Error
Coefficient of Variation
Measures of Shape
Measures of Skewness
Pearsons First Coefficient of Skewness
Pearsons Second Coefficient of Skewness
Bowleys Coefficient of Skewness
Fishers Coefficient of Skewness
Coefficient of Skewness on Stata
Measures of Kurtosis
Coefficient of Kurtosis
Fishers Coefficient of Kurtosis
Coefficient of Kurtosis on Stata
A Practical Example in Excel
A Practical Example on SPSS
Frequencies Option
Descriptives Option
Explore Option
A Practical Example on Stata
Univariate Frequency Distribution Tables on Stata
Summary of Univariate Descriptive Statistics on Stata
Calculating Percentiles on Stata
Charts on Stata: Histograms, Stem-and-Leaf, and Boxplots
Histogram
Stem-and-Leaf
Boxplot
Final Remarks
Exercises
4
Bivariate Descriptive Statistics
Introduction
Association Between Two Qualitative Variables
Joint Frequency Distribution Tables
Measures of Association
Chi-Square Statistic
Other Measures of Association Based on Chi-Square
Spearmans Coefficient
Correlation Between Two Quantitative Variables
Joint Frequency Distribution Tables
Graphical Representation Through a Scatter Plot
Measures of Correlation
Covariance
Pearsons Correlation Coefficient
Final Remarks
Exercises
Part III: Probabilistic Statistics
5
Introduction to Probability
Introduction
Terminology and Concepts
Random Experiment
Sample Space
Events
Unions, Intersections, and Complements
Independent Events
Mutually Exclusive Events
Definition of Probability
Basic Probability Rules
Probability Variation Field
Probability of the Sample Space
Probability of an Empty Set
Probability Addition Rule
Probability of a Complementary Event
Probability Multiplication Rule for Independent Events
Conditional Probability
Probability Multiplication Rule
Bayes´ Theorem
Combinatorial Analysis
Arrangements
Combinations
Permutations
Final Remarks
Exercises
6
Random Variables and Probability Distributions
Introduction
Random Variables
Discrete Random Variable
Expected Value of a Discrete Random Variable
Variance of a Discrete Random Variable
Cumulative Distribution Function of a Discrete Random Variable
Continuous Random Variable
Expected Value of a Continuous Random Variable
Variance of a Continuous Random Variable
Cumulative Distribution Function of a Continuous Random Variable
Probability Distributions for Discrete Random Variables
Discrete Uniform Distribution
Bernoulli Distribution
Binomial Distribution
Relationship Between the Binomial and the Bernoulli Distributions
Geometric Distribution
Negative Binomial Distribution
Relationship Between the Negative Binomial and the Binomial Distributions
Relationship Between the Negative Binomial and the Geometric Distributions
Hypergeometric Distribution
Approximation of the Hypergeometric Distribution by the Binomial
Poisson Distribution
Approximation of the Binomial by the Poisson Distribution
Probability Distributions for Continuous Random Variables
Uniform Distribution
Normal Distribution
Approximation of the Binomial by the Normal Distribution
Approximation of the Poisson by the Normal Distribution
Exponential Distribution
Relationship Between the Poisson and the Exponential Distribution
Gamma Distribution
Special Cases of the Gamma Distribution
Relationship Between the Poisson and the Gamma Distribution
Chi-Square Distribution
Students t Distribution
Snedecors F Distribution
Relationship Between Students t and Snedecors F Distribution
Final Remarks
Exercises
Part IV: Statistical Inference
7
Sampling
Introduction
Probability or Random Sampling
Simple Random Sampling
Simple Random Sampling Without Replacement
Simple Random Sampling With Replacement
Systematic Sampling
Stratified Sampling
Cluster Sampling
Nonprobability or Nonrandom Sampling
Convenience Sampling
Judgmental or Purposive Sampling
Quota Sampling
Geometric Propagation or Snowball Sampling
Sample Size
Size of a Simple Random Sample
Sample Size to Estimate the Mean of an Infinite Population
Sample Size to Estimate the Mean of a Finite Population
Sample Size to Estimate the Proportion of an Infinite Population
Sample Size to Estimate the Proportion of a Finite Population
Size of the Systematic Sample
Size of the Stratified Sample
Sample Size to Estimate the Mean of an Infinite Population
Sample Size to Estimate the Mean of a Finite Population
Sample Size to Estimate the Proportion of an Infinite Population
Sample Size to Estimate the Proportion of a Finite Population
Size of a Cluster Sample
Size of a One-Stage Cluster Sample
Sample Size to Estimate the Mean of an Infinite Population
Sample Size to Estimate the Mean of a Finite Population
Sample Size to Estimate the Proportion of an Infinite Population
Sample Size to Estimate the Proportion of a Finite Population
Size of a Two-Stage Cluster Sample
Final Remarks
Exercises
8
Estimation
Introduction
Point and Interval Estimation
Point Estimation
Interval Estimation
Point Estimation Methods
Method of Moments
Ordinary Least Squares
Maximum Likelihood Estimation
Interval Estimation or Confidence Intervals
Confidence Interval for the Population Mean (μ)
Known Population Variance (σ2)
Unknown Population Variance (σ2)
Confidence Interval for Proportions
Confidence Interval for the Population Variance
Final Remarks
Exercises
9
Hypotheses Tests
Introduction
Parametric Tests
Univariate Tests for Normality
Kolmogorov-Smirnov Test
Shapiro-Wilk Test
Shapiro-Francia Test
Solving Tests for Normality by Using SPSS Software
Solving Tests for Normality by Using Stata
Kolmogorov-Smirnov Test on the Stata Software
Shapiro-Wilk Test on the Stata Software
Shapiro-Francia Test on the Stata Software
Tests for the Homogeneity of Variances
Bartletts χ2 Test
Cochrans C Test
Hartleys Fmax Test
Levenes F-Test
Solving Levenes Test by Using SPSS Software
Solving Levenes Test by Using the Stata Software
Hypotheses Tests Regarding a Population Mean (μ) From One Random Sample
Z Test When the Population Standard Deviation (σ) Is Known and the Distribution Is Normal
Students t-Test When the Population Standard Deviation (σ) Is Not Known
Solving Students t-Test for a Single Sample by Using SPSS Software
Solving Students t-Test for a Single Sample by Using Stata Software
Students t-Test to Compare Two Population Means From Two Independent Random Samples
Case 1: σ12σ22
Case 2: σ12=σ22
Solving Students t-Test From Two Independent Samples by Using SPSS Software
Solving Students t-Test From Two Independent Samples by Using Stata Software
Students t-Test to Compare Two Population Means From Two Paired Random Samples
Solving Students t-Test From Two Paired Samples by Using SPSS Software
Solving Students t-Test From Two Paired Samples by Using Stata Software
ANOVA to Compare the Means of More Than Two Populations
One-Way ANOVA
Solving the One-Way ANOVA Test by Using SPSS Software
Solving the One-Way ANOVA Test by Using Stata Software
Factorial ANOVA
Two-Way ANOVA
Solving the Two-Way ANOVA Test by Using SPSS Software
Solving the Two-Way ANOVA Test by Using Stata Software
ANOVA With More Than Two Factors
Final Remarks
Exercises
10
Nonparametric Tests
Introduction
Tests for One Sample
Binomial Test
Solving the Binomial Test Using SPSS Software
Solving the Binomial Test Using Stata Software
Chi-Square Test (χ2) for One Sample
Solving the χ2 Test for One Sample Using SPSS Software
Solving the χ2 Test for One Sample Using Stata Software
Sign Test for One Sample
Solving the Sign Test for One Sample Using SPSS Software
Solving the Sign Test for One Sample Using Stata Software
Tests for Two Paired Samples
McNemar Test
Solving the McNemar Test Using SPSS Software
Solving the McNemar Test Using Stata Software
Sign Test for Two Paired Samples
Solving the Sign Test for Two Paired Samples Using SPSS Software
Solving the Sign Test for Two Paired Samples Using Stata Software
Wilcoxon Test
Solving the Wilcoxon Test Using SPSS Software
Solving the Wilcoxon Test Using Stata Software
Tests for Two Independent Samples
Chi-Square Test (χ2) for Two Independent Samples
Solving the χ2 Statistic Using SPSS Software
Solving the χ2 Statistic by Using Stata Software
Mann-Whitney U Test
Solving the Mann-Whitney Test Using SPSS Software
Solving the Mann-Whitney Test Using Stata Software
Tests for k Paired Samples
Cochrans Q Test
Solving Cochrans Q Test by Using SPSS Software
Solution of Cochrans Q Test on Stata Software
Friedmans Test
Solving Friedmans Test by Using SPSS Software
Solving Friedmans Test by Using Stata Software
Tests for k Independent Samples
The χ2 Test for k Independent Samples
Solving the χ2 Test for k Independent Samples on SPSS
Solving the χ2 Test for k Independent Samples on Stata
Kruskal-Wallis Test
Solving the Kruskal-Wallis Test by Using SPSS Software
Solving the Kruskal-Wallis Test by Using Stata
Final Remarks
Exercises
Part V: Multivariate Exploratory Data Analysis
11
Cluster Analysis
Introduction
Cluster Analysis
Defining Distance or Similarity Measures in Cluster Analysis
Distance (Dissimilarity) Measures Between Observations for Metric Variables
Similarity Measures Between Observations for Binary Variables
Agglomeration Schedules in Cluster Analysis
Hierarchical Agglomeration Schedules
Notation
A Practical Example of Cluster Analysis With Hierarchical Agglomeration Schedules
Nearest-Neighbor or Single-Linkage Method
Furthest-Neighbor or Complete-Linkage Method
Between-Groups or Average-Linkage Method
Nonhierarchical K-Means Agglomeration Schedule
Notation
A Practical Example of a Cluster Analysis With the Nonhierarchical K-Means Agglomeration Schedule
Cluster Analysis with Hierarchical and Nonhierarchical Agglomeration Schedules in SPSS
Elaborating Hierarchical Agglomeration Schedules in SPSS
Elaborating Nonhierarchical K-Means Agglomeration Schedules in SPSS
Cluster Analysis With Hierarchical and Nonhierarchical Agglomeration Schedules in Stata
Elaborating Hierarchical Agglomeration Schedules in Stata
Elaborating Nonhierarchical K-Means Agglomeration Schedules in Stata
Final Remarks
Exercises
Appendix
Detecting Multivariate Outliers
12
Principal Component Factor Analysis
Introduction
Principal Component Factor Analysis
Pearsons Linear Correlation and the Concept of Factor
Overall Adequacy of the Factor Analysis: Kaiser-Meyer-Olkin Statistic and Bartletts Test of Sphericity
Defining the Principal Component Factors: Determining the Eigenvalues and Eigenvectors of Correlation Matrix ρ and Calcula ...
Factor Loadings and Communalities
Factor Rotation
A Practical Example of the Principal Component Factor Analysis
Principal Component Factor Analysis in SPSS
Principal Component Factor Analysis in Stata
Final Remarks
Exercises
Appendix: Cronbachs Alpha
Brief Presentation
Determining Cronbachs Alpha Algebraically
Determining Cronbachs Alpha in SPSS
Determining Cronbachs Alpha in Stata
Part VI: Generalized Linear Models
13
Simple and Multiple Regression Models
Introduction
Linear Regression Models
Estimation of the Linear Regression Model by Ordinary Least Squares
Explanatory Power of the Regression Model: Coefficient of Determination R2
General Statistical Significance of the Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Model Parameters and Elaboration of Predictions
Estimation of Multiple Linear Regression Models
Dummy Variables in Regression Models
Presuppositions of Regression Models Estimated by OLS
Normality of Residuals
The Multicollinearity Problem
Causes of Multicollinearity
Consequences of Multicollinearity
Application of Multicollinearity Examples in Excel
Multicollinearity Diagnostics
Possible Solutions for the Multicollinearity Problem
The Problem of Heteroskedasticity
Causes of Heteroskedasticity
Consequences of Heteroskedasticity
Heteroskedasticity Diagnostics: Breusch-Pagan/Cook-Weisberg Test
Weighted Least Squares Method: A Possible Solution
Huber-White Method for Robust Standard Errors
The Autocorrelation of Residuals Problem
Causes of the Autocorrelation of Residuals
Consequences of the Autocorrelation of Residuals
Autocorrelation of Residuals Diagnostic: The Durbin-Watson Test
Autocorrelation of Residuals Diagnostic: The Breusch-Godfrey Test
Possible Solutions for the Autocorrelation of Residuals Problem
Detection of Specification Problems: Linktest and RESET Test
Nonlinear Regression Models
The Box-Cox Transformation: The General Regression Model
Estimation of Regression Models in Stata
Estimation of Regression Models in SPSS
Final Remarks
Exercises
Appendix: Quantile Regression Models
A Brief Introduction
Example: Quantile Regression Model in Stata
14
Binary and Multinomial Logistic Regression Models
Introduction
The Binary Logistic Regression Model
Estimation of the Binary Logistic Regression Model by Maximum Likelihood
General Statistical Significance of the Binary Logistic Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Parameters for the Binary Logistic Regression Model
Cutoff, Sensitivity Analysis, Overall Model Efficiency, Sensitivity, and Specificity
The Multinomial Logistic Regression Model
Estimation of the Multinomial Logistic Regression Model by Maximum Likelihood
General Statistical Significance of the Multinomial Logistic Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Parameters for the Multinomial Logistic Regression Model
Estimation of Binary and Multinomial Logistic Regression Models in Stata
Binary Logistic Regression in Stata
Multinomial Logistic Regression in Stata
Estimation of Binary and Multinomial Logistic Regression Models in SPSS
Binary Logistic Regression in SPSS
Multinomial Logistic Regression in SPSS
Final Remarks
Exercises
Appendix: Probit Regression Models
A Brief Introduction
Example: Probit Regression Model in Stata
15
Regression Models for Count Data: Poisson and Negative Binomial
Introduction
The Poisson Regression Model
Estimation of the Poisson Regression Model by Maximum Likelihood
General Statistical Significance of the Poisson Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Parameters for the Poisson Regression Model
Test to Verify Overdispersion in Poisson Regression Models
The Negative Binomial Regression Model
Estimation of the Negative Binomial Regression Model by Maximum Likelihood
General Statistical Significance of the Negative Binomial Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Parameters for the Negative Binomial Regression Model
Estimating Regression Models for Count Data in Stata
Poisson Regression Model in Stata
Negative Binomial Regression Model in Stata
Regression Model Estimation for Count Data in SPSS
Poisson Regression Model in SPSS
Negative Binomial Regression Model in SPSS
Final Remarks
Exercises
Appendix: Zero-Inflated Regression Models
Brief Introduction
Example: Zero-Inflated Poisson Regression Model in Stata
Example: Zero-Inflated Negative Binomial Regression Model in Stata
Part VII: Optimization Models and Simulation
16
Introduction to Optimization Models: General Formulations and Business Modeling
Introduction to Optimization Models
Introduction to Linear Programming Models
Mathematical Formulation of a General Linear Programming Model
Linear Programming Model in the Standard and Canonical Forms
Linear Programming Model in the Standard Form
Linear Programming Model in the Canonical Form
Transformations Into the Standard or Canonical Form
Assumptions of the Linear Programming Model
Proportionality
Additivity
Divisibility and Non-negativity
Certainty
Modeling Business Problems Using Linear Programming
Production Mix Problem
Blending or Mixing Problem
Diet Problem
Capital Budget Problems
Portfolio Selection Problem
Model 1: Maximization of an Investment Portfolios Expected Return
Model 2: Investment Portfolio Risk Minimization
Production and Inventory Problem
Aggregated Planning Problem
Final Remarks
Exercises
17
Solution of Linear Programming Problems
Introduction
Graphical Solution of a Linear Programming Problem
Linear Programming Maximization Problem with a Single Optimal Solution
Linear Programming Minimization Problem With a Single Optimal Solution
Special Cases
Multiple Optimal Solutions
Unlimited Objective Function z
There Is No Optimal Solution
Degenerate Optimal Solution
Analytical Solution of a Linear Programming Problem in Which m n
The Simplex Method
Logic of the Simplex Method
Analytical Solution of the Simplex method for Maximization Problems
Tabular Form of the Simplex Method for Maximization Problems
The Simplex Method for Minimization Problems
Special Cases of the Simplex Method
Multiple Optimal Solutions
Unlimited Objective Function z
There Is No Optimal Solution
Degenerate Optimal Solution
Solution by Using a Computer
Solver in Excel
Solution of the Examples found in Section 16.6 of Chapter 16 using Solver in Excel
Solution of Example 16.3 of Chapter 16 (Production Mix Problem at the Venix Toys)
Solution of Example 16.4 of Chapter 16 (Production Mix Problem at Naturelat Dairy)
Solution of Example 16.5 of Chapter 16 (Mix Problem of Oil-South Refinery)
Solution of Example 16.6 of Chapter 16 (Diet Problem)
Solution of Example 16.7 of Chapter 16 (Farmers Problem)
Solution of Example 16.8 of Chapter 16 (Portfolio Selection-Maximization of the Expected Return)
Solution of Example 16.9 of Chapter 16 (Portfolio Selection-Minimization of the Portfolios Mean Absolute Deviation)
Solution of Example 16.10 of Chapter 16 (Production and Inventory Problem of FenixandFurniture)
Solution of Example 16.11 of Chapter 16 (Problem of Lifestyle Natural Juices Manufacturer)
Solver Error Messages for Unlimited and Infeasible Solutions
Unlimited Objective Function z
There Is No Optimal Solution
Result Analysis by Using the Solver Answer and Limits Reports
Answer Report
Limits Report
Sensitivity Analysis
Alteration in one of the Objective Function Coefficients (Graphical Solution)
Alteration in One of the Constants on the Right-Hand Side of the Constraint and Concept of Shadow Price (Graphica ...
Reduced Cost
Sensitivity Analysis With Solver in Excel
Special Case: Multiple Optimal Solutions
Special Case: Degenerate Optimal Solution
Exercises
18
Network Programming
Introduction
Terminology of Graphs and Networks
Classic Transportation Problem
Mathematical Formulation of the Classic Transportation Problem
Balancing the Transportation Problem When the Total Supply Capacity Is Not Equal to the Total Demand Consumed
Case 1: Total Supply Is Greater than Total Demand
Case 2: Total Supply Capacity Is Lower than Total Demand Consumed
Solution of the Classic Transportation Problem
The Transportation Algorithm
Solution of the Transportation Problem Using Excel Solver
Transhipment Problem
Mathematical Formulation of the Transhipment Problem
Solution of the Transhipment Problem Using Excel Solver
Job Assignment Problem
Mathematical Formulation of the Job Assignment Problem
Solution of the Job Assignment Problem Using Excel Solver
Shortest Path Problem
Mathematical Formulation of the Shortest Path Problem
Solution of the Shortest Path Problem Using Excel Solver
Maximum Flow Problem
Mathematical Formulation of the Maximum Flow Problem
Solution of the Maximum Flow Problem Using Excel Solver
Exercises
19
Integer Programming
Introduction
Mathematical Formulation of a General Model for Integer Programming and/or Binary and Linear Relaxation
The Knapsack Problem
Modeling of the Knapsack Problem
Solution of the Knapsack Problem Using Excel Solver
The Capital Budgeting Problem as a Model of Binary Programming
Solution of the Capital Budgeting Problem as a Model of Binary Programming Using Excel Solver
The Traveling Salesman Problem
Modeling of the Traveling Salesman Problem
Solution of the Traveling Salesman Problem Using Excel Solver
The Facility Location Problem
Modeling of the Facility Location Problem
Solution of the Facility Location Problem Using Excel Solver
The Staff Scheduling Problem
Solution of the Staff Scheduling Problem Using Excel Solver
Exercises
20
Simulation and Risk Analysis
Introduction to Simulation
The Monte Carlo Method
Monte Carlo Simulation in Excel
Generation of Random Numbers and Probability Distributions in Excel
Practical Examples
Case 1: Consumption of Red Wine
Case 2: Profit x Loss Forecast
Final Remarks
Exercises
Part VIII: Other Topics
21
Design and Analysis of Experiments
Introduction
Steps in the Design of Experiments
The Four Principles of Experimental Design
Types of Experimental Design
Completely Randomized Design (CRD)
Randomized Block Design (RBD)
Factorial Design (FD)
One-Way Analysis of Variance
Factorial ANOVA
Final Remarks
Exercises
22
Statistical Process Control
Introduction
Estimating the Process Mean and Variability
Control Charts for Variables
Control Charts for X and R
Control Charts for X
Control Charts for R
Control Charts for X and S
Control Charts for Attributes
P Chart (Defective Fraction)
np Chart (Number of Defective Products)
C Chart (Total Number of Defects per Unit)
U Chart (Average Number of Defects per Unit)
Process Capability
Cp Index
Cpk Index
Cpm and Cpmk Indexes
Final Remarks
Exercises
23
Data Mining and Multilevel Modeling
Introduction to Data Mining
Multilevel Modeling
Nested Data Structures
Hierarchical Linear Models
Two-Level Hierarchical Linear Models With Clustered Data (HLM2)
Three-Level Hierarchical Linear Models With Repeated Measures (HLM3)
Estimation of Hierarchical Linear Models in Stata
Estimation of a Two-Level Hierarchical Linear Model With Clustered Data in Stata
Estimation of a Three-Level Hierarchical Linear Model With Repeated Measures in Stata
Estimation of Hierarchical Linear Models in SPSS
Estimation of a Two-Level Hierarchical Linear Model With Clustered Data in SPSS
Estimation of a Three-Level Hierarchical Linear Model With Repeated Measures in SPSS
Final Remarks
Exercises
Appendix
Hierarchical Nonlinear Models
Answers
Answer Keys: Exercises: Chapter 2
Answer Keys: Exercises: Chapter 3
Answer Keys: Exercises: Chapter 4
Answer Keys: Exercises: Chapter 5
Answer Keys: Exercises: Chapter 6
Answer Keys: Exercises: Chapter 7
Answer Keys: Exercises: Chapter 8
Answer Keys: Exercises: Chapter 9
Answer Keys: Exercises: Chapter 10
Answer Keys: Exercises: Chapter 11
Answer Keys: Exercises: Chapter 12
Answer Keys: Exercises: Chapter 13
Answer Keys: Exercises: Chapter 14
Answer Keys: Exercises: Chapter 15
Answer Keys: Exercises: Chapter 16
Answer Keys: Exercises: Chapter 17
Answer Keys: Exercises: Chapter 18
Answer Keys: Exercises: Chapter 19
Answer Keys: Exercises: Chapter 20
Answer Keys: Exercises: Chapter 21
Answer Keys: Exercises: Chapter 22
Answer Keys: Exercises: Chapter 23
Appendices
References
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Y
Z

##### Citation preview

Data Science for Business and Decision Making

Data Science for Business and Decision Making

Luiz Paulo Fa´vero School of Economics, Business and Accounting University of Sa˜o Paulo Sa˜o Paulo SP Brazil

Patrı´cia Belfiore Center of Engineering, Modeling and Applied Social Sciences Management Engineering Federal University of ABC Sa˜o Bernardo do Campo SP Brazil

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom © 2019 Elsevier Inc. All rights reserved No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/ permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN 978-0-12-811216-8 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Candice Janco Acquisition Editor: Scott Bentley Editorial Project Manager: Susan Ikeda Production Project Manager: Purushothaman Vijayaraj Cover Designer: Miles Hitchen Typeset by SPi Global, India

Dedication We dedicate this book to Ovı´dio and Leonor Antonio and Ana Vera for the unconditional effort dedicated to our education and development. We dedicate this book to Gabriela and Luiz Felipe who are the reason for our existence.

Epigraph When a human awakens to a great dream and throws the full force of his soul over it, all the universe conspires in your favor. Johann Wolfgang von Goethe

Chapter 1

Introduction to Data Analysis and Decision Making Everything in us is mortal, except the gifts of the spirit and of intelligence. Ovid

1.1

INTRODUCTION: HIERARCHY BETWEEN DATA, INFORMATION, AND KNOWLEDGE

In academic and business environments, improving the use of research techniques and modern software packages, together with the understanding, by researchers and managers in the most varied fields of knowledge, of the importance of statistics and data modeling in defining objectives and substantiating research hypotheses based on underlying theories, has been producing more consistent and rigorous papers from a methodological and scientific standpoint. Nevertheless, as the well-known Austrian philosopher, later on naturalized as a British citizen, Ludwig Joseph Johann Wittgenstein used to say, only methodological rigor and the existence of authors who research more of the same topic can generate a deep lack of oxygen in the academic world. Besides availability of data, adequate software packages, and an adequate underlying theory, it is essential for researchers to also use their intuition and experience when defining their objectives and constructing their hypotheses, even in terms of deciding to study the behavior of new and, sometimes, unimaginable variables in their models. This, believe it or not, may also generate interesting and innovative information for the decision-making process! The basic principle of this book is to explain the hierarchy between data, information, and knowledge, at every turn, in this new scenario we live in. Whenever treated and analyzed, data are transformed into information. On the other hand, knowledge is generated at the moment in which such information is recognized and applied to the decision-making process. Analogously, reverse hierarchy can also be applied, since knowledge, whenever disseminated or explained, becomes information that, when broken up, has the capacity to generate a dataset. Fig. 1.1 shows this logic.

1.2

OVERVIEW OF THE BOOK

The book is divided into 23 chapters, which are structured into eight major parts, as follows: Part I: Foundations of Business Data Analysis l Chapter 1: Introduction to Data Analysis and Decision Making. l Chapter 2: Types of Variables and Mensuration and Accuracy Scales. Part II: Descriptive Statistics l Chapter 3: Univariate Descriptive Statistics. l Chapter 4: Bivariate Descriptive Statistics. Part III: Probabilistic Statistics l Chapter 5: Introduction to Probability. l Chapter 6: Random Variables and Probability Distributions. Part IV: Statistical Inference l Chapter 7: Sampling. l Chapter 8: Estimation. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00001-X © 2019 Elsevier Inc. All rights reserved.

3

4 PART

l l

I Foundations of Business Data Analysis

Chapter 9: Hypotheses Tests. Chapter 10: Nonparametric Tests.

Part V: Multivariate Exploratory Data Analysis Chapter 11: Cluster Analysis. l Chapter 12: Principal Component Factor Analysis. l

Part VI: Generalized Linear Models Chapter 13: Simple and Multiple Regression Models. l Chapter 14: Binary and Multinomial Logistic Regression Models. l Chapter 15: Regression Models for Count Data: Poisson and Negative Binomial. l

Part VII: Optimization Models and Simulation Chapter 16: Introduction to Optimization Models: General Formulations and Business Modeling. l Chapter 17: Solution of Linear Programming Problems. l Chapter 18: Network Programming. l Chapter 19: Integer Programming. l Chapter 20: Simulation and Risk Analysis. l

Part VIII: Other Topics Chapter 21: Design and Analysis of Experiments. l Chapter 22: Statistical Process Control. l Chapter 23: Data Mining and Multilevel Modeling. l

Data

Data

Treatment and analysis

Dismerberment

Information

Information Diffusion

Decision making

Knowledge

Knowledge

FIG. 1.1 Hierarchy between data, information, and knowledge.

Each chapter is structured in the same presentation didactical logic, which we believe favors learning. First, the concepts regarding each topic are introduced and always followed by the algebraic solution, many times in Excel, of practical exercises from datasets primarily developed with a more educational focus. Next, sometimes, the same exercises are solved in Stata Statistical Software® and IBM SPSS Statistics Software®. We believe that this logic facilitates the study and understanding of the correct use of each technique and of the analysis of the results. Moreover, the practical application of the models in Excel, Stata, and SPSS also brings benefits to researchers, as the results can be compared, at every turn, to the ones already estimated or calculated algebraically in the previous sections of each chapter. In addition to providing an opportunity to use these important software packages. At the end of each chapter, additional exercises are proposed, whose answers, presented through the outputs generated, are available at the end of the book. The datasets used are available at www.elsevier.com.

1.3

FINAL REMARKS

All the benefits and potential of the techniques discussed here will be felt by researchers and managers as the procedures are practiced repeatedly. As there are several methods, we must be very careful when defining the technique, since choosing the best alternatives for treating the data fundamentally depends on this moment of practice and exercises. The adequate use of the techniques presented in this book by professors, students, and business managers may more powerfully underpin the research’s initial perception, which can support the decision-making process. Generating knowledge from a phenomenon depends on a well-structured research plan, with the definition of the variables to be collected, the dimensions of the sample, the development of the dataset, and choosing the technique that will be used, which is extremely important.

Introduction to Data Analysis and Decision Making Chapter

1

5

Thus, we believe that this book is meant for researchers who, for different reasons, are specifically interested in data science and decision making, as well as for those who want to deepen their knowledge by using Excel, SPSS, and Stata software packages. This book is recommended to undergraduate and graduate students in the fields of Business Administration, Engineering, Economics, Accounting, Actuarial Science, Statistics, Psychology, Medicine and Health, and to students in other fields related to Human, Exact and Biomedical Sciences. It is also meant for students taking extension, lato sensu postgraduation and MBA courses, as well as for company employees, consultants, and other researchers that have as their main objectives to treat and analyze data, aiming at preparing data models, generating information, and improving knowledge through decision-making processes. To all the researchers and managers that use this book, we hope that adequate and ever more interesting research questions may arise, that analyses may be developed, and that reliable, robust, and useful models for decision-making processes may be constructed. We also hope that the interpretation of outputs may become friendlier and that the use of Excel, SPSS, and Stata may result in important and valuable fruits for new researches and projects. We would like to thank everyone who contributed and made this book become a reality. We would also like to sincerely thank the professionals at Montvero Consulting and Training Ltd., at the International Business Machines Corporation (Armonk, New York), at StataCorp LP (College Station, Texas), at Elsevier Publishing House, especially Andre Gerhard Wolff, J. Scott Bentley, and Susan E. Ikeda. Lastly, but not less important, we would like to thank the professors, students, and employees of the Economics, Business Administration and Accounting College of the University of Sao Paulo (FEA/ USP) and of the Federal University of the ABC (UFABC). Now it is time for you to get started! We would like to emphasize that any contributions, criticisms, and suggestions will always be welcome. So that, later on, they may be incorporated into this book and make it better. Luiz Paulo Fa´vero Patrı´cia Belfiore

Chapter 2

Types of Variables and Measurement and Accuracy Scales And God said: p, i, 0, and 1, and the Universe was created. Leonhard Euler

2.1

INTRODUCTION

A variable is a characteristic of the population (or sample) being studied, and it is possible to measure, count, or categorize it. The type of variable collected is crucial in the calculation of descriptive statistics and in the graphical representation of results, as well as in the selection of the statistical methods that will be used to analyze the data. According to Freund (2006), statistical data are the raw materials of statistical research, always appearing in cases of measurement or record of observations. This chapter discusses the existing types of variables (metric or quantitative and nonmetric or qualitative), as well as their respective scales of measurement (nominal and ordinal for qualitative variables, and interval and ratio for quantitative variables). Classifying the types of variables based on the number of categories and scales of accuracy is also discussed (binary and polychotomous for qualitative variables and discrete and continuous for quantitative variables).

2.2

TYPES OF VARIABLES

Variables can be classified as nonmetric, also known as qualitative or categorical, or metric, also known as quantitative (Fig. 2.1). Nonmetric or qualitative variables represent the characteristics of an individual, object, or element that cannot be measured or quantified. The answers are given in categories. In contrast, metric or quantitative variables represent the characteristics of an individual, object, or element that result from a count (a finite set of values) or from a measurement (an infinite set of values).

2.2.1

Nonmetric or Qualitative Variables

As we will study in Chapter 3, the representation of the characteristics of nonmetric or qualitative variables can be done through frequency distribution tables or in a graphical way, without having to calculate the measures of position, dispersion, and shape. The only exception is the mode, a measure that provides the variable’s most frequent value, and it can also be applied to nonmetric variables. Imagine that a questionnaire will be used to collect data on family income from a sample of consumers, based on certain salary ranges. Table 2.1 shows the variable categories. Note that both variables are qualitative, since the data are represented by ranges. However, it is very common for researchers to classify them incorrectly, mainly when the variable has numerical values in the data. In this case, it is only possible to calculate the frequencies, and not the summary measures, such as, the mean and standard deviation. The frequencies obtained for each income range can be seen in Table 2.2. A common error found in papers that use qualitative variables represented by numbers is the calculation of the sample mean, or any other summary measure. First of all, the researcher calculates the mean of the limits of each range, assuming that this value corresponds to the real mean of the consumers found in that range. However, since the data distribution is not necessarily linear or symmetrical around the mean, this hypothesis is often violated. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00002-1 © 2019 Elsevier Inc. All rights reserved.

7

8 PART

I Foundations of Business Data Analysis

FIG. 2.1 Types of variables.

TABLE 2.1 Family Income Ranges × Social Class Class

Minimum Wage Salaries (MWS)

Family Income (\$)

A

Above 20 MWS

Above \$ 15,760.00

B

from 10 to 20 MWS

From \$ 7880.00 to \$ 15,760.00

C

from 4 to 10 MWS

From \$ 3152.00 to \$ 7880.00

D

from 2 to 4 MWS

From \$ 1576.00 to \$ 3152.00

E

Up to 2 MWS

Up to \$ 1576.00

TABLE 2.2 Frequencies × Family Income Ranges Frequencies

Family Income (\$)

10%

Above \$ 15,760.00

18%

From \$ 7880.00 to \$ 15,760.00

24%

From \$ 3152.00 to \$ 7880.00

36%

From \$ 1576.00 to \$ 3152.00

12%

Up to \$ 1576.00

In order for us to be able to calculate summary measures, such as, the mean and standard deviation, the variable being studied must necessarily be quantitative.

2.2.2

Metric or Quantitative Variables

Quantitative variables can be represented in a graphical way (line charts, scatter plots, histograms, stem-and-leaves, and boxplots), through measures of position or location (mean, median, mode, quartiles, deciles, and percentiles), measures of dispersion or variability (range, average deviation, variance, standard deviation, standard error, and coefficient of variation), or through measures of shape, such as, skewness and kurtosis, as we will study in Chapter 3. These variables can be discrete or continuous. Discrete variables can take on a finite set of values that frequently come from a count, such as, for example, the number of children in a family (0, 1, 2…). Conversely, continuous variables take on values that are in an interval with real numbers, such as, for example, an individual’s weight or income. Imagine a dataset with 20 people’s names, age, weight and height, as shown in Table 2.3. The data are available in the file VarQuanti.sav. To classify the variables on SPSS (Fig. 2.2), let’s click on Variable View. Note that the variable Name is qualitative (a string), and it is measured on a nominal scale (column Measure). On the other hand, variables Age, Weight, and Height are quantitative (Numeric), and they are measured in scale (Scale). The variable scales of measurement will be studied in more detail in Section 2.3.

Types of Variables and Measurement and Accuracy Scales Chapter

2

9

TABLE 2.3 Dataset With Information on 20 People Name

Age (Years)

Weight (kg)

Height (m)

Mariana

48

62

1.60

Roberta

41

56

1.62

Luiz

54

84

1.76

Leonardo

30

82

1.90

Felipe

35

76

1.85

Marcelo

60

98

1.78

Melissa

28

54

1.68

Sandro

50

70

1.72

Armando

40

75

1.68

Heloisa

24

50

1.59

Julia

44

65

1.62

Paulo

39

83

1.75

Manoel

22

68

1.78

Ana Paula

31

56

1.66

Amelia

45

60

1.64

Horacio

62

88

1.77

Pedro

24

80

1.92

Joao

28

75

1.80

Marcos

49

92

1.76

Celso

54

66

1.68

FIG. 2.2 Classification of the variables.

2.3

TYPES OF VARIABLES × SCALES OF MEASUREMENT

Variables can also be classified according to the level or scale of measurement. Measurement is the process of assigning numbers or labels to objects, people, states, or events, in accordance with specific rules, to represent the quantities or qualities of the attributes. Rule is a guide, a method, or a command that tells the researcher how to measure the attribute. Scale is a set of symbols or numbers, based on a rule, and it applies to individuals or to their behaviors or attitudes. An individual’s position in the scale is based on whether this individual has the attribute that the scale must measure. There are several taxonomies found in the existing literature to classify the scales of measurement of all types of variables (Stevens, 1946; Hoaglin et al., 1983). We will use Stevens classification because it is simple, it is widely used, and because its nomenclature is used in statistical software. According to Stevens (1946), the scales of measurement of nonmetric, categorical or qualitative variables can be classified as nominal and ordinal, while the metric or quantitative variables are classified as interval and ratio scales (or proportional), as shown in Fig. 2.3.

10

PART

I Foundations of Business Data Analysis

FIG. 2.3 Types of variables  scales of measurement.

2.3.1

Nonmetric Variables—Nominal Scale

The nominal scale classifies the units into classes or categories regarding the characteristic represented, not establishing any magnitude or order relationship. It is called nominal because the categories are only differentiated by their names. We can assign numerical labels to the variable categories, but arithmetic operations, such as, addition, subtraction, multiplication, and division over these numbers are not allowed. The nominal scale only allows some basic arithmetic operations. For instance, we can count the number of elements in each class or apply hypotheses tests regarding the distribution of the population units in the classes. Thus, most of the usual statistics, such as, the mean and standard deviation, do not make any sense for nominal scale qualitative variables. As examples of nonmetric variables on nominal scales, we can mention professions, religion, color, marital status, geographic location, or country of origin. Imagine a nonmetric variable related to the country of origin of 10 large multinational companies. To represent the categories of the variable Country of origin, we can use numbers, assigning value 1 to the United States, 2 to the Netherlands, 3 to China, 4 to the United Kingdom, and 5 to Brazil, as shown in Table 2.4. In this case, the numbers are only labels or tags to help identify and classify objects. This scale of measurement is known as a nominal scale, that is, the numbers are randomly assigned to the object categories, without any kind of order. To represent the behavior of nominal data, we can use descriptive statistics, such as, frequency distribution tables, bar or pie charts, or the calculation of the mode (Chapter 3). Now, we will discuss how to define labels for qualitative variables on a nominal scale by using the SPSS software (Statistical Package for the Social Sciences). After that, we will be able to construct absolute and relative frequencies tables and charts. Before generating the dataset, let’s define the characteristics of the variables being studied in Variable View (visualization of variables). In order to do that, click on the respective spreadsheet that is available in the lower left side of the Data Editor, or click twice on the column var.

TABLE 2.4 Companies and Country of Origin Company

Country of Origin

Exxon Mobil

1

JP Morgan Chase

1

General Electric

1

Royal Dutch Shell

2

ICBC

3

HSBC Holdings

4

PetroChina

3

Berkshire Hathaway

1

Wells Fargo

1

Petrobras

5

Types of Variables and Measurement and Accuracy Scales Chapter

2

11

The first variable, called Company, is a string, that is, its data are inserted as characters or letters. It was established that the maximum number of characters of the respective variable would be 18. In the column Measure, the scale of measurement of the variable Company is defined, which is nominal. The second variable, called Country, is numerical, since its data are inserted as numbers. However, the numbers are only used to categorize or label the objects, so, the scale of measurement of the respective variable is also nominal (Fig. 2.4). To insert the data from Table 2.4, we are going to go back to Data View. The information must be typed as shown in Fig. 2.5 (the columns represent the variables and the rows represent the observations or individuals). Since the variable Country is represented by numbers, it is necessary to assign labels to each variable category, as shown in Table 2.5. In order to do that, we must click on Data → Define Variable Properties… and select the variable Country, according to Figs. 2.6 and 2.7. Since the nominal scale of measurement of the variable Country has already been defined in the column Measure in Variable View, we can see that it already appears correctly in Fig. 2.8. Defining the labels for each category must be done at this moment, and it can also be seen in the same figure. Value Labels, The database starts being seen with the label names assigned, as shown in Fig. 2.9. By clicking on located on the toolbar, it is possible to alternate from the numerical values of the nominal or ordinal variable and their respective labels. Having structured the dataset, it is possible to construct absolute and relative frequencies tables and charts on SPSS.

FIG. 2.4 Defining the variable characteristics in Variable View.

FIG. 2.5 Inserting the data found in Table 2.4 into Data View.

12

PART

I Foundations of Business Data Analysis

TABLE 2.5 Categories Assigned to the Countries Categories

Country

1

United States

2

The Netherlands

3

China

4

The United Kingdom

5

Brazil

FIG. 2.6 Defining labels for each nominal variable category.

The descriptive statistics to represent the behavior of a single qualitative variable and of two qualitative variables will be studied in Chapters 3 and 4, respectively.

2.3.2

Nonmetric Variables—Ordinal Scale

A nonmetric variable on an ordinal scale classifies the units into classes or categories regarding the characteristic being represented, establishing an order between the units of the different categories. An ordinal scale is a scale on which data is shown in order, determining a relative position of the classes according to one direction. Any set of values can be assigned to the variable categories, as long as the order between them is respected. As in the nominal scale, arithmetic operations (sums, subtractions, multiplications, and divisions) between these values do not make any sense. Thus, the application of the usual descriptive statistics is also limited to nominal variables. Since the scale numbers are only meant to classify them, the descriptive statistics that can be used for ordinal data are frequency distribution tables, charts (including bar and pie charts), and the mode, as will study in Chapter 3.

Types of Variables and Measurement and Accuracy Scales Chapter

2

13

FIG. 2.7 Selecting the nominal variable Country.

Examples of ordinal variables include consumers’ opinion and satisfaction scales, educational level, social class, age, etc. Imagine a nonmetric variable called Classification that measures a group of consumers’ preference regarding a certain wine brand. The definition of labels for each ordinal variable category can be found in Table 2.6. Value 1 is assigned to the worst classification, value 2 to the second worst, and so on, until value 5, which is the best classification, as shown in this table. Instead of using scales from 1 to 5, we could have assigned any other numerical scale, as long as the order of classification had been respected. Thus, the numerical values do not represent a score of the product’s quality, they are only meant to classify it. So, the difference between these values does not represent the difference of the attribute analyzed. These scales of measurement are known as ordinal scales. Fig. 2.10 shows the characteristics of the variables being studied in Variable View on SPSS. The variable Customer is a string (its data are inserted as characters or letters) with a nominal scale of measurement. On the other hand, the variable Classification is numerical (numerical values were assigned to represent the variable categories) with an ordinal scale of measurement. The procedure for defining labels for qualitative variables on an ordinal scale is the same as the one already presented for nominal variables.

14

PART

I Foundations of Business Data Analysis

FIG. 2.8 Defining the labels for the variable Country.

FIG. 2.9 Dataset with labels.

Types of Variables and Measurement and Accuracy Scales Chapter

2

15

TABLE 2.6 Consumers’ Classification of a Certain Wine Brand Value

Label

1

Very bad

2

Bad

3

Average

4

Good

5

Very good

FIG. 2.10 Defining the variable characteristics in Variable View.

2.3.3

Quantitative Variable—Interval Scale

According to Stevens classification (1946), metric or quantitative variables have data in an interval or ratio scale. Besides ordering the units based on the characteristic being measured, the interval scale has a constant unit of measure. The origin or point zero of this scale of measurement is random, and it does not express an absence of quantity. A classic example of an interval scale is temperature measured in Celsius (°C) or in Fahrenheit (°F). Choosing temperature zero is random and differences of equal temperatures are determined by the identification of equal expansion volumes in the liquid inside the thermometer. Hence, the interval scale allows us to infer differences between the units to be measured. However, we cannot state that a value in a specific interval of the scale is a multiple of another one. For instance, assume that two objects are measured at 15°C and 30°C, respectively. Measuring the temperature allows us to determine how much one object is hotter than the other. However, we cannot state that the object with 30°C is twice as hot as the other with 15°C. The interval scale does not vary under positive linear transformations. So, an interval scale can be transformed into another through a positive linear transformation. Transforming degrees Celsius into degrees Fahrenheit is an example of a linear transformation. Most descriptive statistics can be applied to variable data with an interval scale, except statistics based on the ratio scale, such as, the variation coefficient.

2.3.4

Quantitative Variable—Ratio Scale

Analogous to the interval scale, the ratio scale orders the units based on the characteristic measured and has a constant unit of measure. On the other hand, the origin (or point zero) is unique and value zero expresses the absence of quantity. Therefore, it is possible to know if a value in a specific interval of the scale is a multiple of another. Equal ratios between values of the scale correspond to equal ratios between the units measured. Thus, ratio scales do not vary under positive proportion transformations. For example, if a unit is 1 m high and the other 3 m, we can say that the latter is three times higher than the former. Among the scales of measurement, the ratio scale is the most complete, because it allows us to use all arithmetic operations. In addition to this, all the descriptive statistics can be applied to the data of a variable expressed on a ratio scale. Examples of variables whose data can be on the ratio scale include income, age, how many units of a certain product were manufactured, and distance traveled.

16

PART

2.4

I Foundations of Business Data Analysis

TYPES OF VARIABLES × NUMBER OF CATEGORIES AND SCALES OF ACCURACY

Qualitative or categorical variables can also be classified based on the number of categories: (a) dichotomous or binary (dummies), when they only take on two categories; (b) polychotomous, when they take on more than two categories. On the other hand, metric or quantitative variables can also be classified based on the scale of accuracy: discrete or continuous. This classification can be seen in Fig. 2.11.

2.4.1

Dichotomous or Binary Variable (Dummy)

A dichotomous or binary variable (dummy) can only take on two categories, and the values 0 or 1 are assigned to these categories. Value 1 is assigned when the characteristic of interest is present in the variable and value 0 if otherwise. As examples, we have: smokers (1) and nonsmokers (0), a developed country (1) and an underdeveloped country (0), vaccinated patients (1) and nonvaccinated patients (0). Multivariate dependence techniques have as their main objective to specify a model that can explain and predict the behavior of one or more dependent variables through one or more explanatory variables. Many of these techniques, including the simple and multiple regression analysis, binary and multinomial logistic regression, regression for count data, and multilevel modeling, among others, can easily and coherently be applied with the use of nonmetric explanatory variables, as long as they are transformed into binary variables that represent the categories of the original qualitative variable. In this regard, a qualitative variable with n categories, for example, can be represented by (n 1) binary variables. For instance, imagine a variable called Evaluation, expressed by the categories good, average, or bad. Thus, two binary variables may be necessary to represent the original variable, depending on the researcher’s objectives, as shown in Table 2.7. Further details about the definition of dummy variables in confirmatory models will be discussed in Chapter 13, including the presentation of the operations necessary to generate them on software such as Stata.

2.4.2

Polychotomous Variable

A qualitative variable can take on more than two categories and, in this case, it is called polychotomous. As examples, we can mention social classes (lower, middle, and upper) and educational levels (elementary school, high school, college, and graduate school).

2.4.3

Discrete Quantitative Variable

As described in Section 2.2.2, discrete quantitative variables can take on a finite set of values that frequently come from a count, such as, for example, the number of children in a family (0, 1, 2…), the number of senators elected, or the number of cars manufactured in a certain factory.

2.4.4

Continuous Quantitative Variable

Continuous quantitative variables, on the other hand, are those whose possible values are in an interval with real numbers and result from a metric measurement, as, for example, weight, height, or an individual’s salary (Bussab and Morettin, 2011).

FIG. 2.11 Qualitative variables  Number of categories and Quantitative variables  Scales of accuracy.

Types of Variables and Measurement and Accuracy Scales Chapter

2

17

TABLE 2.7 Defining Binary Variables (Dummies) for the Variable Evaluation Binary Variables (Dummies)

2.5

Evaluation

D1

D2

Good

0

0

Average

1

0

Bad

0

1

FINAL REMARKS

Whenever treated and analyzed through several different statistical techniques, data are transformed into information and can support the decision-making process. These data can be metric (quantitative) or nonmetric (categorical or qualitative). Metric data represent the characteristics of an individual, object, or element that result from a count or measurement (patients’ weight, age, interest rate, among other examples). In the case of nonmetric data, these characteristics cannot be measured or quantified (answers as, for example, yes or no, educational levels, among others). According to Stevens (1946), the scales of measurement of nonmetric, categorical or qualitative variables can be classified as nominal and ordinal, while the metric or quantitative variables are classified on interval and ratio scales (or proportional). A lot of data can be collected in a metric as well as in a nonmetric way. Assume that we wish to assess the quality of a certain product. In order to do that, scores from 1 to 10 regarding certain attributes can be assigned, and a Likert scale can be defined from information that has already been established. In general, and whenever possible, questions must be defined in a quantitative way, in order for the researcher not to lose data information. For Fa´vero et al. (2009), generating the questionnaire and defining the variable scales of measurement will depend on several aspects, including the research objectives, the modeling to be adopted to achieve such objectives, the average time to apply the questionnaire, and how it will be collected. A dataset can present variables on metric and on nonmetric scales, it does not need to restrict itself to only one type of scale. This combination can provide some interesting researches and, jointly with the suitable modeling, it can generate information aimed at assisting the decision-making process. The type of variable collected is crucial in the calculation of descriptive statistics and in the graphical representation of results, as well as in the selection of the statistical methods that will be used to analyze the data.

2.6 1) 2) 3) 4)

EXERCISES

What is the difference between qualitative and quantitative variables? What are scales of measurement and what are the main types of scales? What are the differences between them? What is the difference between discrete and continuous variables? Classify the variables below according to the following scales: nominal, ordinal, binary, discrete, or continuous. a. A company’s revenue. b. A performance rank: good, average, and bad. c. Time to process a part. d. Number of cars sold. e. Distance traveled in km. f. Municipalities in the Greater Sao Paulo. g. Family income ranges. h. A student’s grades: A, B, C, D, O, or R. i. Hours worked. j. Region: North, Northeast, Center-West, South, and Southeast. k. Location: Sao Paulo or Seoul. l. Size of the organization: small, medium, and large.

18

PART

I Foundations of Business Data Analysis

m. Number of bedrooms. n. Classification of risk: high, average, speculative, substantial, in moratorium. o. Married: yes or no. 5) A researcher wishes to study the impact of physical aptitude on the improvement of productivity in an organization. How would you describe the binary variables to be included in this model, so that the variable physical aptitude could be represented? The possible variable categories are: (a) active and healthy; (b) acceptable (could be better); (c) not good enough; (d) sedentary.

Chapter 3

Univariate Descriptive Statistics Mathematics is the alphabet with which God has written the Universe. Galileo Galilei

3.1

INTRODUCTION

Descriptive statistics describes and summarizes the main characteristics observed in a dataset through tables, charts, graphs, and summary measures, allowing the researcher to have a better understanding of the data behavior. The analysis is based on the dataset being studied (sample), without drawing any conclusions or inferences from the population. Researchers can use descriptive statistics to study a single variable (univariate descriptive statistics), two variables (bivariate descriptive statistics), or more than two variables (multivariate descriptive statistics). In this chapter, we will study the concepts of descriptive statistics involving a single variable. Univariate descriptive statistics considers the following topics: (a) the frequency in which a set of data occurs through frequency distribution tables; (b) the representation of the variable’s distribution through charts; and (c) measures that represent a data series, such as measures of position or location, measures of dispersion or variability, and measures of shape (skewness and kurtosis). The four main goals of this chapter are: (1) to introduce the most common concepts related to the tables, charts, and summary measures in univariate descriptive statistics, (2) to present its applications in real examples, (3) to construct tables, charts, and summary measures using Excel and the statistical software SPSS and Stata, and (4) to discuss the results achieved. As described in the previous chapter, before we begin using descriptive statistics, it is necessary to identify the type of variable being studied. The type of variable is essential when calculating descriptive statistics and in the graphical representation of the results. Fig. 3.1 shows the univariate descriptive statistics that will be studied in this chapter, represented by tables, charts, graphs, and summary measures, for each type of variable. Fig. 3.1 summarizes the following information: a) The descriptive statistics used to represent the behavior of one qualitative variable’s data are frequency distribution tables and graphs/charts. b) The frequency distribution table for a qualitative variable represents the frequency in which each variable category occurs. c) The graphical representation of qualitative variables can be illustrated by bar charts (horizontal and vertical), pie charts, and by a Pareto chart. d) For quantitative variables, the most common descriptive statistics are charts and summary measures (measures of position or location, dispersion or variability, and measures of shape). Frequency distribution tables can also be used to represent the frequency in which each possible value of a discrete variable occurs, or to represent the frequency of the data of continuous variables grouped into classes. e) Line graphs, dot or dispersion plots, histograms, stem-and-leaf plots, and boxplots (box-and-whisker diagrams) are normally used as the graphical representation of quantitative variables. f) Measures of position or location can be divided into measures of central tendency (mean, mode, and median) and quantiles (quartiles, deciles, and percentiles). g) The most common measures of dispersion or variability are range, average deviation, variance, standard deviation, standard error, and coefficient of variation. h) The measures of shape include measures of skewness and kurtosis. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00003-3 © 2019 Elsevier Inc. All rights reserved.

21

22

PART

II Descriptive Statistics

Variable type Qualitative

Quantitative

Charts

Tables

Frequency distribution

Tables

Bar

Frequency distribution

(horizontal or vertical)

Graphs

Summary measures

Line

Pie

Histogram

Pareto

Stem-and-Leaf

Boxplot

Dispersion or Variability

Position or Location

Scatter

Central tendency

Range

Skewness

Average

Kurtosis

Quantiles

Mean

Quartiles

Mode*

Deciles

Median

Percentiles

Shape

Variance Standard deviation Standard error Coefficient of variation

FIG. 3.1 A brief summary of univariate descriptive statistics. *The mode, which provides the most frequent value of the variable, is the only summary measure that can also be used for qualitative variables.

3.2

FREQUENCY DISTRIBUTION TABLE

Frequency distribution tables can be used to represent the frequency in which a set of data with qualitative or quantitative variables occurs. In the case of qualitative variables, the table represents the frequency in which each variable category happens. For discrete quantitative variables, the frequency of occurrences is calculated for each discrete value of the variable. On the other hand, continuous variable data are first grouped into classes and, afterwards, we calculate the frequencies in which each class occurs. A frequency distribution table contains the following calculations: a) b) c) d)

Absolute frequency (Fi): number of times each value i appears in the sample. Relative frequency (Fri): percentage related to the absolute frequency. Cumulative frequency (Fac): sum of all the values equal to or less than the value being analyzed. Relative cumulative frequency (Frac): percentage related to the cumulative frequency (sum of all relative frequencies equal to or less than the value being considered).

3.2.1

Frequency Distribution Table for Qualitative Variables

Through a practical example, we will build the frequency distribution table using the calculations of the absolute frequency, relative frequency, cumulative frequency, and relative cumulative frequency for each category of the qualitative variable being analyzed. Example 3.1 Saint August Hospital provides 3000 blood transfusions to hospitalized patients every month. In order for the hospital to be able to maintain its stocks, 60 blood donations a day are necessary. Table 3.E.1 shows the total number of donors for each blood type on a certain day. Build the frequency distribution table for this problem.

TABLE 3.E.1 Total Number of Donors of Each Blood Type Blood Type

Donors

A+

15

A

2

B+

6

Univariate Descriptive Statistics Chapter

3

23

TABLE 3.E.1 Total Number of Donors of Each Blood Type— cont’d Blood Type

Donors

B

1

AB+

1

AB 

1

O+

32

O

2

Solution The complete frequency distribution table for Example 3.1 is shown in Table 3.E.2:

TABLE 3.E.2 Frequency Distribution of Example 3.1

3.2.2

Blood Type

Fi

Fri (%)

Fac

Frac (%)

A+

15

25

15

25

A

2

3.33

17

28.33

B+

6

10

23

38.33

B

1

1.67

24

40

AB+

1

1.67

25

41.67

AB 

1

1.67

26

43.33

O+

32

53.33

58

96.67

O

2

3.33

60

100

Sum

60

100

Frequency Distribution Table for Discrete Data

Through the frequency distribution table, we can calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency for each possible value of the discrete variable. Different from qualitative variables, instead of the possible categories we must have the possible numeric values. To facilitate understanding, the data must be presented in ascending order. Example 3.2 A Japanese restaurant is defining the new layout for its tables and, in order to do that, it collected information on the number of people who have lunch and dinner at each table throughout one week. Table 3.E.3 shows the first 40 pieces of data collected. Build the frequency distribution table for these data.

TABLE 3.E.3 Number of People per Table 2

5

4

7

4

1

6

2

2

5

4

12

8

6

4

5

2

8

2

6

4

7

2

5

6

4

1

5

10

2

2

10

6

4

3

4

6

3

8

4

24

PART

II Descriptive Statistics

Solution In the next table, each row of the first column represents a possible numeric value of the variable being analyzed. The data are sorted in ascending order. The complete frequency distribution table for Example 3.2 is shown below.

TABLE 3.E.4 Frequency Distribution for Example 3.2

3.2.3

Number of People

Fi

Fri (%)

Fac

Frac (%)

1

2

5

2

5

2

8

20

10

25

3

2

5

12

30

4

9

22.5

21

52.5

5

5

12.5

26

65

6

6

15

32

80

7

2

5

34

85

8

3

7.5

37

92.5

10

2

5

39

97.5

12

1

2.5

40

100

Sum

40

100

Frequency Distribution Table for Continuous Data Grouped into Classes

As described in Chapter 2, continuous quantitative variables are those whose possible values are in an interval of real numbers. Therefore, it makes no sense to calculate the frequency for each possible value, since they rarely repeat themselves. It is better to group the data into classes or ranges. The interval to be defined between the classes is random. However, we must be careful if the number of classes is too small because a lot of information can be lost. On the other hand, if the number of classes is too large, the summary of information is compromised (Bussab and Morettin, 2011). The interval between the classes does not need to be constant, but in order to keep things simple, we will assume the same interval. The following steps must be taken to build a frequency distribution table for continuous data: Step 1: Sort the data in ascending order. Step 2: Determine the number of classes (k), using one of the options: a) Sturges’ Rule ! k ¼ 1 + 3.3 pﬃﬃﬃ  log(n) b) Through expression k ¼ n where n is the sample size. The value of k must be an integer. Step 3: Determine the interval between the classes (h), calculated as the range of the sample (A ¼ maximum value  minimum value) divided by the number of classes: h ¼ A=k The value of h is rounded to the highest integer. Step 4: Build the frequency distribution table (calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency) for each class. The lowest limit of the first class corresponds to the minimum value of the sample. To determine the highest limit of each class, we must add the value of h to the lowest limit of the respective class. The lowest limit of the new class corresponds to the highest limit of the previous class.

Univariate Descriptive Statistics Chapter

3

Example 3.3 Consider the data in Table 3.E.5 regarding the grades of 30 students enrolled in the subject Financial Market. Elaborate a frequency distribution table for this problem.

TABLE 3.E.5 Grades of 30 Students Enrolled in the Subject Financial Market 4.2

3.9

5.7

6.5

4.6

6.3

8.0

4.4

5.0

5.5

6.0

4.5

5.0

7.2

6.4

7.2

5.0

6.8

4.7

3.5

6.0

7.4

8.8

3.8

5.5

5.0

6.6

7.1

5.3

4.7

Note: To determine the number of classes, use Sturges’ rule.

Solution Let’s apply the four steps to build the frequency distribution table of Example 3.3, whose variables are continuous: Step 1: Let’s sort the data in ascending order, as shown in Table 3.E.6.

TABLE 3.E.6 Data From Table 3.E.5 Sorted in Ascending Order 3.5

3.8

3.9

4.2

4.4

4.5

4.6

4.7

4.7

5

5

5

5

5.3

5.5

5.5

5.7

6

6

6.3

6.4

6.5

6.6

6.8

7.1

7.2

7.2

7.4

8

8.8

Step 2: Let’s determine the number of classes (k) by using Sturges’ rule: k ¼ 1 + 3:3  log ð30Þ ¼ 5:87 ﬃ 6 Step 3: The interval between the classes (h) is given by: A ð8:8  3:5Þ ¼ ¼ 0:88 ﬃ 1 k 6 Step 4: Finally, let’s build the frequency distribution table for each class. The lowest limit of the first class corresponds to the minimum grade 3.5. From this value, we must add the interval between the classes (1), considering that the highest limit of the first class will be 4.5. The second class starts from this value, and so on, and so forth, until the last class is defined. We use the notation ├ to determine that the lowest limit is included in the class and the highest limit is not. The complete frequency distribution table for Example 3.3 (Table 3.E.7) is presented. h¼

TABLE 3.E.7 Frequency Distribution for Example 3.3 Class

Fi

Fri (%)

Fac

Frac (%)

3.5 ├ 4.5

5

16.67

5

16.67

4.5 ├ 5.5

9

30

14

46.67

5.5 ├ 6.5

7

23.33

21

70

6.5 ├ 7.5

7

23.33

28

93.33

7.5 ├ 8.5

1

3.33

29

96.67

8.5 ├ 9.5

1

3.33

30

100

Sum

30

100

25

26

PART

3.3

II Descriptive Statistics

GRAPHICAL REPRESENTATION OF THE RESULTS

The behavior of qualitative and quantitative variable data can also be represented in a graphical way. Charts are a representation of numeric data, in the form of geometric figures (graphs, diagrams, drawings, or images), allowing the reader to interpret these data quickly and objectively. In Section 3.3.1, the main graphical representations for qualitative variables are illustrated: bar charts (horizontal and vertical), pie charts, and a Pareto chart. The graphical representation of quantitative variables is usually illustrated by line graphs, dot plots, histograms, stemand-leaf plots, and boxplots (or box-and-whisker diagrams), as shown in Section 3.3.2. Bar charts (horizontal and vertical), pie charts, a Pareto chart, line graphs, dot plots, and histograms will be generated in Excel. The boxplots and histograms will be constructed by using SPSS and Stata. To build a chart in Excel, first, variables’ data and names must be standardized, codified, and selected in a spreadsheet. The next step consists in clicking on the Insert tab and, in the group Charts, selecting the type of chart we are interested in using (Columns, Rows, Pie, Bar, Area, Scatter, or Other Charts). The chart will be generated automatically on the screen, and it can be personalized according to the preferences of the researcher. Excel offers a variety of chart styles, layouts, and formats. To use them, researcher just needs to select the plotted chart and click on the Design, Layout or Format tab. On the Layout tab, for example, there are many resources available, such as, Chart Title, Axis Titles (shows the name of the horizontal and vertical axes); Legend (shows or hides the legend); Data Labels (allows researcher to insert the series name, the category name, or the values of the labels in the place we are interested in); Data Table (shows the data table below the chart, with or without legend codes); Axes (allows researcher to personalize the scale of the horizontal and vertical axes); Gridlines (shows or hides horizontal and vertical gridlines), among others. The Chart Title, Axis Titles, Legend, Data Labels and Data Table icons are in the Labels group, while the icons Axes and Gridlines are in the Axes group.

3.3.1

Graphical Representation for Qualitative Variables

3.3.1.1 Bar Chart This type of chart is widely used for nominal and ordinal qualitative variables, but it can also be used for discrete quantitative variables, because it allows us to investigate the presence of data trends. As its name indicates, through bars, this chart represents the absolute or relative frequencies of each possible category (or numeric value) of a qualitative variable (or quantitative). In vertical bar charts, each variable category is shown on the X-axis as a bar with constant width, and the height of the respective bar indicates the frequency of the category on the Y-axis. Conversely, in horizontal bar charts, each variable category is shown on the Y-axis as a bar of constant height, and the length of the respective bar indicates the frequency of the category on the X-axis. Let’s now build horizontal and vertical bar charts from a practical example. Example 3.4 A bank created a satisfaction survey, which was used with 120 customers, trying to measure how agile its services were (excellent, good, satisfactory, and poor). The absolute frequencies for each category are presented in Table 3.E.8. Construct a vertical and horizontal bar chart for this problem.

TABLE 3.E.8 Frequencies of Occurrences per Category Satisfaction

Absolute Frequency

Excellent

58

Good

18

Satisfactory

32

Poor

12

Solution Let’s build the vertical and horizontal bar charts of Example 3.4 in Excel.

Univariate Descriptive Statistics Chapter

27

FIG. 3.2 Vertical bar chart for Example 3.4.

Satisfaction 70 58

60 Absolute frequency

3

50 40

32

30 18

20

12

10 0 Excellent

Good

Poor

Satisfactory

FIG. 3.3 Horizontal bar chart for Example 3.4.

Satisfaction Poor

12

Satisfactory

32

Good

18

Excellent

58 0

10

20

30 40 Absolute frequency

50

60

70

First, the data in Table 3.E.8 must be standardized, codified, and selected in a spreadsheet. After that, we can click on the Insert tab and, in the Charts group, and select the option Columns. The chart is automatically generated on the screen. Next, to personalize the chart, while clicking on it, we must select the following icons on the Layout tab: (a) Axis Titles: let’s select the title for the horizontal axis (Satisfaction) and for the vertical axis (Frequency); (b) Legend: to hide the legend, we must click on None; (c) Data Labels: clicking on More Data Label Options, the option Value must be selected in Label Contains (or we can select the option Outside End). Fig. 3.2 shows the vertical bar chart of Example 3.4 generated in Excel. Based on Fig. 3.2, we can see that the categories of the variable being analyzed are presented on the X-axis by bars with the same width and their respective heights indicate the frequencies on the Y-axis. To construct the horizontal bar chart, we must select the option Bar instead of Columns. The other steps follow the same logic. Fig. 3.3 represents the frequency data from Table 3.E.8 through a horizontal bar chart constructed in Excel. The horizontal bar chart in Fig. 3.3 represents the categories of the variable on the Y-axis and their respective frequencies on the X-axis. For each variable category, we draw a bar with a length that corresponds to its frequency. Therefore, this chart only offers information related to the behavior of each category of the original variable and to the generation of investigations regarding the type of distribution, not allowing us to calculate position, dispersion, skewness or kurtosis measures, since the variable being studied is qualitative.

3.3.1.2 Pie Chart Another way to represent qualitative data, in terms of relative frequencies (percentages), is the definition of pie charts. The chart corresponds to a circle with a random radius (the whole) divided into sectors or slices of pie of several different sizes (parts of the whole).

28

PART

II Descriptive Statistics

This chart allows the researcher to visualize the data as slices of a pie or parts of a whole. Let’s now build the pie chart from a practical example. Example 3.5 An election poll was carried out in the city of Sao Paulo to check voters’ preferences concerning the political parties running in the next elections for Mayor. The percentage of voters per political party can be seen in Table 3.E.9. Construct a pie chart for Example 3.5.

TABLE 3.E.9 Percentage of Voters per Political Party Political Party

Percentage

PMDB

18

PSDB

22

PDT

12.5

PT

24.5

PC do B

8

PV

5

Others

10

Solution Let’s build the pie chart for Example 3.5 in Excel. The steps are similar to the ones in Example 3.4. However, we now have to select the option Pie in the Charts group, on the Insert tab. Fig. 3.4 presents the pie chart obtained in Excel for the data shown in Table 3.E.9. FIG. 3.4 Pie chart of Example 3.5.

Political party Others 10% PV 5%

PMDB 18%

PC do B 8%

PSDB 22% PT 24.5% PDT 12.5%

3.3.1.3 Pareto Chart The Pareto chart is a Quality control tool and has as its main objective to investigate the types of problems and, consequently, to identify their respective causes, so that an action can be taken in order to reduce or eliminate them. The Pareto chart is a chart that contains bars and a line graph. The bars represent the absolute frequencies of occurrences of problems and the lines represent the relative cumulative frequencies. The problems are sorted in descending order of priority. Let’s now illustrate a practical example with a Pareto chart.

Univariate Descriptive Statistics Chapter

3

Example 3.6 A manufacturer of credit and magnetic cards has as its main objective to reduce the number of defective cards. The quality inspector classified a sample of 1000 cards that were collected during one week of production, according to the types of defects found, as shown in Table 3.E.10. Construct a Pareto chart for this problem.

TABLE 3.E.10 Frequencies of the Occurrence of Each Defect Type of Defect

Absolute Frequency (Fi)

Damaged/Bent

71

Perforated

28

Illegible printing

12

Wrong characters

20

Wrong numbers

44

Others

6

Total

181

Solution The first step in generating a Pareto chart is to sort the defects in order of priority (from the highest to the lowest frequency). The bar chart represents the absolute frequency of each defect. To construct the line graph, it is necessary to calculate the relative cumulative frequency (%) up to the defect analyzed. Table 3.E.11 shows the absolute frequency for each type of defect, in descending order, and the relative cumulative frequency (%).

TABLE 3.E.11 Absolute Frequency for Each Defect and the Relative Cumulative Frequency (%) Type of Defect

Number of Defects

Cumulative %

Damaged/Bent

71

39.23

Wrong numbers

44

63.54

Perforated

28

79.01

Wrong characters

20

90.06

Illegible printing

12

96.69

Others

6

100

Let’s now build a Pareto chart for Example 3.6 in Excel, using the data in Table 3.E.11. First, the data in Table 3.E.11 must be standardized, codified, and selected in an Excel spreadsheet. In the Charts group, on the Insert tab, let’s select the option Columns (and the clustered column subtype). Note that the chart is automatically generated on the screen. However, absolute frequency data as well as relative cumulative frequency data are presented as columns. To change the type of chart related to the cumulative percentage, we must click with the right button on any bar of the respective series and select the option Change Series Chart Type, followed by a line graph with markers. The resulting chart is a Pareto chart. To personalize the Pareto chart, we must use the following icons on the Layout tab: (a) Axis Titles: for the bar chart, we selected the title for the horizontal axis (Type of defect) and for the vertical axis (Frequency); for the line graph, we called the vertical axis Percentage; (b) Legend: to hide the legend, we must click on None; (c) Data Table: let’s select the option Show Data Table with Legend Keys; (d) Axes: the main unit of the vertical axes for both charts is set in 20 and the maximum value of the vertical axis for line graphs, in 100. Fig. 3.5 shows the chart constructed in Excel that corresponds to the Pareto chart for Example 3.6.

29

30

PART

II Descriptive Statistics

FIG. 3.5 The Pareto chart for Example 3.6. Legend: A, Damaged/Bent; B, Wrong numbers; C, Perforated; D, Wrong characters; E, Illegible printing; F, Others.

3.3.2

Graphical Representation for Quantitative Variables

3.3.2.1 Line Graph In a line graph, points are represented by the intersection of the variables involved on the horizontal axis (X) and on the vertical axis (Y), and they are connected by straight lines. Despite considering two axes, line graphs will be used in this chapter to represent the behavior of a single variable. The graph shows the evolution or trend of a quantitative variable’s data, which is usually continuous, at regular intervals. The numeric variable values are represented on the Y-axis, and the X-axis only shows the data distribution in a uniform way. Let’s now illustrate a practical example of a line graph. Example 3.7 Cheap & Easy is a supermarket that registered the percentage of losses it had in the last 12 months (Table 3.E.12). After having done that, it will adopt new prevention measures. Build a line graph for Example 3.7.

TABLE 3.E.12 Percentage of Losses in the Last 12 Months Month

Losses (%)

January

0.42

February

0.38

March

0.12

April

0.34

May

0.22

June

0.15

July

0.18

August

0.31

September

0.47

October

0.24

November

0.42

December

0.09

Univariate Descriptive Statistics Chapter

3

31

Solution To build the line graph for Example 3.7 in Excel, in the Charts group, on the Insert tab, we must select the option Lines. The other steps follow the same logic of the previous examples. The complete chart can be seen in Fig. 3.6.

FIG. 3.6 Line graph for Example 3.7.

3.3.2.2 Scatter Plot A scatter plot is very similar to a line graph. The biggest difference between them is in the way the data are plotted on the horizontal axis. Similar to a line graph, here the points are also represented by the intersection of the variables along the X-axis and the vertical axis. However, they are not connected by straight lines. The scatter plot studied in this chapter is used to show the evolution or trend of a single quantitative variable’s data, similar to the line graph; however, at irregular intervals (in general). Analogous to a line graph, the numeric variable values are represented on the Y-axis and the X-axis only represents the data behavior throughout time. In the next chapter, we will see how a scatter plot can be used to describe the behavior of two variables simultaneously (bivariate analysis). The numeric values of one variable will be represented on the Y-axis and the other one on the X-axis. Example 3.8 Papermisto is the supplier of three types of raw materials for the production of paper: cellulose, mechanical pulp, and trimmings. In order to maintain its quality standards, the factory carries out a rigorous inspection of its products during each production phase. At irregular intervals, an operator must verify the esthetic and dimensional characteristics of the product selected with specialized instruments. For instance, in the cellulose storage phase, the product must be piled up in bales of approximately 250 kg each. Table 3.E.13 shows the weight of the bales collected in the last 5 hours, at irregular intervals, varying between 20 and 45 minutes. Construct a scatter plot for Example 3.8.

TABLE 3.E.13 Evolution of the Weight of the Bales Throughout Time Time (min)

Weight (kg)

30

250

50

255

85

252

106

248

138

250

178

249

198

252

222

251

252

250

297

245

32

PART

II Descriptive Statistics

Solution To build the scatter plot for Example 3.8 in Excel, in the Charts group, on the Insert tab, we must select the option Scatter. The other steps follow the same logic of the previous examples. The scatter plot can be seen in Fig. 3.7. FIG. 3.7 Scatter plot for Example 3.8.

256

Weight (kg)

254 252 250 248 246 244 0

50

100

150 Time (min)

200

250

300

3.3.2.3 Histogram A histogram is a vertical bar chart that represents the frequency distribution of one quantitative variable (discrete or continuous). The variable values being studied are presented on the X-axis (the base of each bar, with a constant width, represents each possible value of the discrete variable or each class of continuous values, sorted in ascending order). On the other hand, the height of the bars on the Y-axis represents the frequency distribution (absolute, relative, or cumulative) of the respective variable values. A histogram is very similar to a Pareto chart. It is also one of the seven quality tools. A Pareto chart represents the frequency distribution of a qualitative variable (types of problem), whose categories represented on the X-axis are sorted in order of priority (from the category with the highest frequency to the one with the lowest). A histogram represents the frequency distribution of a quantitative variable, whose values represented on the X-axis are sorted in ascending order. Therefore, the first step to elaborate a histogram is building the frequency distribution table. As presented in Sections 3.2.2 and 3.2.3, for each possible value of a discrete variable or for a class with continuous data, we calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency. The data must be sorted in ascending order. The histogram is then constructed from this table. The first column of the frequency distribution table, which represents the numeric values or the classes with the values of the variable being studied, will be presented on the X-axis, and the column of absolute frequency (or relative frequency, cumulative frequency, or relative cumulative frequency) will be presented on the Y-axis. Many pieces of statistical software generate the histogram automatically, from the original values of the quantitative variable being studied, without having to calculate the frequencies. Even though Excel has the option of building a histogram from analysis tools, we will show how to build it from the column chart, due to its simplicity. Example 3.9 In order to improve their services, a national bank is hiring new managers to serve their corporate clients. Table 3.E.14 shows the number of companies dealt with daily in one of their main branches in the capital. Elaborate a histogram from these data using Excel.

TABLE 3.E.14 Number of Companies Dealt With Daily 13

11

13

10

11

12

8

12

9

10

12

10

8

11

9

11

14

11

10

9

Univariate Descriptive Statistics Chapter

3

33

Solution The first step is building the frequency distribution table: From the data in Table 3.E.15, we can build a histogram of absolute frequency, relative frequency, cumulative frequency, or relative cumulative frequency using Excel. The histogram generated will be the absolute frequency one. Thus, we must standardize, codify, and select the first two columns of Table 3.E.15 (except the last row: Sum) in an Excel spreadsheet. In the Charts group, on the Insert tab, let’s select the option Columns. Let’s click on the chart so that it can be personalized. On the Layout tab, we selected the following icons: (a) Axis Titles: select the title for the horizontal axis (Number of companies) and for the vertical axis (Absolute frequency); (b) Legend: to hide the legend, we must click on None. The histogram generated in Excel can be seen in Fig. 3.8.

TABLE 3.E.15 Frequency Distribution for Example 3.9 Number of Companies

Fi

Fri (%)

Fac

Frac (%)

8

2

10

2

10

9

3

15

5

25

10

4

20

9

45

11

5

25

14

70

12

3

15

17

85

13

2

10

19

95

14

1

5

20

100

Sum

20

100

FIG. 3.8 Histogram of absolute frequencies elaborated in Excel for Example 3.9.

Number of companies 6

Absolute frequency

5 4 3 2 1 0 8

9

10

11

12

13

14

As mentioned, many statistical computer packages, including SPSS and Stata, build the histogram automatically from the original data of the variable being studied (in this example, using the data in Table 3.E.14), without having to calculate the frequencies. Moreover, these packages have the option of plotting the normal curve. Fig. 3.9 shows the histogram generated using SPSS (with the option of a normal curve) using the data in Table 3.E.14. We will see this in detail in Sections 3.6 and 3.7, how it can be constructed using SPSS and Stata software, respectively. Note that the values of the discrete variable are presented in the middle of the base. For continuous variables, consider the data in Table 3.E.5 (Example 3.3), regarding the grades of the students enrolled in the subject Financial Market. These data were sorted in ascending order, as presented in Table 3.E.6. Fig. 3.10 shows the histogram generated using SPSS software (with the option of a normal curve) using the data in Table 3.E.5 or Table 3.E.6.

34

PART

II Descriptive Statistics

FIG. 3.9 Histogram constructed using SPSS for Example 3.9 (discrete data).

5

Frequency

4

3

2

1

0 6

FIG. 3.10 Histogram generated using SPSS for Example 3.3 (continuous data).

8

10 12 Number_of_companies

14

16

5

Frequency

4

3

2

1

0 3.00

4.00

5.00

6.00 Grades

7.00

8.00

9.00

Note that the data were grouped considering an interval between h ¼ 0.5 classes, differently from Example 3.3 that considered h ¼ 1. The classes’ lower limits are represented on the left side of the base of the bar, and the upper limits (not included in the class) on the right side. The height of the bar represents the total frequency in the class. For example, the first bar represents the 3.5 ├ 4.0 class and there are three values in this interval (3.5, 3.8 and 3.9).

3.3.2.4 Stem-and-Leaf Plot Both bar charts and histograms represent the shape of the variable’s frequency distribution. The stem-and-leaf plot is an alternative to represent the frequency distributions of discrete and continuous quantitative variables with few observations, with the advantage of maintaining the original value of each observation (it allows the visualization of all data information).

Univariate Descriptive Statistics Chapter

3

35

In the plot, the representation of each observation is divided into two parts, separated by a vertical line: the stem is located on the left of the vertical line and represents the observation’s first digit(s); the leaf is located on the right of the vertical line and represents the observation’s last digit(s). Choosing the number of initial digits that will form the stem or the number of complementary digits that will form the leaf is random. The stems usually contain the most significant digits, and the leaves the least significant. The stems are represented in a single column and their different values throughout many lines. For each stem represented on the left-hand side of the vertical line, we have the respective leaves shown on the right-hand side throughout many columns. Stems as well as leaves must be sorted in ascending order. In the cases in which there are too many leaves per stem, we can have more than one line with the same stem. Choosing the number of lines is random, as well as defining the interval or the number of classes in a frequency distribution. To build a stem-and-leaf plot, we can follow the sequence of steps: Step 1: Sort the data in ascending order, to make the visualization of the data easier. Step 2: Define the number of initial digits that will form the stem, or the number of complementary digits that will form the leaf. Step 3: Elaborate the stems, represented in a single column on the left of the vertical line. Their different values are represented throughout many lines, in ascending order. When the number of leaves by stem is very high, we can define two or more lines for the same stem. Step 4: Place the leaves that correspond to the respective stems, on the right-hand side of the vertical line, throughout many columns (in ascending order). Example 3.10 A small company collected its employees’ ages, as shown in Table 3.E.16. Build a stem-and-leaf plot.

TABLE 3.E.16 Employees’ Ages 44

60

22

49

31

58

42

63

33

37

54

55

40

71

55

62

35

45

59

54

50

51

24

31

40

73

28

35

75

48

Solution To construct the stem-and-leaf plot, let’s apply the four steps described: Step 1 First, we must sort the data in ascending order, as shown in Table 3.E.17.

TABLE 3.E.17 Employees’ Ages in Ascending Order 22

24

28

31

31

33

35

35

37

40

40

42

44

45

48

49

50

51

54

54

55

55

58

59

60

62

63

71

73

75

Step 2 The next step to construct a stem-and-leaf plot is to define the number of initial digits of the observation that will form the stem. The complementary digits will form the leaf. In this example, all of the observations have two digits. The stems correspond to the tens and the leaves correspond to the units. Step 3 The following step is to build the stems. Based on Table 3.E.17, we can see that there are observations that begin with the tens 2, 3, 4, 5, 6, and 7 (stems). The stem with the highest frequency is 5 (8 observations), it is possible to represent all of its leaves in a single line. Therefore, we will have a single line per stem. Hence, the stems are presented in a single column on the left of the vertical line, in ascending order, as shown in Fig. 3.11.

36

PART

II Descriptive Statistics

FIG. 3.11 Building the stems for Example 3.10.

2 3 4 5 6 7

Step 4 Finally, let’s place the leaves that correspond to each stem on the right-hand side of the vertical line. The leaves are represented in ascending order throughout many columns. For example, stem 2 contains leaves 2, 4, and 8. Stem 5 contains leaves 0, 1, 4, 4, 5, 5, 8, and 9, represented throughout 8 columns. If this stem were divided into two lines, the first line would have leaves 0 to 4, and the second line leaves 5 to 9. Fig. 3.12 illustrates the stem-and-leaf plot for Example 3.10. FIG. 3.12 Stem-and-Leaf plot for Example 3.10.

2

2

4

8

3

1

1

3

5

5

7

4

0

0

2

4

5

8

9

5

0

1

4

4

5

5

8

6

0

2

3

7

1

3

5

9

Example 3.11 The average temperature, in Celsius, registered in the last 40 days in the city of Porto Alegre can be found in Table 3.E.18. Elaborate the stem-and-leaf plot for Example 3.11.

TABLE 3.E.18 Average Temperature in Celsius 8.5

13.7

12.9

9.4

11.7

19.2

12.8

9.7

19.5

11.5

15.5

16.0

20.4

17.4

18.0

14.4

14.8

13.0

16.6

20.2

17.9

17.7

16.9

15.2

18.5

17.8

16.2

16.4

18.2

16.9

18.7

19.6

13.2

17.2

20.5

14.1

16.1

15.9

18.8

15.7

Solution Once again, let’s apply the four steps to construct the stem-and-leaf plot, but now we have to consider continuous variables. Step 1 First, let’s sort the data in ascending order, as shown in Table 3.E.19.

TABLE 3.E.19 Average Temperature in Ascending Order 8.5

9.4

9.7

11.5

11.7

12.8

12.9

13.0

13.2

13.7

14.1

14.4

14.8

15.2

15.5

15.7

15.9

16.0

16.1

16.2

16.4

16.6

16.9

16.9

17.2

17.4

17.7

17.8

17.9

18.0

18.2

18.5

18.7

18.8

19.2

19.5

19.6

20.2

20.4

20.5

Univariate Descriptive Statistics Chapter

3

37

Step 2 In this example, the leaves correspond to the last digit. The remaining digits (to the left) correspond to the stems. Steps 3 and 4 The stems vary from 8 to 20. The stem with the highest frequency is 16 (7 observations), and its leaves can be represented in a single line. For each stem, we place the respective leaves. Fig. 3.13 shows the stem-and-leaf plot for Example 3.11. FIG. 3.13 Stem-and-Leaf Plot for Example 3.11.

8

5

9

4

7

11

5

7

12

8

9

13

0

2

7

14

1

4

8

15

2

5

7

9

16

0

1

2

4

6

17

2

4

7

8

9

18

0

2

5

7

8

19

2

5

6

20

2

4

5

10

9

9

3.3.2.5 Boxplot or Box-and-Whisker Diagram The boxplot (or box-and-whisker diagram) is a graphical representation of five measures of position or location of a certain variable: minimum value, first quartile (Q1), second quartile (Q2) or median (Md), third quartile (Q3) and maximum value. From a sorted sample, the median corresponds to the central position and the quartiles to subdivisions of the sample, four equal parts, each one containing 25% of the data. Thus, the first quartile (Q1) describes 25% of the first data (organized in ascending order). The second quartile corresponds to the median (50% of the sorted data are located below it and the remaining 50% above it), and the third quartile (Q13) corresponds to 75% of the observations. The dispersion measure resulting from these location measures is called interquartile range (IQR) or interquartile interval (IQI) and corresponds to the difference between Q3 and Q1. This plot allows us to assess the data symmetry and distribution. It also gives us a visual perspective of whether or not there are discrepant data (univariate outliers), since these data are above the upper and lower limits. A representation of the diagram can be seen in Fig. 3.14. FIG. 3.14 Boxplot.

38

PART

II Descriptive Statistics

Calculating the median, the first, and third quartiles, and investigating the existence of univariate outliers will be discussed in Sections 3.4.1.1, 3.4.1.2, and 3.4.1.3, respectively. In Sections 3.6.3 and 3.7, we will study how to generate the box-and-whisker diagram on SPSS and Stata, respectively, using a practical example.

3.4 THE MOST COMMON SUMMARY-MEASURES IN UNIVARIATE DESCRIPTIVE STATISTICS Information found in a dataset can be summarized through suitable numerical measures, called summary measures. In univariate descriptive statistics, the most common summary measures have as their main objective to represent the behavior of the variable being studied through its central and noncentral values, its dispersions, or the way its values are distributed around the mean. The summary measures that will be studied in this chapter are measures of position or location (measures of central tendency and quantiles), measures of dispersion or variability, and measures of shape, such as, skewness and kurtosis. These measures are calculated for metric or quantitative variables. The only exception is the mode, which is a measure of central tendency that provides the most frequent value of a certain variable, so, it can also be calculated for nonmetric or qualitative variables.

3.4.1

Measures of Position or Location

These measures provide values that characterize the behavior of a data series, indicating the data position or location in relation to the axis of the values assumed by the variable or characteristic being studied. The measures of position or location are subdivided into measures of central tendency (mean, median, and mode) and quantiles (quartiles, deciles, and percentiles).

3.4.1.1 Measures of Central Tendency The most common measures of central tendency are the arithmetic mean, the median, and the mode. 3.4.1.1.1

Arithmetic Mean

The arithmetic mean can be a representative measure of a population with N elements, represented by the Greek letter m, or a representative measure of a sample with n elements, represented by X. 3.4.1.1.1.1 Case 1: Simple Arithmetic Mean of Ungrouped Discrete and Continuous Data Simple arithmetic mean, or simply mean, or average, is the sum of all the values of a certain variable (discrete or continuous) divided by the total number of observations. Thus, the sample arithmetic mean of a certain variable X (X) is: n X

Xi

i¼1

(3.1)

n

where n is the total number of observations in the dataset and Xi, for i ¼ 1, …, n, represents each one of variable X’s values. Example 3.12 Calculate the simple arithmetic mean of the data in Table 3.E.20, regarding the grades of the graduate students enrolled in the subject Quantitative Methods.

TABLE 3.E.20 Students’ Grades 5.7

6.5

6.9

8.3

8.0

4.2

6.3

7.4

5.8

6.9

Univariate Descriptive Statistics Chapter

3

39

Solution The mean is simply calculated as the sum of all the values in Table 3.E.20 divided by the total number of observations: X¼

5:7 + 6:5 + ⋯ + 6:9 ¼ 6:6 10

The MEAN function in Excel calculates the simple arithmetic mean of the set of values selected. Let’s assume that the data in Table 3.E.20 are available from cell A1 to cell A10. To calculate the mean, we just need to insert the expression 5MEAN(A1:A10). Another way to calculate the mean using Excel, as well as other descriptive measures, such as, the median, mode, variance, standard deviation, standard error, skewness and kurtosis, which will also be studied in this chapter, is by using the Analysis ToolPack supplement (Section 3.5).

3.4.1.1.1.2 Case 2: Weighted Arithmetic Mean of Ungrouped Discrete and Continuous Data When calculating the simple arithmetic mean, all of the occurrences have the same importance or weight. When we are interested in assigning different weights (pi) to each value i of variable X, we use the weighted arithmetic mean: n X

Xi :pi

i¼1 n X

(3.2) pi

i¼1

If the weight is expressed in percentages (relative weight - rw), Expression (3.2) becomes: X¼

n X

Xi :rwi

(3.3)

i¼1

Example 3.13 At Vanessa’s school, the annual average of each subject is calculated based on the grades obtained throughout all four quarters, with their respective weights being: 1, 2, 3, and 4. Table 3.E.21 shows Vanessa’s grades in mathematics in each quarter. Calculate her annual average in the subject.

TABLE 3.E.21 Vanessa’s Grades in Mathematics Period

Grade

Weight

1st Quarter

4.5

1

2nd Quarter

7.0

2

3rd Quarter

5.5

3

4th Quarter

6.5

4

Solution The annual average is calculated by using the weighted arithmetic mean criterion. Applying Expression (3.2) to the data in Table 3. E.21, we have: X¼

4:5  1 + 7:0  2 + 5:5  3 + 6:5  4 ¼ 6:1 1+2+3+4

40

PART

II Descriptive Statistics

Example 3.14 There are five stocks in a certain investment portfolio. Table 3.E.22 shows the average yield of each stock in the previous month, as well as the respective percentage invested. Determine the portfolio’s average yield.

TABLE 3.E.22 Yield of Each Stock and Percentage Invested Stock

Yield (%)

% Investment

Bank of Brazil ON

1.05

10

Bradesco PN

0.56

25

Eletrobras PNB

0.08

15

Gerdau PN

0.24

20

Vale PN

0.75

30

Solution The portfolio’s average yield (%) corresponds to the sum of the products between each stock’s average yield (%) and the respective percentage invested, and, using Expression (3.3), we have: X ¼ 1:05  0:10 + 0:56  0:25 + 0:08  0:15 + 0:24  0:20 + 0:75  0:30 ¼ 0:53%

3.4.1.1.1.3 Case 3: Arithmetic Mean of Grouped Discrete Data When the discrete values of Xi repeat themselves, the data are grouped into a frequency table. To calculate the arithmetic mean, we have to use the same criterion as for the weighted mean. However, the weight for each Xi will be represented by absolute frequencies (Fi) and, instead of n observations with n different values, we will have n observations with m different values (grouped data): m X

m X

Xi :Fi

X ¼ i¼1 m X

¼

Xi :Fi

i¼1

Fi

n

(3.4)

i¼1

If the frequency of the data is expressed in terms of the percentage relative to the absolute frequency (relative frequency—Fr), Expression (3.4) becomes: m X X¼ Xi :Fr i (3.5) i¼1

Example 3.15 A satisfaction survey with 120 participants evaluated the performance of a health insurance company through grades given to it. Grades that vary between 1 and 10. The survey’s results can be seen in Table 3.E.23. Calculate the arithmetic mean for Example 3.15.

TABLE 3.E.23 Absolute Frequency Table Grades

Number of Participants

1

9

2

12

3

15

Univariate Descriptive Statistics Chapter

3

41

TABLE 3.E.23 Absolute Frequency Table—cont’d Grades

Number of Participants

4

18

5

24

6

26

7

5

8

7

9

3

10

1

Solution The arithmetic mean of Example 3.15 is calculated from Expression (3.4): X¼

1  9 + 2  12 + ⋯ + 9  3 + 10  1 ¼ 4:62 120

3.4.1.1.1.4 Case 4: Arithmetic Mean of Continuous Data Grouped into Classes To calculate the simple arithmetic mean, the weighted arithmetic mean, and the arithmetic mean of grouped discrete data, Xi represents each i value of variable X. For continuous data grouped into classes, each class does not have a single value defined, but a set of values. In order for the arithmetic mean to be calculated in this case, we assume that Xi is the middle or central point of class i (i ¼ 1,…,k), so, Expressions (3.4) and (3.5) are rewritten due to the number of classes (k): k X

k X

Xi :Fi

i¼1 k X

¼

Xi :Fi

i¼1

(3.6)

n

Fi

i¼1

k X

Xi :Fr i

(3.7)

i¼1

Example 3.16 Table 3.E.24 shows the classes of salaries paid to the employees of a certain company and their respective absolute and relative frequencies. Calculate the average salary.

TABLE 3.E.24 Classes of Salaries (US\$ 1000.00) and Their Respective Absolute and Relative Frequencies Classes

Fi

Fri (%)

1├3

240

17.14

3├5

480

34.29

5├7

320

22.86

7├9

150

10.71

9 ├ 11

130

9.29

11 ├ 13

80

5.71

1400

100

Sum

42

PART

II Descriptive Statistics

Solution Considering Xi the central point of class i and applying Expression (3.6), we have: X¼

2  240 + 4  480 + 6  320 + 8  150 + 10  130 + 12  80 ¼ 5:557 1; 400

or using Expression (3.7): X ¼ 2  0:1714 + 4  0:3429 + ⋯ + 10  0:0929 + 12  0:0571 ¼ 5:557 Therefore, the average salary is US\$ 5,557.14.

3.4.1.1.2 Median The median (Md) is a measure of location. It locates the center of the distribution of a set of data sorted in ascending order. Its value separates the series in two equal parts, so, 50% of the elements are less than or equal to the median, and the other 50 % are greater than or equal to the median.

3.4.1.1.2.1 Case 1: Median of Ungrouped Discrete and Continuous Data The median of variable X (discrete or continuous) can be calculated as follows: 8 Xn + X  n  > > +1 > > 2 < 2 , if n is an even number: 2 (3.8) Md ðXÞ ¼ >X > , if n is an odd number: > > : ð n + 1Þ 2 where n is the total number of observations and X1  …  Xn, considering that X1 is the smallest observation or the value of the first element, and that Xn is the highest observation or the value of the last element. Example 3.17 Table 3.E.25 shows the monthly production of treadmills of a company in a given year. Calculate the median.

TABLE 3.E.25 Monthly Production of Treadmills in a Given Year Month

Production (units)

Jan.

210

Feb.

180

Mar.

203

April

195

May

208

June

230

July

185

Aug.

190

Sept.

200

Oct.

182

Nov.

205

Dec.

196

Univariate Descriptive Statistics Chapter

3

43

Solution To calculate the median, the observations are sorted in ascending order. Therefore, we have the order of the observations and their respective positions: 180

182

185

190

195

196

200

203

205

208

210

230

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

11th

12th

The median will be the mean between the sixth and the seventh elements, since n is an even number, that is: X12 + X12 +1 2 2 Md ¼ 2 Md ¼

196 + 200 ¼ 198 2

Excel calculates the median of a set of data through the MED function. Note that the median does not consider the order of magnitude of the original variable’s values. If, for instance, the highest value were 400 instead of 230, the median would be exactly the same; however, with a much higher mean. The median is also known as the 2nd quartile (Q2), 50th percentile (P50), or 5th decile (D5). These definitions will be studied in more detail in the following sections.

3.4.1.1.2.2 Case 2: Median of Grouped Discrete Data Here, the calculation of the median is similar to the previous case. However, the data are grouped in a frequency distribution table. Analogous to Case 1, if n is an odd number, the position of the central element will be (n + 1)/2. We can see in the cumulative frequency column the group that has this position and, consequently, its corresponding value in the first column (median). If n is an even number, we verify the group(s) that contain(s) the central positions n/2 and (n/2) + 1 in the cumulative frequency column. If both positions correspond to the same group, we directly obtain their corresponding value in the first column (median). If each position corresponds to a distinct group, the median will be the average between the corresponding values defined in the first column. Example 3.18 Table 3.E.26 shows the number of bedrooms in 70 real estate properties in a condominium located in the metropolitan area of Sao Paulo, and their respective absolute and cumulative frequencies. Calculate the median.

TABLE 3.E.26 Frequency Distribution Number of Bedrooms

Fi

Fac

1

6

6

2

13

19

3

20

39

4

15

54

5

7

61

6

6

67

7

3

70

Sum

70

44

PART

II Descriptive Statistics

Since n is an even number, the median will be the average of the values that occupy positions n/2 and (n/2) + 1, that is: Xn + Xn +1

X + X36 ¼ 35 2 2 Based on Table 3.E.26, we can see that the third group contains all the elements between positions 20 and 39 (including 35 and 36), whose corresponding value is 3. Therefore, the median is: Md ¼ 2

2

Md ¼

3+3 ¼3 2

3.4.1.1.2.3 Case 3: Median of Continuous Data Grouped into Classes For continuous variables grouped into classes, in which the data are presented in a frequency distribution table, we apply the following steps to calculate the median: Step 1: Calculate the position of the median, not taking into consideration if n is an even or an odd number, through the following expression: PosðMd Þ ¼ n=2

(3.9)

Step 2: Identify the class that contains the median (median class) from the cumulative frequency column. Step 3: Calculate the median using the following expression: n   FacðMd1Þ  AMd Md ¼ LIMd + 2 FMd

(3.10)

where: LIMd ¼ lower limit of the median class; FMd ¼ absolute frequency of the median class; Fac(Md1)¼ cumulative frequency from the previous class to the median class; AMd ¼ range of the median class; n ¼ total number of observations.

Example 3.19 Consider the data in Example 3.16 regarding the classes of salaries paid to the employees of a company and their respective absolute and cumulative frequencies (Table 3.E.27). Calculate the median.

TABLE 3.E.27 Classes of Salaries (US\$ 1000.00) and Their Respective Absolute and Cumulative Frequencies Classes

Fi

Fac

1├3

240

240

3├5

480

720

5├7

320

1040

7├9

150

1190

9 ├ 11

130

1320

11 ├ 13

80

1400

Sum

1400

Univariate Descriptive Statistics Chapter

3

45

Solution In the case of continuous data grouped into classes, let’s apply the following steps to calculate the median: Step 1: First, we calculate the position of the median: n 1400 PosðMd Þ ¼ ¼ ¼ 700 2 2 Step 2: Through the cumulative frequency column, we can see that the median is in the second class (3 ├ 5). Step 3: Calculating the median: n Md ¼ LI Md + 2

 Fac ðMd1Þ FMd

  AMd

where: LIMd ¼ 3, FMd ¼ 480, Fac(Md1) ¼ 240, AMd ¼ 2, n ¼ 1400 Therefore, we have: Md ¼ 3 +

3.4.1.1.3

ð700  240Þ  2 ¼ 4916 ðUS\$ 4916:67Þ 480

Mode

The mode (Mo) of a data series corresponds to the observation that occurs with the highest frequency. The mode is the only measure of position that can also be used for qualitative variables, since these variables only allow us to calculate frequencies. 3.4.1.1.3.1 Case 1: Mode of Ungrouped Data Consider a set of observations X1, X2, …, Xn of a certain variable. The mode is the value that appears with the highest frequency. Excel gives us the mode of a set of data through the MODE function. Example 3.20 The production of carrots in a certain company is divided into five phases, including the post-harvest handling phase. Table 3.E.28 shows the average time the processing (in seconds) takes in this phase for 20 observations. Calculate the mode.

TABLE 3.E.28 Processing Time in the Post-Harvest Handling Phase in Seconds 45.0

44.5

44.0

45.0

46.5

46.0

45.8

44.8

45.0

46.2

44.5

45.0

45.4

44.9

45.7

46.2

44.7

45.6

46.3

44.9

Solution The mode is 45.0, which is the most frequent value in the dataset (Table 3.E.28). This value could be determined directly in Excel by using the MODE function.

3.4.1.1.3.2 Case 2: Mode of Grouped Qualitative or Discrete Data For discrete qualitative or quantitative data grouped in a frequency distribution table, the mode can be obtained directly from the table. It is the value with the highest absolute frequency. Example 3.21 A TV station interviewed 500 viewers trying to analyze their preferences in terms of interest categories. The result of the survey can be seen in Table 3.E.29. Calculate the mode.

46

PART

II Descriptive Statistics

TABLE 3.E.29 Viewers’ Preferences in Terms of Interest Categories Fi

Interest Categories Movies

71

Soap Operas

46

News

90

Comedy

98

Sports

120

Concerts

35

Variety

40

Sum

500

Solution Based on Table 3.E.29, we can see that the mode corresponds to the category Sports (the highest absolute frequency). Therefore, the mode is the only measure of position that can also be used for qualitative variables.

3.4.1.1.3.3 Case 3: Mode of Continuous Data Grouped into Classes For continuous data grouped into classes, there are several procedures to calculate the mode, such as, Czuber’s and King’s methods. Czuber’s method has the following phases: Step 1: Identify the class that has the mode (modal class), which is the one with the highest absolute frequency. Step 2: Calculate the mode (Mo): Mo ¼ LI Mo +

FMo  FMo1  AMo 2:FMo  ðFMo1 + FMo + 1 Þ

(3.11)

where: LIMo ¼ lower limit of the modal class; FMo ¼ absolute frequency of the modal class; FMo1 ¼ absolute frequency from the previous class to the modal class; FMo+1 ¼ absolute frequency from the posterior class to the modal class; AMo ¼ range of the modal class.

Example 3.22 A set of continuous data with 200 observations is grouped into classes with their respective absolute frequencies, as shown in Table 3.E.30. Determine the mode using Czuber’s method.

TABLE 3.E.30 Continuous Data Grouped into Classes and Their Respective Frequencies Class

Fi

01 ├ 10

21

10 ├ 20

36

20 ├ 30

58

30 ├ 40

24

40 ├ 50

19

Sum

200

Univariate Descriptive Statistics Chapter

3

47

Solution Considering continuous data grouped into classes, we can use Czuber’s method to calculate the mode: Step 1: Based on Table 3.E.30, we can see that the modal class is the third one (20 ├ 30), since it has the highest absolute frequency. Step 2: Calculating the mode (Mo): Mo ¼ LI Mo +

FMo  FMo1  AMo 2:FMo  ðFMo1 + FMo + 1 Þ

where: LIMo ¼ 20, FMo ¼ 58, FMo1 ¼ 36, FMo+1 ¼ 24, AMo ¼ 10 Therefore, we have: Mo ¼ 20 +

58  36  10 ¼ 23:9 2  58  ð36 + 24Þ

On the other hand, King’s method consists of the following phases: Step 1: Identify the modal class (the one with the highest absolute frequency). Step 2: Calculate the mode (Mo) using the following expression: Mo ¼ LI Mo +

FMo + 1  AMo FMo1 + FMo + 1

(3.12)

where: LIMo ¼ lower limit of the modal class; FMo1 ¼ absolute frequency from the previous class to the modal class; FMo+1 ¼ absolute frequency from the posterior class to the modal class; AMo ¼ range of the modal class.

Example 3.23 Once again, consider the data from the previous example. Use King’s method to determine the mode. Solution In Example 3.22, we saw that: LI Mo ¼ 20 FMo + 1 ¼ 24 FMo1 ¼ 36 AMo ¼ 10 Applying Expression (3.12): Mo ¼ LI Mo +

FMo + 1 24  10 ¼ 24  AMo ¼ 20 + FMo1 + FMo + 1 36 + 24

3.4.1.2 Quantiles According to Bussab and Morettin (2011), only the use of measures of central tendency may not be suitable to represent a set of data, since they are also impacted by extreme values. Moreover, only with the use of these measures, it is not possible for the researcher to have a clear idea of the data dispersion and symmetry. As an alternative, we can use quantiles, such as, quartiles, deciles, and percentiles. The 2nd quartile (Q2), 5th decile (D5), or 50th percentile (P50) correspond to the median; therefore, they are measures of central tendency. 3.4.1.2.1

Quartiles

Quartiles (Qi, i ¼ 1, 2, 3) are measures of position that divide a set of data into four parts with equal dimensions, sorted in ascending order.

Min.

Q1

Md = Q2

Q3

Max.

48

PART

II Descriptive Statistics

Thus, the 1st Quartile (Q1 or the 25th percentile) indicates that 25% of the data are less than Q1, or that 75% of the data are greater than Q1. The 2nd Quartile (Q2, or the 5th decile, or the 50th percentile) corresponds to the median, indicating that 50% of the data are less or greater than Q2. The 3rd Quartile (Q3 or the 75th percentile) indicates that 75% of the data are less than Q3, or that 25% of the data are greater than Q3. 3.4.1.2.2

Deciles

Deciles (Di, i ¼ 1, 2, ..., 9) are measures of position that divide a set of data into 10 equal parts, sorted in ascending order.

Min.

D1

D2

D3

D4

D5

D6

D7

D8

D9

Max.

Md

Therefore, the 1st decile (D1 or 10th percentile) indicates that 10% of the data are less than D1 or that 90% of the data are greater than D1. The 2nd decile (D2 or 20th percentile) indicates that 20% of the data are less than D2 or that 80% of the data are greater than D2. And so on, and so forth, until the 9th decile (D9 or 90th percentile), indicating that 90% of the data are less than D9 or that 10% of the data are greater than D9. 3.4.1.2.3 Percentiles Percentiles (Pi, i ¼ 1, 2, ..., 99) are measures of position that divide a set of data, sorted in ascending order, into 100 equal parts. Hence, the 1st percentile (P1) indicates that 1% of the data is less than P1 or that 99% of the data are greater than P1. The 2nd percentile (P2) indicates that 2% of the data are less than P2 or that 98% of the data are greater than P2. And so on, and so forth, until the 99th percentile (P99), which indicates that 99% of the data are less than P99 or that 1% of the data is greater than P99. 3.4.1.2.3.1 Case 1: Quartiles, Deciles, and Percentiles of Ungrouped Discrete and Continuous Data If the position of the quartile, decile, or percentile we are interested in is an integer or is exactly between two positions, calculating the respective quartile, decile or percentile becomes easier. However, this does not happen all the time (imagine a sample with 33 elements and that the objective is to calculate the 67th percentile), there are many methods proposed for this kind of calculation that lead to close results, but they are not identical. We will present a simple and generic method that can be applied to calculate any quartile, decile, or percentile of order i, considering ungrouped discrete and continuous data: Step 1: Sort the observations in ascending order. Step 2: Determine the position of the quartile, decile, or percentile, of order i, we are interested in: i 1 × i + , i ¼ 1, 2,3 4 2 hn i 1 × i + , i ¼ 1, 2,…, 9 Decile ! PosðDi Þ5 10 2 h n i 1 Percentile ! PosðPi Þ5 × i + , i ¼ 1, 2,…, 99 100 2 Quartile ! PosðQi Þ5

hn

(3.13) (3.14) (3.15)

Step 3: Calculate the value of the quartile, decile, or percentile that corresponds to the respective position. Assume that Pos(Q1) ¼ 3.75, that is, the value of Q1 is between the 3rd and 4th positions (75% closer to the 4th position, and 25% to the 3rd position). Therefore, Q1 will be the sum of the value that corresponds to the 3rd position multiplied by 0.25, with the value that corresponds to the 4th position multiplied by 0.75.

Univariate Descriptive Statistics Chapter

3

Example 3.24 Consider the data in Example 3.20 regarding the average carrot processing time in the post-harvest handling phase, as specified in Table 3.E.28. Determine Q1 (1st quartile), Q3 (3rd quartile), D2 (2nd decile), and P64 (64th percentile). Solution For ungrouped continuous data, we must apply the following steps to determine the quartiles, deciles, and percentiles we are interested in: Step 1: Sort the observations in ascending order. 1st

2nd

3rd

4th

5th

7th

7th

8th

9th

10th

44.0

44.5

44.5

44.7

44.8

44.9

44.9

45.0

45.0

45.0

11th

12th

13th

14th

15th

16th

17th

18th

19th

20th

45.0

45.4

45.6

45.7

45.8

46.0

46.2

46.2

46.3

46.5

Step 2: Calculation of the positions of Q1, Q3, D2, and P64:   1 a) PosðQ1 Þ ¼ 20 4  1 + 2 ¼ 5:5   1 b) PosðQ3 Þ ¼ 20 4  3 + 2 ¼ 15:5   1 c) PosðD2 Þ ¼ 20 10  2 + 2 ¼ 4:5  20  d) PosðP64 Þ ¼ 100  64 + 12 ¼ 13:3 Step 3: Calculating Q1, Q3, D2, and P64: a) Pos(Q1) ¼ 5.5 means that its corresponding value is 50% near position 5 and 50% near position 6, that is, Q1 is simply the average of the values that correspond to both positions: 44:8 + 44:9 ¼ 44:85 2 b) Pos(Q3) ¼ 15.5 means that the value we are interested in is between positions 15 and 16 (50% near the 15th position and 50% near the 16th position), so, Q3 can be calculated as follows: Q1 ¼

45:8 + 46 ¼ 45:9 2 c) Pos(D2) ¼ 4.5 means that the value we are interested in is between positions 4 and 5, so, D2 can be calculated as follows: Q3 ¼

44:7 + 44:8 ¼ 44:75 2 d) Pos(P64) ¼ 13.3 means that the value we are interested in is 70% closer to position 13 and 30% closer to position 14, so, P64 can be calculated as follows: D2 ¼

P64 ¼ (0.70 x 45.6) + (0.30 x 45.7) ¼ 45.63. Interpretation Q1 ¼ 44.85 indicates that, in 25% of the observations (the first 5 observations listed in Step 1), the carrot processing time in the postharvest handling phase is less than 44.85 seconds, or that in 75% of the observations (the remaining 15 observations), the processing time is greater than 44.85. Q3 ¼ 45.9 indicates that, in 75% of the observations (15 of them), the processing time is less than 45.9 seconds, or that in 5 observations, the processing time is greater than 45.9. D2 ¼ 44.75 indicates that, in 20% of the observations (4 of them), the processing time is less than 44.75 seconds, or that in 80% of the observations (16 of them), the processing time is greater than 44.75. P64 ¼ 45.63 indicates that, in 64% of the observations (12.8 of them), the processing time is less than 45.63 seconds, or that in 36% of the observations (7.2 of them) the processing time is greater than 45.63. Excel calculates the quartile of order i (i ¼ 0, 1, 2, 3, 4) through the QUARTILE function. As arguments of the function, we must define the matrix or set of data in which we are interested to calculate the respective quartile (it does not need to be in ascending order), in addition to the fourth we are interested in (minimum value ¼ 0; 1st quartile ¼ 1; 2nd quartile ¼ 2, 3rd quartile ¼ 3; maximum value ¼ 4). The k-th percentile (k ¼ 0, ..., 1) can also be calculated in Excel through the PERCENTILE function. As arguments of the function, we must define the matrix we are interested in, in addition to the value of k (for example, in the case of P64, k ¼ 0.64).

49

50

PART

II Descriptive Statistics

The calculation of quartiles, deciles, and percentiles using SPSS and Stata statistical software will be demonstrated in Sections 3.6 and 3.7, respectively. SPSS and Stata software use two methods to calculate quartiles, deciles, or percentiles. One of them is called Tukey’s Hinges and it is the method used in this book. The other method is related to the Weighted Average, whose calculations are more complex. Excel, on the other hand, implements another algorithm that gets similar results.

3.4.1.2.3.2 Case 2: Quartiles, Deciles, and Percentiles of Grouped Discrete Data Here, the calculation of quartiles, deciles, and percentiles is similar to the previous case. However, the data are grouped in a frequency distribution table. In the frequency distribution table, the data must be sorted in ascending order, with their respective absolute and cumulative frequencies. First, we must determine the position of the quartile, decile, or percentile, of order i, we are interested in through Expressions (3.13), (3.14), and (3.15), respectively. From the cumulative frequency column, we must verify the group (s) that contain(s) this position. If the position is a discrete number, its corresponding value is obtained directly in the first column. However, if the position is a fractional number, as, for example, 2.5, and if the 2nd and the 3rd positions are in the same group, its respective value will also be obtained directly. On the other hand, if the position is a fractional number, as, for example, 4.25, and positions 4 and 5 are in different groups, we must calculate the sum of the value that corresponds to the 4th position multiplied by 0.75 with the value that corresponds to the 5th position multiplied by 0.25 (similar to Case 1). Example 3.25 Consider the data in Example 3.18 regarding the number of bedrooms in 70 real estate properties in a condominium located in the metropolitan area of Sao Paulo, and their respective absolute and cumulative frequencies (Table 3.E.26). Calculate Q1, D4, and P96. Solution Let’s calculate the positions of Q1, D4, and P96 through Expressions (3.13), (3.14), and (3.15), respectively, and their corresponding values:   1 a) PosðQ1 Þ ¼ 70 4  1 + 2 ¼ 18 Based on Table 3.E.26, we can see that position 18 is in the second group (2 bedrooms), so, Q1 ¼ 2.   1 b) PosðD4 Þ ¼ 70 10  4 + 2 ¼ 28:5 Through thecumulative  frequency column, we can see that positions 28 and 29 are in the third group (3 bedrooms), so, D4 ¼ 3. 70 c) Pos P96 ¼ 100  96 + 12 ¼ 67:7 that is, P96 is 70% closer to position 68 and 30% to position 67. Through the cumulative frequency column, we can see that position 68 is in the seventh group (7 bedrooms) and position 67 to the sixth group (6 bedrooms), so, P96 can be calculated as follows: P96 ¼ ð0:70 x 7Þ + ð0:30 x 6Þ ¼ 6:7: Interpretation Q1 ¼ 2 indicates that 25% of the real estate properties have less than 2 bedrooms, or that 75% of the real estate properties have more than 2 bedrooms. D4 ¼ 3 indicates that 40% of the real estate properties have less than 3 bedrooms, or that 60% of the real estate properties have more than 3 bedrooms. P96 ¼ 6.7 indicates that 96% of the real estate properties have less than 6.7 bedrooms, or that 4% of the real estate properties have more than 6.7 bedrooms.

3.4.1.2.3.3 Case 3: Quartiles, Deciles, and Percentiles of Continuous Data Grouped into Classes For continuous data grouped into classes in which data are represented in a frequency distribution table, we must apply the following steps to calculate the quartiles, deciles, and percentiles: Step 1: Calculate the position of the quartile, decile, or percentile, of order i, we are interested in through the following expressions: n (3.16) Quartile ! PosðQi Þ ¼  i, i ¼ 1,2, 3 4 n Decile ! PosðDi Þ ¼  i, i ¼ 1,2, …, 9 (3.17) 10 n  i, i ¼ 1, 2,…,99 (3.18) Percentile ! PosðPi Þ ¼ 100

Univariate Descriptive Statistics Chapter

3

51

Step 2: Identify the class that contains the quartile, decile, or percentile, of order i, we are interested in (quartile class, decile class, or percentile class) from the cumulative frequency column. Step 3: Calculate the quartile, decile, or percentile, of order i, we are interested in through the following expressions: ! PosðQi Þ  FcumðQi 1Þ (3.19)  RQi , i ¼ 1,2, 3 Quartile ! Qi ¼ LLQi + FQi where: LLQi ¼ lower limit of the quartile class; Fcum(Qi1)¼ cumulative frequency from the previous class to the quartile class; FQi ¼ absolute frequency of the quartile class; RQi ¼ range of the quartile class. Decile ! Di ¼ LLDi +

PosðDi Þ  FcumðDi 1Þ

!

FDi

 RDi , i ¼ 1,2, …, 9

(3.20)

where: LLDi ¼ lower limit of the decile class; Fcum(Di1)¼ cumulative frequency from the previous class to the decile class; FDi ¼ absolute frequency of the decile class; RDi ¼ range of the decile class. Percentile ! Pi ¼ LLPi +

PosðPi Þ  FcumðPi 1Þ FPi

!  RPi , i ¼ 1,2, …, 99

(3.21)

where: LLPi ¼ lower limit of the percentile class; Fcum(Pi1)¼ cumulative frequency from the previous class to the percentile class; FPi ¼ absolute frequency of the percentile class; RPi ¼ range of the percentile class.

Example 3.26 A survey on the health conditions of 250 patients collected information about their weight. The data are grouped into classes, as shown in Table 3.E.31. Calculate the first quartile, the seventh decile, and the 60th percentile.

TABLE 3.E.31 Absolute and Cumulative Frequencies Distribution table of Patients’ Weight Grouped into Classes Class

Fi

Fac

50 ├ 60

18

18

60 ├ 70

28

46

70 ├ 80

49

95

80 ├ 90

66

161

90 ├ 100

40

201

100 ├ 110

33

234

110 ├ 120

16

250

Sum

250

52

PART

II Descriptive Statistics

Solution Let’s apply the three steps to calculate Q1, D7, and P60: Step 1: Let’s calculate the position of the first quartile, the seventh decile, and the 60th percentile through Expressions (3.16), (3.17), and (3.18), respectively: 250  1 ¼ 62:5 4 250  7 ¼ 175 7th Decile ! PosðD7 Þ ¼ 10 250  60 ¼ 150 60th Percentile ! PosðP60 Þ ¼ 100 1st Quartile ! PosðQ1 Þ ¼

Step 2: Let’s identify the class that has Q1, D7, and P60 from the cumulative frequency column in Table 3.E.31: Q1 is in the 3rd class (70 ├ 80) D7 is in the 5th class (90 ├ 100) P60 is in the 4th class (80 ├ 90) Step 3: Let’s calculate Q1, D7, and P60 from Expressions (3.19), (3.20), and (3.21), respectively: Q1 ¼ LLQ1 + D7 ¼ LLD7 + P60 ¼ LLP60 +

Pos ðQ1 Þ  FcumðQ1 1Þ

!

FQ1 Pos ðD7 Þ  FcumðD7 1Þ

 RQ1 ¼ 70 + !

FD7 Pos ðP60 Þ  FcumðP60 1Þ FP60

  62:5  46  10 ¼ 73:37 49 

 RD7 ¼ 90 + !

 175  161  10 ¼ 93:5 40

  RP60 ¼ 80 +

 150  95  10 ¼ 88:33 66

Interpretation Q1 ¼ 73.37 indicates that 25% of the patients weigh less than 73.37 kg, or that 75% of the patients weigh more than 73.37 kg. D7 ¼ 93.5 indicates that 70% of the patients weigh less than 93.5 kg, or that 30% of the patients weigh more than 93.5 kg. P60 ¼ 88.33 indicates that 60% of the patients weigh less than 88.33 kg, or that 40% of the patients weigh more than 88.33 kg.

3.4.1.3 Identifying the Existence of Univariate Outliers A dataset can contain observations that are extremely distant from most observations or that are inconsistent. These observations are called outliers or atypical, discrepant, abnormal, or extreme values. Before deciding what will be done with the outliers, we must know the causes that lead to such an occurrence. In many cases, these causes can determine the most suitable treatment for the respective outliers. The main causes are measurement mistakes, execution/implementation mistakes, and variability inherent to the population. There are many outlier identification methods: boxplots, discordance models, Dixon’s test, Grubbs’ test, Z-scores, among others. In the Appendix of Chapter 11 (Cluster Analysis), a very efficient method for detecting multivariate outliers will be presented (BACON algorithm—Blocked Adaptive Computationally Efficient Outlier Nominators). The existence of outliers through boxplots (the construction of boxplots was studied in Section 3.3.2.5) is identified from the IQR (interquartile range), which corresponds to the difference between the third and first quartiles: IQR ¼ Q3  Q1

(3.22)

Note that the IQR is the length of the box. Any values located below Q1 or above Q3 by 1.5IQR more will be considered mild outliers and will be represented by circles. They may even be accepted in the population, but with some suspicion. Thus, the X° value of a variable is considered a mild outlier when: X° < Q1 21:5  IQR

(3.23)

X° > Q3 + 1:5  IQR

(3.24)

Univariate Descriptive Statistics Chapter

3

53

FIG. 3.15 Boxplot with the identification of outliers.

or any values located below Q1 or above Q3 by 3 IQR more will be considered extreme outliers and will be presented by asterisks. Thus, the X* value of a variable is considered an extreme outlier when: X∗ < Q1  3:IQR

(3.25)

X∗ > Q3 + 3:IQR

(3.26)

Fig. 3.15 illustrates the boxplot with the identification of outliers. Example 3.27 Consider the sorted data in Example 3.24 regarding the average carrot processing time in the post-harvest handling phase: 44.0

44.5

44.5

44.7

44.8

44.9

44.9

45.0

45.0

45.0

45.0

45.4

45.6

45.7

45.8

46.0

46.2

46.2

46.3

46.5

where Q1 ¼ 44.85, Q2 ¼ 45, Q3 ¼ 45.9, mean ¼ 45.3, and mode ¼ 45. Check and see if there are mild and extreme outliers. Solution To verify if there is a possible outlier, we must calculate: Q1  1:5  ðQ3  Q1 Þ ¼ 44:85  1:5:ð45:9  44:85Þ ¼ 43:275 Q3 + 1:5  ðQ3  Q1 Þ ¼ 45:9 + 1:5:ð45:9  44:85Þ ¼ 47:475 Since there is no value in the distribution outside this interval, we conclude that there are no mild outliers. Obviously, it is not necessary to calculate the interval for extreme outliers. In case only one outlier in a certain variable is identified, the researcher can treat it through some existing procedures, as, for example, the complete elimination of this observation. On the other hand, if there is more than one outlier for one or more variables individually, the elimination of all the observations can reduce the sample size significantly. To avoid this problem, it is very common for observations considered outliers for a certain variable to have their atypical values substituted for the mean of the variable, thus, excluding the outliers (Fa´vero et al., 2009). The authors mention other procedures for dealing with outliers, such as, substituting them for values from a regression or winsorization; which, in an organized way, eliminates an equal number of observations from each side of the distribution. Fa´vero et al. (2009) also highlight the importance of dealing with outliers when the researcher in interested in investigating the behavior of a certain variable without the influence of observations with atypical values. On the other hand, if the main goal is to analyze the behavior of these atypical observations or to define subgroups through discrepancy criteria, maybe eliminating these observations or substituting their values would not be the best solution.

54

PART

3.4.2

II Descriptive Statistics

Measures of Dispersion or Variability

To study the behavior of a set of data, we use measures of central tendency, measures of dispersion, in addition to the nature or shape of the data distribution. Measures of central tendency determine a value that represents the set of data. In order to characterize the dispersion or variability of the data, measures of dispersion are necessary. The most common measures of dispersion are the range, average deviation, variance, standard deviation, standard error, and the coefficient of variation (CV).

3.4.2.1 Range The simplest measure of variability is the total range, or simply range (R), which represents the difference between the highest and lowest value of the set of data: R ¼ Xmax  Xmin

(3.27)

3.4.2.2 Average Deviation Deviation is the difference between each observed value and

the mean of the variable. Thus, for population data, it would be m), and for sample data, by X  X . The modulus or absolute deviation ignores the  sign and is represented by (Xi  i denoted by Xi  X . Average deviation, or absolute average deviation, represents the arithmetic mean of absolute deviations. 3.4.2.2.1 Case 1: Average Deviation of Ungrouped Discrete and Continuous Data The average deviation (D) is the sum of the absolute deviations of all observations divided by the population size (N) or the sample size (n): N X X  m i

i¼1

ðfor the populationÞ N n X X  X i

(3.28)

i¼1

(3.29)

n

ðfor samplesÞ

Example 3.28 Table 3.E.32 shows the distances traveled (in km) by a vehicle in order to deliver 10 packages throughout the day. Calculate the average deviation.

TABLE 3.E.32 Distances Traveled (km) 12.4

22.6

18.9

9.7

14.5

22.5

26.3

17.7

31.2

20.4

Solution For the data in Table 3.E.32, we have X ¼ 19:62. Applying Expression (3.29), we get the average deviation: j12:4  19:62j + j22:6  19:62j + ⋯ + j20:4  19:62j ¼ 4:98 10 The average deviation can be directly calculated in Excel using the AVEDEV function. D¼

3.4.2.2.2

Case 2: Average Deviation of Grouped Discrete Data

For grouped data, presented in a frequency distribution table for m groups, the calculation of the average deviation is:

Univariate Descriptive Statistics Chapter

3

55

m X X  m :F i i

Pm

X :F i¼1 i i

bearing in mind that X ¼

i¼1

n

ðfor the populationÞ N m X X  X :F i i

(3.30)

i¼1

(3.31)

n

ðfor samplesÞ

.

Example 3.29 Table 3.E.33 shows the number of goals scored by the D.C. soccer team in their last 30 games, with their respective absolute frequencies. Calculate the average deviation.

TABLE 3.E.33 Frequency Distribution of Example 3.29 Number of Goals

Fi

0

5

1

8

2

6

3

4

4

4

5

2

6

1

Sum

30

Solution 05+18+⋯+61 The mean is X ¼ ¼ 2:133. The average deviation can be determined from the calculations presented in 30 Table 3.E.34:

TABLE 3.E.34 Calculations of the Average Deviation for Example 3.29

Therefore, D ¼

Number of Goals

Fi

X  X i

X  X :F i i

0

5

2.133

10.667

1

8

1.133

9.067

2

6

0.133

0.800

3

4

0.867

3.467

4

4

1.867

7.467

5

2

2.867

5.733

6

1

3.867

3.867

Sum

30

Pm 41:067 i¼1 Xi  X :Fi ¼ ¼ 1:369. n 30

41.067

56

PART

II Descriptive Statistics

3.4.2.2.3 Case 3: Average Deviation of Continuous Data Grouped into Classes For continuous data grouped into classes, the calculation of the average deviation is: k X X  m :F i i

i¼1

ðfor the populationÞ

N

(3.32)

k X X  X :F i i

i¼1

n

ðfor samplesÞ

(3.33)

Note that Expressions (3.32) and (3.33) are similar to Expressions (3.30) and (3.31), respectively, except that, instead of m Pk X :F groups, we consider k classes. Moreover, Xi represents the middle or central point of each class i, where X ¼ i¼1n i i , as presented in Expression (3.6). Example 3.30 In order to determine its variation due to genetic factors, a survey with 100 newborn babies collected information about their weight. Table 3.E.35 shows the data grouped into classes and their respective absolute frequencies. Calculate the average deviation.

TABLE 3.E.35 Newborn Babies’ Weight (in kg) Grouped into Classes Class

Fi

2.0 ├ 2.5

10

2.5 ├ 3.0

24

3.0 ├ 3.5

31

3.5 ├ 4.0

22

4.0 ├ 4.5

13

Sum

Solution First, we must calculate X: k X

Xi :Fi

2:25  10 + 2:75  24 + 3:25  31 + 3:75  22 + 4:25  13 ¼ ¼ 3:270 n 100 The average deviation can be determined from the calculations presented in Table 3.E.36: X¼

i¼1

TABLE 3.E.36 Calculations of the Average Deviation for Example 3.30

Therefore, D ¼

Class

Fi

Xi

X  X i

X  X :F i i

2.0 ├ 2.5

10

2.25

1.02

10.20

2.5 ├ 3.0

24

2.75

0.52

12.48

3.0 ├ 3.5

31

3.25

0.02

0.62

3.5 ├ 4.0

22

3.75

0.48

10.56

4.0 ├ 4.5

13

4.25

0.98

12.74

Sum

100

Pk 46:6 i¼1 Xi  X :Fi ¼ ¼ 0:466. n 100

46.6

Univariate Descriptive Statistics Chapter

3

57

3.4.2.3 Variance Variance is a measure of dispersion or variability that evaluates how much the data are dispersed in relation to the arithmetic mean. Thus, the higher the variance, the higher the data dispersion.

3.4.2.3.1

Case 1: Variance of Ungrouped Discrete and Continuous Data

Instead of considering the mean of absolute deviations, as discussed in the previous section, it is more common to calculate the mean of squared deviations. This measure is known as variance: !2 N X Xi N N X X i¼1 2 2 ðXi  mÞ Xi  N i¼1 i¼1 2 s ¼ ðfor the populationÞ (3.34) ¼ N N !2 n X Xi n n X

2 X i¼1 2 Xi  X Xi  n i¼1 i¼1 2 ¼ ð forsamplesÞ (3.35) S ¼ n1 n1 The relationship between the sample variance (S2) and the population variance (s2) is given by: S2 ¼

N :s2 n1

(3.36)

Example 3.31 Consider the data in Example 3.28 regarding the distances traveled (in km) by a vehicle in order to deliver 10 packages throughout the day. Calculate the variance. Solution We saw in Example 3.28 that X ¼ 19:62. Applying Expression (3.35), we have: S2 ¼

ð12:4  19:62Þ2 + ð22:6  19:62Þ2 + ⋯ + ð20:4  19:62Þ2 ¼ 41:94 9

The sample variance can be directly calculated in Excel using the VAR.S function. To calculate the variance population, we must use the VAR.P function.

3.4.2.3.2 Case 2: Variance of Grouped Discrete Data For grouped data, represented in a frequency distribution table by m groups, the variance can be calculated as follows: !2 m X Xi :Fi m m X X i¼1 2 2 ðXi  mÞ :Fi Xi :Fi  N i¼1 i¼1 2 ¼ ðfor the populationÞ (3.37) s ¼ N N !2 m X Xi :Fi m m X X

2 i¼1 2 Xi  X :Fi Xi :Fi  n i¼1 i¼1 2 S ¼ ðfor samplesÞ (3.38) ¼ n1 n1 Pm X :F where X ¼ i¼1 i i . n

58

PART

II Descriptive Statistics

Example 3.32 Consider the data in Example 3.29 regarding the number of goals scored by the D.C. soccer team in the last 30 games, with their respective absolute frequencies. Calculate the variance. Solution As calculated in Example 3.29, the mean is X ¼ 2:133. The variance can be determined from the calculations presented in Table 3.E.37:

TABLE 3.E.37 Calculations of the Variance Number of Goals

Fi

0

5

Xi  X

2

2 Xi  X :Fi

4.551

22.756

1

8

1.284

10.276

2

6

0.018

0.107

3

4

0.751

3.004

4

4

3.484

13.938

5

2

8.218

16.436

6

1

14.951

14.951

Sum

30

81.467

Pm Therefore, S 2 ¼

3.4.2.3.3

i¼1

2 Xi  X :Fi 81:467 ¼ ¼ 2:809 n1 29

Case 3: Variance of Continuous Data Grouped into Classes

For continuous data grouped into classes, we calculate the variance as follows: k X k X

s ¼ 2

k X

2

ðXi  mÞ :Fi

i¼1

¼

N

!2 Xi :Fi

i¼1

Xi2 :Fi 

N

i¼1

k X k X

S ¼ 2

k X

2

ðXi  xÞ :Fi

i¼1

n1

ðfor the populationÞ

N

¼

Xi2 :Fi 

(3.39)

!2 Xi :Fi

i¼1

i¼1

n1

n

ðfor samplesÞ

(3.40)

Note that Expressions (3.39) and (3.40) are similar to Expressions (3.37) and (3.38), respectively, except that we consider k classes instead of m groups.

Example 3.33 Consider the data in Example 3.30 regarding the weight of newborn babies grouped into classes, with their respective absolute frequencies. Calculate the variance. Solution As calculated in Example 3.30, we have X ¼ 3:270.

Univariate Descriptive Statistics Chapter

3

59

The variance can be determined from the calculations presented in Table 3.E.38:

TABLE 3.E.38 Calculations of the Variance for Example 3.33

2

2 Xi  X :Fi

Class

Fi

Xi

2.0 ├ 2.5

10

2.25

1.0404

2.5 ├ 3.0

24

2.75

0.2704

6.4896

3.0 ├ 3.5

31

3.25

0.0004

0.0124

3.5 ├ 4.0

22

3.75

0.2304

5.0688

4.0 ├ 4.5

13

4.25

0.9604

12.4852

Sum

100

Pk Therefore, S 2 ¼

i¼1

ðXi X Þ

2

n1

:Fi

Xi  X

10.404

34.46

¼ 34:46 99 ¼ 0:348.

3.4.2.4 Standard Deviation Since the variance considers the mean of squared deviations, its value tends to be very high and difficult to interpret. To solve this problem, we calculate the square root of the variance. This measure is known as the standard deviation. It is calculated as follows: pﬃﬃﬃﬃﬃ s ¼ s2 ðfor the populationÞ

(3.41)

pﬃﬃﬃﬃﬃ S ¼ S2 ðfor samplesÞ

(3.42)

Example 3.34 Once again, consider the data in Examples 3.28 or 3.31 regarding the distances traveled (in km) by the vehicle. Calculate the standard deviation. Solution We have X ¼ 19:62. The standard deviation is the square root of the variance, which has already been calculated in Example 3.31: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð12:4  19:62Þ2 + ð22:6  19:62Þ2 + ⋯ + ð20:4  19:62Þ2 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ S¼ ¼ 41:94 ¼ 6:476 9 The standard deviation of a sample can be directly calculated in Excel using the STDEV.S function. To calculate the standard deviation of the population, we use the STDEV.P function.

Example 3.35 Consider the data in Examples 3.29 or 3.32 regarding the number of goals scored by the D.C. soccer team in the last 30 games, with their respective absolute frequencies. Calculate the standard deviation. Solution The mean is X ¼ 2:133. The standard deviation is the square root of the variance, so, it can be determined from the calculations of the variance, which has already been calculated in Example 3.32, as demonstrated in Table 3.E.37: rP ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ m 2 ðXi X Þ :Fi i¼1 ¼ 81:467 Therefore, S ¼ 29 ¼ 2:809 ¼ 1:676. n1

60

PART

II Descriptive Statistics

Example 3.36 Consider the data in Examples 3.30 or 3.33 regarding the weight of newborn babies grouped into classes, with their respective absolute frequencies. Calculate the standard deviation. Solution We have X ¼ 3:270. The standard deviation is the square root of the variance, so, it can be determined from the calculations of the variance, which has already been calculated in Example 3.33, as demonstrated in Table 3.E.38: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

2 Pk qﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ i¼1 Xi  X :Fi ¼ 34:46 Therefore, S ¼ 99 ¼ 0:348 ¼ 0:59. n1

3.4.2.5 Standard Error The standard error is the standard deviation of the mean. It is obtained by dividing the standard deviation by the square root of the population or sample size: s (3.43) sX ¼ pﬃﬃﬃﬃ for the population N S SX ¼ pﬃﬃﬃ for samples n

(3.44)

The higher the number of measurements, the better the determination of the average value will be (higher accuracy), due to the compensation of random errors. Example 3.37 One of the phases in the preparation of concrete is mixing it in a concrete mixer. Tables 3.E.39 and 3.E.40 show the concrete mixing times (in seconds), considering a sample with 10 and 30 elements, respectively. Calculate the standard error for both cases and interpret the results.

TABLE 3.E.39 Concrete Mixing Time for a Sample With 10 Elements 124

111

132

142

108

127

133

144

148

105

TABLE 3.E.40 Concrete Mixing Time for a Sample With 30 Elements 125

102

135

126

132

129

156

112

108

134

126

104

143

140

138

129

119

114

107

121

124

112

148

145

130

125

120

127

106

148

Solution First, let’s calculate the standard deviation for both samples: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð124  127:4Þ2 + ð111  127:4Þ2 + ⋯ + ð105  127:4Þ2 S1 ¼ ¼ 15:364 9 sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð125  126:167Þ2 + ð102  126:167Þ2 + ⋯ + ð148  126:167Þ2 ¼ 14:227 S2 ¼ 29 To calculate the standard error, we must apply Expression (3.44): S1 15:364 SX ¼ pﬃﬃﬃﬃﬃ ¼ pﬃﬃﬃﬃﬃﬃ ¼ 4:858 1 n1 10

Univariate Descriptive Statistics Chapter

3

61

S2 14:227 SX ¼ pﬃﬃﬃﬃﬃ ¼ pﬃﬃﬃﬃﬃﬃ ¼ 2:598 2 n2 30 Despite the small difference in the calculation of the standard deviation, we can see that the standard error of the first sample is almost the double when compared to the second sample. Therefore, the higher the number of measurements, the higher the accuracy.

3.4.2.6 Coefficient of Variation The coefficient of variation (CV) is a relative measure of dispersion that provides the variation of the data in relation to the mean. The smaller the value, the more homogeneous the data will be, that is, the smaller the dispersion around the mean will be. It can be calculated as follows: s (3.45) CV ¼  100 ð%Þ for the population m S CV ¼  100 ð%Þ for samples X

(3.46)

A CV can be considered low, indicating a set of data that is reasonably homogeneous, when it is less than 30%. If this value is greater than 30%, the set of data can be considered heterogeneous. However, this standard varies according to the application. Example 3.38 Calculate the coefficient of variation for both samples of the previous example. Solution Applying Expression (3.46), we have: CV 1 ¼ CV 2 ¼

S1 15:364  100 ¼  100 ¼ 12:06% 127:4 X1

S2 14:227  100 ¼ 11:28%  100 ¼ 126:167 X2

These results confirm the homogeneity of the data of the variable being studied for both samples. We conclude, therefore, that the mean is a good measure to represent the data. Let’s now study the measures of skewness and kurtosis.

3.4.3

Measures of Shape

Measures of asymmetry (skewness) and kurtosis characterize the shape of the distribution of the population elements sampled around the mean (Maroco, 2014).

3.4.3.1 Measures of Skewness Measures of skewness describe the shape of a frequency distribution curve. For a symmetrical curve or frequency distribution, the mean, the mode, and the median are the same. For an asymmetrical curve, the mean gets farther away from the mode, and the median is located in an intermediary position. Fig. 3.16 shows a symmetrical distribution. On the other hand, if the frequency distribution is more concentrated on the left side, that is, the tail on the right is longer than the tail on the left, we will have a positively skewed distribution or to the right, as shown in Fig. 3.17. In this case, the mean is greater than the median, and the latter is greater than the mode (Mo < Md < X). Conversely, if the frequency distribution is more concentrated on the right side, that is, the tail on the left is longer than the tail on the right, we will have a negatively skewed distribution or to the left, as shown in Fig. 3.18. In this case, the mean is less than the median, and the latter is less than the mode X < Md < Mo .

62

PART

II Descriptive Statistics

FIG. 3.16 Symmetrical distribution.

FIG. 3.17 Skewness to the right or positive skewness.

FIG. 3.18 Skewness to the left or negative skewness.

3.4.3.1.1

Pearson’s First Coefficient of Skewness

Pearson’s first coefficient of skewness (Sk1) is a measure of skewness given by the difference between the mean and the mode, weighted by one measure of dispersion (the standard deviation): Sk1 ¼

m  Mo for the population s

(3.47)

X  Mo for samples, S

(3.48)

Sk1 ¼

which has the following interpretation: If Sk1 ¼ 0, the distribution is symmetrical; If Sk1 > 0, the distribution is positively skewed (to the right); If Sk1 < 0, the distribution is negatively skewed (to the left). Example 3.39 From one set of data, we obtained the following measures X ¼ 34:7, Mo ¼ 31.5, Md ¼ 33.2, and S ¼ 12.4. Determine the type of skewness and calculate Pearson’s first coefficient of skewness.

Univariate Descriptive Statistics Chapter

3

63

Solution Since Mo < Md < X, we have a positive asymmetrical distribution (to the right). Applying Expression (3.48), we can determine Pearson’s first coefficient of skewness: X  Mo 34:7  31:5 ¼ ¼ 0:258 S 12:4 Classifying the distribution as positively skewed can also be interpreted by the value Sk1 > 0. Sk 1 ¼

3.4.3.1.2 Pearson’s Second Coefficient of Skewness To avoid using the mode to calculate

the skewness, we must adopt the empirical relationship between the mean, the median, and the mode: X  Mo ¼ 3: X  Md , which corresponds to Pearson’s second coefficient of skewness (Sk2): 3:ðm  Md Þ for the population s

3: X  Md for samples Sk2 ¼ S

Sk2 ¼

(3.49) (3.50)

In the same way, we have: If Sk2 ¼ 0, the distribution is symmetrical; If Sk2 > 0, the distribution is positively skewed (to the right); If Sk2 < 0, the distribution is negatively skewed (to the left). Pearson’s first and second coefficients of skewness allow us to compare two or more distributions and to evaluate which one is more asymmetrical. Its modulus indicates the intensity of the skewness. That is, the higher Pearson’s coefficient of skewness is, the more asymmetrical the curve is. Thus: If 0 < j Sk j < 0.15, the skewness is weak; If 0.15  j Sk j  1, the skewness is moderate; If j Sk j > 1, the skewness is strong. Example 3.40 From the data in Example 3.39, calculate Pearson’s second coefficient of skewness. Solution Applying Expression (3.50), we have:

3: X  Md 3:ð34:7  33:2Þ ¼ ¼ 0:363 S 12:4 Analogously, since Sk2 > 0, we confirm that the distribution is positively skewed. Sk 2 ¼

3.4.3.1.3 Bowley’s Coefficient of Skewness Another measure of skewness is Bowley’s coefficient of skewness (SkB), also known as quartile coefficient of skewness, calculated with quantiles, such as, the first and third quartiles, in addition to the median: SkB ¼

Q3 + Q1  2:Md Q3  Q1

In the same way, we have: If SkB ¼ 0, the distribution is symmetrical; If SkB > 0, the distribution is positively skewed (to the right); If SkB < 0, the distribution is negatively skewed (to the left).

(3.51)

64

PART

II Descriptive Statistics

Example 3.41 Calculate Bowley’s coefficient of skewness for the following dataset, which has already been sorted in ascending order:

24

25

29

31

36

40

44

45

48

50

54

56

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

11th

12th

Solution We have Q1 ¼ 30, Md ¼ 42, and Q3 ¼ 49. Therefore, we can determine Bowley’s coefficient of skewness: Sk B ¼

Q3 + Q1  2:Md 49 + 30  2:ð42Þ ¼ 0:263 ¼ Q3  Q1 49  30

Since SkB < 0, we conclude that the distribution is negatively skewed (to the left).

3.4.3.1.4 Fisher’s Coefficient of Skewness The last measure of skewness we will study is known as Fisher’s coefficient of skewness (g1), calculated from the third moment around the mean (M3), as presented in Maroco (2014): g1 ¼

n2 :M3 ðn  1Þ:ðn  2Þ:S3

(3.52)

where: n X

3 Xi  X

M3 ¼

i¼1

n

(3.53)

which is interpreted the same way as the other coefficients of skewness, that is: If g1 ¼ 0, the distribution is symmetrical; If g1 > 0, the distribution is positively skewed (to the right); If g1 < 0, the distribution is negatively skewed (to the left). Fisher’s coefficient of skewness can be calculated in Excel using the DISTORTION function (see Example 3.42) or using the Analysis Tools supplement (Section 3.5). Its calculation through SPSS software will be presented in Section 3.6. 3.4.3.1.5 Coefficient of Skewness on Stata The coefficient of skewness on Stata is calculated from the second and third moments around the mean, as presented by Cox (2010): Sk ¼

M3 3=2

(3.54)

M2

where: n X

2 Xi  X

M2 ¼

i¼1

n

(3.55)

Univariate Descriptive Statistics Chapter

3

65

which is interpreted the same way as the other coefficients of skewness, that is: If Sk ¼ 0, the distribution is symmetrical; If Sk > 0, the distribution is positively skewed (to the right); If Sk < 0, the distribution is negatively skewed (to the left).

3.4.3.2 Measures of Kurtosis In addition to measures of skewness, measures of kurtosis can also be used to characterize the shape of the distribution of the variable being studied. Kurtosis can be defined as the flatness level of a frequency distribution (height of the peak of the curve) in relation to a theoretical distribution that usually corresponds to the normal distribution. When the shape of the distribution is not very flat, nor very long, similar to a normal curve, it is called mesokurtic, as we can see in Fig. 3.19. In contrast, when the distribution shows a frequency curve that is flatter than a normal curve, it is called platykurtic, as shown in Fig. 3.20. Or, when the distribution presents a frequency curve that is longer than a normal curve, it is called leptokurtic, according to Fig. 3.21.

3.4.3.2.1

Coefficient of Kurtosis

One of the most common coefficients to measure the flatness level or kurtosis of a distribution is the percentile coefficient of kurtosis, or simply coefficient of kurtosis (k). It is calculated from the interquartile interval, in addition to the 10th and 90th percentiles: k¼

Q  Q1 3

, 2  P90  P10

(3.56)

which has the following interpretation: If k ¼ 0.263, we say that the curve is mesokurtic; If k > 0.263, we say that the curve is platykurtic; If k < 0.263, we say that the curve is leptokurtic.

FIG. 3.19 Mesokurtic curve.

FIG. 3.20 Platykurtic curve.

66

PART

II Descriptive Statistics

FIG. 3.21 Leptokurtic curve.

3.4.3.2.2

Fisher’s Coefficient of Kurtosis

Another very common measure to determine the flatness level or kurtosis of a distribution is Fisher’s coefficient of kurtosis, (g2). It is calculated using the fourth moment near the mean (M4), as presented in Maroco (2014): g2 ¼

n2 :ðn + 1Þ:M4 ðn  1Þ2  3: ðn  1Þ:ðn  2Þ:ðn  3Þ:S4 ðn  2Þ:ðn  3Þ

(3.57)

where: n X

M4 ¼

Xi  X

i¼1

n

4 ,

(3.58)

which has the following interpretation: If g2 ¼ 0, the curve has a normal distribution (mesokurtic); If g2 < 0, the curve is very flat (platykurtic); If g2 > 0, the curve is very long (leptokurtic). Many pieces of statistical software, among them SPSS, use Fisher’s coefficient of kurtosis to calculate the flatness level or kurtosis (Section 3.6). In Excel, the KURT function calculates Fisher’s coefficient of kurtosis (Example 3.42), and it can be calculated through the Analysis ToolPak supplement as well (Section 3.5).

3.4.3.2.3

Coefficient of Kurtosis on Stata

The coefficient of kurtosis on Stata is calculated from the second and fourth moments near the mean, as presented by Bock (1975) and Cox (2010): kS ¼

M4 M22

which has the following interpretation: If kS ¼ 3, the curve has a normal distribution (mesokurtic); If kS < 3, the curve is very flat (platykurtic); If kS > 3, the curve is very long (leptokurtic).

(3.59)

Univariate Descriptive Statistics Chapter

3

Example 3.42 Table 3.E.41 shows the prices of stock Y throughout a month, resulting in a sample with 20 periods (i.e., business days). Calculate: a) Fisher’s coefficient of skewness (g1); b) The coefficient of skewness used on Stata; c) Fisher’s coefficient of kurtosis (g2); d) The coefficient of kurtosis used on Stata;

TABLE 3.E.41 Prices of Stock Y Throughout the Month 18.7

18.3

18.4

18.7

18.8

18.8

19.1

18.9

19.1

19.9

18.5

18.5

18.1

17.9

18.2

18.3

18.1

18.8

17.5

16.9

Solution The mean and the standard deviation of the data in Table 3.E.41 are X ¼ 18:475 and S ¼ 0.6324, respectively. We have: a) Fisher’s coefficient of skewness g1: It is calculated using the third moment near the mean (M3): n X

M3 ¼

Xi  X

3

i¼1

¼

n

ð18:7  18:475Þ3 + ⋯ + ð16:9  18:475Þ3 ¼ 0:0788 20

Therefore, we have: g1 ¼

n2 :M3 ð20Þ2  ð0:079Þ ¼ ¼ 0:3647 3 ðn  1Þ:ðn  2Þ:S 19  18  ð0:63Þ3

Since g1 < 0, we can conclude that the frequency curve is more concentrated on the right side and has a longer tail to the left, that is, the distribution is asymmetrical to the left or negative. Excel calculates Fisher’s coefficient of skewness (g1) through the SKEW function. File Stock_Market.xls shows the data from Table 3.E.41, cells A1:A20. Thus, to calculate it, we just need to insert expression 5SKEW(A1:A20). b) The coefficient of skewness used on Stata: It is calculated from the second and third moments near the mean: n X

M2 ¼

Xi  X

2

i¼1

¼

n

ð18:7  18:475Þ2 + ⋯ + ð16:9  18:475Þ2 ¼ 0:3799 20 M3 ¼ 0:0788

It is calculated as follows: Sk ¼

M3 3=2

M2

¼ 0:3367,

which is interpreted the same way as Fisher’s coefficient of skewness. c) Fisher’s coefficient of kurtosis g2: It is calculated using the fourth moment near the mean (M4): n X

M4 ¼

Xi  X

i¼1

n

4 ¼

ð18:7  18:475Þ4 + ⋯ + ð16:9  18:475Þ4 ¼ 0:5857 20

Therefore, we calculate g2 as follows: g2 ¼ g2 ¼

n2 :ðn + 1Þ:M4 ðn  1Þ2  3: 4 ðn  1Þ:ðn  2Þ:ðn  3Þ:S ðn  2Þ:ðn  3Þ ð20Þ2  21  0:5857 19  18  17  ð0:6324Þ

Thus, we can conclude that the curve is long or leptokurtic.

ð19Þ2

4  3: 18  17 ¼ 1:7529

67

68

PART

II Descriptive Statistics

The KURT function in Excel calculates Fisher’s coefficient of kurtosis (g2). To calculate it from the file Stock_Market.xls, we must insert expression 5KURT(A1:A20). d) Coefficient of kurtosis on Stata: It is calculated from the second and fourth moments near the mean: M2 ¼ 0.3799 and M4 ¼ 0.5857, as already calculated. Thus: kS ¼

M4 0:5857 ¼ ¼ 4:0586 M22 ð0:3799Þ2

Since kS > 3, the curve is long or leptokurtic. In the next three sections, we will discuss how to construct tables, charts, graphs, and summary measures in Excel and in the statistical softwares SPSS and Stata, using the data in Example 3.42.

3.5

A PRACTICAL EXAMPLE IN EXCEL

Section 3.3.1 showed the graphical representation of qualitative variables through bar charts (horizontal and vertical), pie charts, and the Pareto chart. We demonstrated how each one of these charts can be obtained using Excel. Conversely, Section 3.3.2 showed the graphical representation of quantitative variables through line graphs, scatter plots, histograms, among others. Analogously, we presented how most of them can be obtained using Excel. Section 3.4 presented the main summary measures, including measures of central tendency (mean, mode, and median), quantiles (quartiles, deciles, and percentiles), measures of dispersion or variability (range, average deviation, variance, standard deviation, standard error, and coefficient of variation), in addition to the measures of shape as skewness and kurtosis. Then, we presented how they can be calculated using the Excel functions, except the ones that are not available. This section discusses how to obtain descriptive statistics (such as, the mean, standard error, median, mode, standard deviation, variance, kurtosis, skewness, among others), through the Analysis ToolPak add-in in Excel. In order to do that, let’s consider the problem presented in Example 3.42, whose data are available in Excel in the file Stock_Market.xls, presented in cells A1:A20, as shown in Fig. 3.22. To load the Analysis ToolPak add-in in Excel, we must first click on the File tab and on Options, as shown in Fig. 3.23. Now, the Excel Options dialog box will open, as shown in Fig. 3.24. From this box, we selected the option Add-ins. In Add-ins, we must select the option Analysis ToolPak and click on Go. Then, the Add-ins dialog box will appear, as shown in Fig. 3.25. Among the add-ins available, we must select the option Analysis ToolPak and click on OK.

FIG. 3.22 Dataset in Excel—Price of Stock Y.

Univariate Descriptive Statistics Chapter

3

69

FIG. 3.23 File tab, focusing more on Options.

Thus, the option Data Analysis will start being available on the Data tab, inside the Analysis group, as shown in Fig. 3.26. Fig. 3.27 shows the Data Analysis dialog box. Note that several analysis tools are available. Let’s select the option Descriptive Statistics and click on OK. From the Descriptive Statistics dialog box (Fig. 3.28), we must select the Input Range (A1:A20) and, as Output options, let’s select Summary statistics. The results can be presented in a new spreadsheet or in a new work folder. Finally, let’s click on OK. The descriptive statistics generated can be seen in Fig. 3.29 and include measures of central tendency (mean, mode, and median), measures of dispersion or variability (variance, standard deviation, and standard error), and measures of shape (skewness and kurtosis). The range can be calculated from the difference between the sample’s maximum and minimum values. As mentioned in Sections 3.4.3.1 and 3.4.3.2, the measure of skewness calculated by Excel (using the SKEW function or by Fig. 3.28) corresponds to Fisher’s coefficient of skewness (g1); and the measure of kurtosis calculated (using the KURT function or by Fig. 3.28) corresponds to Fisher’s coefficient of kurtosis (g2).

3.6

A PRACTICAL EXAMPLE ON SPSS

From a practical example, this section presents how to obtain the main univariate descriptive statistics studied in this chapter by using IBM SPSS Statistics Software. These include frequency distribution tables, charts (histogram, stemand-leaf plots, boxplots, bar charts, and pie charts), measures of central tendency (mean, mode, and median), quantiles

FIG. 3.24 Excel Options dialog box.

FIG. 3.25 Add-ins dialog box.

Univariate Descriptive Statistics Chapter

FIG. 3.26 Availability of the Data Analysis command, from the Data tab.

FIG. 3.27 Data Analysis dialog box.

FIG. 3.28 Descriptive Statistics dialog box.

3

71

72

PART

II Descriptive Statistics

FIG. 3.29 Descriptive statistics in Excel.

FIG. 3.30 Dataset on SPSS—Price of Stock Y.

(quartiles and percentiles), measures of dispersion or variability (range, variance, standard deviation, standard error, among others), and measures of shape (skewness and kurtosis). The use of the images in this section has been authorized by the International Business Machines Corporation©. The data presented in Example 3.42 are the input basis on SPSS and are available in the file Stock_Market.sav, as shown in Fig. 3.30. To obtain such descriptive statistics, we must click on Analyze ! Descriptive Statistics. After that, three options can be used: Frequencies, Descriptive, and Explore.

3.6.1

Frequencies Option

This option can be used for qualitative and quantitative variables, and it provides frequency distribution tables, as well as measures of central tendency (mean, median, and mode), quantiles (quartiles and percentiles), measures of dispersion or variability (range, variance, standard deviation, standard error, among others), and measures of skewness and kurtosis. The Frequencies option also plots bar charts, pie charts, or histograms (with or without a normal curve). Therefore, on the toolbar, click on Analyze ! Descriptive Statistics and select Frequencies..., as shown in Fig. 3.31.

Univariate Descriptive Statistics Chapter

3

73

FIG. 3.31 Descriptive statistics on SPSS—Frequencies Option.

FIG. 3.32 Frequencies dialog box: selecting the variable and showing the frequency table.

Therefore, the Frequencies dialog box will open. The variable being studied (Stock price, called Price) must be selected in Variable(s) and the Display frequency tables option must be activated so that the frequency distribution table can be shown (Fig. 3.32). The following step consists of clicking on Statistics... To select the summary measures that interest us (Fig. 3.33). Among the quantiles, let’s select the option Quartiles (which calculates the first and third quartiles, in addition to the median). To get the percentile of order i (i ¼ 1, 2, ..., 99), we must select the option Percentile(s) and add the order desired. In this case, we chose to calculate the percentiles of order 10 and 60. The measures of central tendency that we have to select are the mean, median, and mode. As measures of dispersion, let’s select Std. deviation (standard deviation), Variance,

74

PART

II Descriptive Statistics

FIG. 3.33 Frequencies: Statistics dialog box.

Range, and S.E. mean (standard error). Finally, let’s select both measures of shape of a distribution: Skewness and Kurtosis. To go back to the Frequencies dialog box, we must click on Continue. Next, let’s click on Charts... and select the chart that interest us. As options, we have Bar charts, Pie charts, or Histograms. Let’s select the last chart with the option of plotting a normal curve (Fig. 3.34). Bar or pie charts can be shown in terms of absolute frequencies (Frequencies) or relative frequencies (Percentages). In order to go back to the Frequencies dialog box once again, we must click on Continue. Finally, click on OK. Fig. 3.35 shows the calculations of the summary measures selected in Fig. 3.33. As studied in Sections 3.4.3.1 and 3.4.3.2, the measure of skewness calculated by SPSS corresponds to Fisher’s coefficient of skewness (g1), and the measure of kurtosis corresponds to Fisher’s coefficient of kurtosis (g2), respectively. Also in Fig. 3.35, note that the percentiles of order 25, 50, and 75 that correspond to the first quartile, median, and third quartile, respectively, were calculated automatically. The method used to calculate the percentiles was the Weighted Average. The frequency distribution table can be seen in Fig. 3.36. The first column represents the absolute frequency of each element (Fi), the second and third columns represent the relative frequency of each element (Fri—%), and the last column represents the relative cumulative frequency (Frac—%). Also in Fig. 3.36, we can see that all the values happened only once. Since we have a continuous quantitative variable with 20 observations and no repetitions, constructing bar or pie charts would not give the researcher any additional information, that is, it would not allow a good visualization of how the stock prices behave in terms of bins. Hence, we chose to construct a histogram with previously defined bins. The histogram generated using SPSS with the option of plotting a normal curve can be seen in Fig. 3.37.

3.6.2

Descriptives Option

Different from Frequencies..., which also has the frequency distribution table option, besides bar charts, pie charts, or histograms (with or without a normal curve), Descriptives... only makes summary measures available (therefore, it is recommended for quantitative variables). Nevertheless, measures of central tendency, such as, the median and mode

Univariate Descriptive Statistics Chapter

FIG. 3.34 Frequencies: Charts dialog box.

FIG. 3.35 Summary measures obtained from Frequencies: Statistics.

3

75

76

PART

II Descriptive Statistics

FIG. 3.36 Frequency distribution.

FIG. 3.37 Histogram with a normal curve obtained from Frequencies: Charts.

Histogram 8

Mean = 18.47 Std. Dev. = .632 N = 20

Frequency

6

4

2

0 17.0

18.0

19.0

20.0

Price

are not made available; nor are quantiles, such as, quartiles and percentiles. To use it, let’s click on Analyze ! Descriptive Statistics and select Descriptives..., as shown in Fig. 3.38. Therefore, the Descriptives dialog box will open. The variable being studied must be selected in Variable(s), as shown in Fig. 3.39. Let’s click on Options... and select the summary measures that interest us (Fig. 3.40). Note that the same summary measures in the Frequencies... were selected, except the median, the mode, in addition to the quartiles and percentiles that are not available, as already mentioned. Let’s click on Continue to go back to the Descriptives dialog box. Finally, click on OK. The results are available in Fig. 3.41.

Univariate Descriptive Statistics Chapter

3

77

FIG. 3.38 Descriptive statistics on SPSS—Descriptives Option.

FIG. 3.39 Descriptives dialog box: selecting the variable.

3.6.3

Explore Option

As Frequencies..., Explore... does not provide the frequency distribution table either. Regarding the types of chart, different from this last option, which offers bar charts, pie charts, and histograms, Explore... provides stem-and-leaf plots, boxplots, in addition to histograms. However, it does not have the option of plotting a normal curve. Regarding summary measures, Explore... provides measures of central tendency, such as, the mean and median (there is no option for the mode); quantiles, such as, percentiles (of order 5, 10, 25, 50, 75, 90, and 95); measures of dispersion, such as, the range, variance, standard deviation, among others (it does not calculate the standard error), besides measures of skewness and kurtosis.

78

PART

II Descriptive Statistics

FIG. 3.40 Descriptives: Options dialog box.

FIG. 3.41 Summary measures obtained from Descriptive: Options.

Therefore, this command is the best one to generate descriptive statistics for quantitative variables. Hence, from Analyze ! Descriptive Statistics, select Explore..., as shown in Fig. 3.42. Therefore, the Explore dialog box will open. The variable being studied must be selected from the list of dependent variables (Dependent List), as shown in Fig. 3.43. Next, we must click on Statistics... to open the Explore: Statistics box, and select the options Descriptives, Outliers, and Percentiles, as shown in Fig. 3.44. Let’s click on Continue to go back to the Explore box. Next, we must click on Plots... to open the Explore: Plots box and select the charts that interest us, as shown in Fig. 3.45. In this case, we have to select Boxplots: Factor levels together (the resulting boxplots will be together in the same chart), Stem-and-leaf and the histogram (note that there is no option for plotting the normal curve). Once again, we must click on Continue to go back to the Explore dialog box. Finally, click on OK. The results obtained are illustrated. Fig. 3.46 shows the results obtained from Explore: Statistics, with Descriptives option. Fig. 3.47 shows the results obtained from Explore: Statistics, with Percentiles option. The percentiles of order 5, 10, 25 (Q1), 50 (median), 75 (Q3), 90, and 95 were calculated using two methods: the Weighted Average and Tukey’s Hinges. The latter corresponds to the method proposed in this chapter (Section 3.4.1.2, Case 1). Thus, applying the expressions in

Univariate Descriptive Statistics Chapter

3

79

FIG. 3.42 Descriptive statistics on SPSS—Explore Option.

FIG. 3.43 Explore dialog box: selecting the variable.

Section 3.4.1.2 to this example, we get the same results seen in Fig. 3.47, as regards Tukey’s Hinges method for calculating P25, P50, and P75. Coincidently, in this example, the value of P75 was the same for both methods, but they are usually different. Fig. 3.48 shows the results obtained from the Explore: Statistics, with Outliers option. The extreme values of the distribution are presented here (the highest five and the lowest five), with their respective positions found in the dataset. Now, the charts constructed from the options selected in Explore: Plots (histograms, stem-and-leaf plots, and boxplots) are presented in Figs. 3.49, 3.50, and 3.51, respectively.

80

PART

II Descriptive Statistics

FIG. 3.44 Explore: Statistics dialog box.

FIG. 3.45 Explore: Plots dialog box.

FIG. 3.46 Results Obtained from the Descriptives Option.

FIG. 3.47 Results obtained from the Percentiles option.

FIG. 3.48 Results obtained from the Outliers option.

FIG. 3.49 Histogram constructed from the Explore: Plots dialog box.

Histogram 8

Mean = 18.48 Std. Dev. = .632 N = 20

Frequency

6

4

2

0 17.0

18.0

19.0

20.0

Price

FIG. 3.50 Stem-and-leaf chart generated from the Explore: Plots dialog box.

Price

Stem-and-Leaf Plot

Frequency

Stem & Leaf

1.00 Extremes 17 .

2.00

17 . 59

6.00

18 . 112334

8.00

18 . 55778889

2.00

19 . 11

1.00 Extremes

Stem width: Each leaf:

FIG. 3.51 Boxplot generated from the Explore: Plots dialog box.

20.0

(==19.9)

1.0 1 case(s)

10

19.0

18.0

17.0

20

16.0 Price

Univariate Descriptive Statistics Chapter

3

83

Obviously, the histogram generated by Fig. 3.49 is the same as the Frequencies... (Fig. 3.37); however, without the normal curve, since the Explore... does not provide this function. Fig. 3.50 shows that the first two digits of the number (the integers, before the point) form the stem and the decimals correspond to the leaf. Moreover, stem 18 is represented in two lines because it contains several observations. In Section 3.4.1.3, we learned how to calculate an extreme outlier through expressions X* < Q1  3.(Q3  Q1) and X* > Q3 + 3.(Q3  Q1). If we consider that Q1 ¼ 18.15 and Q3 ¼ 18.8, we have X* < 16.2 or X* > 20.75. Since there are no observations outside these limits, we conclude that there are no extreme outliers. Repeating the same procedure for mild outliers, that is, applying expressions X° < Q1  1.5.(Q3  Q1) and X° > Q3 + 1.5.(Q3  Q1), we can see that there is one observation with a value of less than 17.175 (20th observation), and another one with a value greater than 19.775 (10th observation). These values are therefore considered mild outliers. The boxplot in Fig. 3.51 shows that observations 10 and 20, with values 19.9 and 16.9, respectively, are mild outliers (represented by circles). Depending on their survey goals, this allows researchers to decide whether to keep them, exclude them (the analysis may be harmed because of the reduction in the sample size), or substitute their values for the variable mean. Continuing in Fig. 3.51, the values of Q1, Q2 (Md), and Q3 correspond to 18.15, 18.5, and 18.8, respectively, which are those obtained from Tukey’s Hinges method (Fig. 3.47), considering all of the initial 20 observations. Therefore, the boxplot’s measures of position (Q1, Md, and Q3), except for the minimum and maximum values, are calculated without excluding the outliers.

3.7

A PRACTICAL EXAMPLE ON STATA

The same descriptive statistics obtained in the previous section through SPSS software will be calculated in this section through Stata Statistical Software. The results will be compared to those obtained in an algebraic way and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©. The data presented in Example 3.42 are the input basis on Stata, and are available in the file Stock_Market.dta.

3.7.1

Univariate Frequency Distribution Tables on Stata

Through command tabulate, or simply tab, as we will use throughout this book, we can obtain frequency distribution tables for a certain variable. The syntax of the command is: tab variable*

where the term variable* should be substituted for the name of the variable considered in the analysis. Fig. 3.52 shows the obtained output using the command tab price. Just as the frequency distribution table obtained through SPSS (Fig. 3.36), Fig. 3.52 provides the absolute, relative, and relative cumulative frequencies for each category of the variable price. FIG. 3.52 Frequency distribution on Stata using the command tab.

84

PART

II Descriptive Statistics

Consider a case with more than one variable being studied in which the objective is to construct univariate frequency distribution tables (one-way tables), that is, one table for each variable being analyzed. In this case, we must use the command tab1, with the following syntax: tab1 variables*

where the term variables* should be substituted for the list of variables being considered in the analysis.

3.7.2

Summary of Univariate Descriptive Statistics on Stata

Through command summarize, or simply sum, as we will use throughout this book, we can obtain summary measures, such as, the mean, standard deviation, and minimum and maximum values. The syntax of this command is: sum variables*

where the term variables* should be substituted for the list of variables to be considered in the analysis. If no variable is specified, the statistics will be calculated for all of the variables in the dataset. Through the option detail, we can obtain additional statistics, such as, the coefficient of skewness, the coefficient of kurtosis, the four lowest and highest values, as well as several percentiles. The syntax of this command is: sum variables*, detail

Therefore, for the data in our example, available in the file Stock_Market.dta, first, we must type the following command: sum price

obtaining the statistics in Fig. 3.53. To obtain additional descriptive statistics, we must type the following command: sum price*, detail

Fig. 3.54 shows the generated outputs. As shown in Fig. 3.54, the option detail provides the calculation of the percentiles of order 1, 5, 10, 25, 50, 75, 90, 95 and 99. These results are obtained by Tukey’s Hinges method. We have seen, through Fig. 3.47 on the SPSS software, the results of the percentiles of order 25, 50, and 75 obtained by the same method. Fig. 3.54 also provides the four lowest and highest values of the sample analyzed, as well as the coefficients of skewness and kurtosis. Note that these values coincide with the ones calculated in Sections 3.4.3.1.5 and 3.4.3.2.3, respectively.

FIG. 3.53 Summary measures using the command sum on Stata. FIG. 3.54 Additional statistics using the option detail.

Univariate Descriptive Statistics Chapter

3

85

FIG. 3.55 Results obtained from the command centile on Stata.

3.7.3

Calculating Percentiles on Stata

The previous section discussed how to calculate the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles through Tukey’s Hinges method. On the other hand, by using the command centile, we can specify the percentiles to be calculated. The method used in this case is the Weighted Average. The syntax of this command is: centile variables*, centile (numbers*)

where the term variables* should be substituted for the list of variables to be considered in the analysis, and the term numbers* for the list of numbers that represent the order of the percentiles to be reported. Therefore, let’s suppose that we want to calculate the percentiles of order 5, 10, 25, 60, 64, 90, and 95 for the variable price, through the Weighted Average. In order to do that, we must use the following command: centile price, centile (5 10 25 60 64 90 95)

The results can be seen in Fig. 3.55. We have seen, through Fig. 3.35, the results of the SPSS software for the percentiles of order 10, 25, 50, 60, and 75 using the same method. Fig. 3.47 on SPSS also provided the calculation of the percentiles of order 5, 10, 25, 50, 75, 90, and 95 through the Weighted Average. The only percentile that had not been specified previously was the one of order 64; the others coincide with the results in Figs. 3.35 and 3.47.

3.7.4

Charts on Stata: Histograms, Stem-and-Leaf, and Boxplots

Stata makes a series of charts available, including bar charts, pie charts, scatter plots, histograms, stem-and-leaf, and boxplots, among others. Next, we will discuss how to obtain histograms, stem-and-leaf plots, and boxplots on Stata, for the data available in the file Stock_Market.dta.

3.7.4.1 Histogram Histograms on Stata can be obtained for continuous and discrete variables. In the case of continuous variables, to obtain a histogram of absolute frequencies, with the option of plotting a normal curve, we must type the following syntax: histogram variable*, normal frequency

or simply: hist variable*, norm freq

as we will use throughout this book. As mentioned before, the term variable* must be substituted for the name of the variable being studied. For discrete variables, we must include the term discrete: hist variable*, discrete norm freq

86

PART

II Descriptive Statistics

FIG. 3.56 Frequency histogram on Stata.

Frequency

10

5

0 17

18

19

20

Price

Going back to the data in Example 3.42, to obtain a frequency histogram, with the option of plotting a normal curve, we must type the following command: hist price, norm freq

The obtained output is shown in Fig. 3.56.

3.7.4.2 Stem-and-Leaf The stem-and-leaf plot on Stata can be obtained using the command stem, followed by the name of the variable being studied. For the data in the file Stock_Market.dta, we just need to type the following command: stem price

The obtained output is shown in Fig. 3.57.

3.7.4.3 Boxplot To obtain the boxplot on the Stata software, we must use the following syntax: graph box variables*

FIG. 3.57 Stem-and-Leaf plot on Stata.

Univariate Descriptive Statistics Chapter

3

87

FIG. 3.58 Boxplot on Stata.

20

Price

19

18

17

where the term variables* should be substituted for the list of variables to be considered in the analysis, and, for each variable, one chart is constructed. For the data in Example 3.42, the command is: graph box price

The chart is shown in Fig. 3.58 which corresponds to the same chart as in Fig. 3.51 generated using SPSS.

3.8

FINAL REMARKS

In this chapter, we studied descriptive statistics for a single variable (univariate descriptive statistics), in order to acquire a better understanding of the behavior of each variable through tables, charts, graphs and summary measures, identifying trends, variability, and outliers. Before we start using descriptive statistics, it is necessary to identify the type of variable we will study. The type of variable is essential for calculating descriptive statistics and in the graphical representation of the results. The descriptive statistics used to represent the behavior of a qualitative variable’s data are frequency distribution tables and charts. The frequency distribution table for a qualitative variable represents the frequency in which each variable category occurs. The graphical representation of qualitative variables can be illustrated by bar charts (horizontal and vertical), pie charts, and a Pareto chart. For quantitative variables, the most common descriptive statistics are charts and summary measures (measures of position or location, dispersion or variability, and measures of shape). Frequency distribution tables can also be used to represent the frequency in which each possible value of a discrete variable occurs, or to represent the frequency of continuous variables’ data grouped into classes. Line graphs, dot or scatter plots, histograms, stem-and-leaf plots, and boxplots (or box-and-whisker diagrams) are normally used to graphically represent quantitative variables.

3.9

EXERCISES

1) What statistics can be used (and in which situations) to represent the behavior of a single quantitative or qualitative variable? 2) What are the limitations of only using measures of central tendency in the study of a certain variable? 3) How can we verify the existence of outliers in a certain variable? 4) Describe each one of the measures of dispersion or variability. 5) What is the difference between Pearson’s first and second coefficients used as measures of skewness in a distribution? 6) What is the best chart to check the position, skewness and discrepancy among the data? 7) In the case of bar charts and scatter plots, what kind of data should be used? 8) What are the most suitable charts to represent qualitative data?

88

PART

II Descriptive Statistics

9) Table 3.1 shows the number of vehicles sold by a dealership in the last 30 days. Construct a frequency distribution table for these data. TABLE 3.1 Number of Vehicles Sold 7

5

9

11

10

8

9

6

8

10

8

5

7

11

9

11

6

7

10

9

8

5

6

8

6

7

6

5

10

8

10) A survey on patients’ health was carried out and information regarding the weight of 50 patients was collected (Table 3.2). Build the frequency distribution table for this problem. TABLE 3.2 Patients’ Weight 60.4

78.9

65.7

82.1

80.9

92.3

85.7

86.6

90.3

93.2

75.2

77.3

80.4

62.0

90.4

70.4

80.5

75.9

55.0

84.3

81.3

78.3

70.5

85.6

71.9

77.5

76.1

67.7

80.6

78.0

71.6

74.8

92.1

87.7

83.8

93.4

69.3

97.8

81.7

72.2

69.3

80.2

90.0

76.9

54.7

78.4

55.2

75.5

99.3

66.7

11) At an electrical appliances factory, in the door component production phase, the quality inspector verifies the total number of parts rejected per type of defect (lack of alignment, scratches, deformation, discoloration, and oxygenation), as shown in Table 3.3.

TABLE 3.3 Total Number of Parts Rejected per Type of Defect Type of Defect

Total

Lack of Alignment

98

Scratches

67

Deformation

45

Discoloration

28

Oxygenation

12

Total

250

We would like you to: a) Elaborate a frequency distribution table for this problem. b) Construct a pie chart, in addition to a Pareto chart. 12) To preserve ac¸aı´, it is necessary to carry out several procedures, such as, whitening, pasteurization, freezing, and dehydration. The files Dehydration.xls, Dehydration.sav, and Dehydration.dta show the processing times (in seconds) in the dehydration phase throughout 100 periods. We would like you to: a) Calculate the measures of position regarding the arithmetic mean, the median, and the mode. b) Calculate the first and third quartiles and see if there are any outliers. c) Calculate the 10th and 90th percentiles. d) Calculate the 3rd and 6th deciles. e) Calculate the measures of dispersion (range, average deviation, variance, standard deviation, standard error, and coefficient of variation).

Univariate Descriptive Statistics Chapter

3

89

f) Check if the distribution is symmetrical, positively skewed, or negatively skewed. g) Calculate the coefficient of kurtosis and determine the flatness level of the distribution (mesokurtic, platykurtic or leptokurtic). h) Construct a histogram, a stem-and-leaf plot, and a boxplot for the variable being studied. 13) In a certain bank branch, we collected the average service time (in minutes) from a sample with 50 customers regarding three types of services. The data can be found in files Services.xls, Services.sav, and Services.dta. Compare the results of the services based on the following measures: a) Measures of position (mean, median, and mode). b) Measures of dispersion (variance, standard deviation, and standard error). c) First and third quartiles; check if there are any outliers. d) Fisher’s coefficient of skewness (g1) and Fisher’s coefficient of kurtosis (g2). Classify the symmetry and the flatness level of each distribution. e) For each one of the variables, construct a bar chart, a boxplot, and a histogram. 14) A passenger collected the average travel times (in minutes) of a bus in the district of Vila Mariana, on the Jabaquara route, for 120 days (Table 3.4). We would like you to: a) Calculate the arithmetic mean, the median, and the mode.

TABLE 3.4 Average Travel Times in 120 Days Time

b) c) d) e)

Number of Days

30

4

32

7

33

10

35

12

38

18

40

22

42

20

43

15

45

8

50

4

Calculate Q1, Q3, D4, P61, and P84. Are there any outliers? Calculate the range, the variance, the standard deviation, and the standard error. Calculate Fisher’s coefficient of skewness (g1) and Fisher’s coefficient of kurtosis (g2). Classify the symmetry and the flatness level of each distribution. f) Construct a bar chart, a histogram, a stem-and-leaf plot, and a boxplot. 15) In order to improve the quality of its services, a retail company collected the average service time, in seconds, of 250 employees. The data were grouped into classes, with their respective absolute and relative frequencies, as shown in Table 3.5. We would like you to: a) Calculate the arithmetic mean, the median, and the mode. b) Calculate Q1, Q3, D2, P13, and P95. c) Are there any outliers? d) Calculate the range, the variance, the standard deviation, and the standard error. e) Calculate Pearson’s first coefficient of skewness and the coefficient of kurtosis. Classify the symmetry and the flatness level of each distribution. f) Construct a histogram.

90

PART

II Descriptive Statistics

TABLE 3.5 Average Service Time Class

Fi

Fri (%)

30 ├ 60

11

4.4

60 ├ 90

29

11.6

90 ├ 120

41

16.4

120 ├ 150

82

32.8

150 ├ 180

54

21.6

180 ├ 210

33

13.2

250

100

Sum

16) A financial analyst wants to compare the price of two stocks throughout the previous month. The data are listed in Table 3.6.

TABLE 3.6 Stock Price Stock A

Stock B

31

25

30

33

24

27

24

34

28

32

22

26

24

26

34

28

24

34

28

28

23

31

30

28

31

34

32

16

26

28

39

29

25

27

42

28

29

33

24

29

22

34

23

33

32

27

29

26

Univariate Descriptive Statistics Chapter

3

91

Carry out a comparative analysis of the price of both stocks based on: a) Measures of position, such as, the mean, median, and mode. b) Measures of dispersion, such as, the range, variance, standard deviation, and standard error. c) The existence of outliers. d) The symmetry and flatness level of the distribution. e) A line graph, scatter plot, stem-and-leaf plot, histogram, and boxplot. 17) Aiming to determine the standards of the investments made in hospitals in Sao Paulo (US\$ millions), a state government agency collected data regarding 15 hospitals, as shown in Table 3.7.

TABLE 3.7 Investments in 15 Hospitals in the State of Sao Paulo Hospital

a) b) c) d)

Investment

A

44

B

12

C

6

D

22

E

60

F

15

G

30

H

200

I

10

J

8

K

4

L

75

M

180

N

50

O

64

We would like you to: Calculate the sample’s arithmetic mean and standard deviation. Eliminate possible outliers. Once again, calculate the sample’s arithmetic mean and standard deviation (without the outliers). What can we say about the standard deviation of the new sample without the outliers?

Chapter 4

Bivariate Descriptive Statistics Numbers rule the world. Plato

4.1

INTRODUCTION

The previous chapter discussed descriptive statistics for a single variable (univariate descriptive statistics). This chapter presents the concepts of descriptive statistics involving two variables (bivariate analysis). Therefore, a bivariate analysis has as its main objective to study the relationships (associations for qualitative variables and correlations for quantitative variables) between two variables. These relationships can be studied through the joint distribution of frequencies (contingency tables or crossed classification tables—cross tabulation), graphical representations, and through summary measures. The bivariate analysis will be studied from two distinct situations: a) When two variables are qualitative; b) When two variables are quantitative. Fig. 4.1 shows the bivariate descriptive statistics that will be studied in this chapter, represented by tables, charts, and summary measures, and presents the following situations: a) The descriptive statistics used to represent the data behavior of two qualitative variables are: a) joint frequency distribution tables, in this specific case, also called contingency tables or crossed classification tables (cross tabulation); b) charts, such as, perceptual maps resulting from the correspondence analysis technique (more details can be found in Fa´vero and Belfiore, 2017); c) measures of association, such as, the chi-square statistics (used for nominal and ordinal qualitative variables), the Phi coefficient, the contingency coefficient, and Cramer’s V coefficient (all of them based on chi-square and used for nominal variables), in addition to Spearman’s coefficient (for ordinal qualitative variables). b) In the case of two quantitative variables, we are going to use joint frequency distribution tables, graphical representations, such as, the scatter plot, besides measures of correlation, such as, covariance and Pearson’s correlation coefficient.

4.2

ASSOCIATION BETWEEN TWO QUALITATIVE VARIABLES

The main objective is to assess if there is a relationship between the qualitative or categorical variables studied, in addition to the level of association between them. This can be done through frequency distribution tables, summary measures, such as, the chi-square (used for nominal and ordinal variables), the Phi coefficient, the contingency coefficient, and Cramer’s V coefficient (for nominal variables), and Spearman’s coefficient (for ordinal variables), in addition to graphical representations, such as, perceptual maps resulting from the correspondence analysis, as presented in Fa´vero and Belfiore (2017).

4.2.1

Joint Frequency Distribution Tables

The simplest way to summarize a set of data resulting from two qualitative variables is through a joint frequency distribution table, in this specific case, it is called a contingency table, or a crossed classification table (cross tabulation), or even a correspondence table. In a joint way, it shows the absolute or relative frequencies of variable X’s categories, represented on the X-axis, and of variable Y, represented on the Y-axis. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00004-5 © 2019 Elsevier Inc. All rights reserved.

93

94

PART

II Descriptive Statistics

Bivariate analysis 2 Qualitative variables

Tables

Charts

Contingency tables

Perceptual maps

2 Quantitative variables Measures of association

Chi-square

Tables

Charts

Frequency distribution

Scatter Plot

Phi coefficient

Measures of correlation

Covariance Pearson’s correlation coefficient

Contingency coefficient Cramer’s V coefficient Spearman’s coefficient

FIG. 4.1 Bivariate descriptive statistics depending on the type of variable.

It is common to add the marginal totals to the contingency table, which correspond to the sum of variable X’s rows and to the sum of variable Y’s columns. We are going to illustrate this analysis through an example based on Bussab and Morettin (2011). Example 4.1 A study was done with 200 individuals trying to analyze the joint behavior of variable X (Health insurance agency) with variable Y (Level of satisfaction). The contingency table showing the variables’ joint absolute frequency distribution, in addition to the marginal totals, is shown in Table 4.E.1. These data are available on the SPSS software in the file HealthInsurance.sav.

TABLE 4.E.1 Joint Absolute Frequency Distribution of the Variables Being Studied Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

40

16

12

68

Live Life

32

24

16

72

Mena Health

24

32

4

60

Total

96

72

32

200

The study can also be carried out based on the relative frequencies, as studied in Chapter 3, for univariate problems. Bussab and Morettin (2011) show three ways to illustrate the proportion of each category: a) In relation to the general total; b) In relation to the total of each row; c) In relation to the total of each column. Choosing each option varies according to the objective of the problem. For example, Table 4.E.2 shows the joint relative frequency distribution of the variables being studied in relation to the general total.

Bivariate Descriptive Statistics Chapter

4

TABLE 4.E.2 Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the General Total Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

20%

8%

6%

34%

Live Life

16%

12%

8%

36%

Mena Health

12%

16%

2%

30%

Total

48%

36%

16%

100%

First, we are going to analyze the marginal totals of the rows and columns that provide the unidimensional distributions of each variable. The marginal totals of the rows correspond to the sum of the relative frequencies of each category of the variable Agency and the marginal totals of the columns correspond to the sum of each category of the variable Level of satisfaction. Thus, we can conclude that 34% of the individuals are members of Total Health, 36% of Live Life, and 30% of Mena Health. Analogously, we can conclude that 48% of the individuals are dissatisfied with their health insurance agencies, 36% said they were neutral, and only 16% said they were satisfied. Regarding the joint relative frequency distribution of the variables being studied (a contingency table), we can state that 20% of the individuals are members of Total Health and are dissatisfied. The same logic is applied to the other categories of the contingency table. Conversely, Table 4.E.3 shows the joint relative frequency distribution of the variables being studied in relation to the total of each row.

TABLE 4.E.3 Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the Total of Each Row Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

58.8%

23.5%

17.6%

100%

Live Life

44.4%

33.3%

22.2%

100%

Mena Health

40%

53.3%

6.7%

100%

Total

48%

36%

16%

100%

From Table 4.E.3, we can see that the ratio of individuals who are members of Total Health and who are dissatisfied is 58.8% (40/ 68), those who are neutral is 23.5% (16/68); and those who are satisfied is 17.6% (12/68). The sum of the ratios in the respective row is 100%. The same logic is applied to the other rows. Finally, Table 4.E.4 shows the joint relative frequency distribution of the variables being studied in relation to the total of each column. Therefore, the ratio of individuals who are members of Total Health and who are dissatisfied is 41.7% (40/96), members of Live Life, 33.3% (32/96), and members of Mena Health, 25% (24/96). The sum of the ratios in the respective column is 100%. The same logic is applied to the other columns.

TABLE 4.E.4 Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the Total of Each Column Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

41.7%

22.2%

37.5%

34%

Live Life

33.3%

33.3%

50%

36%

Mena Health

25%

44.4%

12.5%

30%

Total

100%

100%

100%

100%

95

96

PART

II Descriptive Statistics

Creating Contingency Tables on the SPSS Software The contingency tables in Example 4.1 will be generated by using SPSS. The use of the images in this chapter has been authorized by the International Business Machines Corporation©. First, we are going to define the properties of each variable on SPSS. The variables Agency and Level of satisfaction are qualitative, but, initially, they are presented as numbers, as shown in the file HealthInsurance_NoLabel.sav. Thus, labels corresponding to each category of both variables must be created, so that: Labels of the variable Agency: 1 ¼ Total Health 2 ¼ Live Life 3 ¼ Mena Health Labels of the variable Level of satisfaction, simply called Satisfaction: 1 ¼ Dissatisfied 2 ¼ Neutral 3 ¼ Satisfied Therefore, we must click on Data → Define Variable Properties… and select the variables that interest us, as seen in Figs. 4.2 and 4.3.

FIG. 4.2 Defining the properties of the variable on SPSS.

Bivariate Descriptive Statistics Chapter

4

FIG. 4.3 Selecting the variables that interest us.

Next, we must click on Continue. Based on Figs. 4.4 and 4.5, note that the variables Agency and Satisfaction were defined as nominal. This definition can also be done in the environment Variable View. The definition of the labels must be created at this moment, as shown in Figs. 4.4 and 4.5. Clicking on OK, the database initially represented as numbers starts being substituted for the respective labels. In the file HealthInsurance.sav, the data have already been labeled. To create contingency tables (cross tabulation), we are going to click on the menu Analyze → Descriptive Statistics → Crosstabs…, as shown in Fig. 4.6. We are going to select the variable Agency in Row(s) and the variable Satisfaction in Column(s). Next, we must click on Cells, as shown in Fig. 4.7. To create contingency tables that represent the joint absolute frequency distribution of the variables observed, the joint relative frequency distribution in relation to the general total, the joint relative frequency distribution in relation to the total of each row, and the joint relative frequency distribution in relation to the total of each column (Tables 4.1–4.4) we must, from the Crosstabs: Cell Display dialog box (opened after we clicked on Cells…), select the option Observed in Counts and options Row, Column and Total in Percentages, as shown in Fig. 4.8. Finally, we are going to click on Continue and OK. The contingency table (cross tabulation) generated by SPSS is shown in Fig. 4.9. Note that the data generated are exactly the same as those presented in Tables 4.1–4.4.

97

98

PART

II Descriptive Statistics

FIG. 4.4 Defining the labels of variable Agency.

FIG. 4.5 Defining the labels of variable Satisfaction.

FIG. 4.6 Creating contingency tables (cross tabulation) on SPSS.

FIG. 4.7 Creating a contingency table.

100

PART

II Descriptive Statistics

FIG. 4.8 Creating contingency tables from the Crosstabs: Cell Display dialog box.

FIG. 4.9 Cross classification table (cross tabulation) generated by SPSS.

Bivariate Descriptive Statistics Chapter

4

101

Creating Contingency Tables on the Stata Software In Chapter 03, we learned how to create frequency distribution tables for a single variable on Stata through the command tabulate, or simply tab. In the case of two or more variables, if the objective is to create univariate frequency distribution tables for each variable being analyzed, we must use the command tab1, followed by the list of variables. The same logic must be applied to create joint frequency distribution tables (contingency tables). To create a contingency table on Stata from the absolute frequencies of the variables being observed, we must use the following syntax: tabulate variable1* variable2*

or simply: tab variable1* variable2* where the terms variable1* and variable2* must be substituted for the names of the respective variables.

If, in addition to the joint absolute frequency distribution of the variables being observed, we want to obtain the joint relative frequency distribution in relation to the total of each row, to the total of each column, and to the general total, we must use the following syntax: tabulate variable1* variable2*, row column cell

or simply: tab variable1* variable2*, r co ce

Consider a case with more than two variables being studied, in which the objective is to construct bivariate frequency distribution tables (two-way tables), for all the combinations of variables, two by two. In this case, we must use the command tab2, with the following syntax: tab2 variables* where the term variables* should be substituted for the list of variables being considered in the analysis.

Analogously, to obtain both the joint absolute frequency distribution and the joint relative frequency distributions per row, per column, and per general total, we must use the following syntax: tab2 variables*, r co ce

The contingency tables in Example 4.1 will be generated now by using the Stata software. The data are available in the file HealthInsurance.dta. Hence, to obtain the table of joint absolute frequency distribution, relative frequencies per row, relative frequencies per column, and relative frequencies per general total, the command is: tab agency satisfaction, r co ce

The results can be seen in Fig. 4.10 and are similar to those presented in Fig. 4.9 (SPSS). FIG. 4.10 Contingency table constructed on Stata.

102

PART

4.2.2

II Descriptive Statistics

Measures of Association

The main measures that represent the association between two qualitative variables are: a) The chi-square statistic (w2)—used for nominal and ordinal qualitative variables; b) The Phi coefficient, the contingency coefficient and Cramer’s V coefficient—applied to nominal variables and based on chi-square; and c) Spearman’s coefficient—used for ordinal variables.

4.2.2.1 Chi-Square Statistic The chi-square statistic (w2) measures the discrepancy between the contingency table observed and the contingency table expected, starting from the hypothesis that there is no association between the variables studied. If the frequency distribution observed is exactly equal to the frequency distribution expected, the result of the chi-square statistic is zero. Therefore, a value lower than w2 indicates independence between the variables. Statistic w2 is given by:  2 I X J X Oij  Eij 2 (4.1) w ¼ Eij i¼1 j¼1 where: Oij: number of observations in the ith position of variable X and in the jth position of variable Y; Eij: expected frequency of observations in the ith position of variable X and in the jth position of variable Y; I: number of categories (rows) of variable X; J: number of categories (columns) of variable Y.

Example 4.2 Calculate the w2 statistic for Example 4.1. Solution Table 4.E.5 shows the observed values in the distribution with the respective relative frequencies in relation to the general total of the row. The calculation could also be done in relation to the general total of the column, arriving at the same result of the w2 statistic.

TABLE 4.E.5 Observed Values of Each Category With the Respective Ratios in Relation to the General Total of the Row Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

40 (58.8%)

16 (23.5%)

12 (17.6%)

68 (100%)

Live Life

32 (44.4%)

24 (33.3%)

16 (22.2%)

72 (100%)

Mena Health

24 (40%)

32 (53.3%)

4 (6.7%)

60 (100%)

Total

96 (48%)

72 (36%)

32 (16%)

200 (100%)

The data in Table 4.E.5 show the dependence between the variables. Assuming that there was no association between the variables, we would expect a ratio of 48% in relation to the total of the row of all three health insurance companies in the Dissatisfied column, 36% in the Neutral column, and 16% in the Satisfied column. The calculation of the expected values can be seen in Table 4.E.6. For example, the calculation of the first cell is 0.48  68 ¼ 32.64.

Bivariate Descriptive Statistics Chapter

4

103

TABLE 4.E.6 Expected Values in Table 4.E.5, Assuming the Nonassociation Between the Variables Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

32.6 (48%)

24.5 (36%)

10.9 (16%)

68 (100%)

Live Life

34.6 (48%)

25.9 (36%)

11.5 (16%)

72 (100%)

Mena Health

28.8 (48%)

21.6 (36%)

9.6 (16%)

60 (100%)

Total

96 (48%)

72 (36%)

32 (16%)

200 (100%)

To calculate the w2 statistic, we must apply expression (4.1) for the data in Tables 4.E.5 and 4.E.6. The calculation of each term 2 ðOij Eij Þ is shown in Table 4.E.7, jointly with the w2 measure resulting from the sum of the categories. Eij

TABLE 4.E.7 Calculating the x2 Statistic Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total Health

1.66

2.94

0.12

Live Life

0.19

0.14

1.74

Mena Health

0.80

5.01

3.27

Total

w ¼ 15.861 2

As we are going to study in Chapter 9, which discusses hypotheses tests, significance level a indicates the probability of rejecting a certain hypothesis when it is true. P-value, on the other hand, represents the probability associated to the sample observed value, indicating the lowest significance level that would lead to the rejection of the supposed hypothesis. In other words, P-value represents a decreasing reliability index of a result. The lower the value, the less we can believe in the assumed hypothesis. In the case of the w2 statistic, whose test presupposes the nonassociation between the variables being studied, most statistical software, including SPSS and Stata, calculate the corresponding P-value. Thus, for a confidence level of 95%, if P-value < 0.05, the hypothesis is rejected and we can state that there is an association between the variables. On the other hand, if P-value > 0.05, we conclude that the variables are independent. All of these concepts will be studied in more detail in Chapter 9. Excel calculates the P-value of the w2 statistic through the CHITEST or CHISQ.TEST (Excel 2010 and future versions) functions. In order to do that, we just need to select the set of cells corresponding to the observed or real values and the set of cells of the expected values. Solving the chi-square statistic on the SPSS software Analogous to Example 4.1, calculating the chi-square statistic (w2) on SPSS is also done on the tab Analyze → Descriptive Statistics → Crosstabs…. Once again, we are going to select the variable Agency in Row(s) and the variable Satisfaction in Column(s). Initially, to generate the observed values and the expected values in case of nonassociation between the variables (data in Tables 4.E.5 and 4.E.6), we must click on Cells… and select the options Observed and Expected in Counts, from the Crosstabs: Cell Display dialog box (Fig. 4.11). In the same box, to generate the adjusted standardized residuals, we must select the option Adjusted standardized in Residuals. The results can be seen in Fig. 4.12. To calculate the w2 statistic, in Statistics…, we must select the option Chi-square (Fig. 4.13). Finally, we are going to click on Continue and OK. The result can be seen in Fig. 4.14. Based on Fig. 4.14, we can see that the value of w2 is 15.861, similar to the one calculated in Table 4.E.7. We can also observe that the lowest significance level that would lead to the rejection of the nonassociation hypothesis between the variables (P-value) is 0.003. Since 0.003 < 0.05 (for a confidence level of 95%), the null hypothesis is rejected, which allows us to conclude that there is association between the variables.

104

PART

II Descriptive Statistics

FIG. 4.11 Creating the contingency table with the observed frequencies, the expected frequencies, and the residuals.

FIG. 4.12 Contingency table with the observed values, the expected values, and the residuals, assuming the nonassociation between the variables.

Bivariate Descriptive Statistics Chapter

4

105

FIG. 4.13 Selecting the w2 statistic.

Solving the w2 statistic on the Stata software In Section 4.2.1, we learned how to create contingency tables on Stata through the command tabulate, or simply tab. Besides the observed frequencies, this command also gives us the expected frequencies through the option expected, or simply exp, as well as the calculation of the w2 statistic using the option chi2, or simply ch. For the data in Example 4.1 available in the file HealthInsurance.dta, to obtain the observed and expected frequency distribution tables, jointly with the w2 statistic, we are going to use the following command: tab agency satisfaction, exp ch However, the command tab does not allow residuals to be generated in the output. As an alternative, the command tabchi

was developed from a tabulation module created by Nicholas J. Cox, allowing the adjusted standardized residuals to be calculated too. In order for this command to be used, we must initially type:

FIG. 4.14 Result of the w2 statistic.

106

PART

II Descriptive Statistics

FIG. 4.15 Result of the w2 statistic on Stata.

findit tabchi

and install it in the link tab_chi from http://fmwww.bc.edu/RePEc/bocode/t. After doing this, we can type the following command: tabchi agency satisfaction, a

The result is shown in Fig. 4.15 and is similar to those presented in Figs. 4.12 and 4.14 on the SPSS software. Note that, differently from the command tab, which requires the option exp so that the expected frequencies can be generated, the command tabchi already gives them to us automatically.

4.2.2.2 Other Measures of Association Based on Chi-Square The main measures of association based on the chi-square statistic (w2) are Phi, Cramer’s V coefficient, and the contingency coefficient (C), all of them applied to nominal qualitative variables. In general, an association or correlation coefficient is a measure that varies between 0 and 1, presenting value 0 when there is no relationship between the variables, and value 1 when they are perfectly related. We are going to see how each one of the coefficients studied in this section behaves in relation to these characteristics. a) Phi Coefficient The Phi coefficient is the simplest measure of association for nominal variables based on w2, and it can be expressed as follows: rﬃﬃﬃﬃﬃ w2 (4.2) Phi ¼ n In order for Phi to vary only between 0 and 1, it is necessary for the contingency table to have a 2 x 2 dimension.

Example 4.3 In order to offer high-quality services and meet their customers’ expectations, Ivanblue, a company in the male fashion industry, is investing in strategies to segment the market. Currently, the company has four stores in Campinas, located in the north, center, south, and east regions of the city, and sells four types of clothes: ties, shirts, polo shirts, and pants. Table 4.E.8 shows the purchase data of 20 customers, such as, the type of clothes and the location of the store. Check if there is association between the two variables using the Phi coefficient.

Bivariate Descriptive Statistics Chapter

4

107

TABLE 4.E.8 Purchase Data of 20 Customers Customer

Clothes

Region

1

Tie

South

2

Polo shirt

North

3

Shirt

South

4

Pants

North

5

Tie

South

6

Polo shirt

Center

7

Polo shirt

East

8

Tie

South

9

Shirt

South

10

Tie

Center

11

Pants

North

12

Pants

Center

13

Tie

Center

14

Polo shirt

East

15

Pants

Center

16

Tie

Center

17

Pants

South

18

Pants

North

19

Polo shirt

East

20

Shirt

Center

Solution Using the procedure described in the previous section, the value of the chi-square statistic is w2 ¼ 18.214. Therefore: Phi ¼

rﬃﬃﬃﬃﬃ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ w2 18:214 ¼ ¼ 0:954 n 20

Since both variables have four categories, in this case the condition 0  Phi  1 is not valid, making it difficult to interpret how strong the association is. b) Contingency coefficient The contingency coefficient (C), also known as Pearson’s contingency coefficient, is another measure of association for nominal variables based on the w2 statistic, being represented by the following expression: C¼

sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ w2 n + w2

(4.3)

where n is the sample size. The contingency coefficient (C) has as its lowest limit the value 0, indicating that there is no relationship between the variables; however, the highest limit of C varies depending on the number of categories, so: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ q1 0C  q

(4.4)

108

PART

II Descriptive Statistics

where: q ¼ min ðI, J Þ

(4.5)

where I is the number of rows and J is the number of columns in a contingency table. qﬃﬃﬃﬃﬃﬃﬃ When C ¼ q1 q , there is a perfect association between the variables; however, this limit never assumes the value 1. Hence, two contingency coefficients can only be compared if both are defined from tables with the same number of rows and columns.

Example 4.4 Calculate the contingency coefficient (C) for the data in Example 4.3. Solution We calculate C as follows: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ w2 18:214 ¼ 0:690 C¼ ¼ n + w2 20 + 18:214 Since the contingency table is 4  4 (q ¼ min(4, 4) ¼ 4), the values that C can assume are in the interval: rﬃﬃﬃ 3 ! 0  C  0:866 0C  4 We can conclude that there is association between the variables. c) Cramer’s V coefficient Another measure of association for nominal variables based on the w2 statistic is Cramer’s V coefficient, calculated by: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ w2 V¼ n:ðq  1Þ

(4.6)

where q ¼ min(I, J), as presented in expression (4.5). qﬃﬃﬃﬃ 2 For 2 x 2 contingency tables, expression (4.6) is going to be V ¼ wn , which corresponds to the Phi coefficient. Cramer’s V coefficient is an alternative to the Phi coefficient and to the contingency coefficient (C), and its value is always limited to the interval [0, 1], regardless of the number of categories in the rows and columns: 0V 1

(4.7)

Value 0 indicates that the variables do not have any kind of association and value 1 shows that they are perfectly associated. Therefore, Cramer’s V coefficient allows us to compare contingency tables that have different dimensions.

Example 4.5 Calculate Cramer’s V coefficient for the data in Example 4.3. Solution V¼

sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ w2 18:214 ¼ ¼ 0:551 nðq  1Þ 20  3

Since 0  V  1, there is association between the variables; however, it is not considered very strong. Solution of Examples 4.3, 4.4, and 4.5 (calculation of the Phi, contingency, and Cramer’s V coefficients) by using SPSS In Section 4.2.1, we discussed how to create labels that correspond to the variable categories from the menu Data → Define Variable Properties…. The same procedure must be applied to the data in Table 4.E.8 (we cannot forget to define the variables as nominal). The file Market_Segmentation.sav gives us these data already tabulated on SPSS.

Bivariate Descriptive Statistics Chapter

4

109

FIG. 4.16 Selecting the contingency coefficient and Phi and Cramer’s V coefficients.

FIG. 4.17 Results of the contingency coefficient and Phi and Cramer’s V coefficients.

Similar to the calculation of the w2 statistic, calculating the Phi, contingency, and Cramer’s V coefficients on SPSS can also be done on the menu Analyze → Descriptive Statistics → Crosstabs…. We are going to select the variable Clothes in Row(s) and the variable Region in Column(s). In Statistics…, we are going to select the options Contingency coefficient and Phi and Cramer’s V (Fig. 4.16). Note that these coefficients are calculated for nominal variables. The results of the statistics can be seen in Fig. 4.17. For all three coefficients, the P-value of 0.033 (0.033 < 0.05) indicates that there is association between the variables being studied. Solution of Examples 4.3 and 4.5 (calculation of the Phi and Cramer’s V coefficients) by using Stata Stata calculates the Phi and Cramer’s V coefficients through the command phi. Hence, they are going to be calculated for the data in Example 4.3 available in the file Market_Segmentation.dta.

110

PART

II Descriptive Statistics

FIG. 4.18 Calculating the Phi and Cramer’s V coefficients on Stata.

In order for the phi command to be used, initially, we must type: findit phi

and install it in the link snp3.pkg from http://www.stata.com/stb/stb3/. After doing this, we can type the following command: phi clothes region

The results can be seen in Fig. 4.18. Note that the Phi coefficient on Stata is called Cohen’s w. Cramer’s V coefficient, on the other hand, is called Cramer’s phi-prime.

4.2.2.3 Spearman’s Coefficient Spearman’s coefficient (rsp) is a measure of association between two ordinal qualitative variables. Initially, we must sort the set of data of variable X and of variable Y in ascending order. After sorting the data, it is possible to create ranks or rankings, denoted by k (k ¼ 1, …, n). Assigning ranks is something done separately for each variable. Rank 1 is then assigned to the smallest value of the variable, rank 2 to the second smallest value, and so on, and so forth, up until ranking n for the highest value. In case of a tie between values k and k +1, we must assign ranking k + 1/2 to both observations. Calculating Spearman’s coefficient can be done by using the following expression: 6 rsp ¼ 1  where:

n X

dk2

k¼1

n:ðn2  1Þ

(4.8)

n: number of observations (pairs of values); dk: difference between the rankings of order k. Spearman’s coefficient is a measure that varies between 1 and 1. If rsp ¼ 1, all the values of dk are null, indicating that all the rankings are equal to variables X and Y (perfect positive association). The value rsp ¼  1 is found when Pn 2 n:ðn2 1Þ reaches its maximum value (there is an inversion in the values of the variable rankings), indicating a k¼1 dk ¼ 3 perfect negative association. When rsp ¼ 0, there is no association between variables X and Y. Fig. 4.19 shows a summary of this interpretation. This interpretation is similar to Pearson’s association coefficient, which will be studied in Section 4.3.3.2. FIG. 4.19 Interpretation coefficient.

of

Spearman’s

Bivariate Descriptive Statistics Chapter

4

111

Example 4.6 The coordinator of the Business Administration course is analyzing if there is any kind of association between the grades of 10 students in two different subjects: Simulation and Finance. The data regarding this problem are presented in Table 4.E.9. Calculate Spearman’s coefficient.

TABLE 4.E.9 Grades in the Subjects Simulation and Finance of the 10 Students Being Analyzed Grades Student

Simulation

Finance

1

4.7

6.6

2

6.3

5.1

3

7.5

6.9

4

5.0

7.1

5

4.4

3.5

6

3.7

4.6

7

8.5

6.8

8

8.2

7.5

9

3.5

4.2

10

4.0

3.3

Solution To calculate Spearman’s coefficient, first, we are going to assign rankings to each category of each variable depending on their respective values, as shown in Table 4.E.10.

TABLE 4.E.10 Ranks in the Subjects Simulation and Finance of the 10 Students Rankings Student

Simulation

Finance

dk

d2k

1

5

6

1

1

2

7

5

2

4

3

8

8

0

0

4

6

9

3

9

5

4

2

2

4

6

2

4

2

4

7

10

7

3

9

8

9

10

1

1

9

1

3

2

4

10

3

1

2

4

Sum

40

112

PART

II Descriptive Statistics

Applying expression (4.8), we have: n X

dk2 6  40 k¼1 ¼1 ¼ 0:7576 rsp ¼ 1  nðn2  1Þ 10  99 6

Value 0.758 indicates a strong positive association between the variables. Calculating Spearman’s coefficient using SPSS software File Grades.sav shows the data from Example 4.6 (rankings in Table 4.E.9) tabulated in an ordinal scale (defined in the environment Variable View). Similar to the calculation of the w2 statistic and the Phi, contingency, and Cramer’s V coefficients, Spearman’s coefficient can also be generated by SPSS from the menu Analyze → Descriptive Statistics → Crosstabs…. We are going to select the variable Simulation in Row(s) and the variable Finance in Column(s). In Statistics…, we are going to select the option Correlations (Fig. 4.20). We are going to click on Continue and then, finally, on OK. The result of Spearman’s coefficient is shown in Fig. 4.21. The P-value 0.011 < 0.05 (under the hypothesis of nonassociation between the variables) indicates that there is a correlation between the grades in Simulation and Finance, with 95% confidence. Spearman’s coefficient can also be calculated in the menu Analyze → Correlate → Bivariate…. We must select the variables that interest us, in addition to Spearman’s coefficient, as shown in Fig. 4.22. We are going to click on OK, resulting in Fig. 4.23. FIG. 4.20 Calculating Spearman’s coefficient from the Crosstabs: Statistics dialog box.

FIG. 4.21 Result of Spearman’s coefficient from the Crosstabs: Statistics dialog box.

Bivariate Descriptive Statistics Chapter

4

113

Calculating Spearman’s coefficient by using Stata software In Stata, Spearman’s coefficient is calculated using the command spearman. Therefore, for the data in Example 4.6, available in the file Grades.dta, we must type the following command: spearman simulation finance The results can be seen in Fig. 4.24.

FIG. 4.22 Calculating Spearman’s coefficient from the Bivariate Correlations dialog box.

FIG. 4.23 Result of Spearman’s coefficient from the Bivariate Correlations dialog box. FIG. 4.24 Result of Spearman’s coefficient on Stata.

114

PART

4.3

II Descriptive Statistics

CORRELATION BETWEEN TWO QUANTITATIVE VARIABLES

In this section, the main objective is to assess if there is a relationship between the quantitative variables being studied, besides the level of correlation between them. This can be done through frequency distribution tables, graphical representations, such as, scatter plots, in addition to measures of correlation, such as, the covariance and Pearson’s correlation coefficient.

4.3.1

Joint Frequency Distribution Tables

The same procedure presented for qualitative variables can be used to represent the joint distribution of quantitative variables and to analyze the possible relationships between the respective variables. Analogous to the study of the univariate descriptive statistic, continuous data that do not repeat themselves with a certain frequency can be grouped into class intervals.

4.3.2

Graphical Representation Through a Scatter Plot

The correlation between two quantitative variables can be represented in a graphical way through a scatter plot. It graphically represents the values of variables X and Y in a Cartesian plane. Therefore, a scatter plot allows us to assess: a) Whether there is any relationship between the variables being studied or not; b) The type of relationship between the two variables, that is, the direction in which variable Y increases or decreases depending on changes in X; c) The level of relationship between the variables; d) The nature of the relationship (linear, exponential, among others). Fig. 4.25 shows a scatter plot in which the relationship between variables X and Y is strong positive linear, that is, variations in Y are directly proportional to variations in X. The level of relationship between the variables is strong and the nature is linear. If all the points are contained in a straight line, we have a case in which the relationship is perfect linear, as shown in Fig. 4.26. Figs. 4.27 and 4.28, on the other hand, show a scatter plot in which the relationship between variables X and Y is strong negative linear and perfect negative linear, respectively. FIG. 4.25 Strong positive linear relationship.

FIG. 4.26 Perfect positive linear relationship.

Bivariate Descriptive Statistics Chapter

4

115

FIG. 4.27 Strong negative linear relationship.

FIG. 4.28 Perfect negative linear relationship.

FIG. 4.29 There is no relationship between variables X and Y.

Finally, we may now have a case in which there is no relationship between variables X and Y, as shown in Fig. 4.29. Constructing a scatter plot on SPSS

Example 4.7 Let us open file Income_Education.sav on SPSS. The objective is to analyze the correlation between the variables Family Income and Years of Education through a scatter plot. In order to do that, we are going to click on Graphs ! Legacy Dialogs ! Scatter/Dot… (Fig. 4.30). In the window Scatter/Dot in Fig. 4.31, we are going to select the type of chart (Simple Scatter). Clicking on Define, the Simple Scatterplot dialog box will open, as shown in Fig. 4.32. We are going to select the variable FamilyIncome in the Y-axis and the variable YearsofEducation in the X-axis. Next, we are going to click on OK. The scatter plot created is shown in Fig. 4.33. Based on Fig. 4.33, we can see a strong positive correlation between the variables Family Income and Years of Education. Therefore, the higher the number of years of education, the higher the family income will be, even if there is no cause and effect relationship.

116

PART

II Descriptive Statistics

FIG. 4.30 Constructing a scatter plot on SPSS.

FIG. 4.31 Selecting the type of chart.

The scatter plot can also be created in Excel by selecting the option Scatter. Constructing a scatter plot on Stata The data from Example 4.7 are also available on Stata from the file Income_Education.dta. The variables being studied are called income and education. The scatter plot on Stata is created using the command twoway scatter (or simply tw sc) followed by the variables we are interested in. Thus, to analyze the correlation between the variables Family Income and Years of Education through a scatter plot on Stata, we must type the following command: tw sc income education

The resulting scatter plot is shown in Fig. 4.34.

FIG. 4.32 Simple Scatterplot dialog box.

FIG. 4.33 Scatter plot of the variables Family Income and Years of Education.

6000

5000

Family income

4000

3000

2000

1000

0 4.0

5.0

6.0 7.0 8.0 Years of education

9.0

10.0

118

PART

II Descriptive Statistics

FIG. 4.34 Scatter plot on Stata.

5000

Family income

4000

3000

2000

1000

0 5

6

7

8

9

Years of education

4.3.3

Measures of Correlation

The main measures of correlation, used for quantitative variables, are the covariance and Pearson’s correlation coefficient.

4.3.3.1 Covariance Covariance measures the joint variation between two quantitative variables X and Y, and it is calculated by using the following expression: n  X

covðX, Y Þ ¼

  Xi  X : Y i  Y

i¼1

n1

(4.9)

where: Xi: ith value of X; Yi: ith value of Y; X: mean of the values of Xi; Y: mean of the values of Yi; n: sample size. One of the limitations of the covariance is that the measure depends on the sample size, and it may lead to a bad estimate in the case of small samples. Pearson’s correlation coefficient is an alternative for this problem. Example 4.8 Once again, consider the data in Example 4.7 regarding the variables Family Income and Years of Education. The data are also available in Excel in the file Income_Education.xls. Calculate the covariance of the data matrix of both variables. Solution Applying expression (4.9), we have: ð7:6  7:08Þð1, 961  1, 856:22Þ + ⋯ + ð5:4  7:08Þð775  1, 856:22Þ 72, 326:93 ¼ ¼ 761:336 95 95 The covariance can be calculated in Excel by using the COVARIANCE.S (sample) function. In the following section, we are also going to discuss how the covariance can be calculated on SPSS, jointly with Pearson’s correlation coefficient. SPSS considers the same expression presented in this section. covðX, Y Þ ¼

Bivariate Descriptive Statistics Chapter

4

119

FIG. 4.35 Interpretation of Pearson’s correlation coefficient.

4.3.3.2 Pearson’s Correlation Coefficient Pearson’s correlation coefficient (r) is a measure that varies between 1 and 1. Through the sign, it is possible to verify the type of linear relationship between the two variables analyzed (the direction in which variable Y increases or decreases depending on how X changes); the closer it is to the extreme values, the stronger the correlation between them. Therefore: – If r is positive, there is a directly proportional relationship between the variables; if r ¼ 1, we have a perfect positive linear correlation. – If r is negative, there is an inversely proportional relationship between the variables; if r ¼  1, we have a perfect negative linear correlation. – If r is null, there is no correlation between the variables. Fig. 4.35 shows a summary of the interpretation of Pearson’s correlation coefficient. Pearson’s correlation coefficient (r) can be calculated as a ratio between the covariance of two variables and the product of the standard deviations (S) of each one of them: n  X

  Xi  X : Yi  Y

i¼1

covðX, Y Þ n1 ¼ r¼ S X  SY SX  SY rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pn Pn 2 2 X X ð Þ ðYi Y Þ i i¼1 i¼1 Since SX ¼ and SY ¼ , as we studied in Chapter 3, expression (4.10) becomes: n1 n1 n  X

  Xi  X : Yi  Y

i¼1

r ¼ sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ n  n  X 2 X 2 Xi  X : Yi  Y i¼1

(4.10)

(4.11)

i¼1

In Chapter 12, we are going to use Pearson’s correlation coefficient a lot, when studying factorial analysis. Example 4.9 Once again, open the file Income_Education.xls and calculate Pearson’s correlation coefficient between the two variables. Solution Calculating Pearson’s correlation coefficient through expression (4.10) is as follows: r¼

covðX, Y Þ 761:336 ¼ 0:777 ¼ S X  SY 970:774  1:009

This calculation could also be done by using expression (4.11), which does not depend on the sample size. The result indicates a strong positive correlation between the variables Family Income and Years of Education.

120

PART

II Descriptive Statistics

FIG. 4.36 Bivariate Correlations dialog box.

Excel also calculates Pearson’s correlation coefficient through the PEARSON function. Solution of Examples 4.8 and 4.9 (calculation of the covariance and Pearson’s correlation coefficient) on SPSS Once again, open the file Income_Education.sav. To calculate the covariance and Pearson’s correlation coefficient on SPSS, we are going to click on Analyze ! Correlate ! Bivariate…. The Bivariate Correlations window will open. We are going to select the variables Family Income and Years of Education, in addition to Pearson’s correlation coefficient, as shown in Fig. 4.36. In Options…, we must select the option Cross-product deviations and covariances, according to Fig. 4.37. We are going to click on Continue and then on OK. The results of the statistics are presented in Fig. 4.38. FIG. 4.37 Selecting the covariance statistic.

Bivariate Descriptive Statistics Chapter

4

121

FIG. 4.38 Results of the covariance and of Pearson’s correlation coefficient on SPSS.

FIG. 4.39 Calculating Pearson’s correlation coefficient on Stata.

FIG. 4.40 Calculating the covariance on Stata.

Analogous to Spearman’s coefficient, Pearson’s correlation coefficient can also be generated on SPSS from the menu Analyze → Descriptive Statistics → Crosstabs… (option Correlations in the Statistics button…). Solution of Examples 4.8 and 4.9(calculation of the covariance and Pearson’s correlation coefficient) on Stata To calculate Pearson’s correlation coefficient on Stata, we must use the command correlate, or simply corr, followed by the list of variables we are interested in. The result is the correlation matrix between the respective variables. Once again, open the file Income_Education.dta. Thus, for the data in this file, we can type the following command: corr income education

The result can be seen in Fig. 4.39. To calculate the covariance, we must use the option covariance, or only cov, at the end of the command correlate (or simply corr). Thus, to generate Fig. 4.40, we must type the following command: corr income education, cov

4.4

FINAL REMARKS

This chapter presented the main concepts of descriptive statistics with greater focus on the study of the relationship between two variables (bivariate analysis). We studied the relationships between two qualitative variables (associations) and between two quantitative variables (correlations). For each situation, several measures, tables, and charts were presented, which allow us to have a better understanding of the data behavior. Fig. 4.1 summarizes this information.

122

PART

II Descriptive Statistics

The construction and interpretation of frequency distributions, graphical representations, in addition to summary measures (measures of position or location and measures of dispersion or variability), allow the researcher to have a better understanding and visualization of the data behavior for two variables simultaneously. More advanced techniques can be applied in the future to the same set of data, so that researchers can go deeper in their studies on bivariate analysis, aiming at improving the quality of the decision making process.

4.5

EXERCISES

1) Which descriptive statistics can be used (and in which situations) to represent the behavior of two qualitative variables simultaneously? 2) And to represent the behavior of two quantitative variables? 3) In what situations should we use contingency tables? 4) What are the differences between the chi-square statistic (w2), Phi coefficient, the contingency coefficient (C), Cramer’s V coefficient, and Spearman’s coefficient? 5) What are the main summary measures to represent the data behavior between two quantitative variables? Describe each one of them. 6) Aiming at identifying the behavior of customers who are in default regarding their payments, a survey with information on the age and level of default of the respondents was carried out. The objective is to determine if there is an association between the variables. Based on the files Default.sav and Default.dta, we would like you to: a) Create the joint frequency distribution tables for the variables age_group and default (absolute frequencies, relative frequencies in relation to the general total, relative frequencies in relation to the total of each line, relative frequencies in relation to the total of each column and the expected frequencies). b) Determine the percentage of individuals who are between 31 and 40 years of age. c) Determine the percentage of individuals who are heavily indebted. d) Determine the percentage of respondents who are 20 years old or younger and do not have debts. e) Determine, among the individuals who are older than 60, the percentage of those who are a little indebted. f) Determine, among the individuals who are a relatively indebted, the percentage of those who are between 41 and 50 years old. g) Verify if there are indications of dependence between the variables. h) Confirm the previous item using the w2 statistic. i) Calculate the Phi, contingency, and Cramer’s V coefficients, confirming whether there is an association between the variables or not. 7) The files Motivation_Companies.sav and Motivation_Companies.dta show a database with the variables Company and Level of Motivation (Motivation), obtained through a survey carried out with 250 employees (50 respondents for each one of the 5 companies surveyed), aiming at assessing the employees’ level of motivation in relation to the companies, considered to be large firms. Hence, we would like you to: a) Create the contingency tables of absolute frequencies, relative frequencies in relation to the general total, relative frequencies in relation to the total of each line, relative frequencies in relation to the total of each column and the expected frequencies; b) Calculate the percentage of respondents who are very demotivated. c) Calculate the percentage of respondents from Company A and are very demotivated. d) Calculate the percentage of motivated respondents in Company D. e) Calculate the percentage of little motivated respondents in Company C. f) Among the respondents who are very motivated, determine the percentage of those who work for Company B. g) Verify if there are indications of dependence between the variables. h) Confirm the previous item using the w2 statistic. i) Calculate the Phi, contingency, and Cramer’s V coefficients, confirming whether there is an association between the variables or not. 8) The files Students_Evaluation.sav and Students_Evaluation.dta show the grades, from 0 to 10, of 100 students from a public university in relation to the following subjects: Operational Research, Statistics, Operations Management, and Finance. Check and see if there is a correlation between the following pairs of variables, constructing the scatter plot and calculating Pearson’s correlation coefficient: a) Operational Research and Statistics; b) Operations Management and Finance. c) Operational Research and Operations Management.

Bivariate Descriptive Statistics Chapter

4

123

9) The files Brazilian_Supermarkets.sav and Brazilian_Supermarkets.dta show revenue data and the number of stores of the 20 largest Brazilian supermarket chains in a given year (source: ABRAS - Brazilian Association of Supermarkets). We would like you to: a) Create the scatter plot for the variables revenue x number of stores. b) Calculate Pearson’s correlation coefficient between the two variables. c) Exclude the four largest supermarket chains in terms of revenue, as well as the chain AM/PM Food and Beverages Ltd., and once again create the scatter plot. d) Once again, calculate Pearson’s correlation coefficient between the two variables being studied.

Chapter 5

Introduction to Probability Do you want to sell sugar water for the rest of your life, or do you want to come with me and change the world? Steve Jobs

5.1

INTRODUCTION

In the previous part of this book, we studied descriptive statistics, which describes and summarizes the main characteristics observed in a dataset through frequency distribution tables, charts, graphs, and summary measures, allowing the researcher to have a better understanding of the data. Probabilistic statistics, on the other hand, uses the probability theory to explain how often certain uncertain events happen, in order to estimate or predict the occurrence of future events. For example, when rolling dice, we do not know for sure which value will appear, so, probability can be used to indicate the occurrence probability of a certain event. According to Bruni (2011), the history of probability presumably started with the cave men. They needed to understand nature’s uncertain phenomena better. In the 17th century, probability theory appeared to explain uncertain events. The study of probability evolved to help plan moves or develop strategies meant for gambling. Currently, it is also applied to the study of statistical inference, in order to generalize the data population. This chapter has as its main objective to present the concepts and terminologies related to the probability theory, as well as their practical application.

5.2 5.2.1

TERMINOLOGY AND CONCEPTS Random Experiment

An experiment consists in any observation or measure process. A random experiment is one that generates unpredictable results, so, if the process is repeated several times, it becomes impossible to predict the result. Flipping a coin and/or rolling dice are examples of random experiments.

5.2.2

Sample Space

Sample space S consists of all the possible results of an experiment. For example, when flipping a coin, we can get head (H) or tail (T). Therefore, S ¼ {H, T}. On the other hand, when rolling a die, the sample space is represented by S ¼ {1, 2, 3, 4, 5, 6}.

5.2.3

Events

An event is any subset of a sample space. For example, event A only contains the even occurrences of rolling a die. Therefore, A ¼ {2, 4, 6}.

5.2.4

Unions, Intersections, and Complements

Two or more events can form unions, intersections, and complements. The union of two events A and B, represented by A [ B, results in a new event containing all the elements of A, B, or both, and can be illustrated according to Fig. 5.1. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00005-7 © 2019 Elsevier Inc. All rights reserved.

127

128

PART

III Probabilistic Statistics

FIG. 5.1 Union of two events (A [ B).

The intersection of two events A and B, represented by A \ B, results in a new event containing all the elements that are simultaneously in A and B, and can be illustrated according to Fig. 5.2. The complement of an event A, represented by Ac, is the event that contains all the points of S that are not in A, as shown in Fig. 5.3.

5.2.5

Independent Events

Two events A and B are independent when the probability of B happening is not conditional on event A happening. The concept of conditional probability will be discussed in Section 5.5.

5.2.6

Mutually Exclusive Events

Mutually excluding or exclusive events are those that do not have any elements in common, so, they cannot happen simultaneously. Fig. 5.4 illustrates two events A and B that are mutually exclusive.

FIG. 5.2 Intersection of two events (A \ B).

FIG. 5.3 Complement of event A.

FIG. 5.4 Events A and B that are mutually exclusive.

Introduction to Probability Chapter

5.3

5

129

DEFINITION OF PROBABILITY

The probability of a certain event A happening in sample space S is given by the ratio between the number of cases favorable to the event (nA) and the total number of possible cases (n): Pð AÞ ¼

nA number of cases favorable to event A ¼ n total number of possible cases

(5.1)

Example 5.1 When rolling a die, what is the probability of getting an even number? Solution The sample space is given by S ¼ {1, 2, 3, 4, 5, 6}. The event we are interested in is A ¼ {even numbers on a die}, so, A ¼ {2, 4, 6}. Therefore, the probability of A happening is: 3 1 P ðAÞ ¼ ¼ 6 2

Example 5.2 A gravity-pick machine contains three white balls, two red balls, four yellow balls, and two black balls. What is the probability of a red ball being drawn? Solution Given a total of 11 balls and considering A ¼ {the ball is red}, the probability is: P ðAÞ ¼

5.4 5.4.1

number of red balls 2 ¼ total number of balls 11

BASIC PROBABILITY RULES Probability Variation Field

The probability of an event A happening is a number between 0 and 1:

5.4.2

0  Pð A Þ  1

(5.2)

Pð SÞ ¼ 1

(5.3)

PðfÞ ¼ 0

(5.4)

Probability of the Sample Space

Sample space S has probability equal to 1:

5.4.3

Probability of an Empty Set

The probability of an empty set (f) occurring is null:

130

PART

5.4.4

III Probabilistic Statistics

Probability Addition Rule

The probability of event A, event B or both happening can be calculated as follows: PðA [ BÞ ¼ PðAÞ + PðBÞ  PðA \ BÞ

(5.5)

If events A and B are mutually exclusive, that is, A \ B 6¼ f, the probability of one of them happening is equal to the sum of the individual probabilities: PðA [ BÞ ¼ PðAÞ + PðBÞ

(5.6)

Expression (5.6) can be extended to n events (A1, A2, …, An) that are mutually exclusive: PðA1 [ A2 [ ⋯ [ An Þ ¼ PðA1 Þ + PðA2 Þ + ⋯ + PðAn Þ

5.4.5

(5.7)

Probability of a Complementary Event

If Ac is A’s complementary event, then: Pð Ac Þ ¼ 1  P ð A Þ

5.4.6

(5.8)

Probability Multiplication Rule for Independent Events

If A and B are two independent events, the probability of them happening together is equal to the product of their individual probabilities: PðA \ BÞ ¼ PðAÞ  PðBÞ (5.9) Expression (5.9) can be extended to n independent events (A1, A2, …, An): P ð A 1 \ A 2 \ … \ A n Þ ¼ P ð A 1 Þ  P ð A 2 Þ  …  Pð A n Þ

(5.10)

Example 5.3 A gravity-pick machine contains balls with numbers 1 through 60 that have the same probability of being drawn. We would like you to: a) Define the sample space. b) Calculate the probability of a ball with an odd number on it being drawn. c) Calculate the probability of a ball with a multiple of 5 on it being drawn. d) Calculate the probability of a ball with an odd number or with a multiple of 5 on it being drawn. e) Calculate the probability of a ball with a multiple of 7 or a multiple of 10 on it being drawn. f) Calculate the probability of a ball that does not have a multiple of 5 on it being drawn. g) One ball is drawn randomly and put back into the gravity-pick machine. A new ball will be drawn. Calculate the probability of the first ball having an even number on it and the second one a number greater than 40. Solution a) S ¼ {1, 2, 3, …, 60}. 1 b) A ¼ {1, 3, 5, …, 59}, PðAÞ ¼ 30 60 ¼ 2 1 c) A ¼ {5, 10, 15, …, 60}, PðAÞ ¼ 12 60 ¼ 5 d) Where A ¼ {1, 3, 5, …, 59} and B ¼ {5, 10, 15, …, 60}. Since A and B are not mutually exclusive events, because they have common elements (5, 15, 25, 35, 45, 55), we apply Expression (5.5): 1 1 6 3 P ðA [ BÞ ¼ P ðAÞ + P ðBÞ  P ðA \ B Þ ¼ +  ¼ 2 5 60 5 e) In this case, A ¼ {7, 14, 21, 28, 35, 42, 49, 56} and B ¼ {10, 20, 30, 40, 50, 60}. Since the events are mutually exclusive (A \ B 6¼ f), we apply Expression (5.6): 8 6 7 + ¼ P ðA [ BÞ ¼ P ðAÞ + P ðB Þ ¼ 60 60 30 f) In this case, A ¼ {multiples of 5} and Ac ¼ {numbers that are not multiples of 5}. Therefore, the probability of complementary event Ac happening is: 1 4 P ðAc Þ ¼ 1  P ðAÞ ¼ 1  ¼ 5 5 g) Since the events are independent, we apply Expression (5.9): 1 20 1 P ðA \ BÞ ¼ P ðAÞ  P ðB Þ ¼  ¼ 2 60 6

Introduction to Probability Chapter

5.5

5

131

CONDITIONAL PROBABILITY

When events are not independent, we must use the concept of conditional probability. Considering two events A and B, the probability of A happening, given that B has already happened, is called conditional probability of A given B, and is represented by P(A jB): PðAj BÞ ¼

Pð A \ B Þ Pð BÞ

(5.11)

An event A is considered independent of B if: PðAj BÞ ¼ PðAÞ

(5.12)

Example 5.4 A die is rolled. What is the probability of getting number 4, given that the number drawn was an even number? Solution In this case, A ¼ {number 4} and B ¼ {an even number}. Applying Expression (5.11), we have: P ðAj BÞ ¼

5.5.1

P ðA \ B Þ 1=6 1 ¼ ¼ P ðB Þ 1=2 3

Probability Multiplication Rule

From the definition of conditional probability, the multiplication rule allows researcher to calculate the probability of the simultaneous occurrence of two events A and B as the probability of one of them multiplied by the conditional probability of the other, given that the first event has occurred: PðA \ BÞ ¼ PðAÞ  PðBj AÞ ¼ PðBÞ  PðAj BÞ

(5.13)

The multiplication rule can be extended to three events A, B, and C: PðA \ B \ CÞ ¼ PðAÞ  PðBj AÞ  PðCj A \ BÞ

(5.14)

This is only one of the six ways in which Expression (5.14) can be written. Example 5.5 A gravity-pick machine contains eight white balls, six red balls, and four black balls. Initially, we draw a ball that is not put back into the gravity-pick machine. A new ball will be drawn. What is the probability of both balls being red? Solution Differently from the previous example that calculated the conditional probability of a single event, the objective in this case is to calculate the probability of two events occurring simultaneously. The events are also not independent, since the first ball is not put back into the gravity-pick machine. If event A ¼ {the first ball is red} and B ¼ {the second ball is red}, to calculate P(A \ B), we must apply Expression (5.13): P ðA \ B Þ ¼ P ðAÞ  P ðBj AÞ ¼

6 5 5  ¼ 18 17 51

Example 5.6 A company will give a car to one of its customers (who are located in different regions of Brazil). Table 5.E.1 shows the data regarding these customers, in terms of gender and city. Determine: a) What is the probability of a male customer being drawn? b) What is the probability of a female customer being drawn? c) What is the probability of a customer from Curitiba being drawn? d) What is the probability of a customer from Sao Paulo being drawn, given that it is a male customer?

132

PART

III Probabilistic Statistics

e) What is the probability of a female customer being drawn, given that it is a customer from Aracaju? f) What is the probability of a female customer from Salvador being drawn?

TABLE 5.E.1 Absolute Frequency Distribution According to Gender and City Male

Female

Total

Goiania

12

14

26

Aracaju

8

12

20

Salvador

16

15

31

Curitiba

24

22

46

Sao Paulo

35

25

60

Belo Horizonte

10

12

22

105

100

205

Solution a) The probability of the customer being a man is 105/205 ¼ 21/41. b) The probability of the customer being a woman is 100/205 ¼ 20/41. c) The probability of the customer being from Curitiba is 46/205. d) Considering that A ¼ {Sao Paulo} and B ¼ {male}, the P(A jB) is calculated according to Expression (5.11): P ðAj BÞ ¼

P ðA \ BÞ 35=205 1 ¼ ¼ P ðB Þ 105=205 3

e) Considering that A ¼ {female} and B ¼ {Aracaju}, the P(A jB) is: P ðAj BÞ ¼

P ðA \ B Þ 12=205 3 ¼ ¼ P ðB Þ 20=205 5

f) If A ¼ {Salvador} and B ¼ {female}, the P(A \B) is calculated according to Expression (5.13): P ðA \ BÞ ¼ P ðAÞ  P ðBj AÞ ¼

5.6

31 15 3  ¼ 205 31 41

BAYES’ THEOREM

Imagine that the probability of a certain event was calculated. However, new information was added to the process, so, the probability must be recalculated. The probability calculated initially is called a priori probability; the probability with the recently added information is called a posteriori probability. The calculation of the a posteriori probability is based on Bayes’ Theorem and is described here. Consider B1, B2, …, Bn mutually exclusive events, and P(B1) + P(B2) + … + P(Bn) ¼ 1. A, on the other hand, is any given event that will happen jointly or as a consequence of one of the Bi events (i ¼ 1, 2, …, n). The probability of a Bi event happening, given that A event has already happened, is calculated as follows: PðBi j AÞ ¼

Pð B i \ A Þ PðBi Þ  PðAj Bi Þ ¼ PðAÞ PðB1 Þ  PðAj B1 Þ + PðB2 Þ  PðAj B2 Þ + ⋯ + PðBn Þ  PðAj Bn Þ

(5.15)

where: P(Bi) is the a priori probability; P(Bi j A) is the a posteriori probability (probability of Bi after A has happened).

Example 5.7 Consider three identical gravity-pick machines U1, U2, and U3. Gravity-pick machine U1 contains two balls, one is yellow and the other is red. Gravity-pick machine U2, on the other hand, contains three blue balls, while machine U3 contains two red balls and a yellow one. We select one of the gravity-pick machines at random and draw one ball. We can see that the ball chosen is yellow. What is the probability of gravity-pick machine U1 having been chosen?

Introduction to Probability Chapter

5

133

Solution Let‘s define the following events: B1 ¼ choosing gravity-pick machine U1; B2 ¼ choosing gravity-pick machine U2; B3 ¼ choosing gravity-pick machine U3; A ¼ choosing the yellow ball. The objective is to calculate P(B1 j A), knowing that: P(B1) ¼ 1/3, P(Aj B1) ¼ 1/2 P(B2) ¼ 1/3, P(Aj B2) ¼ 0 P(B3) ¼ 1/3, P(Aj B3) ¼ 1/3 Therefore, we have: P ð B1 j A Þ ¼

P ðB1 \ AÞ P ðB1 Þ  P ðAj B1 Þ ¼ P ðAÞ P ðB1 Þ  P ðAj B1 Þ + P ðB2 Þ  P ðAj B2 Þ + P ðB3 Þ  P ðAj B3 Þ 1 1  3 3 2 P ðB1 j AÞ ¼ ¼ 1 1 1 1 1 5  + 0+  3 2 3 3 3

5.7

COMBINATORIAL ANALYSIS

Combinatorial analysis is a set of procedures that calculates the number of different groups that can be formed by selecting a finite number of elements from a set. Arrangements, combinations, and permutations are the three main types of configurations and are applicable to the probability. The probability of an event is, therefore, the ratio between the number of results of the event we are interested in and the total number of results in the sample space (total number of arrangements, combinations, or permutations).

5.7.1

Arrangements

An arrangement calculates the number of possible configurations with distinct elements from a certain set. Bruni (2011) defines arrangement as the study of the number of ways in which researcher can organize a sample of objects, which was removed from a larger population, and in which the alteration of the order of the organized objects is relevant. Given n different objects, if the objective is to select p of these objects (n and p are integers, n p), the number of arrangements or possible ways of doing this is represented by An,p and calculated as follows: An, p ¼

n! ðn  pÞ!

(5.16)

Example 5.8 Consider a set with three elements A ¼ {1, 2, 3}. If these elements were taken 2 by 2, how many arrangements would be possible? What is the probability of element 3 being in the second position? Solution From Expression (5.16), we have: An, p ¼

3! 321 ¼ ¼6 ð3  2Þ! 1

These arrangements are (1, 2), (1, 3), (2, 1), (2, 3), (3, 1), and (3, 2). In an arrangement, the order in which the elements are organized is relevant. For example, (1, 2) 6¼ (2, 1). After defining all the arrangements, it is easy to calculate the probability. Since we have two arrangements in which element 3 is in the second position, given that the total number of arrangements is 6, the probability is 2/6 ¼ 1/3.

134

PART

III Probabilistic Statistics

Example 5.9 Calculate the number of ways in which it is possible to park six vehicles in three parking spaces. What is the probability of vehicle 1 being in the first parking space? Solution Through Expression (5.16), we have: A 6, 3 ¼

6! 6  5  4  3! ¼ ¼ 120 ð6  3Þ! 3!

From the 120 possible arrangements, in 20 of them vehicle 1 is in the first position: (1, 2, 3), (1, 2, 4), (1, 2, 5), (1, 2, 6), (1, 3, 2), (1, 3, 4), (1, 3, 5), (1, 3, 6), (1, 4, 2), (1, 4, 3), (1, 4, 5), (1, 4, 6), (1, 5, 2), (1, 5, 3), (1, 5, 4), (1, 5, 6), (1, 6, 2), (1, 6, 3), (1, 6, 4), (1, 6, 5). Therefore, the probability is 20/120 ¼ 1/6.

5.7.2

Combinations

Combinations are a special case of arrangements in which it does not matter the order in which the elements are organized. Given n different objects, the number of ways or combinations in which to organize p of these objects is represented by Cn,p (a combination of n elements arranged p by p), and calculated as follows:   n! n ¼ (5.17) Cn, p ¼ p p!ðn  pÞ!

Example 5.10 How many different ways can we form groups of four students in a class with 20 students? Solution Since the order of the elements in the group is not relevant, we must apply Expression (5.17):  C20, 4 ¼

 20! 20  19  18  17  16! 20 ¼ ¼ 4, 845 ¼ 4 4!ð20  4Þ! 24ð16Þ!

Thus, 4,845 different groups can be formed.

Example 5.11 Marcelo, Felipe, Luiz Paulo, Rodrigo, and Ricardo went to an amusement park to have fun. The ride they chose to go on next only has three seats, so, only three of them will be chosen randomly. What is the probability of Felipe and Luiz Paulo being on that ride? Solution The total number of combinations is: C5, 3 ¼ The 10 possibilities are: Group 1: Marcelo, Felipe, and Luiz Paulo Group 2: Marcelo, Felipe, and Rodrigo Group 3: Marcelo, Felipe, and Ricardo Group 4: Marcelo, Luiz Paulo, and Rodrigo Group 5: Marcelo, Luiz Paulo, and Ricardo Group 6: Marcelo, Rodrigo, and Ricardo Group 7: Felipe, Luiz Paulo, and Rodrigo Group 8: Felipe, Luiz Paulo, and Ricardo Group 9: Felipe, Rodrigo, and Ricardo Group 10: Luiz Paulo, Rodrigo, and Ricardo Therefore, the probability is 3/10.

  5! 5  4  3! 5 ¼ ¼ 10 ¼ 3 3!2! 3!2

Introduction to Probability Chapter

5.7.3

5

135

Permutations

Permutation is an arrangement in which all the elements in the set are selected. Therefore, it is the number of ways in which n elements can be grouped, changing their order. The number of possible permutations is represented by Pn and can be calculated as follows: Pn ¼ n!

(5.18)

Example 5.12 Consider a set with three elements, A ¼ {1, 2, 3}. What is the total number of permutations possible? Solution P3 ¼ 3 ! ¼ 3  2  1 ¼ 6. They are (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), and (3, 2, 1).

Example 5.13 A certain factory manufactures six different products. How many different ways can the production sequence occur? Solution To determine the number of possible production sequences, we just need to apply Expression (5.18): P6 ¼ 6! ¼ 6  5  4  3  2  1 ¼ 720

5.8

FINAL REMARKS

This chapter discussed the concepts and terminologies related to the probability theory, as well as their practical application. Probability theory is used to assess the possibility of uncertain events happening, its origin comes from trying to understand uncertain natural phenomena, evolving to planning how to gamble, and, currently, it is being applied to the study of statistical inference.

5.9

EXERCISES

1) Two soccer teams will play overtime until the Golden Goal is scored. Define the sample space. 2) What is the difference between mutually exclusive events and independent events? 3) In a deck of cards with 52 cards, determine: a. The probability of a card of hearts being drawn; b. The probability of a queen being drawn; c. The probability of a face card (jack, queen, or king) being drawn; d. The probability of any card, but not a face card, being drawn; 4) A production batch contains 240 parts and 12 of them are defective. One part is drawn randomly. What is the probability of this part being defective? 5) A number between 1 and 30 is chosen randomly. We would like you to: a. Define the sample space. b. What is the probability of this number being divisible by 3? c. What is the probability of this number being a multiple of 5? d. What is the probability of this number being divisible by 3 or a multiple of 5? e. What is the probability of this number being even, given that it is a multiple of 5? f. What is the probability of this number being a multiple of 5, given that it is divisible by 3? g. What is the probability of this number not being divisible by 3? h. Assuming that two numbers are chosen randomly, what is the probability of the first number being a multiple of 5 and the second one an odd number?

136

PART

III Probabilistic Statistics

6) Two dice are rolled simultaneously. Determine: a. The sample space. b. What is the probability of both numbers being even? c. What is the probability of the sum of the numbers being 10? d. What is the probability of the multiplication of the numbers being 6? e. What is the probability of the sum of the numbers being 10 or 6? f. What is the probability of the number drawn in the first die being an odd number or of the number drawn in the second die being a multiple of 3? g. What is the probability of the number drawn in the first die being an even number or of the number drawn in the second die being a multiple of 4? 7) What is the difference between arrangements, combinations, and permutations?

Chapter 6

Random Variables and Probability Distributions What we call chance can only be the unknown cause of a known effect. Voltaire

6.1

INTRODUCTION

In Chapters 3 and 4, we discussed several statistics to describe the behavior of quantitative and qualitative data, including sample frequency distributions. In this chapter, we are going to study population probability distributions (for quantitative variables). The frequency distribution of a sample is an estimate of the corresponding population probability distribution. When the sample size is large, the sample frequency distribution approximately follows the population probability distribution (Martins and Domingues, 2011). According to the authors, for the study of empirical researches, as well as for solving several practical problems, the study of descriptive statistics is essential. However, when the main goal is to study a population’s variables, the probability distribution is more suitable. This chapter discusses the concept of discrete and continuous random variables, the main probability distributions for each type of random variable, and also the calculation of the expected value and the variance of each probability distribution. For discrete random variables, the most common probability distributions are the discrete uniform, Bernoulli, binomial, geometric, negative binomial, hypergeometric, and Poisson. On the other hand, for continuous random variables, we are going to study the uniform, normal, exponential, gamma, chi-square (w2), Student’s t, and Snedecor’s F distributions.

6.2

RANDOM VARIABLES

As studied in the previous chapter, the set of all possible results of a random experiment is called sample space. To describe a random experiment, it is convenient to associate numerical values to the elements of the sample space. A random variable can be characterized as being a variable that presents a single value for each element, and this value is determined randomly. Assume that e is a random experiment and S is the sample space associated to this experiment. Function X that associates to each element s 2 S a real number X (s) is called random variable. Random variables can be discrete or continuous.

6.2.1

Discrete Random Variable

A discrete random variable can only take on countable numbers of distinct values, usually counts. Therefore, it cannot assume decimal or noninteger values. As examples of discrete random variables, we can mention the number of children in a family, the number of employees in a company, or the number of vehicles produced in a certain factory.

6.2.1.1 Expected Value of a Discrete Random Variable Let X be a discrete random variable that can take on the values {x1, x2, …, xn} with the respective probabilities {p(x1), p(x2), …, p(xn)}. Function {xi, p(xi), i ¼ 1, 2, …, n} is called random variable X probability function and associates, to each value of xi, its probability of occurrence: Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00006-9 © 2019 Elsevier Inc. All rights reserved.

137

138

PART

III Probabilistic Statistics

pðxi Þ ¼ PðX ¼ xi Þ ¼ pi , i ¼ 1,2,…, n so p(xi) 0 for every xi and

n P

(6.1)

pðxi Þ ¼ 1.

i¼1

The expected or average value of X is given by the expression: Eð X Þ ¼

n X

x i  Pð X ¼ x i Þ ¼

i¼1

n X

xi :pi

(6.2)

i¼1

Expression (6.2) is similar to the one used for the mean in Chapter 3, in which instead of probabilities pi we had relative frequencies Fri. The difference between pi and Fri is that the former corresponds to the values from an assumed theoretical model and the latter to the variable values observed. Since pi and Fri have the same interpretation, all of the measures and charts presented in Chapter 3, based on the distribution of Fri, have a corresponding one in the distribution of a random variable. The same interpretation is valid for other measures of position and variability, such as, the median and the standard deviation (Bussab and Morettin, 2011).

6.2.1.2 Variance of a Discrete Random Variable The variance of a discrete random variable X is a weighted mean of the distances between the values that X can take on and X’s expected value, where the weights are the probabilities of the possible values of X. If X assumes the values {x1, x2, …, xn} with the respective probabilities {p1, p2, …, pn}, then its variance is given by: n h i X ½xi  EðXÞ2 :pi (6.3) Var ðXÞ ¼ s2 ðXÞ ¼ E ðX  EðXÞÞ2 ¼ i¼1

In some cases, it is convenient to use the standard deviation of a random variable as a measure of variability. The standard deviation of X is the square root of the variance: pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ sðXÞ ¼ Var ðXÞ (6.4) Example 6.1 Assume that the monthly real estate sales for a certain real estate agent follow the probability distribution seen in Table 6.E.1. Determine the expected value of monthly sales, as well as its variance.

TABLE 6.E.1 Monthly Real Estate Sales and Their Respective Probabilities xi(sales)

0

1

2

3

p(xi)

2/10

4/10

3/10

1/10

Solution The expected value of monthly sales is: E ðX Þ ¼ 0  0:20 + 1  0:40 + 2  0:30 + 3  0:10 ¼ 1:3 The variance can be calculated as: Var ðX Þ ¼ ð0  1:3Þ2  0:2 + ð1  1:3Þ2  0:4 + ð2  1:3Þ2  0:3 + ð3  1:3Þ2  0:1 ¼ 0:81

6.2.1.3 Cumulative Distribution Function of a Discrete Random Variable The cumulative distribution function (c.d.f.) of a random variable X, denoted by F(x), corresponds to the sum of the xi values probabilities that are less than or equal to x: X Fð x Þ ¼ Pð X  x Þ ¼ pð x i Þ (6.5) xi x

The following properties are valid for the cumulative distribution function of a discrete random variable: 0  FðxÞ  1

(6.6)

Random Variables and Probability Distributions Chapter

6

139

lim FðxÞ ¼ 1

(6.7)

lim FðxÞ ¼ 0

(6.8)

a < b ! Fð aÞ  Fð bÞ

(6.9)

x!∞

x!∞

Example 6.2 For the data in Example 6.1, calculate F(0, 5), F(1), F(2, 5), F(3), F(4), and F(0, 5). Solution 2 a) Fð0:5Þ ¼ PðX  0:5Þ ¼ 10 2 4 6 b) Fð1Þ ¼ PðX  1Þ ¼ 10 + 10 ¼ 10 2 4 3 9 c) Fð2:5Þ ¼ PðX  2:5Þ ¼ 10 + 10 + 10 ¼ 10 2 4 3 1 d) Fð3Þ ¼ PðX  3Þ ¼ 10 + 10 + 10 + 10 ¼ 1 e) F(4) ¼ P(X  4) ¼ 1 f) F(0.5) ¼ P(X   0.5) ¼ 0 In short, the cumulative distribution function of random variable X in Example 6.1 is given by: 8 0 if x < 0, > > > > < 2=10 if 0  x < 1, F ðx Þ ¼ 6=10 if 1  x < 2, > > > 9=10 if 2  x < 3, > : 1 if x  3

6.2.2

Continuous Random Variable

A continuous random variable can take on several different values in an interval of real numbers. As examples of continuous random variables, we can mention a family’s income, the revenue of a company, or the height of a certain child. A continuous random variable X is associated to an f(x) function, called a probability density function (p.d.f.) of X, which meets the following condition: Z+∞ f ðxÞdx ¼ 1, f ðxÞ  0

(6.10)

∞

For any a and b, such that ∞ < a < b < + ∞, the probability of random variable X taking on values within this interval is: Zb f ðxÞdx

Pða  X  bÞ ¼ a

which can be graphically represented as shown in Fig. 6.1.

FIG. 6.1 Probability of X assuming values within the interval [a, b].

(6.11)

140

PART

III Probabilistic Statistics

6.2.2.1 Expected Value of a Continuous Random Variable The mathematical expected or average value of a continuous random variable X with a probability density function f(x) is given by the expression: Z+∞ EðXÞ ¼

xf ðxÞdx

(6.12)

∞

6.2.2.2 Variance of a Continuous Random Variable The variance of a continuous random variable X with a probability density function f(x) is calculated as: 

Var ðXÞ ¼ E X

2



Z∞ 2

ðx  EðXÞÞ2 f ðxÞdx

 ½ Eð X Þ  ¼

(6.13)

∞

Example 6.3 The probability density function of a continuous random variable X is given by:  2x, 0 < x < 1 f ðx Þ ¼ 0, for any other values Calculate E(X) and Var(X). Solution Z1

Z1 ðx:2x Þdx ¼

E ðX Þ ¼ 0

  E X2 ¼



 2 2x 2 dx ¼ 3

0

Z1



 x 2 :2x dx ¼

0

Z1



 1 2x 3 dx ¼ 2

0



VAR ðX Þ ¼ E X

 2

 2 1 2 1 ¼  ½E ðX Þ2 ¼  2 3 18

6.2.2.3 Cumulative Distribution Function of a Continuous Random Variable As in the discrete case, we can calculate the probabilities associated to a continuous random variable X from a cumulative distribution function. Cumulative distribution function F(x) of a continuous random variable X with probability density function f(x) is defined by: FðxÞ ¼ PðX  xÞ,  ∞ < x < ∞

(6.14)

Expression (6.14) is similar to the one presented for the discrete case, in Expression (6.5). The difference is that, for continuous variables, the cumulative distribution function is a continuous function, without jumps. From (6.11) we have: Zx f ðxÞdx

FðxÞ ¼

(6.15)

∞

As in the discrete case, the following properties for the cumulative distribution function of a continuous random variable are valid:

Random Variables and Probability Distributions Chapter

6

141

0  Fð x Þ  1

(6.16)

lim FðxÞ ¼ 1

(6.17)

lim FðxÞ ¼ 0

(6.18)

a < b ! Fð aÞ  Fð bÞ

(6.19)

x!∞

x!∞

Example 6.4 Once again, let us consider the probability density function in Example 6.3:  2x, 0 < x < 1 f ðx Þ ¼ 0, for any other values Calculate the cumulative distribution function of X. Solution Zx

Zx f ðx Þdx ¼

F ðx Þ ¼ P ðX  x Þ ¼ ∞

6.3

∞

8 < 0 if x  0 2xdx ¼ x 2 if 0 < x  1 : 1 if x > 1

PROBABILITY DISTRIBUTIONS FOR DISCRETE RANDOM VARIABLES

For discrete random variables, the most common probability distributions are the discrete uniform, Bernoulli, binomial, geometric, negative binomial, hypergeometric, and Poisson.

6.3.1

Discrete Uniform Distribution

It is the simplest discrete probability distribution and receives the name uniform because all of the possible values of the random variable have the same probability of occurrence. A discrete random variable X that takes on the values x1, x2, …, xn has a discrete uniform distribution with parameter n, denoted by X  Ud{x1, x2, …, xn}, if its probability function is given by: 1 PðX ¼ xi Þ ¼ pðxi Þ ¼ , i ¼ 1, 2,…, n n

(6.20)

which may be graphically represented as shown in Fig. 6.2. The mathematical expected value of X is given by: n 1 X xi Eð X Þ ¼ : n i¼1

FIG. 6.2 Discrete uniform distribution.

(6.21)

142

PART

III Probabilistic Statistics

The variance of X is calculated as:

2 6 n 16 6X 2 Var ðXÞ ¼ :6 x  n 6 i¼1 i 4

n X

!2 3 7 7 7 7 7 5

(6.22)

nð x Þ , n

(6.23)

i¼1

n

xi

And the cumulative distribution function (c.d.f.) is: FðXÞ ¼ PðX  xÞ ¼

X1 xi x

n

¼

where n(x) is the number of xi  x, as shown in Fig. 6.3.

FIG. 6.3 Cumulative distribution function.

Example 6.5 A totally balanced and clean die is thrown and random variable X represents the value on the face that is facing up. Determine the distribution of X, in addition to X’s expected value and variance. Solution The distribution of X is shown in Table 6.E.2.

TABLE 6.E.2 Distribution of X X

1

2

3

4

5

6

Sum

f(x)

1/6

1/6

1/6

1/6

1/6

1/6

1

Therefore, we have: 1 E ðX Þ ¼ ð1 + 2 + 3 + 4 + 5 + 6Þ ¼ 3:5 6 " #  ð21Þ2 1  35 1 + 22 + ⋯ + 62  Var ðX Þ ¼ ¼ ¼ 2:917 6 6 12

6.3.2

Bernoulli Distribution

The Bernoulli trial is a random experiment that only offers two possible results, conventionally called success or failure. As an example of a Bernoulli trial, we can mention tossing a coin, whose only possible results are head and tail. For a certain Bernoulli trial, we will consider the random variable X that takes on the value 1 in case of success, and 0 in case of failure. The probability of success is represented by p and the probability of failure by (1  p) or q. The Bernoulli

Random Variables and Probability Distributions Chapter

6

143

FIG. 6.4 Probability function of the Bernoulli distribution.

FIG. 6.5 Bernoulli’s distribution c.d.f.

distribution, therefore, provides the probability of success or failure of variable X when carrying out a single experiment. Therefore, we can say that variable X follows a Bernoulli distribution with parameter p, denoted by X  Bern(p), if its probability function is given by:  q ¼ 1  p, if x ¼ 0 Pð X ¼ x Þ ¼ pð x Þ ¼ (6.24) p , if x ¼ 1 which can also be represented in the following way: PðX ¼ xÞ ¼ pðxÞ ¼ px :ð1  pÞ1x , x ¼ 0, 1

(6.25)

The probability function of random variable X is represented in Fig. 6.4. It is easy to see that the expected value of X is: EðXÞ ¼ p

(6.26)

Var ðXÞ ¼ p:ð1  pÞ

(6.27)

with X’s variance being:

Bernoulli’s cumulative distribution function (c.d.f.) is given by: 8 if x < 0 < 0, FðxÞ ¼ PðX  xÞ ¼ 1  p, if x  0 < 1 : 1, if x  1

(6.28)

which can be represented by Fig. 6.5. It is important to mention that we are going to use all knowledge on Bernoulli’s distribution when discussing binary logistics regression models (Chapter 14). Example 6.6 The Interclub Indoor Soccer Cup final match is going to be between teams A and B. Random variable X represents the team that will win the Cup. We know that the probability of team A winning is 0.60. Determine the distribution of X, in addition to X’s expected value and variance.

144

PART

III Probabilistic Statistics

Solution Random variable X can only take on two values:

 X¼

1, if team A wins 0, if team B wins

Since it is a single game, variable X follows a Bernoulli distribution with parameter p ¼ 0.60, denoted by X Bern(0.6), so:  q ¼ 0:4, if x ¼ 0 ðteam BÞ P ðX ¼ x Þ ¼ p ðx Þ ¼ p ¼ 0:6, if x ¼ 1 ðteam AÞ We have: E ðX Þ ¼ p ¼ 0:6 Var ðX Þ ¼ p ð1  p Þ ¼ 0:6  0:4 ¼ 0:24

6.3.3

Binomial Distribution

A binomial experiment consists in n independent repetitions of a Bernoulli trial with probability of success p, probability that remains constant in all repetitions. Discrete random variable X of a binomial model corresponds to the number of successes (k) in the n repetitions of the experiment. Therefore, X follows a binomial distribution with parameters n and p, denoted by X  b(n, p), if its probability distribution function is given by:   n (6.29) f ðk Þ ¼ P ðX ¼ k Þ ¼ :pk :ð1  pÞnk , k ¼ 0,1, …,n k   n! n ¼ where k k!ðn  kÞ! The mean of X is given by: EðXÞ ¼ n:p

(6.30)

On the other hand, the variance of X can be expressed as: Var ðXÞ ¼ n:p:ð1  pÞ

(6.31)

Note that the mean and variance of the binomial distribution are equal to the mean and variance of the Bernoulli distribution, multiplied by n, the number of repetitions in a Bernoulli trial. Fig. 6.6 shows the probability function of the binomial distribution for n ¼ 10 and p varying between 0.3, 0.5, and 0.7. From Fig. 6.6, we can see that, for p ¼ 0.5, the probability function is symmetrical around the mean. If p < 0.5, the distribution is positive asymmetrical, observing a higher frequency of smaller values of k and a longer tail to the right. If p > 0.5, the distribution is negative asymmetrical, observing a higher frequency of larger values of k and a longer tail to the left. It is important to mention that we are going to use all knowledge on the binomial distribution when studying multinomial logistics regression models (Chapter 14). FIG. 6.6 Probability function of the binomial distribution for n ¼ 10.

Random Variables and Probability Distributions Chapter

6

145

6.3.3.1 Relationship Between the Binomial and the Bernoulli Distributions A binomial distribution with parameter n ¼ 1 is equivalent to a Bernoulli distribution: X  bð1, pÞ X  BernðpÞ

Example 6.7 A certain part is produced in a production line. The probability of the part not having defects is 99%. If 30 parts are produced, what is the probability of at least 28 of them being in good conditions? Also determine the random variable’s mean and variance. Solution We have: X ¼ random variable that represents the number of successes (parts in good conditions) in the 30 repetitions p ¼ 0.99 ¼ probability of the part being in good conditions q ¼ 0.01 ¼ probability of the part being defective n ¼ 30 repetitions k ¼ number of successes The probability of at least 28 parts not being defective is given by: P ðX  28Þ ¼ P ðX ¼ 28Þ + P ðX ¼ 29Þ + P ðX ¼ 30Þ     30! 99 28 1 2  ¼ 0:0328 P ðX ¼ 28Þ ¼  28!2! 100 100     30! 99 29 1 1   ¼ 0:224 P ðX ¼ 29Þ ¼ 29!1! 100 100     30! 99 30 1 0  ¼ 0:7397  P ðX ¼ 30Þ ¼ 30!0! 100 100 P ðX  28Þ ¼ 0:0328 + 0:224 + 0:7397 ¼ 0:997 The mean of X is expressed as: E ðX Þ ¼ n:p ¼ 30  0:99 ¼ 29:7 And the variance of X is: Var ðX Þ ¼ n:p:ð1  p Þ ¼ 30  0:99  0:01 ¼ 0:297

6.3.4

Geometric Distribution

The geometric distribution, as well as the binomial, considers successive independent Bernoulli trials, all of them with probability of success p. However, instead of using a fixed number of trials, they will carry out the experiment until the first success is obtained. The geometric distribution presents two distinct parameterizations, described here. The first parameterization considers successive independent Bernoulli trials, with probability of success p in each trial, until a success occurs. In this case, we cannot include zero as a possible result, so, the domain is supported by using the set {1, 2, 3, …}. For example, we can consider how many times we tossed a coin until we got the first head, the number of parts manufactured until a defective one was produced, among others. The second parameterization of the geometric distribution counts the number of failures or unsuccessful attempts before the first success. Since here it is possible to obtain success in the first Bernoulli trial, we include zero as being a possible result, so, the domain is supported by the set {0, 1, 2, 3, …}. Let X be the random variable that represents the number of trials until the first success. Variable X has a geometric distribution with parameter p, denoted by X  Geo(p), if its probability function is given by: f ðxÞ ¼ PðX ¼ xÞ ¼ p:ð1  pÞx1 , x ¼ 1,2, 3, …

(6.32)

For the second case, let us consider Y the random variable that represents the number of failures or unsuccessful attempts before the first success. Variable Y has a geometric distribution with parameter p, denoted by Y  Geo(p), if its probability function is given by: f ðyÞ ¼ PðY ¼ yÞ ¼ p:ð1  pÞy , y ¼ 0,1, 2, …

(6.33)

146

PART

III Probabilistic Statistics

FIG. 6.7 Probability function of variable X with parameter p ¼ 0.4.

In both cases, the sequence of probabilities is a geometric progression. The probability function of variable X is graphically represented in Fig. 6.7, for p ¼ 0.4. The calculation of X’s expected value and variance is: Eð X Þ ¼ Var ðXÞ ¼

1 p

1p p2

(6.34) (6.35)

In a similar way, for variable Y, we have: Eð Y Þ ¼

1p p

Var ðY Þ ¼

1p p2

(6.36) (6.37)

The geometric distribution is the only discrete distribution that has the memoryless property (in the case of continuous distributions, we will see that the exponential distribution also has this property). This means that if an experiment is repeated before the first success, then, given that the first success has not happened yet, the conditional distribution function of the number of additional trials does not depend on the number of failures that occurred until then. Thus, for any two positive integers s and t, if X is greater than s, then, the probability of X being greater than s + t is equal to the unconditional probability of X being greater than t: PðX > s + t j X > sÞ ¼ PðX > tÞ

(6.38)

Example 6.8 A company manufactures a certain electronic component and, at the end of the process, each component is tested, one by one. Assume that the probability of one electronic component being defective is 0.05. Determine the probability of the first defect being found in the eighth component tested. Also determine the random variable’s expected value and variance. Solution We have: X ¼ random variable that represents the number of electronic components tested until the first defect is found; p ¼ 0.05 ¼ probability of the component being defective; q ¼ 0.95 ¼ probability of the component being in good conditions. The probability of the first defect being found in the eighth component tested is given by: P ðX ¼ 8Þ ¼ 0:05ð1  0:05Þ81 ¼ 0:035 The mean of X is expressed as: 1 E ðX Þ ¼ ¼ 20 p

Random Variables and Probability Distributions Chapter

6

147

And the variance of X is: Var ðX Þ ¼

6.3.5

1p 0:95 ¼ ¼ 380 p2 0:0025

Negative Binomial Distribution

The negative binomial distribution, also known as the Pascal distribution, carries out successive independent Bernoulli trials (with a constant probability of success in all the trials) until it achieves a prefixed number of successes (k), that is, the experiment continues until k successes are achieved. Let X be the random variable that represents the number of attempts carried out (Bernoulli trials) until the k-th success is reached. Variable X has a negative binomial distribution, denoted by X  nb(k, p), if its probability function is given by:   x1 (6.39) f ð x Þ ¼ Pð X ¼ x Þ ¼ :pk :ð1  pÞxk , x ¼ k, k + 1,… k1 The graphical representation of a negative binomial distribution with parameter k ¼ 2 and p ¼ 0.4 can be found in Fig. 6.8. The expected value of X is: EðXÞ ¼

k p

(6.40)

and the variance is: Var ðXÞ ¼

k:ð1  pÞ p2

(6.41)

6.3.5.1 Relationship Between the Negative Binomial and the Binomial Distributions The negative binomial distribution is related to the binomial distribution. In the binomial, we must set the sample size (number of Bernoulli trials) and observe the number of successes (random variable). In the negative binomial, we must set the number of successes (k) and observe the number of Bernoulli trials necessary to obtain k successes.

6.3.5.2 Relationship Between the Negative Binomial and the Geometric Distributions The negative binomial distribution with parameter k ¼ 1 is equivalent to the geometric distribution: X  nbð1,pÞ X  GeoðpÞ Or, a negative binomial series can be considered to be a sum of geometric series.

FIG. 6.8 Probability function of variable X with parameter k ¼ 2 and p ¼ 0.4.

148

PART

III Probabilistic Statistics

It is important to mention that we are going to use all knowledge on the negative binomial distribution when studying the regression models for count data (Chapter 15). Example 6.9 Assume that a student gets three questions right every five tests. Let X be the number of attempts until the twelfth correct answer. Determine the probability of the student having to answer 20 questions in order to get 12 right. Solution We have: k ¼ 12, p ¼ 3/5 ¼ 0.6, q ¼ 2/5 ¼ 0.4 X ¼ number of attempts until the twelfth correct answer, that is, X  nb(12; 0.6). Therefore:   20  1 f ð20Þ ¼ P ðX ¼ 20Þ ¼  0:612  0:42012 ¼ 0:1078 ¼ 10:78% 12  1

6.3.6

Hypergeometric Distribution

The hypergeometric distribution is also related to a Bernoulli trial. However, differently from the binomial sampling, in which the probability of success is constant, in the hypergeometric distribution, since the sampling is without replacement, as the elements are removed from the population to form the sample, the population size diminishes, making the probability of success vary. The hypergeometric distribution describes the number of successes in a sample with n elements, drawn from a finite population without replacement. For example, let us consider a population with N elements, from which M have a certain attribute. The hypergeometric distribution describes the probability of exactly k elements having such attribute (k successes and n  k failures), in a sample with n distinct elements randomly drawn from the population without replacement. Let X be a random variable that represents the number of successes obtained from the n elements drawn from the sample. Variable X follows a hypergeometric distribution with parameters N, M, n, denoted by X  Hip(N, M, n), if its probability function is given by:    M N M : k nk   , 0  k  min ðM, nÞ (6.42) f ð k Þ ¼ Pð X ¼ k Þ ¼ N n The graphical representation of a hypergeometric distribution with parameters N ¼ 200, M ¼ 50, and n ¼ 30 can be found in Fig. 6.9. The mean of X can be calculated as: Eð X Þ ¼

FIG. 6.9 Probability function of variable X with parameters N ¼ 200, M ¼ 50, and n ¼ 30.

n:M N

(6.43)

Random Variables and Probability Distributions Chapter

6

149

with variance: Var ðXÞ ¼

n:M ðN  MÞ:ðN  nÞ : N N:ðN  1Þ

(6.44)

6.3.6.1 Approximation of the Hypergeometric Distribution by the Binomial Let X be a random variable that follows a hypergeometric distribution with parameters N, M, and n, denoted by X  Hip(N, M, n). If the population is large when compared to the sample size, the hypergeometric distribution can be approximated by a binomial distribution with parameters n and p ¼ M/N (probability of success in a single trial): X  HipðN, M, nÞ X  bðn, pÞ, with p ¼ M=N

Example 6.10 A gravity-pick machine contains 15 balls and 5 of them are red. 7 balls are chosen randomly, without replacement. Determine: a) The probability of exactly two red balls being drawn. b) The probability of at least two red balls being drawn. c) The expected number of red balls drawn. d) The variance of the number of red balls drawn. Solution Let X be the random variable that represents the number of red balls drawn. We have N ¼ 15, M ¼ 5, and n ¼ 7.       M 5 NM 10 : : k 2 nk 5   a) PðX ¼ 2Þ ¼ ¼   ¼ 39:16% N 15       n 7 5 5 10 10 : : 0 1 7 6 b) PðX  2Þ ¼ 1  PðX < 2Þ ¼ 1  ½PðX ¼ 0Þ + PðX ¼ 1Þ ¼ 1       ¼ 81:82% 15 15 7 7 n:M 7:5 ¼ ¼ 2:33 c) EðXÞ ¼ N 15 n:M ðN  MÞ:ðN  nÞ 7  5 10  8 : ¼  ¼ 0:8889 ¼ 88:89% d) VarðXÞ ¼ N N:ðN  1Þ 5 15  14

6.3.7

Poisson Distribution

The Poisson distribution is used to register the occurrence of rare events, with a very low probability of success (p ! 0), in a certain time interval or space. Differently from the binomial model that provides the probability of the number of successes in a discrete interval (n repetitions of an experiment), the Poisson model provides the probability of the number of successes in a certain continuous interval (time, area, among other possibilities). As examples of variables that represent a Poisson distribution, we can mention the number of customers that arrive in a line per unit of time, the number of defects per unit of time, the number of accidents per unit of area, among others. Note that the measurement units (time and area, in these situations) are continuous, but the random variable (number of occurrences) is discrete. The Poisson distribution presents the following hypotheses: (i) (ii) (iii) (iv)

Events defined in nonoverlapping intervals are independent; In intervals with the same length, the probabilities that the same number of successes will occur are equal; In very small intervals, the probability that more than one success will occur is insignificant; In very small intervals, the probability of one success is proportional to the length of the interval.

Let us consider a discrete random variable X that represents the number of successes (k) in a certain unit of time, unit of area, among other possibilities. Random variable X, with parameter l 0, follows a Poisson distribution, denoted by X Poisson (l), if its probability function is given by: f ðk Þ ¼ P ðX ¼ k Þ ¼

el :lk , k ¼ 0,1, 2, … k!

(6.45)

150

PART

III Probabilistic Statistics

FIG. 6.10 Poisson probability function.

where: e: base of the Napierian (or natural) logarithm, and e ﬃ 2.718282; l: estimated average rate of occurrence of the event we are interested in for a certain exposition (time interval, area, among other examples). Fig. 6.10 shows the Poisson distribution probability function for l ¼ 1, 3, and 6. In the Poisson distribution, the mean is equal to the variance: EðXÞ ¼ VARðXÞ ¼ l

(6.46)

It is important to mention that we are going to use all knowledge on the Poisson distribution when studying the regression models for count data (Chapter 15).

6.3.7.1 Approximation of the Binomial by the Poisson Distribution Let X be a random variable that follows a binomial distribution with parameters n and p, denoted by X b(n, p). When the number of repetitions of a random experiment is very high (n ! ∞) and the probability of success is very low (p ! 0), such that n. p ¼ l ¼ constant, the binomial distribution gets closer to the Poisson distribution: X  b ðn, pÞ X  Poisson ðlÞ, com l ¼ n:p

Example 6.11 Assume that the number of customers that arrive at a bank follows a Poisson distribution. We verified that, on average, 12 customers arrive at the bank per minute. Calculate: (a) the probability of 10 customers arriving in the next minute; (b) the probability of 40 customers arriving in the next 5 minutes; (c) X’s mean and variance. Solution We have l ¼ 12 customers per minute. e12  1210 ¼ 0:1048 a) PðX ¼ 10Þ ¼ 10! e12  128 ¼ 0:0655 b) PðX ¼ 8Þ ¼ 8! c) E(X) ¼ VAR(X) ¼ l ¼ 12

Example 6.12 A certain part is produced in a production line. The probability of the part being defective is 0.01. If 300 parts are produced, what is the probability of none of them being defective?

Random Variables and Probability Distributions Chapter

6

151

Solution This example is characterized by a binomial distribution. Since the number of repetitions is high and the probability of success is low, the binomial distribution can be approximated by a Poisson distribution with parameter l ¼ n. p ¼ 300  0.01 ¼ 3, such that: P ðX ¼ 0Þ ¼

6.4

e 3  30 ¼ 0:05 0!

PROBABILITY DISTRIBUTIONS FOR CONTINUOUS RANDOM VARIABLES

For continuous random variables, we are going to study the uniform, normal, exponential, gamma, chi-square (w2), Student’s t, and Snedecor’s F distributions.

6.4.1

Uniform Distribution

The uniform model is the simplest model for continuous random variables. It is used to model the occurrence of events whose probability is constant in intervals with the same range. A random variable X follows a uniform distribution in the interval [a, b], denoted by X  U[a, b], if its probability density function is given by:  1=ðb  aÞ, if a  x  b f ðx Þ ¼ (6.47) 0 , otherwise which can be graphically represented as seen in Fig. 6.11. The expected value of X is calculated by the expression: Zb EðXÞ ¼

x a

1 a+b dx ¼ ba 2

(6.48)

Table 6.1 presents a summary of the discrete distributions studied in this section, including the calculation of the random variable’s probability function, the distribution parameters, besides the calculation of X’s expected value and variance. TABLE 6.1 Models for Discrete Variables Distribution Discrete uniform

Bernoulli Binomial Geometric

Negative binomial Hypergeometric

Poisson

Probability Function – P(X) 1 n

Parameters n

E(X) n X

1 xi : n i¼1

Var(X) 2 n 16 6X 2 x  :6 n 4 i¼1 i



n P

i¼1

2 3 xi

n

7 7 7 5

px. (1  p)1x, x ¼ 0, 1   n :p k :ð1  p Þnk ,k ¼ 0,1,…, n k

p

p

p. (1  p)

n, p

n.p

n. p. (1  p)

P(X) ¼ p. (1  p)x1, x ¼ 1, 2, 3, … P(Y) ¼ p. (1  p)y, y ¼ 0, 1, 2, …

p

E ðX Þ ¼

1 p 1p E ðY Þ ¼ p

1p p2 1p Var ðY Þ ¼ 2 p

k, p

k p

k:ð1  p Þ p2

N, M, n

n:M N

n:M ðN  MÞ:ðN  nÞ : N N:ðN  1Þ

l

l

l



 x 1 :p k :ð1  p Þxk , x ¼ k, k + 1,… k 1    M N M : k nk   ,0  k  min ðM, nÞ N n

e l :lk ,k ¼ 0,1,2,… k!

Var ðX Þ ¼

152

PART

III Probabilistic Statistics

FIG. 6.11 Uniform distribution in the interval [a, b].

And the variance of X is:   ðb  aÞ2 Var ðXÞ ¼ E X2  ½EðXÞ2 ¼ 12 On the other hand, the cumulative distribution function of the uniform distribution is given by: 8 0 , if x < a > > Zx Zx

ba > : a a 1 , if x  b

(6.49)

(6.50)

Example 6.13 Random variable X represents the time a bank’s ATM machines are used (in minutes), and it follows a uniform distribution in the interval [1, 5]. Determine: a) P(X < 2) b) P(X > 3) c) P(3 < X < 5) d) E(X) e) VAR(X) Solution a) P(X < 2) ¼ F(2) ¼ (2  1)/(5  1) ¼ 1/4 b) P(X > 3) ¼ 1  P(X < 3) ¼ 1  F(3) ¼ 1  (3  1)/(5  1) ¼ 1/2 c) P(3 < X < 4) ¼ F(4)  F(3) ¼ (4  1)/(5  1)  (3  1)/(5  1) ¼ 1/4 ð1 + 5Þ ¼3 d) EðXÞ ¼ 2 ð5  1Þ2 4 ¼ e) VARðXÞ ¼ 12 3

6.4.2

Normal Distribution

The normal distribution, also known as Gaussian, is the most widely used and important probability distribution, because it allows us to model a myriad of natural phenomena, studies of human behavior, industrial processes, among others. In addition to allowing us to use approximations to calculate the probabilities of many random variables. A random variable X, with mean m 2 ℜ and standard deviation s > 0, follows a normal or Gaussian distribution, denoted by X  N (m, s2), if its probability density function is given by: 2

ðxmÞ 1 f ðxÞ ¼ pﬃﬃﬃﬃﬃﬃ :e 2:s2 ,  ∞  x  + ∞, s: 2p

whose graphical representation is shown in Fig. 6.12.

(6.51)

Random Variables and Probability Distributions Chapter

6

153

FIG. 6.12 Normal distribution.

FIG. 6.13 Area under the normal curve.

Fig. 6.13 shows the area under the normal curve based on the number of standard deviations. From Fig. 6.13, we can see that the curve has the shape of a bell and is symmetrical around parameter m, and the smaller parameter s is, the more concentrated the curve is around m. Therefore, in a normal distribution, the mean of X is: Eð X Þ ¼ m

(6.52)

Var ðXÞ ¼ s2

(6.53)

And the variance of X is:

In order to obtain the standard normal distribution or the reduced normal distribution, the original variable X is transformed into a new random variable Z, with mean 0 (m ¼ 0) and variance 1 (s2 ¼ 1): Z¼

Xm  N ð0, 1Þ s

(6.54)

Score Z represents the number of standard deviations that separates a random variable X from the mean. This kind of transformation, known as Zscores, is broadly used to standardize variables, because it does not change the shape of the original variable’s normal distribution, and it generates a new variable with mean 0 and variance 1. Therefore,

154

PART

III Probabilistic Statistics

FIG. 6.14 Standard normal distribution.

when many variables with different orders of magnitude are being used in a certain type of modeling, the Zscores standardization process will make all the new standardized variables have the same distribution, with equal orders of magnitude (Fa´vero et al., 2009). The probability density function of random variable Z is reduced to: z2 1 f ðzÞ ¼ pﬃﬃﬃﬃﬃﬃ :e 2 ,  ∞  z  + ∞ 2p

(6.55)

whose graphical representation is shown in Fig. 6.14. The cumulative distribution function F(xc) of a normal random variable X is obtained by integrating Expression (6.51) from ∞ to xc, that is: Zxc F ð x c Þ ¼ Pð X  x c Þ ¼

f ðxÞdx

(6.56)

∞

Integral (6.56) corresponds to the area under f(x) from ∞ to xc, as shown in Fig. 6.15. In the specific case of the standard normal distribution, the cumulative distribution function is: Zzc F ð z c Þ ¼ Pð Z  z c Þ ¼ ∞

Zzc

1 f ðzÞdz ¼ pﬃﬃﬃﬃﬃﬃ 2p

z2

e 2 dz

(6.57)

∞

For a random variable Z with a standard normal distribution, let us suppose that the main goal now is to calculate P(Z > zc). So, we have: Z∞ Pð Z > z c Þ ¼ zc

1 f ðzÞdz ¼ pﬃﬃﬃﬃﬃﬃ 2p

Z∞

z2 2 dz

e

(6.58)

zc

Fig. 6.16 represents this probability. Table E in the Appendix shows the value of P(Z > zc), that is, the cumulative probability from zc to +∞ (the gray area under the normal curve). FIG. 6.15 Cumulative normal distribution.

f(x)

F(Xc)

–¥

Xc

X

Random Variables and Probability Distributions Chapter

6

155

f(z)

–¥

zc

z

FIG. 6.16 Graphical representation of P(Z > zc) for a standardized normal random variable.

6.4.2.1 Approximation of the Binomial by the Normal Distribution Let X be a random variable that has a binomial distribution with parameters n and p, denoted by X  b(n, p). As the average number of successes and the average number of failures tend to infinity (n. p ! ∞ and n. (1  p) ! ∞), the binomial distribution gets closer to a normal one with mean m ¼ n. p and variance s2 ¼ n. p. (1  p):   X  bðn, pÞ X  N m, s2 , com m ¼ n:p e s2 ¼ n:p:ð1  pÞ Some authors admit that the approximation of the binomial by the normal distribution is good when n. p > 5 and n. (1  p) > 5, or when n. p. (1  p) 3. A better and more conservative rule requires n. p > 10 and n. (1  p) > 10. However, since it is a discrete approximation through a continuous one, we recommend greater accuracy, carrying out a continuity correction that consists in, for example, transforming P(X ¼ x) into the interval P(x  0.5 < X < x + 0.5).

6.4.2.2 Approximation of the Poisson by the Normal Distribution Analogous to the binomial distribution, the Poisson distribution can also be approximated by a normal one. Let X be a random variable that follows a Poisson distribution with parameter l, denoted by X  Poisson(l). Since l ! ∞, the Poisson distribution gets closer to a normal one with mean m ¼ l and variance s2 ¼ l:   X  PoissonðlÞ X  N m, s2 , with m ¼ l and s2 ¼ l In general, we admit that the approximation of the Poisson distribution by the normal distribution is good when l > 10. Once again, we recommend using the continuity correction x  0.5 and x + 0.5. Example 6.14 We know that the average thickness of the hose storage units produced in a factory (X) follows a normal distribution with a mean of 3 mm and a standard deviation of 0.4 mm. Determine: a) P(X > 4.1) b) P(X > 3) c) P(X  3) d) P(X  3.5) e) P(X < 2.3) f) P(2  X  3.8) Solution The probabilities will be calculated based on Table E in the Appendix, which provides the value of P(Z > zc):   4:1  3 ¼ PðZ > 2:75Þ ¼ 0:0030 a) PðX > 4:1Þ ¼ P Z > 0:4   33 ¼ PðZ > 0Þ ¼ 0:5 b) PðX > 3Þ ¼ P Z > 0:4 c) P(X  3) ¼ P(Z  0) ¼ 0.5   3:5  3 PðX  3:5Þ ¼ P Z  ¼ PðZ  1:25Þ ¼ 1  PðZ > 1:25Þ d) 0:4 ¼ 1  0:1056 ¼ 0:8944   2:3  3 e) PðX < 2:3Þ ¼ P Z < ¼ PðZ < 1:75Þ ¼ PðZ > 1:75Þ ¼ 0:04 0:4

156

PART

III Probabilistic Statistics 

 23 3:8  3 Z ¼ Pð2:5  Z  2Þ 0:4 0:4 ¼ PðZ  2Þ  PðZ < 2:5Þ ¼ ½1  PðZ > 2Þ  PðZ > 2:5Þ ¼

f) Pð2  X  3:8Þ ¼ P

¼ ½1  0:0228  0:0062 ¼ 0:971

6.4.3

Exponential Distribution

Another important distribution, which has applications in system reliability and in the queueing theory, is the exponential distribution. It has as its main characteristic the property of being memoryless, that is, the future lifetime (t) of a certain object has the same distribution, regardless of its past lifetime (s), for any s, t > 0, as shown in Expression (6.38), once again shown below: PðX > s + t j X > sÞ ¼ PðX > tÞ A continuous random variable X has an exponential distribution with parameter l > 0, denoted by X  exp(l), if its probability density function is given by:  l:el:x , if x  0 (6.59) f ðx Þ ¼ 0 , if x < 0 Fig. 6.17 represents the probability density function of the exponential distribution for parameters l ¼ 0.5, l ¼ 1, and l ¼ 2. We can see that the exponential distribution is positive asymmetrical (to the right), observing a higher frequency for smaller values of x and a longer tail to the right. The density function assumes value l when x ¼ 0, and tends to zero as x ! ∞. The higher the value of l, the more quickly the function tends to zero. In the exponential distribution, the mean of X is: Eð X Þ ¼

1 l

(6.60)

1 l2

(6.61)

and the variance of X is: Var ðXÞ ¼ And the cumulative distribution function F(x) is given by: 

Zx f ðxÞdx ¼

FðxÞ ¼ PðX  xÞ ¼ 0

FIG. 6.17 Exponential distribution for l ¼ 0.5,l ¼ 1, and l ¼ 2.

1  el:x , if x  0 0 , if x < 0

(6.62)

Random Variables and Probability Distributions Chapter

6

157

From (6.62) we can conclude that: PðX > xÞ ¼ el:x

(6.63)

In system reliability, random variable X represents the lifetime, that is, the time during which a component or system remains operational, outside the interval for repairs and above a specified limit (yield, pressure, among other examples). On the other hand, parameter l represents the failure rate, that is, the number of components or systems that failed in a preestablished time interval: l¼

number of failures operation time

(6.64)

The main measures of reliability are: (a) Mean time to failure (MTTF) and (b) Mean time between failures (MTBF). Mathematically, MTTF and MTBF are equal to the mean of the exponential distribution and represent the mean lifetime. Thus, the failure rate can also be calculated as: l¼

1 MTTF:ðMTBFÞ

(6.65)

In the queueing theory, random variable X represents the mean waiting time until the next arrival (mean time between two customers’ arrivals). On the other hand, parameter l represents the mean arrivals rate, that is, the expected number of arrivals per unit of time.

6.4.3.1 Relationship Between the Poisson and the Exponential Distribution If the number of occurrences in a counting process follows a Poisson distribution (l), then, the random variables “time until the first occurrence” and “time between any successive occurrences” of the aforementioned process have an exp(l) distribution. Example 6.15 The life span of an electronic component follows an exponential distribution with a mean lifetime of 120 hours. Determine: a) The probability of a component failing in the first 100 hours of use; b) The probability of a component lasting more than 150 hours. Solution Assume that l ¼ 1/120 and X  exp(1/120). Therefore: x 100 100 x x 100 100 R 120:e 120   120 dx ¼  a) PðX  100Þ ¼ 120:e  ¼ e 120  ¼ e 120 + 1 ¼ 0:5654  0 120 0 0

b) PðX > 150Þ ¼

R∞

x

120:e 120 dx ¼ 

x ∞ 120:e 120 

150

6.4.4

120

 

150

x ∞ 150 ¼ e 120  ¼ e 120 ¼ 0:2865 150

Gamma Distribution

The gamma distribution is one of the most general, such that, other distributions, as the Erlang, exponential, and chi-square (w2) are particular cases of it. As the exponential distribution, it is also widely used in system reliability. The gamma distribution also has applications in physical phenomena, in meteorological processes, insurance risk theory, and economic theory. A continuous random variable X has a gamma distribution with parameters a > 0 and l > 0, denoted by X  Gamma (a, l), if its probability density function is given by: 8 a < l :xa1 :el:x , if x  0 (6.66) f ðxÞ ¼ GðaÞ : 0 , if x < 0

158

PART

III Probabilistic Statistics

FIG. 6.18 Density function of x for some values of a and l. (Source: Navidi, W., 2012. Probabilidade e estatı´stica para ci^ encias exatas. Bookman, Porto Alegre.)

where G(a) is the Gamma function, given by: Z∞ GðaÞ ¼

ex :xa1 dx, a > 0

(6.67)

0

The gamma probability density function for some values of a and l is represented in Fig. 6.18. We can see that the gamma distribution is positive asymmetrical (to the right), observing a higher frequency for smaller values of x and a longer tail to the right. However, as a tends to infinity, the distribution becomes symmetrical. We can also observe that when a ¼ 1, the gamma distribution is equal to the exponential. Moreover, the greater the value of l, the more quickly the density function tends to zero. The expected value of X can be calculated as: EðXÞ ¼ a:l

(6.68)

Var ðXÞ ¼ a:l2

(6.69)

On the other hand, the variance of X is given by:

The cumulative distribution function is: Zx Fð x Þ ¼ Pð X  x Þ ¼ 0

la f ðxÞdx ¼ GðaÞ

Zx

xa1 :elx dx

(6.70)

0

6.4.4.1 Special Cases of the Gamma Distribution A gamma distribution with parameter a, a positive integer, is called an Erlang distribution, such that: If a is a positive integer ) X  Gamma(a, l) X Erlang(a, l) As mentioned before, a gamma distribution with parameter a ¼ 1 is called an exponential distribution: If a ¼ 1 ) X  Gamma(a, l) X  exp(l) Or, a gamma distribution with parameter a ¼ n/2 and l ¼ 1/2 is called a chi-square distribution with n degrees of freedom: If a ¼ n/2, l ¼ 1/2 ) X  Gamma(n/2, 1/2) w  w2v¼n

6.4.4.2 Relationship Between the Poisson and the Gamma Distribution In the Poisson distribution, we try to determine the number of occurrences of a certain event within a fixed period. On the other hand, the gamma distribution determines the time necessary to obtain a specified number of occurrences of the event.

Random Variables and Probability Distributions Chapter

6.4.5

6

159

Chi-Square Distribution

A continuous random variable X has a chi-square distribution with n degrees of freedom, denoted by X  w2n , if its probability density function is given by: 8 1 < :xn=21 :ex=2 , x > 0 n=2 (6.71) f ðxÞ ¼ 2 :Gðn=2Þ : 0 , x xc), we have: Z∞ Pð X > x c Þ ¼ which can be represented by Fig. 6.20.

f ðxÞdx xc

FIG. 6.19 w2 distribution for different values of n.

(6.76)

160

PART

III Probabilistic Statistics

FIG. 6.20 Graphical representation of P(X > xc) for a random variable with a w2 distribution.

The w2 distribution has several applications in statistical inference. Due to its importance, the w2 distribution can be found in Table D in the Appendix, for different values of parameter n. This table provides the critical values of xc such that P(X > xc) ¼ a. In other words, we can obtain the calculation of the probabilities and of the cumulative probability density function for different values of x from random variable X. Example 6.16 Assume that random variable X follows a chi-square distribution (w2) with 13 degrees of freedom. Determine: a) P(X > 5) b) The x value such that P(X  x) ¼ 0.95 c) The x value such that P(X > x) ¼ 0.95 Solution Through the w2 distribution table (Table D in the Appendix), for n ¼ 13, we have: a) P(X > 5) ¼ 97.5% b) 22.362 c) 5.892

6.4.6

Student’s t Distribution

Student’s t distribution was developed by William Sealy Gosset, and it is one of the main probability distributions, with several applications in statistical inference. We are going to assume a random variable Z that has a normal distribution with mean 0 and standard deviation 1, and a random variable X with a chi-square distribution with n degrees of freedom, such that, Z and X are independent. Continuous random variable T is then defined as: Z T ¼ pﬃﬃﬃﬃﬃﬃﬃﬃ X=n

(6.77)

We can say that variable T follows Student’s t distribution with n degrees of freedom, denoted by T  tn, if its probability density function is given by:   n+1   n + 1 G 2 t2 2 , ∞ tc), we have: Z∞ PðT > tc Þ ¼

f ðtÞdt

(6.82)

tc

as shown in Fig. 6.22. Just as the normal and chi-square (w2) distributions, Student’s t distribution has several applications in statistical inference, such that, there is a table to obtain the probabilities, based on different values of parameter n (Table B in the Appendix). This table provides the critical values of tc such that P(T > tc) ¼ a. In other words, we can obtain the calculation of the probabilities and of the cumulative probability density function for different values of t from random variable T. We are going to use Student’s t distribution when studying simple and multiple regression models (Chapter 13).

a/2

a/2

–tc FIG. 6.22 Graphical representation of Student’s t distribution.

tc

t

162

PART

III Probabilistic Statistics

Example 6.17 Assume that random variable T follows Student’s t distribution with 7 degrees of freedom. Determine: a) P(T > 3.5) b) P(T < 3) c) P(T <  0.711) d) The t value such that P(T  t) ¼ 0.95 e) The t value such that P(T > t) ¼ 0.10 Solution a) 0.5% b) 99% c) 25% d) 1.895 e) 1.415

6.4.7

Snedecor’s F Distribution

Snedecor’s F distribution, also known as Fisher’s distribution, is frequently used in tests associated to the analysis of variance (ANOVA), to compare the means of more than two populations. Let us consider continuous random variables Y1 and Y2, such that: l

Y1 and Y2 are independent; Y1 follows a chi-square distribution with n1 degrees of freedom, denoted by Y1  wn21 ;

l

Y2 follows a chi-square distribution with n2 degrees of freedom, denoted by Y2  wn22.

l

We are going to define a new continuous random variable X such that: X¼

Y1 =n1 Y2 =n2

(6.83)

So, we say that X has a Snedecor’s F distribution with n1 and n2 degrees of freedom, denoted by X  Fn1, n2, if its probability density function is given by: n + n n n1 =2 1 2 1 G  xðn1 =2Þ1  2 n2 (6.84) f ðxÞ ¼   ðn1 + n2 Þ=2 , x > 0 n1 n2 n1 :x + 1 G G  2 2 n2 where Z∞ GðaÞ ¼

ex :xa1 dx

0

Fig. 6.23 shows the behavior of Snedecor’s F distribution probability density function, for different values of n1 and n2. We can see that Snedecor’s F distribution is positive asymmetrical (to the right), observing a higher frequency for smaller values of x and a longer tail to the right. However, as n1 and n2 tend to infinity, the distribution becomes symmetrical. The expected value of X is calculated as: n2 , for n2 > 2 (6.85) Eð X Þ ¼ n2  2 On the other hand, the variance of X is given by: Var ðXÞ ¼

2:n22 :ðn1 + n2  2Þ n1 :ðn2  4Þ:ðn2  2Þ2

, for n2 > 4

(6.86)

Random Variables and Probability Distributions Chapter

6

163

f (x)

F30,30

F4,12 0 FIG. 6.23 Probability density function for F4, 12 and F30,

x 30.

FIG. 6.24 Critical values of Snedecor’s F distribution.

Just as the normal, w2, and Student’s t distributions, Snedecor’s F distribution has several applications in statistical inference. And there is a table from which we can obtain the probabilities and the cumulative distribution function, based on different values of parameters n1 and n2 (Table A in the Appendix). This table provides the critical values of Fc such that P(X > Fc) ¼ a (Fig. 6.24). We are going to use Snedecor’s F distribution when studying simple and multiple regression models (Chapter 13).

6.4.7.1 Relationship Between Student’s t and Snedecor’s F Distribution Let us consider a random variable T with Student’s t distribution with n degrees of freedom. So, the square of variable T follows Snedecor’s F distribution with n1 ¼ 1 and n2 degrees of freedom, as shown by Fa´vero et al. (2009). Thus: If T tn, then T2  F1, n2 Example 6.18 Assume that random variable X follows Snedecor’s F distribution with n1 ¼ 6 degrees of freedom in the numerator, and n2 ¼ 12 degrees of freedom in the denominator, that is, X  F6, 12. Determine: a) P(X > 3) b) F6, 12 with a ¼ 10% c) The x value such that P(X  x) ¼ 0.975 Solution Through Snedecor’s F distribution table (Table A in the Appendix), for n1 ¼ 6 and n2 ¼ 12, we have: a) P(X > 3) ¼ 5% b) 2.33 c) 3.73

164

PART

III Probabilistic Statistics

Table 6.2 shows a summary of the continuous distributions studied in this section, including the calculation of the random variable’s probability function, the distribution parameters, besides the calculation of X’s expected value and variance. TABLE 6.2 Models for Continuous Variables Distribution

Probability Function – P(X)

Parameters

E(X)

Var(X)

Uniform

1 ,a  x  b b a

a, b

a+b 2

ðb  aÞ2 12

Normal

ðxmÞ 1 pﬃﬃﬃﬃﬃﬃ :e  2s2 , ∞  x  + ∞ s: 2p

2

m, s

m

s2

Exponential

l. e l. x, x 0

l

1 l

1 l2

Gamma

la a1 lx :x :e , x  0 GðaÞ

a, l

a. l

a. l2

Chi-square (w2)

1 :x n=21 :e x=2 , x > 0 2n=2 :Gðn=2Þ   n+1   n + 1 G 2 t2 2 n pﬃﬃﬃﬃﬃ  1 + ,∞ < t < ∞ n G : pn 2 n + n n n1 =2 1 2 1 G  x ðn1 =2Þ1  2 n2 ðn1 + n2 Þ=2 n n n  1 2 1 :x + 1 G  G 2 2 n2 x>0

n

n

2. n

n

E(T) ¼ 0

Var ðT Þ ¼

n1, n2

n2 n2 2

Student’s t

Snedecor’s F

6.5

n n2

2:n22 :ðn1 + n2  2Þ n1 :ðn2  4Þ:ðn2  2Þ2

FINAL REMARKS

This chapter discussed the main probability distributions used in statistical inference, including the distributions for discrete random variables (discrete uniform, Bernoulli, binomial, geometric, negative binomial, hypergeometric, and Poisson) and for continuous random variables (uniform, normal, exponential, gamma, chi-square (w2), Student’s t, and Snedecor’s F). When characterizing probability distributions, it is extremely important to use measures that indicate the most relevant aspects of the distribution, such as, measures of position (mean, median, and mode), measures of dispersion (variance and standard deviation), and measures of skewness and kurtosis. Understanding the concepts related to probability and to probability distributions helps the researcher in the study of topics related to statistical inference, including parametric and nonparametric hypotheses tests, multivariate analysis through exploratory techniques, and estimation of regression models.

6.6

EXERCISES

1) In a shoe production line, the probability of a defective item being produced is 2%. For a batch with 150 items, determine the probability of a maximum of two items being defective. Also determine the mean and the variance. 2) The probability of a student solving a certain problem is 12%. If 10 students are selected randomly, what is the probability of exactly one of them being successful? 3) A telemarketing salesman sells one product every 8 customers he contacts. The salesman prepares a list of customers. Determine the probability of the first product being sold in the fifth call, in addition to the expected sales value and the respective variance. 4) The probability of a player scoring a penalty is 95%. Determine the probability of the player having to take a penalty kick 33 times to score 30 goals, besides the mean of penalty kicks. 5) Assume that, in a certain hospital, 3 patients undergo stomach surgery daily, following a Poisson distribution. Calculate the probability of 28 patients undergoing surgery next week (7 business days).

Random Variables and Probability Distributions Chapter

6

165

6) Assume that a certain random variable X follows a normal distribution with m ¼ 8 and s2 ¼ 36. Determine the following probabilities: a) P(X  12) b) P(X < 5) c) P(X > 2) d) P(6 < X  11) 7) Consider random variable Z with a standardized normal distribution. Determine critical value zc such that P(Z > zc) ¼ 80%. 8) When tossing 40 balanced coins, determine the following probabilities: a) Of getting exactly 22 heads. b) Of getting more than 25 heads. Solve this exercise by approximating the distribution through a normal distribution. 9) The time until a certain electronic device fails follows an exponential distribution with a failure rate per hour of 0.028. Determine the probability of a device chosen randomly remaining operational for: a) 120 hours; b) 60 hours. 10) A certain type of device follows an exponential distribution with a mean lifetime of 180 hours. Determine: a) The probability of the device lasting more than 220 hours; b) The probability of the device lasting a maximum of 150 hours. 11) The arrival of patients in a lab follows an exponential distribution with an average rate of 1.8 clients per minute. Determine: a) The probability of the next client’s arrival taking more than 30 seconds; b) The probability of the next client’s arrival taking a maximum of 1.5 minutes. 12) The time between clients’ arrivals in a restaurant follows an exponential distribution with a mean of 3 minutes. Determine: a) The probability of more than 3 clients arriving in 6 minutes; b) The probability of the time until the fourth client arrives being less than 10 minutes. 13) A random variable X has a chi-square distribution with n ¼ 12 degrees of freedom. What is critical value xc such that P(X > xc) ¼ 90%? 14) Now, assume that X follows a chi-square distribution with n ¼ 16 degrees of freedom. Determine: a) P(X > 25) b) P(X  32) c) P(25 < X  32) d) The x value such that P(X  x) ¼ 0.975 e) The x value such that P(X > x) ¼ 0.975 15) A random variable T follows Student’s t distribution with n ¼ 20 degrees of freedom. Determine: a) Critical value tc such that P( tc < t < tc) ¼ 95% b) E(T) c) Var(T) 16) Now, assume that T follows Student’s t distribution with n ¼ 14 degrees of freedom. Determine: a) P(T > 3) b) P(T  2) c) P(1.5 < T  2) d) The t value such that P(T  t) ¼ 0.90 e) The t value such that P(T > t) ¼ 0.025 17) Consider a random variable X that follows Snedecor’s F distribution with n1 ¼ 4 and n2 ¼ 16 degrees of freedom, that is, X  F4, 16. Determine: a) P(X > 3) b) F4, 16 with a ¼ 2.5% c) The x value such that P(X  x) ¼ 0.99 d) E(X) e) Var(X)

Chapter 7

Sampling Our reason becomes obscure when we consider that the countless fixed stars that shine in the sky do not have any other purpose besides illuminating worlds in which weeping and pain rule, and, in the best case scenario, only unpleasantness exists; at least, judging by the sample we know. Arthur Schopenhauer

7.1

INTRODUCTION

As discussed in the Introduction of this book, population is the set that has all the individuals, objects, or elements to be studied, which have one or more characteristics in common. A census is the study of data related to all the elements of the population. According to Bruni (2011), populations can be finite or infinite. Finite populations have a limited size, allowing their elements to be counted; infinite populations, on the other hand, have an unlimited size, not allowing us to count their elements. As examples of finite populations, we can mention the number of employees in a certain company, the number of members in a club, the number of products manufactured during a certain period, etc. When the number of elements in a population, even though they can be counted, is too high, we assume that the population is infinite. Examples of populations considered infinite are the number of inhabitants in the world, the number of residences in Rio de Janeiro, the number of points on a straight line, etc. Therefore, there are situations in which a study with all the elements in a population is impossible or unwanted. Hence, the alternative is to extract a subset from the population under analysis, which is called a sample. The sample must be representative of the population being studied, therein is the importance of this chapter. From the information gathered in the sample and using suitable statistical procedures, the results obtained can be used to generalize, infer, or draw conclusions regarding the population (statistical inference). For Fa´vero et al. (2009) and Bussab and Morettin (2011), it is rarely possible to obtain the exact distribution of a variable, due to the high costs, the time needed and the difficulties in collecting the data. Hence, the alternative is to select part of the elements in the population (sample) and, after that, infer the properties for the whole (population). Essentially, there are two types of sampling: (1) probability or random sampling, and (2) nonprobability or nonrandom sampling. In random sampling, samples are obtained randomly, that is, the probability of each element of the population being a part of the sample is the same. In nonrandom sampling, on the other hand, the probability of some or all the elements of the population being in the sample is unknown. Fig. 7.1 shows the main random and nonrandom sampling techniques. Fa´vero et al. (2009) show the advantages and disadvantages of random and nonrandom techniques. Regarding random sampling techniques, the main advantages are: a) the selection criteria of the elements are rigorously defined, not allowing the researchers’ or the interviewer’s subjectivity to interfere in the selection of the elements; b) the possibility to mathematically determine the sample size based on accuracy and on the confidence level desired for the results. On the other hand, the main disadvantages are: a) difficulty in obtaining current and complete listings or regions of the population; b) geographically speaking, a random selection can generate a highly disperse sample, increasing the costs, the time needed for the study, and the difficulty in collecting the data. As regards nonrandom sampling techniques, the advantages are lower costs, less time to carry out the study, and less need of human resources. As disadvantages, we can mention: a) there are units in the population that cannot be chosen; b) a personal bias may happen; c) we do not know with what level of confidence the conclusions arrived at can be inferred Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00007-0 © 2019 Elsevier Inc. All rights reserved.

169

170

PART

IV Statistical Inference

FIG. 7.1 Main sampling techniques.

Sampling Random sampling

Simple

Systematic

Stratified

Cluster

Nonrandom sampling

Convenience

Judgmental

Quota

Snowball

for the population. These techniques do not use a random method to select the elements of the sample, so, there is no guarantee that the sample selected is a good representative of the population (Fa´vero et al., 2009). Choosing the sampling technique must consider the goals of the survey, the acceptable error in the results, accessibility to the elements of the population, the desired representativeness, the time needed, and the availability of financial and human resources.

7.2

PROBABILITY OR RANDOM SAMPLING

In this type of sampling, samples are obtained randomly, that is, the probability of each element of the population being a part of the sample is the same, and all of the samples selected are equally probable. In this section, we will study the main probability or random sampling techniques: (a) simple random sampling, (b) systematic sampling, (c) stratified sampling, and (d) cluster sampling.

7.2.1

Simple Random Sampling

According to Bolfarine and Bussab (2005), simple random sampling (SRS) is the simplest and most important method for selecting a sample. Consider a population or universe (U) with N elements: U ¼ f1, 2, …, N g According to Bolfarine and Bussab (2005), planning and selecting the sample include the following steps: (a) Using a random procedure (as, for example, through a table with random numbers or a gravity-pick machine), we must draw an element from population U with the same probability; (b) We repeat the previous process until a sample with n observations is generated (the calculation of the size of a simple random sample will be studied in Section 7.4); (c) When the value drawn is removed from U before of the next draw, we have the SRS without replacement process. In case drawing a unit more than once is allowed, we have the SRS with replacement process. According to Bolfarine and Bussab (2005), from a practical point of view, an SRS without replacement is much more interesting, because it satisfies the intuitive principle that we do not gain more information in case the same unit appears more than once in the sample. On the other hand, an SRS with replacement has mathematical and statistical advantages, such as, the independence between the units drawn. Let’s now study each of them.

7.2.1.1 Simple Random Sampling Without Replacement According to Bolfarine and Bussab (2005), an SRS without replacement works as follows: (a) All of the elements in the population are numbered from 1 to N: U ¼ f1, 2, …, N g

Sampling Chapter

7

171

(b) Using a procedure to generate random numbers, we must draw, with the same probability, one of the N observations of the population; (c) We draw the following element, with the previous value being removed from the population; (d) We repeat the procedure until n observations have been drawn (how to calculate n is explained in Section 7.4.1).   N! N possible samples of n elements that can be obtained from the ¼ In this type of sampling, there are CN, n ¼ n n!ðN  nÞ!   N population, and each sample has the same probability of being selected, 1= . n Example 7.1: Simple Random Sampling without Replacement Table 7.E.1 shows the weight (kg) of 30 parts. Draw, without any replacements, a random sample of size n ¼ 5. How many different samples of size n can be obtained from the population? What is the probability of a sample being selected?

TABLE 7.E.1 Weight (kg) of 30 parts 6.4

6.2

7.0

6.8

7.2

6.4

6.5

7.1

6.8

6.9

7.0

7.1

6.6

6.8

6.7

6.3

6.6

7.2

7.0

6.9

6.8

6.7

6.5

7.2

6.8

6.9

7.0

6.7

6.9

6.8

Solution All 30 parts were numbered from 1 to 30, as shown in Table 7.E.2.

TABLE 7.E.2 Numbers given to the parts 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

6.4

6.2

7.0

6.8

7.2

6.4

6.5

7.1

6.8

6.9

7.0

7.1

6.6

6.8

6.7

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

6.3

6.6

7.2

7.0

6.9

6.8

6.7

6.5

7.2

6.8

6.9

7.0

6.7

6.9

6.8

Through a random procedure (as, for example, the RANDBETWEEN function in Excel), the following numbers were selected: 02 03 14

24 28

The parts associated to these numbers form the random sample selected.   30  29  28  27  26 30 ¼ 142, 506 different samples. There are ¼ 5 5! The probability of a certain sample being selected is 1/142, 506.

7.2.1.2 Simple Random Sampling With Replacement According to Bolfarine and Bussab (2005), an SRS with replacement works as follows: (a) All of the elements in the population are numbered from 1 to N: U ¼ f1, 2, …, N g (b) Using a procedure to generate random numbers, we must draw, with the same probability, one of the N observations of the population; (c) We put this unit back into the population and draw the following value; (d) We repeat the procedure until n observations have been drawn (how to calculate n is explained in Section 7.4.1). In this type of sampling, there are Nn possible samples of n elements that can be obtained from the population, and each sample has the same probability 1/Nn of being selected.

172

PART

IV Statistical Inference

Example 7.2: Simple Random Sampling with Replacement Redo Example 7.1 considering a simple random sampling with replacement. Solution The 30 parts were numbered from 1 to 30. Through a random procedure (for example, we can use the RANDBETWEEN function in Excel), we drew the first part from the sample (12). This part is put back and the second element is drawn (33). The procedure is repeated until five parts have been drawn: 12 33 02 25 33 The parts associated to these numbers form the random sample selected. There are 305 ¼ 24,300,000 different samples. The probability of a certain sample being selected is 1/24,300,000.

7.2.2

Systematic Sampling

According to Costa Neto (2002), when the elements of the population are sorted and periodically removed, we have a systematic sampling. Hence, for example, in a production line, we can remove an element at every 50 items produced. As advantages of systematic sampling, in comparison to simple random sampling, we can mention that it is carried out in a much faster and cheaper way, besides being less susceptible to errors made by the interviewer during the survey. The main disadvantage is the possibility of having variation cycles, especially if these cycles coincide with the period when the elements are removed from the sample. For example, let’s suppose that at every 60 parts produced in a certain machine, one part is inspected; however, in this machine a certain flaw usually happens, so, at every 20 parts produced, one is defective. Assuming that the elements of the population are sorted from 1 to N and that we already know the sample size (n), systematic sampling works as follows: (a) We must determine the sampling interval (k), obtained by the quotient of the population size and the sample size: k¼

N n

This value must be rounded to the closest integer. (b) In this phase, we introduce an element of randomness, choosing the starting unit. The first element chosen {X1} can be an any element between 1 and k; (c) After choosing the first element, after each k element, a new element is removed from the population. The process is repeated until it reaches the sample size (n): X1 ,X1 + k, X1 + 2k, …,X1 + ðn  1Þk

Example 7.3: Systematic Sampling Imagine a population with N ¼ 500 sorted elements. We wish to remove a sample with n ¼ 20 elements from this population. Use the systematic sampling procedure. Solution (a) The sampling interval (k) is: k¼

N 500 ¼ ¼ 25 n 20

(b) The first element chosen {X} can be any element between 1 and 25; suppose that X ¼ 5; (c) Since the first element of the sample is X ¼ 5, the second element will be X ¼ 5 + 25 ¼ 30, the third element X ¼ 5 + 50 ¼ 55, and so on, and so forth, so, the last element of the sample will be X ¼ 5 + 19  25 ¼ 480: A ¼ f 5,30, 55, 80, 105,130, 155,180, 205,230,255, 280,305,330, 355,380, 405,430,455, 480 g

Sampling Chapter

7.2.3

7

173

Stratified Sampling

In this type of sampling, a heterogeneous population is stratified or divided into subpopulations or homogeneous strata, and, in each stratum, a sample is drawn. Hence, initially, we define the number of strata and, by doing that, we obtain the size of each stratum. For each stratum, we specify how many elements will be drawn from the subpopulation, and this can be a uniform or proportional allocation. According to Costa Neto (2002), uniform stratified sampling, from which we draw an equal number of elements in each stratum, is recommended when the strata are approximately the same size. In proportional stratified sampling, on the other hand, the number of elements in each stratum is proportional to the number of elements in the stratum. According to Freund (2006), if the elements selected in each stratum are simple random samples, the global process (stratification followed by random sampling) is called (simple) stratified random sampling. According to Freund (2006), stratified sampling works as follows: (a) A population of size N is divided into k strata of sizes N1, N2, …, Nk; (b) For each stratum, a random sample of size ni (i ¼ 1, 2, …, k) is selected, resulting in k subsamples of sizes n1, n2, …, nk. In uniform stratified sampling, we have: n1 ¼ n 2 ¼ … ¼ nk where the sample size obtained from each stratum is: n ni ¼ , para i ¼ 1, 2, …,k k where n ¼ n1 + n2 + … + nk In proportional stratified sampling, on the other hand, we have: n1 n2 nk ¼ ¼…¼ N1 N2 Nk

(7.1)

(7.2)

(7.3)

In proportional sampling, the sample size obtained from each stratum can be obtained according to the following expression: ni ¼

Ni  n, for i ¼ 1, 2, …,k N

(7.4)

As examples of stratified sampling, we can mention the stratification of a city into neighborhoods, of a population by gender or age group, of customers by social class or of students by school. The calculation of the size of a stratified sample will be studied in Section 7.4.3. Example 7.4: Stratified Sampling Consider a club that has N ¼ 5000 members. The population can be divided by age group, aiming at identifying the main activities practiced by each group: from 0 to 4 years of age; from 5 to 11; from 12 to 17; from 18 to 25; from 26 to 36; from 37 to 50; from 51 to 65; and over 65 years of age. We have N1 ¼ 330, N2 ¼ 350, N3 ¼ 400, N4 ¼ 520, N5 ¼ 650, N6 ¼ 1030, N7 ¼ 980, N8 ¼ 740. We would like to draw a stratified sample from the population of size n ¼ 80. What should be the size of the sample drawn from each stratum in case of uniform sampling and proportional sampling? Solution For uniform sampling, ni ¼ n/k ¼ 80/8 ¼ 10. Therefore, n1 ¼ … ¼ n8 ¼ 10. For proportional sampling, we calculate ni ¼ NNi  n, for i ¼ 1, 2,…, 8: 330 350  80 ¼ 5:3 ﬃ 6, n2 ¼ NN2  n ¼ 5;000  80 ¼ 5:6 ﬃ 6 n1 ¼ NN1  n ¼ 5;000 400 520 n3 ¼ NN3  n ¼ 5;000  80 ¼ 6:4 ﬃ 7, n4 ¼ NN4  n ¼ 5;000  80 ¼ 8:3 ﬃ 9 650  80 ¼ 10:4 ﬃ 11, n6 ¼ NN6  n ¼ 1:030 n5 ¼ NN5  n ¼ 5;000 5;000  80 ¼ 16:5 ﬃ 17 980 740  80 ¼ 15:7 ﬃ 16, n8 ¼ NN8  n ¼ 5;000  80 ¼ 11:8 ﬃ 12 n7 ¼ NN7  n ¼ 5;000

7.2.4

Cluster Sampling

In cluster sampling, the total population must be subdivided into groups of elementary units, called clusters. The sampling is done from the groups and not from the individuals in the population. Hence, we must randomly draw a sufficient number of clusters and the objects from these will form the sample. This type of sampling is called one-stage cluster sampling.

174

PART

IV Statistical Inference

According to Bolfarine and Bussab (2005), one of the inconveniences of cluster sampling is the fact that elements in the same cluster tend to have similar characteristics. The authors show that the more similar the elements in the cluster are, the less efficient the procedure is. Each cluster must be a good representative of the population, that is, it must be heterogeneous, containing all kinds of participants. It is the opposite of stratified sampling. According to Martins and Domingues (2011), cluster sampling is a simple random sampling in which the sample units are the clusters; however, it is less expensive. When we draw elements in the clusters selected, we have a two-stage cluster sampling: in the first stage, we draw the clusters and, in the second, we draw the elements. The number of elements to be drawn depends on the variability in the cluster. The higher the variability, the more elements must be drawn. On the other hand, when the units in the cluster are very similar, it is not advisable nor necessary to draw all the elements, because they will bring the same kind of information (Bolfarine and Bussab, 2005). Cluster sampling can be generalized to several stages. The main advantages that justify the wide use of cluster sampling are: a) many populations are already grouped into natural or geographic subgroups, facilitating its application; b) it allows a substantial reduction in the costs to obtain the sample, without compromising its accuracy. In short, it is fast, cheap, and efficient. The only disadvantage is that clusters are rarely the same size, making it difficult to control the range of the sample. However, to overcome this problem, we have to use certain statistical techniques. As examples of clusters, we can mention the production in a factory divided into assembly lines, company employees divided by area, students in a municipality divided by schools, or the population in a municipality divided into districts. Consider the following notation for cluster sampling: N: population size; M: number of clusters into which the population was divided; Ni: cluster size i (i ¼ 1, 2, ..., M); n: sample size; m: number of clusters drawn (m < M); ni: cluster size i of the sample (i ¼ 1, 2, ..., m), where ni ¼ Ni; bi: cluster size i of the sample (i ¼ 1, 2, ..., m), where bi < ni. In short, one-stage cluster sampling adopts the following procedure: (a) The population is divided into M clusters (C1, …, CM) with sizes that are not necessarily the same; (b) According to a sample plan, usually SRS, we draw m clusters (m <  M);  m P ni ¼ n . (c) All the elements of each cluster drawn constitute the global sample ni ¼ Ni and i¼1

The calculation of the number of clusters (m) will be studied in Section 7.4.4. On the other hand, two-stage cluster sampling works as follows: (a) The population is divided into M clusters (C1, …, CM) with sizes that are not necessarily the same; (b) We must draw m clusters in the first stage, according to some kind of sample plan, usually SRS; (c) From each cluster i drawn, of size ni, we draw bi elements in the second stage, according to the same or to another  m P sample plan bi < ni and n ¼ bi . i¼1

Example 7.5: One-Stage Cluster Sampling Consider a population with N ¼ 20 elements, U ¼ {1, 2, …, 20}. The population is divided into 7 clusters: C1 ¼ {1, 2}, C2 ¼ {3, 4, 5}, C3 ¼ {6, 7, 8}, C4 ¼ {9, 10, 11}, C5 ¼ {12, 13, 14}, C6 ¼ {15, 16}, C7 ¼ {17, 18, 19, 20}. The sample plan adopted says that we should draw three clusters (m ¼ 3) by simple random sampling without replacement. Assuming that clusters C1, C3, and C4 were drawn, determine the sample size, besides the elements that will constitute the one-stage cluster sampling. Solution In one-stage cluster sampling, all the elements of each cluster drawn constitute the sample, so, M ¼ {C1, C3, C4} ¼ {(1, 2), (6, 7, 8), 3 P (9, 10, 11)}. Therefore, n1 ¼ 2, n2 ¼ 3 and n3 ¼ 3, and n ¼ ni ¼ 8. i¼1

Sampling Chapter

7

175

Example 7.6: Two-Stage Cluster Sampling Example 7.5 will be extended to the case of two-stage cluster sampling. Thus, from the clusters drawn in the first stage, the sample   m P plan adopted tells us to draw a single element with equal probability from each cluster bi ¼ 1, i ¼ 1, 2, 3 and n ¼ bi ¼ 3 , i¼1

which results in the following: Stage 1: M ¼ {C1, C3, C4} ¼ {(1, 2), (6, 7, 8), (9, 10, 11)} Stage 2: M ¼ {1, 8, 10}

7.3

NONPROBABILITY OR NONRANDOM SAMPLING

In nonprobability sampling methods, samples are obtained in a nonrandom way, that is, the probability of some or all elements of the population belonging to the sample is unknown. Thus, it is not possible to estimate the sample error, nor to generalize the results of the sample to the population, since the former is not representative of the latter. For Costa Neto (2002), this type of sampling is used many times due to its simplicity or impossibility to obtain probability samples, as would be the most desirable. Therefore, we must be careful when deciding to use this type of sampling, since it is subjective, based on the researcher’s criteria and judgment, and sample variability cannot be established with accuracy. In this section, we will study the main nonprobability or nonrandom sampling techniques: (a) convenience sampling, (b) judgmental or purposive sampling, (c) quota sampling, (d) geometric propagation or snowball sampling.

7.3.1

Convenience Sampling

Convenience sampling is used when participation is voluntary or the sample elements are chosen due to convenience or simplicity, such as, friends, neighbors, or students. The advantage this method offers is that it allows researcher to obtain information in a quick and cheap way. However, the sample process does not guarantee that the sample is representative of the population. It should only be employed in extreme situations and in special cases that justify its use. Example 7.7: Convenience Sampling A researcher wishes to study customer behavior in relation to a certain brand and, in order to do that, he develops a sampling plan. The collection of data is done through interviews with friends, neighbors, and workmates. This represents convenience sampling, since this sample is not representative of the population. It is important to highlight that, if the population is very heterogeneous, the results of the sample cannot be generalized to the population.

7.3.2

Judgmental or Purposive Sampling

In judgmental or purposive sampling, the sample is chosen according to an expert’s opinion or previous judgment. It is a risky method due to possible mistakes made by the researcher in his prejudgment. Using this type of sampling requires knowledge of the population and of the elements selected. Example 7.8: Judgmental or Purposive Sampling A survey is trying to identify the reasons why a group of employees of a certain company went on strike. In order to do that, the researcher interviews the main leaders of the trade union and of political movements, as well as the employees that are not involved in such movements. Since the sample size is small, it is not possible to generalize the results to the population, since the sample is not representative of this population.

176

PART

7.3.3

IV Statistical Inference

Quota Sampling

Quota sampling presents greater rigor when compared to other nonrandom samplings. For Martins and Domingues (2011), it is one of the most used sampling methods in market surveys and election polls. Quota sampling is a variation of judgmental sampling. Initially, we set the quotas based on a certain criterion. Within the quotas, the selection of the sample items depends on the interviewer’s judgment. Quota sampling can also be considered a nonprobability version of stratified sampling. Quota sampling consists of three steps: (a) We select the control variables or the population’s characteristics considered relevant for the study in question; (b) We determine the percentage of the population (%) for each one of the relevant variable categories; (c) We establish the size of the quotas (number of people to be interviewed that have the characteristics needed) for each interviewer, so that the sample can have the same proportions as the population. The main advantages of quota sampling are the low costs, speed, and convenience or ease in which the interviewer can select elements. However, since the selection of elements is not random, there are no guarantees that the sample will be representative of the population. Hence, it is not possible to generalize the results of the survey to the population. Example 7.9: Quota Sampling We would like to carry out municipal election polls regarding a certain municipality with 14,253 voters. The survey has as its main objective to identify how people intend to vote based on their gender and age group. Table 7.E.3 shows the absolute frequencies for each pair of variable category analyzed. Apply quota sampling, considering that the sample size is 200 voters and that there are two interviewers.

TABLE 7.E.3 Absolute Frequencies for Each Pair of Categories Age Group

Male

Female

Total

16 and 17

50

48

98

from 18 to 24

1097

1063

2160

from 25 to 44

3409

3411

6820

from 45 to 69

2269

2207

4476

> 69

359

331

690

Total

7184

7060

14,244

Solution (a) The variables that are relevant for the study are gender and age; (b) The percentage of the population (%) for each pair of categories of analyzed variables is shown in Table 7.E.4.

TABLE 7.E.4 Percentage of the Population for Each Pair of Categories Age Group

Male

Female

Total

16 and 17

0.35%

0.34%

0.69%

from 18 to 24

7.70%

7.46%

15.16%

from 25 to 44

23.93%

23.95%

47.88%

from 45 to 69

15.93%

15.49%

31.42%

>69

2.52%

2.32%

4.84%

% of the Total

50.44%

49.56%

100.00%

(c) If we multiply each cell in Table 7.E.4 by the sample size (200), we get the dimensions of the quotas that compose the global sample, as shown in Table 7.E.5.

Sampling Chapter

7

177

TABLE 7.E.5 Dimensions of the Quotas Age Group

Male

Female

Total

16 and 17

1

1

2

from 18 to 24

16

15

31

from 25 to 44

48

48

96

from 45 to 69

32

31

63

>69

5

5

10

Total

102

100

202

Considering that there are two interviewers, the quota for each one will be:

TABLE 7.E.6 Dimensions of the Quotas per Interviewer Age Group

Male

Female

Total

16 and 17

1

1

2

from 18 to 24

8

8

16

from 25 to 44

24

24

48

from 45 to 69

16

16

32

>69

3

3

6

Total

52

52

104

Note: The data in Tables 7.E.5 and 7.E.6 were rounded up, resulting in a total number of 202 voters in Table 7.E.5 and 104 voters in Table 7.E.6.

7.3.4

Geometric Propagation or Snowball Sampling

Geometric propagation or snowball sampling is widely used when the elements of the population are rare, difficult to access, or unknown. In this method, we must identify one or more individuals from the target population, and these will identify the other individuals that are in the same population. The process is repeated until the objective proposed is achieved, that is, the point of saturation. The point of saturation is reached when the last respondents do not add new relevant information to the research, thus, repeating the content of previous interviews. As advantages, we can mention: a) it allows the researcher to find the desired characteristic in the population; b) it is easy to apply, because the recruiting is done through referrals from other people who are in the population; c) low cost, because we need less planning and people; and d) it is efficient to enter populations that are difficult to access. Example 7.10: Snowball Sampling A company is recruiting professionals with a specific profile. The group hired initially recommends other professionals with the same profile. The process is repeated until the number of employees needed is hired. Therefore, we have an example of snowball sampling.

7.4

SAMPLE SIZE

According to Cabral (2006), there are six decisive factors when calculating the sample size: 1) Characteristics of the population, such as, variance (s2) and dimension (N); 2) Sample distribution of the estimator used;

178

PART

IV Statistical Inference

3) The accuracy and reliability required in the results, being necessary to specify the estimation error (B), which is the maximum difference that the researcher accepts between the population parameter and the estimate obtained from the sample; 4) The costs: the larger the sample size, the higher the costs; 5) Costs vs. sample error: must we select a larger sample to reduce the sample error or must we reduce the sample size in order to minimize the resources and efforts necessary, thus ensuring better control for the interviewers, a higher response rate, and a precise and better processing of the information? 6) The statistical techniques that will be used: some statistical techniques demand larger samples than others. The sample selected must be representative of the population. Based on Ferra˜o et al. (2001), Bolfarine and Bussab (2005), and Martins and Domingues (2011), this section discusses how to calculate the sample size for the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B, and for each type of random sampling (simple, systematic, stratified, and cluster). In the case of nonrandom samples, either we set the sample size based on a possible budget or we adopt a certain dimension that has already been used successfully in previous studies with the same characteristics. A third alternative would be to calculate the size of a random sample and use that dimension as a reference.

7.4.1

Size of a Simple Random Sample

This section discusses how to calculate the size of a simple random sample to estimate the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B. The estimation error (B) for the mean is the maximum difference that the researcher accepts between m (population mean) and X (sample mean), that is, B  m  X. On the other hand, the estimation error (B) for the proportion is the maximum difference that the researcher accepts between p (proportion of the population) and p^ (proportion of the sample), that is, B  jp  p^j.

7.4.1.1 Sample Size to Estimate the Mean of an Infinite Population Ifthe variable   chosen is quantitative and the population infinite, the size of a simple random sample, where P X  m  B ¼ 1  a, can be calculated as follows: n¼ where:

s2 B2 =z2a

(7.5)

s2: population variance; B: maximum estimation error; za: abscissa (coordinate) of the standard normal distribution, at the significance level a. According to Bolfarine and Bussab (2005), to determine the sample size it is necessary to set the maximum estimation error (B), the significance level a (translated by the value of za), and to have some previous knowledge of the population variance (s2). The first two are set by the researcher, while the third demands more work. When we do not know s2, its value must be substituted for a reasonable initial estimator. In many cases, a pilot sample can provide sufficient information about the population. In other cases, sample surveys done previously about the population can also provide satisfactory initial estimates for s2. Finally, some authors suggest the use of an approximate value for the standard deviation, given by s ﬃ range/4.

7.4.1.2 Sample Size to Estimate the Mean of a Finite Population Ifthe variable   chosen is quantitative and the population finite, the size of a simple random sample, where P X  m  B ¼ 1  a, can be calculated as follows: n¼ where:

N:s2 B2 ðN  1Þ: 2 + s2 za

N: size of the population; s2: population variance; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

(7.6)

Sampling Chapter

7

179

7.4.1.3 Sample Size to Estimate the Proportion of an Infinite Population If the variable chosen is binary and the population infinite, the size of a simple random sample, where Pðjp^  pj  BÞ ¼ 1  a, can be calculated as follows: n¼

p:q B2 =z2a

(7.7)

where: p: proportion of the population that contains the characteristic desired; q ¼ 1  p; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a. In practice, we do not know the value of p and we must, therefore, find its estimate (^ p). If, however, this value is also unknown, we must admit that p^ ¼ 0:50, hence obtaining a conservative size, that is, larger than what is necessary to ensure the accuracy required.

7.4.1.4 Sample Size to Estimate the Proportion of a Finite Population If the variable chosen is binary and the population finite, the size of a simple random sample, where Pðjp^  pj  BÞ ¼ 1  a, can be calculated as follows: n¼

N:p:q B2 ðN  1Þ: 2 + p:q za

(7.8)

where: N: size of the population; p: proportion of the population that contains the characteristic desired; q ¼ 1  p; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

Example 7.11: Calculating the Size of a Simple Random Sample Consider the population of residents in a condominium (N ¼ 540). We would like to estimate the average age of these residents. Based on previous surveys, we can obtain an estimate for s2 of 463.32. Assume that a simple random sample will be drawn from the population. Assuming that the difference between the sample mean and the real population mean is 4 years, at the most, with a confidence level of 95%, determine the sample size to be collected. Solution The value of za for a ¼ 5% (a bilateral test) is 1.96. From expression (7.6), the sample size is: n¼

N:s2 540  463:32 ¼ 92:38 ﬃ 93 ¼ B2 42 ðN  1Þ: 2 + s2 539  + 463:32 za 1:962

Therefore, if we collect a simple random sample of at least 93 residents from the population, we can infer, with a confidence level of 95%, that the sample mean (X) will differ 4 years, at the most, from the real population mean (m).

180

PART

IV Statistical Inference

Example 7.12: Calculating the Size of a Simple Random Sample We would like to estimate the proportion of voters who are dissatisfied with a certain politician’s administration. We admit that the real proportion is unknown, as well as its estimate. Assuming that a simple random sample will be drawn from an infinite population and admitting a sample error of 2%, and a significance level of 5%, determine the sample size. Solution Since we do not know the real value of p nor its estimate, let’s assume that p^ ¼ 0:50. Applying Expression (7.7) to estimate the proportion of an infinite population, we have: n¼

p:q 0:5  0:5 ¼ ¼ 2,401 B2 =za2 0:022 =1:962

Therefore, by randomly interviewing 2401 voters, we can infer the real proportion of voters who are dissatisfied, with a maximum estimation error of 2%, and a confidence level of 95%.

7.4.2

Size of the Systematic Sample

In systematic sampling, we use the same expressions as in simple random sampling (as studied in Section 7.4.1), according to the type of variable (quantitative or qualitative) and population (infinite or finite).

7.4.3

Size of the Stratified Sample

This section discusses how to calculate the size of a stratified sample to estimate the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B. The estimation error (B) for the mean is the maximum difference that the researcher accepts between m (population mean) and X (sample mean), that is, B  m  X. On the other hand, the estimation error (B) for the proportion is the maximum difference that the researcher accepts between p (proportion of the population) and p^ (proportion of the sample), that is, B  jp  p^j. Let’s use the following notation to calculate the size of the stratified sample, as follows: k: number of strata; Ni: size of stratum i, i ¼ 1, 2,..., k; N ¼ N1 + N2 + … + Nk (population size); k P Wi ¼ Ni/N (weight or proportion of stratum i, with Wi ¼ 1); i¼1 mi: population mean of stratum i; s2i : population variance of stratum i; ni: number of elements randomly selected from stratum i; n ¼ n1 + n2 + … + nk (sample size); Xi : sample mean of stratum i; S2i : sample variance of stratum i; pi: proportion of elements that have the characteristic desired in stratum i; q i ¼ 1  pi :

7.4.3.1 Sample Size to Estimate the Mean of an Infinite Population Ifthe variable   chosen is quantitative and the population infinite, the size of the stratified sample, where P X  m  B ¼ 1  a, can be calculated as: k X

n¼ where: Wi ¼ Ni/N (weight or proportion of stratum i, where

Wi :s2i

i¼1

k P i¼1

B2 =z2a Wi ¼ 1);

(7.9)

Sampling Chapter

7

181

s2i : population variance of stratum i; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

7.4.3.2 Sample Size to Estimate the Mean of a Finite Population If the variable   chosen is quantitative and the population finite, the size of the stratified sample, where P X  m  B ¼ 1  a, can be calculated as: k X

Ni2 :s2i =Wi

i¼1 k B2 X N2: 2 + Ni :s2i za i¼1

(7.10)

where: Ni: size of stratum i, i ¼ 1, 2,..., k; s2i : population variance of stratum i; k P Wi ¼ Ni/N (weight or proportion of stratum i, where Wi ¼ 1); i¼1 N: size of the population; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

7.4.3.3 Sample Size to Estimate the Proportion of an Infinite Population If the variable chosen is binary and the population infinite, the size of the stratified sample, where Pðjp^  pj  BÞ ¼ 1  a, can be calculated as: k X Wi :pi :qi n¼

i¼1

B2 =z2a

(7.11)

where: Wi ¼ Ni/N (weight or proportion of stratum i, where

k P

Wi ¼ 1);

i¼1

pi: proportion of elements that have the characteristic desired in stratum i; qi ¼ 1  pi ; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

7.4.3.4 Sample Size to Estimate the Proportion of a Finite Population If the variable chosen is binary and the population finite, the size of the stratified sample, where Pðjp^  pj  BÞ ¼ 1  a, can be calculated as: k X Ni2 :pi :qi =Wi n¼ where:

i¼1 k B2 X N2: 2 + Ni :pi :qi za i¼1

Ni: size of stratum i, i ¼ 1, 2,..., k; pi: proportion of elements that have the characteristic desired in stratum i; qi ¼ 1  pi ;

(7.12)

182

PART

IV Statistical Inference

k P Wi ¼ Ni/N (weight or proportion of stratum i, where Wi ¼ 1); i¼1 N: size of the population; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

Example 7.13 Calculating the Size of a Stratified Sample A university has 11,886 students enrolled in 14 undergraduate courses, divided into three major areas: Exact Sciences, Human Sciences, and Biological Sciences. Table 7.E.7 shows the number of students enrolled per area. A survey will be carried out in order to estimate the average time students spend studying per week (in hours). Based on pilot samples, we obtain the following estimates for the variances in the areas of Exact, Human, and Biological Sciences: 124.36, 153.22, and 99.87, respectively. The samples selected must be proportional to the number of students per area. Determine the sample size, considering an estimation error of 0.8, and a confidence level of 95%.

TABLE 7.E.7 Number of students enrolled per area Area

Number of students enrolled

Exact Sciences

5285

Human Sciences

3877

Biological Sciences

2724

Total

11,886

Solution From the data, we have: k ¼ 3, N1 ¼ 5,285, N2 ¼ 3,877, N3 ¼ 2,724, N ¼ 11, 886, B ¼ 0:8 W1 ¼

5, 285 3, 877 2, 724 ¼ 0:44, W2 ¼ ¼ 0:33, W3 ¼ ¼ 0:23 11, 886 11, 886 11, 886

For a ¼ 5%, we have za ¼ 1.96. Based on the pilot sample, we must use the estimates for s21, s22, and s23. The sample size is calculated from Expression (7.10): k X

Ni2 s2i =Wi

i¼1

N2

k B2 X + Ni s2i 2 za i¼1

  5,2852  124:36 3,8772  153:22 2, 7242  99:87 + + 0:44 0:33 0:23 ¼ 722:52 ﬃ 723 n¼ 2 0:8 11, 8862  + ð5, 285  124:36 + 3, 877  153:22 + 2, 724  99:87Þ 2 1:96 Since the sampling is proportional, we can obtain the size of each stratum by using the expression ni ¼ Wi  n (i ¼ 1, 2, 3): n1 ¼ W1  n ¼ 0:44  723 ¼ 321:48 ﬃ 322 n2 ¼ W2  n ¼ 0:33  723 ¼ 235:83 ﬃ 236 n3 ¼ W3  n ¼ 0:23  723 ¼ 165:70 ﬃ 166 Thus, to carry out the survey, we must select 322 students from the area of Exact Sciences, 236 from the area of Human Sciences, and 166 from Biological Sciences. From the sample selected, we can infer, with a 95% confidence level, that the difference between the sample mean and the real population mean will be a maximum of 0.8 hours.

Sampling Chapter

7

183

Example 7.14 Calculating the Size of a Stratified Sample Consider the same population from the previous example; however, the objective now is to estimate the proportion of students who work, for each area. Based on a pilot sample, we have the following estimates per area: p^1 ¼ 0:3 (Exact Sciences), p^2 ¼ 0:6 (Human Sciences), and p^3 ¼ 0:4 (Biological Sciences). The type of sampling used in this case is uniform. Determine the sample size, considering an estimation error of 3%, and a 90% confidence level. Solution Since we do not know the real value of p for each area, we can use its estimate. For a 90% confidence level, we have za ¼ 1.645. Applying Expression (7.12) from the stratified sampling to estimate the proportion of a finite population, we have: k X

N2 : n¼

Ni2 :pi :qi =Wi

i¼1 k B2 X + Ni :pi :qi za2 i¼1

5,2852  0:3  0:7=0:44 + 3,8772  0:6  0:4=0:33 + 2, 7242  0:4  0:6=0:23 0:032 + 5,285  0:3  0:7 + 3,877  0:6  0:4 + 2, 724  0:4  0:6 11, 8862  1:6452 n ¼ 644:54 ﬃ 645

Since the sampling is uniform, we have n1 ¼ n2 ¼ n3 ¼ 215. Therefore, to carry out the survey, we must randomly select 215 students from each area. From the sample selected, we can infer, with a 90% confidence level, that the difference between the sample proportion and the real population proportion will be a maximum of 3%.

7.4.4

Size of a Cluster Sample

This section discusses how to calculate the size of a one-stage and a two-stage cluster sample. Let’s consider the following notation to calculate the size of a cluster sample: N: population size; M: number of clusters into which the population was divided; Ni: size of cluster i (i ¼ 1, 2, ..., M); n: sample size; m: number of clusters drawn (m < M); ni: size of cluster i from the sample drawn in the first stage (i ¼ 1, 2, ..., m), where ni ¼ Ni; bi: size of cluster i from the sample drawn in the second stage (i ¼ 1, 2, ..., m), where bi < ni; N ¼ N=M (average size of the population clusters); n ¼ n=m (average size of the sample clusters); Xij: j-th observation in cluster i; s2dc: population variance in the clusters; s2ec: population variance between clusters; s2i : population variance in cluster i; mi: population mean in cluster i; s2c ¼ s2dc + s2ec (total population variance). According to Bolfarine and Bussab (2005), the calculation of s2dc and s2ec is given by: Ni M X X 

s2dc ¼

Xij  mi

i¼1 j¼1

N

2 ¼

M 1 X Ni 2  s M i¼1 N i

M M 1 X 1 X Ni Ni :ðmi  mÞ2 ¼ : s2ec ¼ : :ðm  mÞ2 N i¼1 M i¼1 N i

(7.13)

(7.14)

Assuming that all the clusters are the same size, the previous expressions can be summarized as follows: s2dc ¼

M 1 X : s2 M i¼1 i

(7.15)

184

PART

IV Statistical Inference

s2ec ¼

M 1 X : ðm  mÞ2 M i¼1 i

(7.16)

7.4.4.1 Size of a One-Stage Cluster Sample This section discusses how to calculate the size of a one-stage cluster sample to estimate the mean (a quantitative variable) of a finite and infinite population, with a maximum estimation error B. The estimation error (B) for the mean is the maximum difference that the researcher accepts between m (population mean) and X (sample mean), that is, B  m  X. 7.4.4.1.1

Sample Size to Estimate the Mean of an Infinite Population

If the variable chosen    is quantitative and the population infinite, the number of the clusters drawn in the first stage (m), where P X  m  B ¼ 1  a, can be calculated as follows: m¼

s2c B2 =z2a

(7.17)

where: s2c ¼ s2dc + s2ec, according to Expressions (7.13)–(7.16); B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a. If the clusters are the same size, Bolfarine and Bussab (2005) demonstrate that: m¼

s2e 2 B =z2a

(7.18)

According to the authors, generally, s2c is unknown and has to be estimated from pilot samples or obtained from previous sample surveys. 7.4.4.1.2

Sample Size to Estimate the Mean of a Finite Population

Ifthe  variable  chosen is quantitative and the population finite, the number of clusters drawn in the first stage (m), where P X  m  B ¼ 1  a, can be calculated as follows: m¼

M:s2c B2 :N 2 M: 2 + s2c za

(7.19)

where: M: number of clusters into which the population was divided; s2c ¼ s2dc + s2ec, according to Expressions (7.13)–(7.16); B: maximum estimation error; N ¼ N=M (average size of the population clusters); za: coordinate of the standard normal distribution, at the significance level a. 7.4.4.1.3

Sample Size to Estimate the Proportion of an Infinite Population

If the variable chosen is binary and the population infinite, the number of clusters drawn in the first stage (m), where Pðjp^  pj  BÞ ¼ 1  a, can be calculated as follows:

Sampling Chapter

1=M: m¼

M X Ni i¼1 N B2 =z2a

7

185

:pi :qi (7.20)

where: M: number of clusters into which the population was divided; Ni: size of cluster i (i ¼ 1, 2, ..., M); N ¼ N=M (average size of the population clusters); pi: proportion of elements that have the characteristic desired in cluster i; qi ¼ 1  pi ; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a. 7.4.4.1.4

Sample Size to Estimate the Proportion of a Finite Population

If the variable chosen is binary and the population finite, the number of clusters drawn in the first stage (m), where Pðjp^  pj  BÞ ¼ 1  a, can be calculated as follows: M X Ni

i¼1

N

:pi :qi

M X B2 :N 2 Ni M: 2 + 1=M: :pi :qi za i¼1 N

(7.21)

where: M: number of clusters into which the population was divided; Ni: size of cluster i (i ¼ 1, 2, ..., M); N ¼ N=M (average size of the population clusters); pi: proportion of elements that have the characteristic desired in cluster i; qi ¼ 1  pi ; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

7.4.4.2 Size of a Two-Stage Cluster Sample In this case, we assume that all the clusters are the same size. Based on Bolfarine and Bussab (2005), let’s consider the following linear cost function: C ¼ c1 :n + c2 :b

(7.22)

where: c1: observation cost of one unit from the first stage; c2: observation cost of one unit from the second stage; n: sample size in the first stage; b: sample size in the second stage. The optimal size for b that minimizes the linear cost function is given by: rﬃﬃﬃﬃﬃ sdc c1 : b∗ ¼ sec c2

(7.23)

186

PART

IV Statistical Inference

Example 7.15: Calculating the Size of a Cluster Sample Consider the members of a certain club in Sao Paulo (N ¼ 4,500). We would like to estimate the average evaluation score (0 to 10) given by these members regarding the main features of the club. The population is divided into 10 groups of 450 elements each, based on their membership number. The estimate of the mean and of the population variance per group, based on previous surveys, can be seen in Table 7.E.8. Assuming that the cluster sampling is based on a single stage, determine the number of clusters that must be drawn, considering B ¼ 2% and a ¼ 1%.

TABLE 7.E.8 Mean and population variance per group i

1

2

3

4

5

6

7

8

9

10

mi

7.4

6.6

8.1

7.0

6.7

7.3

8.1

7.5

6.2

6.9

s2i

22.5

36.7

29.6

33.1

40.8

51.7

39.7

30.6

40.5

42.7

Solution From the data given to us, we have: N ¼ 4, 500,M ¼ 10, N ¼ 4, 500=10 ¼ 450,B ¼ 0:02, and za ¼ 2:575: Since all the clusters are the same size, the calculation of s2dc and s2ec is given by: s2dc ¼ s2ec ¼

M 1 X 22:5 + 36:7 + … + 42:7 s2 ¼ : ¼ 36:79 M i¼1 i 10

M 1 X ð7:4  7:18Þ2 + … + ð6:9  7:18Þ2 ¼ 0:35 ðmi  mÞ2 ¼ : 10 M i¼1

Therefore, s2c ¼ s2dc + s2ec ¼ 36.79 + 0.35 ¼ 37.14 The number of clusters to be drawn in one stage, for a finite population, is given by Expression (7.19): m¼

M:s2c 10  37:14 ¼ 2:33 ﬃ 3 ¼ 0:022  4502 B2 :N 2 2 + 37:14 M: 2 + sc 10  za 2:5752

Therefore, the population of N ¼ 4, 500 members is divided into M ¼ 10 clusters with the same size (Ni ¼ 450, i ¼ 1, ...10). From the total number of clusters, we must randomly draw m ¼ 3 clusters. In one-stage cluster sampling, all the elements of each cluster drawn constitute the global sample (n ¼ 450  3 ¼ 1, 350).

From the sample selected, we can infer, with a 99% confidence level, that the difference between the sample mean and the real population mean will be 2%, at the most. Table 7.1 shows a summary of the expressions used to calculate the sample size for the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B, and for each type of random sampling (simple, systematic, stratified, and cluster).

7.5

FINAL REMARKS

It is rarely possible to obtain the exact distribution of a variable when we select all the elements of the population, due to the high costs, the time needed, and the difficulties in collecting the data. Therefore, the alternative is to select part of the elements of the population (sample) and, after that, infer the properties for the whole (population). Since the sample must be a good representative of the population, choosing the sampling technique is essential in this process. Sampling techniques can be classified in two major groups: probability or random sampling and nonprobability or nonrandom sampling. Among the main random sampling techniques, we can highlight simple random sampling (with and without replacement), systematic, stratified, and cluster. The main nonrandom sampling techniques are convenience, judgmental or purposive, quota, and snowball sampling. Each one of these techniques has advantages and disadvantages, and choosing the best technique must take the characteristics of each study into consideration. This chapter also discussed how to calculate the sample size for the mean and the proportion of finite and infinite populations, for each type of random sampling. In the case of nonrandom samples, the researcher must either establish the sample size based on a possible budget or adopt a certain dimension that has already been used successfully in previous studies with similar characteristics. Another alternative would be to calculate the size of a random sample and use it as a reference.

Sampling Chapter

7

187

TABLE 7.1 Expressions to Calculate the Size of Random Samples Type of Random Sample Simple

Systematic

Estimating the Mean (Infinite Population) n¼

7.6 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11)

B2 =za2 s2 B2 =za2 k P

Stratified

One-stage Cluster

s2

i¼1

Estimating the Mean (Finite Population)

Wi :s2i

B 2 =za2

s2c 2 B =za2

Estimating the Proportion (Infinite Population)

Estimating the Proportion (Finite Population)

N:s2 B2 ðN  1Þ: 2 + s2 za

p:q B 2 =za2

N:p:q B2 ðN1Þ: + p:q za2

N:s2 B2 ðN  1Þ: 2 + s2 za

p:q B 2 =za2

N:p:q B2 ðN1Þ: + p:q za2

k P i¼1

N2 : m¼

k P

Ni2 :s2i =Wi

k B2 X + Ni :s2i za2 i¼1

M:s2c B 2 :N 2 M: 2 + s2c za

i¼1

Wi :pi :qi

B 2 =za2

1=M:

M N P i :pi :qi i¼1 N 2 2 B =za

k P

i¼1

N2 :

Ni2 :pi :qi =Wi

k B2 X + Ni :pi :qi 2 za i¼1

M N P i :pi :qi i¼1 N m¼ M 2 2 X B :N Ni M: + 1=M: :pi :qi 2 za i¼1 N

EXERCISES Why is sampling important? What are the differences between random and nonrandom sampling techniques? In what cases must they be used? What is the difference between stratified and cluster sampling? What are the advantages and limitations of each sampling technique? What type of sampling is used in the EuroMillions Lottery? To verify if a part meets certain quality specification demands, from every batch with 150 parts produced, we randomly pick a unit and inspect all the quality characteristics. What type of sampling should be used in this case? Assume that the population of the city of Youngstown (OH) is divided by educational level. Thus, for each level, a percentage of the population will be interviewed. What type of sampling should be used in this case? In a production line, one batch with 1500 parts is produced every hour. From each batch, we randomly pick a sample with 125 units. In each sample unit, we inspect all the quality characteristics to check whether the part is defective or not. What type of sampling should be used in this case? The population of the city of Sao Paulo is divided into 96 districts. From this total, 24 districts will be randomly drawn and, for each one of them, a small sample of the population will be interviewed in a public opinion survey. What type of sampling should be used in this case? We would like to estimate the illiteracy rate in a municipality with 4000 inhabitants who are 15 or over 15 years of age. Based on previous surveys, we can estimate that p^ ¼ 0:24. A random sample will be drawn from the population. Assuming a maximum estimation error of 5%, and a 95% confidence level, what should the sample size be? The population of a certain municipality with 120,000 inhabitants is divided into five regions (North, South, Center, East, and West). The table shows the number of inhabitants per region. A random sample will be collected in each Region

Inhabitants

North

14,060

South

19,477

Center

36,564

East

26,424

West

23,475

188

PART

IV Statistical Inference

region in order to estimate the average age of its inhabitants. The samples selected must be proportional to the number of inhabitants per region. Based on pilot samples, we obtain the following estimates for the variances in the five regions: 44.5 (North), 59.3 (South), 82.4 (Center), 66.2 (East), and 69.5 (West). Determine the sample size, considering an estimation error of 0.6 and a 99% confidence level. 12) Consider a municipality with 120,000 inhabitants. We would like to estimate the percentage of the population that lives in urban and rural areas. The sampling plan used divides the municipality into 85 districts of different sizes. From all the districts, we would like to select some and, for each district chosen, all the inhabitants will be selected. The file Districts.xls shows the size of each district, as well as the estimated percentage of the urban and rural population. Determine the total number of districts to be drawn assuming a maximum estimation error of 10% and a 90% confidence level.

Chapter 8

Estimation A comprehensive study of nature is the most fruitful source of mathematical discoveries. Joseph Fourier

8.1

INTRODUCTION

As previously described, statistical inference has as its main objective to draw conclusions in relation to the population based on data obtained from the sample. The sample must be representative of the population. One of the most important goals of statistical inference is the estimation of population parameters, which is the main goal of this chapter. For Bussab and Morettin (2011), a parameter can be defined as a function of a set of population values; a statistic as a function of a set of sample values; and an estimate as the value assumed by the parameter in a certain sample. Parameters can be estimated using points, through a single point (point estimation), or through an interval of values (interval estimation). The main point estimation methods are estimator of moments, ordinary least squares, and maximum likelihood estimation. Conversely, the main interval estimation methods or confidence intervals (CI) are CI for the population mean when the variance is known, CI for the population mean when the variance is unknown, CI for the population variance, and CI for the proportion.

8.2

POINT AND INTERVAL ESTIMATION

Population parameters can therefore be estimated through a single point or through an interval of values. As examples of population parameter estimators (point and interval), we can mention the mean, the variance, and the proportion.

8.2.1

Point Estimation

Point estimation is used when we want to estimate a single value of the population parameter we are interested in. The population parameter estimate is calculated from a sample. Hence, the sample mean (x) is a point estimate of the real population mean (m). Analogously, the sample variance (S2) p) is a point estimate of the population is a point estimate of the population parameter (s2), as the sample proportion (^ proportion (p). Example 8.1: Point Estimation Consider a luxury condominium with 702 lots. We would like to estimate the average size of the lots, their variance, as well as the proportion of lots for sale. In order to do that, a random sample with 60 lots is collected, revealing an average size of 1750 m2 per lot, a variance of 420 m2, and a proportion of 8% of the lots for sale. Thus: (a) x ¼ 1750 is a point estimate of the real population mean (m); (b) S2 ¼ 420 is a point estimate of the real population variance (s2); and (c) p^ ¼ 0:08 is a point estimate of the real population proportion (p).

Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00008-2 © 2019 Elsevier Inc. All rights reserved.

189

190

PART

8.2.2

IV Statistical Inference

Interval Estimation

Interval estimation is used when we are interested in finding an interval of possible values in which the estimated parameter is located, with a certain confidence level (1  a), being a the significance level. Example 8.2: Interval Estimation Consider the information in Example 8.1. However, instead of using a point estimate of the population parameter, let’s use an interval estimate: (a) The [1700–1800] interval contains the average size of the 702 condominium lots, with a 99% confidence interval; (b) With a 95% confidence interval, the [400–440] interval contains the population variance of the size of the lots; (c) The [6%–10%] interval contains the proportion of lots for sale in the condominium, with 90% confidence.

8.3

POINT ESTIMATION METHODS

The main point estimation methods are the method of moments, ordinary least squares, and maximum likelihood estimation.

8.3.1

Method of Moments

In the method of moments, the population parameters are estimated from the sample estimators as, for example, the mean and the sample variance. Consider a random variable X with the probability density function (p.d.f.) f(x). Assume that X1, X2, ..., Xn is a random sample of size n drawn from population X. For k ¼ 1, 2, ..., the k-th population moment of distribution f(x) is:   E Xk (8.1) Consider the random variable X with a p.d.f. f(x). Assume that X1, X2, ..., Xn is a random sample of size n drawn from population X. For k ¼ 1, 2, ..., the k-th sample moment of distribution f(x) is: n X

Mk ¼

Xik

i¼1

(8.2)

n

The estimation procedure of the method of moments is described. Assume that X is a random variable with a p.d.f. f(x, y1, ..., ym), in which y1, ..., ym are population parameters whose values are unknown. A random sample X1, X2, ..., Xn is drawn from population X. ym are obtained by matching the m first sample moments to the corresponding m The estimators of moments ^ y1 , …, ^ population moments and by solving the resulting equations for y1, ..., ym. Thus, the first population moment is: Eð X Þ ¼ m

(8.3)

And the first sample moment is: n X

M1 ¼ X ¼

i¼1

n

Xi (8.4)

By matching the population and the sample moments, we have: ^¼X m Therefore, the sample mean is the moment estimator of the population mean. Table 8.1 shows how to calculate E(X) and Var(X) for different probability distributions, as also studied in Chapter 6.

Estimation Chapter

8

191

TABLE 8.1 Calculating E(X) and Var(X) for Different Probability Distributions E(X)

Var(X)

M

Normal [X  N(m,s )]

m

s

2

Binomial [X  b(n,p)]

np

np(1  p)

1

Poisson [X  Poisson(l)]

l

l

1

(a + b)/2

(b  a) /12

Distribution 2

Uniform [X  U(a,b)]

2

2

2

Exponential [X  exp(l)]

1/l

1/l

1

Gamma [X  Gamma(a,l)]

al

al

2

2

2

Example 8.3: Method of Moments Assume that a certain random variable X follows an exponential distribution with parameter l. A random sample of 10 units is drawn from the population whose data can be seen in Table 8.E.1. Calculate the estimation of the l moment.

TABLE 8.E.1 Data Obtained From the Sample 5.4

9.8

6.3

7.9

9.2

10.7

12.5

15.0

13.9

17.2

Solution We have E ðX Þ ¼ X. For an exponential distribution, since E ðX Þ ¼ 1l, we have 1l ¼ X. Therefore, the moments estimator of l is given by ^ l ¼ X1 . For the data in Example 8.3, since X ¼ 10:79, the estimation of the l moment is: 1 1 ^ ¼ 0:093 l¼ ¼ X 10:79

8.3.2

Ordinary Least Squares

A model of a simple linear regression is given by the following expression: Yi ¼ a + b:Xi + mi , i ¼ 1,2, …, n

(8.5)

where: Yi is the i-th observed value of the dependent variable; a is the linear coefficient of the straight line or constant; b is the angular coefficient of the straight line (slope); Xi is the i-th observed value of the explanatory variable; mi is the random error term of the linear relationship between Y and X. Since parameters a and b of the regression model are unknown, we would like to estimate them by using the regression line: Y^i ¼ a + b:Xi where: Y^i is the i-th value estimated or predicted by the model; a and b are the estimates of parameters a and b of the regression model; Xi is the i-th observed value of the explanatory variable.

(8.6)

192

PART

IV Statistical Inference

However, the Yi observed values are not always equal to the Y^i values estimated by the regression model. The difference between the observed value and the estimated value for the i-th observation is the error term mi: mi ¼ Yi  Y^i

(8.7)

Thus, the ordinary least squares method is used to determine the best straight line that fits the points of a diagram, that is, the method consists in estimating a and b considering that the sum of squares for the residuals is the smallest possible: min

n X i¼1

m2i ¼

n X

ðYi  a  b:Xi Þ2

i¼1

The calculation of the estimators is given by: n  X

i¼1

n X

  Yi  Y Xi  X

n  X

Xi  X

2

¼

i¼1

Yi Xi  nXY

i¼1 n X

(8.8) Xi2  nX2

i¼1

a ¼ Y  b:X

(8.9)

In Chapter 13, we will study the estimation of a linear regression model by ordinary least squares in more detail.

8.3.3

Maximum Likelihood Estimation

Maximum likelihood estimation is one of the procedures used to estimate the parameters of a model from the variable probability distribution that represents the phenomenon being studied. These parameters are chosen in order to maximize the likelihood function, which is the objective function of a certain linear programming problem (Fa´vero, 2015). Consider a random variable X with a probability density function f(x,y), in which vector y ¼ y1, y2, …, yk is unknown. A random sample X1, X2, …, Xn of size n is drawn from population X; consider x1, x2, …, xn the values effectively observed. Likelihood function L associated to X is a joint probability density function given by the product of the densities of each of the observations: Lðy; x1 , x2 , …, xn Þ ¼ f ðx1 , yÞ  f ðx2 , yÞ + ⋯ + f ðxn , yÞ ¼

n Y

f ð x i , yÞ

(8.10)

i¼1

The estimator of maximum likelihood is vector ^y that maximizes the likelihood function.

8.4

INTERVAL ESTIMATION OR CONFIDENCE INTERVALS

In Section 8.3, the population parameters that interested us were estimated through a single value (point estimation). The main limitation of point estimation is that when a parameter is estimated through a single point, all the data information is summarized through this numeric value. As an alternative, we can use interval estimation. Thus, instead of estimating the population parameter through a single point, an interval of likely estimates is given to us. Therefore, we define an interval of values that will contain the true population parameter, with a certain confidence level (1  a), being a the significance level. ^ ^ Consider y an estimator   of population parameter y. An interval estimate for y is obtained through interval ]y – k; y + k[, so, P y  k < ^ y < y + k ¼ 1  a.

8.4.1

Confidence Interval for the Population Mean (m)

Estimating the population mean from a sample is applied to two cases: when the population variance (s2) is known or unknown.

Estimation Chapter

8

193

FIG. 8.1 Standard normal distribution.

8.4.1.1 Known Population Variance (s2) Let X be a random variable with a normal distribution, mean m, and known variance s2, that is, X  N(m,s2). Therefore, we have: Z¼

Xm pﬃﬃﬃ  Nð0, 1Þ s= n

(8.11)

that is, variable Z has a standard normal distribution. Consider that the probability of variable Z assuming values between  zc and zc is 1  a, so, the critical values of  zc and zc are obtained from the standard normal distribution table (Table E in the Appendix), as shown in Fig. 8.1. NR and CR means nonrejection region and critical region of the distribution, respectively. Therefore, we have: Pðzc < Z < zc Þ ¼ 1  a

(8.12)

  Xm P zc < pﬃﬃﬃ < zc ¼ 1  a s= n

(8.13)

  s s P X  zc pﬃﬃﬃ < m < X + zc pﬃﬃﬃ ¼ 1  a n n

(8.14)

or:

Thus, the confidence interval for m is:

Example 8.4: CI for the Population Mean When the Variance Is Known We would like to estimate the average processing time of a certain part, with a 95% confidence interval. We know that s ¼ 1.2. In order to do that, a random sample with 400 parts was collected, obtaining a sample mean of X ¼ 5:4. Therefore, construct a 95% confidence interval for the true population mean. Solution We have s ¼ 1.2, n ¼ 400, X ¼ 5:4, and CI ¼ 95% (a ¼ 5%). The critical values of  zc and zc for a ¼ 5% can be obtained from Table E in the Appendix (Fig. 8.2). Applying Expression (8.14):   1:2 1:2 P 5:4  1:96 pﬃﬃﬃﬃﬃﬃﬃﬃ < m < 5:4 + 1:96 pﬃﬃﬃﬃﬃﬃﬃﬃ ¼ 95% 400 400 that is:

FIG. 8.2 Critical values of zc and zc.

194

PART

IV Statistical Inference

P ð5:28 < m < 5:52Þ ¼ 95% Therefore, the [5.28;5.52] interval contains the average population value with 95% confidence.

8.4.1.2 Unknown Population Variance (s2) Let X be a random variable with a normal distribution, mean m, and unknown variance s2, that is, X  N(m,s2). Since the variance is unknown, it is necessary to use an estimator (S2) instead of s2, which results from another random variable:   Xm pﬃﬃﬃ  tn1 T¼ (8.15) ðS= nÞ that is, variable T follows Student’s t-distribution with n  1 degrees of freedom. Consider that the probability of variable T assuming values between  tc and tc is 1  a, so, the critical values of  tc and tc are obtained from Student’s t-distribution table (Table B in the Appendix), as shown in Fig. 8.3. Therefore, we have: Pðtc < T < tc Þ ¼ 1  a

(8.16)

  Xm P tc < pﬃﬃﬃ < tc ¼ 1  a S= n

(8.17)

or:

Therefore, the confidence interval for m is:   S S P X  tc pﬃﬃﬃ < m < X + tc pﬃﬃﬃ ¼ 1  a n n

(8.18)

Example 8.5: CI for the Population Mean When the Variance Is Unknown We would like to estimate the average weight of a given population, with a 95% confidence interval. The random variable analyzed has a normal distribution with mean m and unknown variance s2. We pick a sample with 25 individuals from the population and

FIG. 8.3 Student’s t-distribution.

FIG. 8.4 Critical values of Student’s t-distribution.

Estimation Chapter

8

195

  calculate the sample mean X ¼ 78 and the sample variance (S2 ¼ 36). Determine the interval that contains the average weight of the population. Solution Since the variance is unknown, we use estimator S2, which results from variable T that follows Student’s t-distribution. The critical values of tc and tc, obtained from Table B in the Appendix, for a significance level of a ¼ 5% and 24 degrees of freedom, can be seen in Fig. 8.4. Applying Expression (8.18):   6 6 P 78  2:064 pﬃﬃﬃﬃﬃﬃ < m < 78 + 2:064 pﬃﬃﬃﬃﬃﬃ ¼ 95% 25 25 that is: P ð75:5 < m < 80:5Þ ¼ 95% Therefore, the [75.5;80.5] interval contains the average population weight with 95% confidence.

8.4.2

Confidence Interval for Proportions

Consider X a random variable that represents whether a characteristic that interests us in the population exists or not. Thus, X follows a binomial distribution with parameter p, in which p represents the probability of an element in the population presenting the characteristic we are interested in: X  bð1, pÞ with mean m ¼ p and variance s2 ¼ p(1 – p). A random sample X1, X2, …, Xn of size n is drawn from the population. Consider k the number of sample elements with the characteristic we are interested in. The estimator of population proportion p (^ p) is given by: p^ ¼

k n

(8.19)

If n is large, we can consider that sample proportion p^ follows a normal distribution, approximately, with mean p and variance p(1  p)/n:   p ð1  p Þ (8.20) p^  N p, n p^  p We consider that variable Z ¼ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Nð0, 1Þ. Since n is large, we can substitute p for p^: pð 1  pÞ n p^  p (8.21) Z ¼ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Nð0, 1Þ p^ð1  p^Þ n Consider that the probability of variable Z assuming values between  zc and zc is 1  a, so, the critical values of  zc and zc are obtained from the standard normal distribution table (Table E in the Appendix), as shown in Fig. 8.1. Thus, we have: Pðzc < Z < zc Þ ¼ 1  a

(8.22)

or: 0

1

B C p^  p PB < zc C @zc < rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ A ¼1a p^ð1  p^Þ n

(8.23)

196

PART

IV Statistical Inference

Therefore, the confidence interval for p is: rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ! p^ð1  p^Þ p^ð1  p^Þ < p < p^ + zc ¼1a P p^  zc n n

(8.24)

Example 8.6: CI for Proportions A factory discovered that the proportion of defective products, in one batch with 1000 parts, is 230 parts. Construct a 95% confidence interval for the true proportion of defective products. Solution n ¼ 1,000 k 230 ¼ 0:23 p^ ¼ ¼ n 1,000 zc ¼ 1:96 Therefore, Expression (8.24) can be written as: rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  0:23  0:77 0:23  0:77 P 0:23  1:96 < p < 0:23 + 1:96 ¼ 95% 1, 000 1, 000 P ð0:204 < p < 0:256Þ ¼ 95% Thus, the [20.4%;25.6%] interval contains the true proportion of defective products with 95% confidence.

8.4.3

Confidence Interval for the Population Variance

Let Xi be a random variable with a normal distribution, mean m, and variance s2, that is, Xi N(m,s2). An estimator for s2 is sample variance S2. Thus, we consider that random variable Q has a chi-square distribution with n  1 degrees of freedom: Q¼

ðn  1Þ  S2  w2n1 s2

(8.25)

Consider that the probability of variable Q assuming values between w2low and w2upp is 1  a, so, the critical values of w2low and w2upp are obtained from the chi-square distribution table (Table D in the Appendix), as shown in Fig. 8.5. Therefore, we have:   P w2low < w2n1 < w2upp ¼ 1  a (8.26) or:   ðn  1Þ  S 2 2 < w P w2low < upp ¼ 1  a s2

low

FIG. 8.5 Chi-square distribution.

upp

(8.27)

Estimation Chapter

8

197

Therefore, the confidence interval for s2 is: ! 2 ð n  1 Þ  S2 ð n  1 Þ  S ¼1a P < s2 < w2upp w2low

(8.28)

Example 8.7: CI for the Population Variance Consider the population of Business Administration students at a public university whose variable of interest is students’ ages. A sample with 101 students was obtained from the normal population and provided S2 ¼ 18.22. Construct a 90% confidence interval for the population variance. Solution From distribution table w2 (Table D in the Appendix), for 100 degrees of freedom, we have: w2low ¼ 77:929 w2upp ¼ 124:342 Therefore, Expression (8.28) can be written as follows:   100  18:22 100  18:22 P < s2 < ¼ 90% 124:342 77:929   P 14:65 < s2 < 23:38 ¼ 90% Thus, the [14.65;23.38] interval contains the true population variance with 90% confidence.

8.5

FINAL REMARKS

Statistical inference is divided into three main parts: sampling, estimation of population parameters, and hypotheses tests. This chapter discussed estimation methods. There are point and interval population parameter estimation methods. Among the main point estimation methods, we can highlight the estimator of moments, ordinary least squares, and maximum likelihood estimation. Conversely, among the main interval estimation methods, we studied the confidence interval (CI) for the population mean (when the variance is known and unknown), the CI for proportions, and the CI for the population variance.

8.6

EXERCISES

1) We would like to estimate the average age of a population that follows a normal distribution and has a standard deviation s ¼ 18. In order to do that, a sample with 120 individuals was drawn from the population and the mean obtained was 51 years old. Construct a 90% confidence interval for the true population mean. 2) We would like to estimate the average income of a certain population with a normal distribution and an unknown variance. A sample with 36 individuals was drawn from the population, presenting a mean of X ¼ 5,400 and a standard deviation S ¼ 200. Construct a 95% confidence interval for the population mean. 3) We would like to estimate the illiteracy rate of a certain municipality. A sample with 500 inhabitants was drawn from the population, presenting an illiteracy rate of 24%. Construct a 95% confidence interval for the proportion of illiterate individuals in the municipality. 4) We would like to estimate the variability of the average time in rendering services to customers in a bank branch. A sample with 61 customers was drawn from the population with a normal distribution and it gave us S2 ¼ 8. Construct a 95% confidence interval for the population variance.

Chapter 9

Hypotheses Tests We must conduct research and then accept the results. If they don’t stand up to experimentation, Buddha’s own words must be rejected. Tenzin Gyatso, 14th Dalai Lama

9.1

INTRODUCTION

As discussed previously, one of the problems to be solved by statistical inference is hypotheses testing. A statistical hypothesis is an assumption about a certain population parameter, such as, the mean, the standard deviation, the correlation coefficient, etc. A hypothesis test is a procedure to decide the veracity or falsehood of a certain hypothesis. In order for a statistical hypothesis to be validated or rejected with accuracy, it would be necessary to examine the entire population, which in practice is not viable. As an alternative, we draw a random sample from the population we are interested in. Since the decision is made based on the sample, errors may occur (rejecting a hypothesis when it is true or not rejecting a hypothesis when it is false), as we will study later on. The procedures and concepts necessary to construct a hypothesis test will be presented. Let’s consider X a variable associated to a population and y a certain parameter of this population. We must define the hypothesis to be tested about parameter y of this population, which is called null hypothesis: H 0 : y ¼ y0

(9.1)

Let’s also define the alternative hypothesis (H1), in case H0 is rejected, which can be characterized as follows: H1 : y 6¼ y0

(9.2)

and the test is called bilateral test (or two-tailed test). The significance level of a test (a) represents the probability of rejecting the null hypothesis when it is true (it is one of the two errors that may occur, as we will see later). The critical region (CR) or rejection region (RR) of a bilateral test is represented by two tails of the same size, respectively, in the left and right extremities of the distribution curve, and each one of them corresponds to half of the significance level a, as shown in Fig. 9.1. Another way to define the alternative hypothesis (H1) would be: H 1 : y < y0

(9.3)

and the test is called unilateral test to the left (or left-tailed test). In this case, the critical region is in the left tail of the distribution and corresponds to significance level a, as shown in Fig. 9.2. Or the alternative hypothesis could be: FIG. 9.1 Critical region (CR) of a bilateral test, also emphasizing the nonrejection region (NR) of the null hypothesis.

Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00009-4 © 2019 Elsevier Inc. All rights reserved.

199

200

PART

IV Statistical Inference

H1 : y > y0

(9.4)

and the test is called unilateral test to the right (or right-tailed test). In this case, the critical region is in the right tail of the distribution and corresponds to significance level a, as shown in Fig. 9.3. Thus, if the main objective is to check whether a parameter is significantly higher or lower than a certain value, we have to use a unilateral test. On the other hand, if the objective is to check whether a parameter is different from a certain value, we have to use a bilateral test. After defining the null hypothesis to be tested, through a random sample collected from the population, we either prove the hypothesis or not. Since the decision is made based on the sample, two types of errors may happen: Type I error: rejecting the null hypothesis when it is true. The probability of this type of error is represented by a: Pðtype I errorÞ ¼ Pðrejecting H0 j H0 is trueÞ ¼ a

(9.5)

Type II error: not rejecting the null hypothesis when it is false. The probability of this type of error is represented by b: Pðtype II errorÞ ¼ Pðnot rejecting H0 j H0 is falseÞ ¼ b

(9.6)

Table 9.1 shows the types of errors that may happen in a hypothesis test. The procedure for defining hypotheses tests includes the following phases: Step Step Step Step Step Step

1: Choosing the most suitable statistical test, depending on the researcher’s intention. 2: Presenting the test’s null hypothesis H0 and its alternative hypothesis H1. 3: Setting the significance level a. 4: Calculating the value observed of the statistic based on the sample obtained from the population. 5: Determining the test’s critical region based on the value of a set in Step 3. 6: Decision: if the value of the statistic lies in the critical region, reject H0. Otherwise, do not reject H0.

According to Fa´vero et al. (2009), most statistical softwares, among them SPSS and Stata, calculate the P-value that corresponds to the probability associated to the value of the statistic calculated from the sample. P-value indicates the lowest significance level observed that would lead to the rejection of the null hypothesis. Thus, we reject H0 if P  a.

FIG. 9.2 Critical region (CR) of a left-tailed test, also emphasizing the nonrejection region of the null hypothesis (NR).

FIG. 9.3 Critical region (CR) of a right-tailed test.

TABLE 9.1 Types of Errors Decision

H0 Is True

H0 Is False

Not rejecting H0

Correct decision (1  a)

Type II error (b)

Rejecting H0

Type I error (a)

Correct decision (1  b)

Hypotheses Tests Chapter

9

201

If we use P-value instead of the statistic’s critical value, Steps 5 and 6 of the construction of the hypotheses tests will be: Step 5: Determine the P-value that corresponds to the probability associated to the value of the statistic calculated in Step 4. Step 6: Decision: if P-value is less than the significance level a established in Step 3, reject H0. Otherwise, do not reject H0.

9.2

PARAMETRIC TESTS

Hypotheses tests are divided into parametric and nonparametric tests. In this chapter, we will study parametric tests. Nonparametric tests will be studied in the next chapter. Parametric tests involve population parameters. A parameter is any numerical measure or quantitative characteristic that describes a population. They are fixed values, usually unknown, and represented by Greek characters, such as, the population mean (m), the population standard deviation (s), the population variance (s2), among others. When hypotheses are formulated about population parameters, the hypothesis test is called parametric. In nonparametric tests, hypotheses are formulated about qualitative characteristics of the population. Therefore, parametric methods are applied to quantitative data and require strong assumptions in order to be validated, including: (i) The observations must be independent; (ii) The sample must be drawn from populations with a certain distribution, usually normal; (iii) The populations must have equal variances for the comparison tests of two paired population means or k population means (k 3); (iv) The variables being studied must be measured in an interval or in a reason scale, so that it can be possible to use arithmetic operations over their respective values. We will study the main parametric tests, including tests for normality, homogeneity of variance tests, Student’s t-test and its applications, in addition to the analysis of variance (ANOVA) and its extensions. All of them will be solved in an analytical way and also through the statistical softwares SPSS and Stata. To verify the univariate normality of the data, the most common tests used are Kolmogorov-Smirnov and Shapiro-Wilk. To compare the variance homogeneity between populations, we have Bartlett’s w2 (1937), Cochran’s C (1947a,b), Hartley’s Fmax (1950), and Levene’s F (1960) tests. We will describe Student’s t-test for three situations: to test hypotheses about the population mean, to test hypotheses to compare two independent means, and to compare two paired means. ANOVA is an extension of Student’s t-test and is used to compare the means of more than two populations. In this chapter, ANOVA of one factor, ANOVA of two factors and its extension for more than two factors will be described.

9.3

UNIVARIATE TESTS FOR NORMALITY

Among all univariate tests for normality, the most common are Kolmogorov-Smirnov, Shapiro-Wilk, and Shapiro-Francia.

9.3.1

Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov test (K-S) is an adherence test, that is, it compares the cumulative frequency distribution of a set of sample values (values observed) to a theoretical distribution. The main goal is to test if the sample values come from a population with a supposed theoretical or expected distribution, in this case, the normal distribution. The statistic is given by the point with the biggest difference (in absolute values) between the two distributions. To use the K-S test, the population mean and standard deviation must be known. For small samples, the test loses power, so, it should be used with large samples (n 30). The K-S test assumes the following hypotheses: H0: the sample comes from a population with distribution N(m, s) H1: the sample does not come from a population with distribution N(m, s)

202

PART

IV Statistical Inference

As specified in Fa´vero et al. (2009), let Fexp(X) be an expected distribution function (normal) of cumulative relative frequencies of variable X, where Fexp(X)  N(m,s), and Fobs(X) the observed cumulative relative frequency distribution of variable X. The objective is to test whether Fobs(X) ¼ Fexp(X), in contrast with the alternative that Fobs(X) 6¼ Fexp(X). The statistic can be calculated through the following expression:   o n     (9.7) Dcal ¼ max Fexp ðXi Þ  Fobs ðXi Þ; Fexp ðXi Þ  Fobs ðXi1 Þ , for i ¼ 1, …,n where: Fexp(Xi): expected cumulative relative frequency in category i; Fobs(Xi): observed cumulative relative frequency in category i; Fobs(Xi1): observed cumulative relative frequency in category i  1. The critical values of Kolmogorov-Smirnov statistic (Dc) are shown in Table G in the Appendix. This table provides the critical values of Dc considering that P(Dcal > Dc) ¼ a (for a right-tailed test). In order for the null hypothesis H0 to be rejected, the value of the Dcal statistic must be in the critical region, that is, Dcal > Dc. Otherwise, we do not reject H0. P-value (the probability associated to the value of Dcal statistic calculated from the sample) can also be seen in Table G. In this case, we reject H0 if P  a. Example 9.1: Using the Kolmogorov-Smirnov Test Table 9.E.1 shows the data on a company’s monthly production of farming equipment in the last 36 months. Check and see if the data in Table 9.E.1 come from a population that follows a normal distribution, considering that a ¼ 5%.

TABLE 9.E.1 Production of Farming Equipment in the Last 36 Months 52

50

44

50

42

30

36

34

48

40

55

40

30

36

40

42

55

44

38

42

40

38

52

44

52

34

38

44

48

36

36

55

50

34

44

42

Solution Step 1: Since the objective is to verify if the data in Table 9.E.1 come from a population with a normal distribution, the most suitable test is Kolmogorov-Smirnov (K-S). Step 2: The K-S test hypotheses for this example are: H0: the production of farming equipment in the population follows distribution N(m, s) H1: the production of farming equipment in the population does not follow distribution N(m, s) Step 3: The significance level to be considered is 5%. Step 4: All the steps necessary to calculate Dcal from Expression (9.7) are specified in Table 9.E.2.

TABLE 9.E.2 Calculating the Kolmogorov-Smirnov Statistic Xi

a

Fabs

Fac

c

d

|Fexp(Xi) 2 Fobs(Xi)|

|Fexp(Xi) 2 Fobs(Xi21)|

30

2

2

0.056

1.7801

0.0375

0.018

0.036

34

3

5

0.139

1.2168

0.1118

0.027

0.056

36

4

9

0.250

0.9351

0.1743

0.076

0.035

38

3

12

0.333

0.6534

0.2567

0.077

0.007

40

4

16

0.444

0.3717

0.3551

0.089

0.022

42

4

20

0.556

0.0900

0.4641

0.092

0.020

44

5

25

0.694

0.1917

0.5760

0.118

0.020

b

Fracobs

Zi

e

Fracexp

Hypotheses Tests Chapter

9

203

TABLE 9.E.2 Calculating the Kolmogorov-Smirnov Statistic—cont’d Xi

Fabs

Fac

Fracobs

48

2

27

0.750

50

3

30

52

3

55

3

Zi

Fracexp

| Fexp(Xi) 2 Fobs(Xi)|

| Fexp(Xi) 2 Fobs(Xi21)|

0.7551

0.7749

0.025

0.081

0.833

1.0368

0.8501

0.017

0.100

33

0.917

1.3185

0.9064

0.010

0.073

36

1

1.7410

0.9592

0.041

0.043

a

Absolute frequency. Cumulative (absolute) frequency. Observed cumulative relative frequency of Xi. d Standardized Xi values according to the expression Zi ¼ Xi SX . e Expected cumulative relative frequency of Xi and it corresponds to the probability obtained in Table E in the Appendix (standard normal distribution table) from the value of Zi. b c

Therefore, the real value of the K-S statistic based on the sample is Dcal ¼ 0.118. Step 5: According to Table G in the Appendix, for n ¼ 36 and a ¼ 5%, the critical value of the Kolmogorov-Smirnov statistic is Dc ¼ 0.23. Step 6: Decision: since the value calculated is not in the critical region (Dcal < Dc), the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that the sample is drawn from a population that follows a normal distribution. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table G in the Appendix, for a sample size n ¼ 36, the probability associated to Dcal ¼ 0.118 has as its lowest limit P ¼ 0.20. Step 6: Decision: since P > 0.05, we do not reject H0.

9.3.2

Shapiro-Wilk Test

The Shapiro-Wilk test (S-W) is based on Shapiro and Wilk (1965) and can be applied to samples with 4  n  2000 observations, and it is an alternative to the Kolmogorov-Smirnov test for normality (K-S) in the case of small samples (n < 30). Analogous to the K-S test, the S-W test for normality assumes the following hypotheses: H0: the sample comes from a population with distribution N(m, s) H1: the sample does not come from a population with distribution N(m, s) The calculation of the Shapiro-Wilk statistic (Wcal) is given by: b2 Wcal ¼ Xn  2 , for i ¼ 1, …, n Xi  X i¼1 b¼

n=2 X

  ai, n  Xðni + 1Þ  XðiÞ

(9.8)

(9.9)

i¼1

where: X(i) are the sample statistics of order i, that is, the i-th ordered observation, so, X(1)  X(2)  …  X(n); X is the mean of X; ai, n are constants generated from the means, variances, and covariances of the statistics of order i of a random sample of size n from a normal distribution. Their values can be seen in Table H2 in the Appendix. Small values of Wcal indicate that the distribution of the variable being studied is not normal. The critical values of Shapiro-Wilk statistic Wc are shown in Table H1 in the Appendix. Different from most tables, this table provides the critical values of Wc considering that P(Wcal < Wc) ¼ a (for a left-tailed test). In order for the null hypothesis H0 to be rejected, the value of the Wcal statistic must be in the critical region, that is, Wcal < Wc. Otherwise, we do not reject H0. P-value (the probability associated to the value of Wcal statistic calculated from the sample) can also be seen in Table H1. In this case, we reject H0 if P  a.

204

PART

IV Statistical Inference

Example 9.2: Using the Shapiro-Wilk Test Table 9.E.3 shows the data on an aerospace company’s monthly production of aircraft in the last 24 months. Check and see if the data in Table 9.E.3 come from a population with a normal distribution, considering that a ¼ 1%.

TABLE 9.E.3 Production of Aircraft in the Last 24 Months 28

32

46

24

22

18

20

34

30

24

31

29

15

19

23

25

28

30

32

36

39

16

23

36

Solution Step 1: For a normality test in which n < 30, the most recommended test is the Shapiro-Wilk (S-W). Step 2: The S-W test hypotheses for this example are: H0: the production of aircraft in the population follows normal distribution N(m, s) H1: the production of aircraft in the population does not follow normal distribution N(m, s) Step 3: The significance level to be considered is 1%. Step 4: The calculation of the S-W statistic for the data in Table 9.E.3, according to Expressions (9.8) and (9.9), is shown. First of all, to calculate b, we must sort the data in Table 9.E.3 in ascending order, as shown in Table 9.E.4. All the steps necessary to calculate b, from Expression (9.9), are specified in Table 9.E.5. The values of ai,n were obtained from Table H2 in the Appendix.

TABLE 9.E.4 Values From Table 9.E.3 Sorted in Ascending Order 15

16

18

19

20

22

23

23

24

24

25

28

28

29

30

30

31

32

32

34

36

36

39

46

TABLE 9.E.5 Procedure to Calculate b i

n 2 i +1

ai,n

X(n 2 i+1)

X(i)

ai,n (X(n 2 i+1) 2 X(i))

1

24

0.4493

46

15

13.9283

2

23

0.3098

39

16

7.1254

3

22

0.2554

36

18

4.5972

4

21

0.2145

36

19

3.6465

5

20

0.1807

34

20

2.5298

6

19

0.1512

32

22

1.5120

7

18

0.1245

32

23

1.1205

8

17

0.0997

31

23

0.7976

9

16

0.0764

30

24

0.4584

10

15

0.0539

30

24

0.3234

11

14

0.0321

29

25

0.1284

12

13

0.0107

28

28

0.0000 b ¼ 36.1675

We have

Pn  i¼1

Xi  X

Therefore, Wcal ¼ Pn

2

¼ ð28  27:5Þ2 + ⋯ + ð36  27:5Þ2 ¼ 1388

b2 2 ðXi X Þ i¼1

2

Þ ¼ ð36:1675 ¼ 0:978 1338

Step 5: According to Table H1 in the Appendix, for n ¼ 24 and a ¼ 1%, the critical value of the Shapiro-Wilk statistic is Wc ¼ 0.884.

Hypotheses Tests Chapter

9

205

Step 6: Decision: the null hypothesis is not rejected, since Wcal > Wc (Table H1 provides the critical values of Wc considering that P(Wcal < Wc) ¼ a), which allows us to conclude, with a 99% confidence level, that the sample is drawn from a population with a normal distribution. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table H1 in the Appendix, for a sample size n ¼ 24, the probability associated to Wcal ¼ 0.978 is between 0.50 and 0.90 (a probability of 0.90 is associated to Wcal ¼ 0.981). Step 6: Decision: since P > 0.01, we do not reject H0.

9.3.3

Shapiro-Francia Test

This test is based on Shapiro and Francia (1972). According to Sarkadi (1975), the Shapiro-Wilk (S-W) and ShapiroFrancia tests (S-F) have the same format, being different only when it comes to defining the coefficients. Moreover, calculating the S-F test is much simpler and it can be considered a simplified version of the S-W test. Despite its simplicity, it is as robust as the Shapiro-Wilk test, making it a substitute for the S-W. The Shapiro-Francia test can be applied to samples with 5  n  5000 observations, and it is similar to the Shapiro-Wilk test for large samples. Analogous to the S-W test, the S-F test assumes the following hypotheses: H0: the sample comes from a population with distribution N(m, s) H1: the sample does not come from a population with distribution N(m, s) 0

The calculation of the Shapiro-Francia statistic (Wcal) is given by: " #2 , " # n n n  X X X 2 0 2 Wcal ¼ mi  XðiÞ mi  Xi  X , for i ¼ 1, …, n i¼1

i¼1

(9.10)

i¼1

where: X(i) are the sample statistics of order i, that is, the ith ordered observation, so, X(1)  X(2)  …  X(n); mi is the approximate expected value of the ith observation (Z-score). The values of mi are estimated by: mi ¼ F1 



i n+1

 (9.11)

where F1 corresponds to the opposite of a standard normal distribution with a mean ¼ zero and a standard deviation ¼ 1. These values can be obtained from Table E in the Appendix. 0 Small values of Wcal indicate that the distribution of the variable being studied is not normal. The critical values of 0 Shapiro-Francia statistic (Wc) are shown in Table H1 in the Appendix. Different from most tables, this table provides 0 0 0 the critical values of Wc considering that P(Wcal < Wc) ¼ a ¼ a (for a left-tailed test). In order for the null hypothesis 0 0 0 H0 to be rejected, the value of the Wcal statistic must be in the critical region, that is, Wcal < Wc. Otherwise, we do not reject H0. 0 P-value (the probability associated to Wcal statistic calculated from the sample) can also be seen in Table H1. In this case, we reject H0 if P  a. Example 9.3: Using the Shapiro-Francia Test Table 9.E.6 shows all the data regarding a company’s daily production of bicycles in the last 60 months. Check and see if the data come from a population with a normal distribution, considering a ¼ 5%. Solution Step 1: The normality of the data can be verified through the Shapiro-Francia test. Step 2: The S-F test hypotheses for this example are: H0: the production of bicycles in the population follows normal distribution N(m, s) H1: the production of bicycles in the population does not follow normal distribution N(m, s) Step 3: The significance level to be considered is 5%.

206

PART

IV Statistical Inference

TABLE 9.E.6 Production of Bicycles in the Last 60 Months 85

70

74

49

67

88

80

91

57

63

66

60

72

81

73

80

55

54

93

77

80

64

60

63

67

54

59

78

73

84

91

57

59

64

68

67

70

76

78

75

80

81

70

77

65

63

59

60

61

74

76

81

79

78

60

68

76

71

72

84

Step 4: The procedure to calculate the S-F statistic for the data in Table 9.E.6 is shown in Table 9.E.7. 0 Therefore, Wcal ¼ (574.6704)2/(53.1904  6278.8500) ¼ 0.989

TABLE 9.E.7 Procedure to Calculate the Shapiro-Francia Statistic i

X(i)

i/(n + 1)

mi

mi X(i)

m2i

(Xi 2 X)2

1

49

0.0164

2.1347

104.5995

4.5569

481.8025

2

54

0.0328

1.8413

99.4316

3.3905

287.3025

3

54

0.0492

1.6529

89.2541

2.7319

287.3025

4

55

0.0656

1.5096

83.0276

2.2789

254.4025

5

57

0.0820

1.3920

79.3417

1.9376

194.6025

6

57

0.0984

1.2909

73.5841

1.6665

194.6025

7

59

0.1148

1.2016

70.8960

1.4439

142.8025

8

59

0.1311

1.1210

66.1380

1.2566

142.8025

93

0.9836

2.1347

198.5256

4.5569

486.2025

574.6704

53.1904

6278.8500

… 60

Sum

Step 5: According to Table H1 in the Appendix, for n ¼ 60 and a ¼ 5%, the critical value of the Shapiro-Francia statistic is 0 Wc ¼ 0.9625. 0 0 0 Step 6: Decision: the null hypothesis is not rejected because Wcal > Wc (Table H1 provides the critical values of Wc considering 0 0 that P(Wcal < Wc) ¼ a), which allows us to conclude, with a 95% confidence level, that the sample is drawn from a population that follows a normal distribution. If we used P-value instead of the statistic’s critical value, Steps 5 and 6 would be: 0 Step 5: According to Table H1 in the Appendix, for a sample size n ¼ 60, the probability associated to Wcal ¼ 0.989 is greater than 0.10 (P-value). Step 6: Decision: since P > 0.05, we do not reject H0.

9.3.4

Solving Tests for Normality by Using SPSS Software

The Kolmogorov-Smirnov and Shapiro-Wilk tests for normality can be solved by using IBM SPSS Statistics Software. The Shapiro-Francia test, on the other hand, will be elaborated through the Stata software, as we will see in the next section. Based on the procedure that will be described, SPSS shows the results of the K-S and the S-W tests for the sample selected. The use of the images in this section has been authorized by the International Business Machines Corporation©. Let’s consider the data presented in Example 9.1 that are available in the file Production_FarmingEquipment.sav. Let´s open the file and select Analyze → Descriptive Statistics → Explore …, as shown in Fig. 9.4. From the Explore dialog box, we must select the variable we are interested in on the Dependent List, as shown in Fig. 9.5. Let´s click on Plots … (the Explore: Plots dialog box will open) and select the option Normality plots with tests (Fig. 9.6). Finally, let’s click on Continue and on OK.

Hypotheses Tests Chapter

9

207

FIG. 9.4 Procedure for elaborating a univariate normality test on SPSS for Example 9.1.

FIG. 9.5 Selecting the variable of interest.

The results of the Kolmogorov-Smirnov and Shapiro-Wilk tests for normality for the data in Example 9.1 are shown in Fig. 9.7. According to Fig. 9.7, the result of the K-S statistic was 0.118, similar to the value calculated in Example 9.1. Since the sample has more than 30 elements, we should only use the K-S test to verify the normality of the data (the S-W test was applied to Example 9.2). Nevertheless, SPSS also makes the result of the S-W statistic available for the sample selected.

208

PART

IV Statistical Inference

FIG. 9.6 Selecting the normality test on SPSS.

FIG. 9.7 Results of the tests for normality for Example 9.1 on SPSS.

FIG. 9.8 Results of the tests for normality for Example 9.2 on SPSS.

As presented in the introduction of this chapter, SPSS calculates the P-value that corresponds to the lowest significance level observed that would lead to the rejection of the null hypothesis. For the K-S and S-W tests the P-value corresponds to the lowest value of P from which Dcal > Dc and Wcal < Wc. As shown in Fig. 9.7, the value of P for the K-S test was of 0.200 (this probability can also be obtained from Table G in the Appendix, as shown in Example 9.1). Since P > 0.05, we do not reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that the data distribution is normal. The S-W test also allows us to conclude that the data distribution follows a normal distribution. Applying the same procedure to verify the normality of the data in Example 9.2 (the data are available in the file Production_Aircraft.sav), we get the results shown in Fig. 9.8. Analogous to Example 9.2, the result of the S-W test was 0.978. The K-S test was not applied to this example due to the sample size (n < 30). The P-value of the S-W test is 0.857 (in Example 9.2, we saw that this probability would be between 0.50 and 0.90

Hypotheses Tests Chapter

9

209

and closer to 0.90) and, since P > 0.01, the null hypothesis is not rejected, which allows us to conclude that the data distribution in the population follows a normal distribution. We will use this test when estimating regression models in Chapter 13. For this example, we can also conclude from the K-S test that the data distribution follows a normal distribution.

9.3.5

Solving Tests for Normality by Using Stata

The Kolmogorov-Smirnov, Shapiro-Wilk, and Shapiro-Francia tests for normality can be solved by using Stata Statistical Software. The Kolmogorov-Smirnov test will be applied to Example 9.1, the Shapiro-Wilk test to Example 9.2, and the Shapiro-Francia test to Example 9.3. The use of the images in this section has been authorized by StataCorp LP©.

9.3.5.1 Kolmogorov-Smirnov Test on the Stata Software The data presented in Example 9.1 are available in the file Production_FarmingEquipment.dta. Let’s open this file and verify that the name of the variable being studied is production. To elaborate the Kolmogorov-Smirnov test on Stata, we must specify the mean and the standard deviation of the variable that interests us in the test syntax, so, the command summarize, or simply sum, must be typed first, followed by the respective variable: sum production

and we get Fig. 9.9. Therefore, we can see that the mean is 42.63889 and the standard deviation is 7.099911. The Kolmogorov-Smirnov test is given by the following command: ksmirnov production = normal((production-42.63889)/7.099911)

The result of the test can be seen in Fig. 9.10. We can see that the value of the statistic is similar to the one calculated in Example 9.1 and by SPSS software. Since P > 0.05, we conclude that the data distribution is normal.

9.3.5.2 Shapiro-Wilk Test on the Stata Software The data presented in Example 9.2 are available in the file Production_Aircraft.dta. To elaborate the Shapiro-Wilk test on Stata, the syntax of the command is: swilk variables*

where the term variables* should be substituted for the list of variables being considered. For the data in Example 9.2, we have a single variable called production, so, the command to be typed is: swilk production FIG. 9.9 Descriptive statistics of the variable production.

FIG. 9.10 Results of the Kolmogorov-Smirnov test on Stata.

210

PART

IV Statistical Inference

FIG. 9.11 Results of the Shapiro-Wilk test for Example 9.2 on Stata.

FIG. 9.12 Results of the Shapiro-Francia test for Example 9.3 on Stata.

The result of the Shapiro-Wilk test can be seen in Fig. 9.11. Since P > 0.05, we can conclude that the sample comes from a population with a normal distribution.

9.3.5.3 Shapiro-Francia Test on the Stata Software The data presented in Example 9.3 are available in the file Production_Bicycles.dta. To elaborate the Shapiro-Francia test on Stata, the syntax of the command is: sfrancia variables*

where the term variables* should be substituted for the list of variables being considered. For the data in Example 9.3, we have a single variable called production, so, the command to be typed is: sfrancia production

The result of the Shapiro-Francia test can be seen in Fig. 9.12. We can see that the value is similar to the one calculated in Example 9.3 (W 0 ¼ 0.989). Since P > 0.05, we conclude that the sample comes from a population with a normal distribution. We will use this test when estimating regression models in Chapter 13.

9.4

TESTS FOR THE HOMOGENEITY OF VARIANCES

One of the conditions to apply a parametric test to compare k population means is that the population variances, estimated from k representative samples, be homogeneous or equal. The most common tests to verify variance homogeneity are Bartlett’s w2 (1937), Cochran’s C (1947a,b), Hartley’s Fmax (1950), and Levene’s F (1960) tests. In the null hypothesis of variance homogeneity tests, the variances of k populations are homogeneous. In the alternative hypothesis, at least one population variance is different from the others. That is: H0 : s21 ¼ s22 ¼ … ¼ s2k H1 : 9i, j : s2i 6¼ s2j ði, j ¼ 1, …, kÞ

9.4.1

(9.12)

Bartlett’s x2 Test

The original test proposed to verify variance homogeneity among groups is Bartlett’s w2 test (1937). This test is very sensitive to normality deviations, and Levene’s test is an alternative in this case. Bartlett’s statistic is calculated from q: k   X   ðni  1Þ  ln S2i q ¼ ðN  kÞ  ln S2p  i¼1

(9.13)

Hypotheses Tests Chapter

9

211

where:

P ni, i ¼ 1, …, k, is the size of each sample i and ki¼1ni ¼ N; 2 Si , i ¼ 1, …, k, is the variance in each sample i;

and

Xk S2p ¼

i¼1

ðni  1Þ  S2i

(9.14)

N k

A correction factor c is applied to q statistic, with the following expression: k X 1 1 1   c¼1+ 3  ð k  1Þ n 1 Nk i¼1 i

! (9.15)

where Bartlett’s statistic (Bcal) approximately follows a chi-square distribution with k  1 degrees of freedom: q (9.16) Bcal ¼  w2k1 c From the previous expressions, we can see that the higher the difference between the variances, the higher the value of B. On the other hand, if all the sample variances are equal, its value will be zero. To confirm if the null hypothesis of variance homogeneity will be rejected or not, the value calculated must be compared to the statistic’s critical value (w2c ), which is available in Table D in the Appendix. This table provides the critical values of w2c considering that P(w2cal > w2c ) ¼ a (for a right-tailed test). Therefore, we reject the null hypothesis if Bcal > w2c . On the other hand, if Bcal  w2c , we do not reject H0. P-value (the probability associated to w2cal statistic) can also be obtained from Table D. In this case, we reject H0 if P  a. Example 9.4: Applying Bartlett’s x2 Test A chain of supermarkets wishes to study the number of customers they serve every day in order to make strategic operational decisions. Table 9.E.8 shows the data of three stores throughout two weeks. Check if the variances between the groups are homogeneous. Consider a ¼ 5%.

TABLE 9.E.8 Number of Customers Served Per Day and Per Store Store 1

Store 2

Store 3

Day 1

620

710

924

Day 2

630

780

695

Day 3

610

810

854

Day 4

650

755

802

Day 5

585

699

931

Day 6

590

680

924

Day 7

630

710

847

Day 8

644

850

800

Day 9

595

844

769

Day 10

603

730

863

Day 11

570

645

901

Day 12

605

688

888

Day 13

622

718

757

Day 14

578

702

712

Standard deviation Variance

24.4059

62.2466

78.9144

595.6484

3874.6429

6227.4780

212

PART

IV Statistical Inference

Solution If we apply the Kolmogorov-Smirnov or the Shapiro-Wilk test for normality to the data in Table 9.E.8, we will verify that their distribution shows adherence to normality, with a 5% significance level, so, Bartlett’s w2 test can be applied to compare the homogeneity of the variances between the groups. Step 1: Since the main goal is to compare the equality of the variances between the groups, we can use Bartlett’s w2 test. Step 2: Bartlett’s w2 test hypotheses for this example are: H0: the population variances of all three groups are homogeneous H1: the population variance of at least one group is different from the others Step 3: The significance level to be considered is 5%. Step 4: The complete calculation of Bartlett’s w2 statistic is shown. First, we calculate the value of S2p, according to Expression (9.14): 13  ð595:65 + 3874:64 + 6227:48Þ ¼ 3565:92 42  3 Thus, we can calculate q through Expression (9.13): Sp2 ¼

q ¼ 39  ln ð3565:92Þ  13  ½ ln ð595:65Þ + ln ð3874:64Þ + ln ð6227:48Þ ¼ 14:94 The correction factor c for q statistic is calculated from Expression (9.15):     1 1 1 c ¼1+ 3  ¼ 1:0256 3  ð3  1Þ 13 42  3 Finally, we calculate Bcal: q 14:94 Bcal ¼ ¼ ¼ 14:567 c 1:0256 Step 5: According to Table D in the Appendix, for n ¼ 3  1 degrees of freedom and a ¼ 5%, the critical value of Bartlett’s w2 test is w2c ¼ 5.991. Step 6: Decision: since the value calculated lies in the critical region (Bcal > w2c ), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population variance of at least one group is different from the others. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, for n ¼ 2 degrees of freedom, the probability associated to w2cal ¼ 14.567 is less than 0.005 (a probability of 0.005 is associated to w2cal ¼ 10.597). Step 6: Decision: since P < 0.05, we reject H0.

9.4.2

Cochran’s C Test

Cochran’s C test (1947a,b) compares the group with the highest variance in relation to the others. The test demands that the data have a normal distribution. Cochran’s C statistic is given by: S2 Ccal ¼ Xmax k

(9.17)

S2 i¼1 i

where: S2max is the highest variance in the sample; S2i is the variance in sample i, i ¼ 1, …, k. According to Expression (9.17), if all the variances are equal, the value of the Ccal statistic is 1/k. The higher the difference of S2max in relation to the other variances, the more the value of Ccal gets closer to 1. To confirm whether the null hypothesis will be rejected or not, the value calculated must be compared to Cochran’s (Cc) statistic’s critical value, which is available in Table M in the Appendix.

Hypotheses Tests Chapter

9

213

The values of Cc vary depending on the number of groups (k), the number of degrees of freedom n ¼ max(ni  1), and the value of a. Table M provides the critical values of Cc considering that P(Ccal > Cc) ¼ a (for a right-tailed test). Thus, we reject H0 if Ccal > Cc. Otherwise, we do not reject H0. Example 9.5: Applying Cochran’s C Test Use Cochran’s C test for the data in Example 9.4. The main objective here is to compare the group with the highest variability in relation to the others. Solution Step 1: Since the objective is to compare the group with the highest variance (group 3—see Table 9.E.8) in relation to the others, Cochran’s C test is the most recommended. Step 2: Cochran’s C test hypotheses for this example are: H0: the population variance of group 3 is equal to the others H1: the population variance of group 3 is different from the others Step 3: The significance level to be considered is 5%. Step 4: From Table 9.E.8, we can see that S2max ¼ 6227.48. Therefore, the calculation of Cochran’s C statistic is given by: S2 6227:48 ¼ 0:582 ¼ Ccal ¼ Xmax k 595:65 + 3874:64 + 6227:48 2 S i i¼1 Step 5: According to Table M in the Appendix, for k ¼ 3, n ¼ 13, and a ¼ 5%, the critical value of Cochran’s C statistic is Cc ¼ 0.575. Step 6: Decision: since the value calculated lies in the critical region (Ccal > Cc), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population variance of group 3 is different from the others.

9.4.3

Hartley’s Fmax Test

Hartley’s Fmax test (1950) has the statistic that represents the relationship between the group with the highest variance (S2max) and the group with the lowest variance (S2min): Fmax ,cal ¼

S2max S2min

(9.18)

The test assumes that the number of observations per group is equal to (n1 ¼ n2 ¼ … ¼ nk ¼ n). If all the variances are equal, the value of Fmax will be 1. The higher the difference between S2max and S2min, the higher the value of Fmax. To confirm if the null hypothesis of variance homogeneity will be rejected or not, the value calculated must be compared to the (Fmax,c) statistic’s critical value, which is available in Table N in the Appendix. The critical values vary depending on the number of groups (k), the number of degrees of freedom n ¼ n  1, and the value of a, and this table provides the critical values of Fmax,c considering that P(Fmax,cal > Fmax,c) ¼ a (for a right-tailed test). Therefore, we reject the null hypothesis H0 of variance homogeneity if Fmax,cal > Fmax,c. Otherwise, we do not reject H0. P-value (the probability associated to Fmax,cal statistic) can also be obtained from Table N in the Appendix. In this case, we reject H0 if P  a. Example 9.6: Applying Hartley’s Fmax Test Use Hartley’s Fmax test for the data in Example 9.4. The goal here is to compare the group with the highest variability to the group with the lowest variability. Solution Step 1: Since the main objective is to compare the group with the highest variance (group 3—see Table 9.E.8) to the group with the lowest variance (group 1), Hartley’s Fmax test is the most recommended. Step 2: Hartley’s Fmax test hypotheses for this example are: H0: the population variance of group 3 is the same as group 1

214

PART

IV Statistical Inference

H1: the population variance of group 3 is different from group 1 Step 3: The significance level to be considered is 5%. Step 4: From Table 9.E.8, we can see that S2min ¼ 595.65 and S2max ¼ 6227.48. Therefore, the calculation of Hartley’s Fmax statistic is given by:

F max , cal ¼

2 Smax 6,227:48 ¼ 10:45 ¼ 2 595:65 Smin

Step 5: According to Table N in the Appendix, for k ¼ 3, n ¼ 13, and a ¼ 5%, the critical value of the test is Fmax,c ¼ 3.953. Step 6: Decision: since the value calculated lies in the critical region (Fmax,cal > Fmax,c), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population variance of group 3 is different from the population variance of group 1. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table N in the Appendix, the probability associated to Fmax,cal ¼ 10.45, for k ¼ 3 and n ¼ 13, is less than 0.01. Step 6: Decision: since P < 0.05, we reject H0.

9.4.4

Levene’s F-Test

The advantage of Levene’s F-test in relation to other homogeneity of variance tests is that it is less sensitive to deviations from normality, in addition to being considered a more robust test. Levene’s statistic is given by Expression (9.19) and it follows an F-distribution, approximately, with n1 ¼ k  1 and n2 ¼ N  k degrees of freedom, for a significance level a: Xk  2 n  Zi  Z ðN  k Þ i¼1 i Fcal ¼ (9.19)   2  Fk1, Nk, a H0 ðk  1Þ Xk Xni Z  Z ij i i¼1 j¼1 where: ni is the dimension of each one of the k samples (i ¼ 1, …, k); N is the of the global sample (N ¼ n1 + n2 + ⋯ + nk);  dimension    Zij ¼ Xij  Xi , i ¼ 1, …, k and j ¼ 1, …, ni; Xij is observation j in sample i; Xi is the mean of sample i; Zi is the mean of Zij in sample i; Z is the mean of Zi in the global sample. An expansion of Levene’s test can be found in Brown and Forsythe (1974). From the F-distribution table (Table A in the Appendix), we can determine the critical values of Levene’s statistic (Fc ¼ Fk1,N k,a). Table A provides the critical values of Fc considering that P(Fcal > Fc) ¼ a (right-tailed table). In order for the null hypothesis H0 to be rejected, the value of the statistic must be in the critical region, that is, Fcal > Fc. If Fcal  Fc, we do not reject H0. P-value (the probability associated to Fcal statistic) can also be obtained from Table A. In this case, we reject H0 if P  a. Example 9.7: Applying Levene’s Test Elaborate Levene’s test for the data in Example 9.4. Solution Step 1: Levene’s test can be applied to check variance homogeneity between the groups, and it is more robust than the other tests.

Hypotheses Tests Chapter

9

215

Step 2: Levene’s test hypotheses for this example are: H0: the population variances of all three groups are homogeneous H1: the population variance of at least one group is different from the others Step 3: The significance level to be considered is 5%. Step 4: The calculation of the Fcal statistic, according to Expression (9.19), is shown.

TABLE 9.E.9 Calculating the Fcal Statistic 

I

X1j

    Z1j ¼ X1j  X 1 

1

620

10.571

9.429

88.898

1

630

20.571

0.571

0.327

1

610

0.571

19.429

377.469

1

650

40.571

20.571

423.184

1

585

24.429

4.429

19.612

1

590

19.429

0.571

0.327

1

630

20.571

0.571

0.327

1

644

34.571

14.571

212.327

1

595

14.429

5.571

31.041

1

603

6.429

13.571

184.184

1

570

39.429

19.429

377.469

1

605

4.429

15.571

242.469

1

622

12.571

7.429

55.184

1

578

31.429

11.429

130.612

X 1 ¼ 609:429

Z 1 ¼ 20

    Z2j ¼ X2j  X 2 

Z1j  Z1

Z1j  Z 1

Sum ¼ 2143.429

 Z2j  Z 2

Z2j  Z 2

27.214

23.204

538.429

780

42.786

7.633

58.257

2

810

72.786

22.367

500.298

2

755

17.786

32.633

1064.890

2

699

38.214

12.204

148.940

2

680

57.214

6.796

46.185

2

710

27.214

23.204

538.429

2

850

112.786

62.367

3889.686

2

844

106.786

56.367

3177.278

2

730

7.214

43.204

1866.593

2

645

92.214

41.796

1746.899

2

688

49.214

1.204

1.450

2

718

19.214

31.204

973.695

2

702

35.214

15.204

231.164

I

X2j

2

710

2

X 2 ¼ 737:214

Z 2 ¼ 50:418

2

2

Sum ¼ 14,782.192

Continued

216

PART

IV Statistical Inference

TABLE 9.E.9 Calculating the Fcal Statistic—cont’d

    Z3j ¼ X3j  X 3 

Z3j  Z 3



Z3j  Z 3

I

X3j

3

924

90.643

24.194

585.344

3

695

138.357

71.908

5170.784

3

854

20.643

45.806

2098.201

3

802

31.357

35.092

1231.437

3

931

97.643

31.194

973.058

3

924

90.643

24.194

585.344

3

847

13.643

52.806

2788.487

3

800

33.357

33.092

1095.070

3

769

64.357

2.092

4.376

3

863

29.643

36.806

1354.691

3

901

67.643

1.194

1.425

3

888

54.643

11.806

139.385

3

757

76.357

9.908

98.172

3

712

121.357

54.908

X 3 ¼ 833:36

Z 3 ¼ 66:449

2

3014.906 Sum ¼ 19,140.678

Therefore, the calculation of Fcal is carried out as follows: Fcal ¼

ð42  3Þ 14  ð20  45:62Þ2 + 14  ð50:418  45:62Þ2 + 14  ð66:449  45:62Þ2  2143:429 + 14, 782:192 + 19, 140:678 ð3  1Þ Fcal ¼ 8:427

Step 5: According to Table A in the Appendix, for n1 ¼ 2, n2 ¼ 39, and a ¼ 5%, the critical value of the test is Fc ¼ 3.24. Step 6: Decision: since the value calculated lies in the critical region (Fcal > Fc), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population variance of at least one group is different from the others. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table A in the Appendix, for n1 ¼ 2 and n2 ¼ 39, the probability associated to Fcal ¼ 8.427 is less than 0.01 (P-value). Step 6: Decision: since P < 0.05, we reject H0.

9.4.5

Solving Levene’s Test by Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. To test the variance homogeneity between the groups, SPSS uses Levene’s test. The data presented in Example 9.4 are available in the file CustomerServices_Store.sav. In order to elaborate the test, we must click on Analyze → Descriptive Statistics → Explore …, as shown in Fig. 9.13. Let’s include the variable Customer_services in the list of dependent variables (Dependent List) and the variable Store in the factor list (Factor List), as shown in Fig. 9.14. Next, we must click on Plots … and select the option Untransformed in Spread vs Level with Levene Test, as shown in Fig. 9.15. Finally, let’s click on Continue and on OK. The result of Levene’s test can also be obtained through the ANOVA test, by clicking on Analyze ! Compare Means ! One-Way ANOVA …. In Options …, we must select the option Homogeneity of variance test (Fig. 9.16).

Hypotheses Tests Chapter

FIG. 9.13 Procedure for elaborating Levene’s test on SPSS.

FIG. 9.14 Selecting the variables to elaborate Levene’s test on SPSS.

9

217

218

PART

IV Statistical Inference

FIG. 9.15 Continuation of the procedure to elaborate Levene’s test on SPSS.

FIG. 9.16 Results of Levene’s test for Example 9.4 on SPSS.

The value of Levene’s statistic is 8.427, exactly the same as the one calculated previously. Since the significance level observed is 0.001, a value lower than 0.05, the test shows the rejection of the null hypothesis, which allows us to conclude, with a 95% confidence level, that the population variances are not homogeneous.

9.4.6

Solving Levene’s Test by Using the Stata Software

The use of the images in this section has been authorized by StataCorp LP©. Levene’s statistical test for equality of variances is calculated on Stata by using the command robvar (robust-test for equality of variances), which has the following syntax: robvar variable*, by(groups*)

in which the term variable* should be substituted for the quantitative variable studied and the term groups* by the categorical variable that represents them. Let’s open the file CustomerServices_Store.dta that contains the data of Example 9.7. The three groups are represented by the variable store and the number of customers served by the variable services. Therefore, the command to be typed is: robvar services, by(store)

The result of the test can be seen in Fig. 9.17. We can verify that the value of the statistic (8.427) is similar to the one calculated in Example 9.7 and to the one generated on SPSS, as well as the calculation of the probability associated to

Hypotheses Tests Chapter

9

219

FIG. 9.17 Results of Levene’s test for Example 9.7 on Stata.

the statistic (0.001). Since P < 0.05, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the variances are not homogeneous.

9.5 HYPOTHESES TESTS REGARDING A POPULATION MEAN (m) FROM ONE RANDOM SAMPLE The main goal is to test if a population mean assumes a certain value or not.

9.5.1 Z Test When the Population Standard Deviation (s) Is Known and the Distribution Is Normal This test is applied when a random sample of size n is obtained from a population with a normal distribution, whose mean (m) is unknown and whose standard deviation (s) is known. If the distribution of the population is not known, it is necessary to work with large samples (n > 30), because the central limit theorem guarantees that, as the sample size grows, the sample distribution of its mean gets closer and closer to a normal distribution. For a bilateral test, the hypotheses are: H0: the sample comes from a population with a certain mean (m ¼ m0) H1: it challenges the null hypothesis (m 6¼ m0) The statistical test used here refers to the sample mean (X). In order for the sample mean to be compared to the value in the table, it must be standardized, so: Zcal ¼

X  m0 s  Nð0, 1Þ, where sX ¼ pﬃﬃﬃ sX n

(9.20)

The critical values of the zc statistic are shown in Table E in the Appendix. This table provides the critical values of zc considering that P(Zcal > zc) ¼ a (for a right-tailed test). For a bilateral test, we must consider P(Zcal > zc) ¼ a/2, since P(Zcal <  zc) + P(Zcal > zc) ¼ a. The null hypothesis H0 of a bilateral test is rejected if the value of the Zcal statistic lies in the critical region, that is, if Zcal <  zc or Zcal > zc. Otherwise, we do not reject H0. The unilateral probabilities associated to Zcal statistic (P) can also be obtained from Table E. For a unilateral test, we consider that P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Therefore, for both tests, we reject H0 if P  a. Example 9.8: Applying the z Test to One Sample A cereal manufacturer states that the average quantity of food fiber in each portion of its product is, at least, 4.2 g with a standard deviation of 1 g. A health care agency wishes to verify if this statement is true. Collecting a random sample of 42 portions, in which the average quantity of food fiber is 3.9 g. With a significance level equal to 5%, is there evidence to reject the manufacturer’s statement?

220

PART

IV Statistical Inference

Solution Step 1: The suitable test for a population mean with a known s, considering a single sample of size n > 30 (normal distribution), is the z test. Step 2: For this example, the z test hypotheses are: H0: m 4.2 g (information provided by the supplier) H1: m < 4.2 g which corresponds to a left-tailed test. Step 3: The significance level to be considered is 5%. Step 4: The calculation of the Zcal statistic, according to Expression (9.20), is:

Zcal ¼

X  m0 3:9  4:2 pﬃﬃﬃﬃﬃﬃ ¼ 1:94 pﬃﬃﬃ ¼ s= n 1= 42

Step 5: According to Table E in the Appendix, for a left-tailed test with a ¼ 5%, the critical value of the test is zc ¼  1.645. Step 6: Decision: since the value calculated lies in the critical region (zcal < 1.645), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the manufacturer’s average quantity of food fiber is less than 4.2 g. If, instead of comparing the value calculated to the critical value of the standard normal distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table E in the Appendix, for a left-tailed test, the probability associated to zcal ¼  1.94 is 0.0262 (P-value). Step 6: Decision: since P < 0.05, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the manufacturer’s average quantity of food fiber is less than 4.2 g.

9.5.2

Student’s t-Test When the Population Standard Deviation (s) Is Not Known

Student’s t-test for one sample is applied when we do not know the population standard deviation (s), so, its value is estimated from the sample standard deviation (S). However, to substitute s for S in Expression (9.20), the distribution of the variable will no longer be normal; it will become a Student’s t-distribution with n  1 degrees of freedom. Analogous to the z test, Student’s t-test for one sample assumes the following hypotheses for a bilateral test: H0: m ¼ m0 H1: m ¼ 6 m0 And the calculation of the statistic becomes: Tcal ¼

X  m0 pﬃﬃﬃ  tn1 S= n

(9.21)

The value calculated must be compared to the value in Student’s t-distribution table (Table B in the Appendix). This table provides the critical values of tc considering that P(Tcal > tc) ¼ a (for a right-tailed test). For a bilateral test, we have P(Tcal <  tc) ¼ a/2 ¼ P(Tcal > tc), as shown in Fig. 9.18. Therefore, for a bilateral test, the null hypothesis is rejected if Tcal <  tc or Tcal > tc. If  tc  Tcal  tc, we do not reject H0. FIG. 9.18 Nonrejection region (NR) and critical region (CR) of Student’s t-distribution for a bilateral test.

Hypotheses Tests Chapter

9

221

The unilateral probabilities associated to Tcal statistic (P1) can also be obtained from Table B. For a unilateral test, we have P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Therefore, for both tests, we reject H0 if P  a. Example 9.9: Applying Student’s t-Test to One Sample The average processing time of a task using a certain machine has been 18 min. New concepts have been implemented in order to reduce the average processing time. Hence, after a certain period of time, a sample with 25 elements was collected, and an average time of 16.808 min was measured, with a standard deviation of 2.733 min. Check and see if this result represents an improvement in the average processing time. Consider a ¼ 1%. Solution Step 1: The suitable test for a population mean with an unknown s is Student’s t-test. Step 2: For this example, Student’s t-test hypotheses are: H0: m ¼ 18 H1: m < 18 which corresponds to a left-tailed test. Step 3: The significance level to be considered is 1%. Step 4: The calculation of the Tcal statistic, according to Expression (9.21), is:

Tcal ¼

X  m0 16:808  18 pﬃﬃﬃﬃﬃﬃ ¼ 2:18 pﬃﬃﬃ ¼ S= n 2:733= 25

Step 5: According to Table B in the Appendix, for a left-tailed test with 24 degrees of freedom and a ¼ 1%, the critical value of the test is tc ¼  2.492. Step 6: Decision: since the value calculated is not in the critical region (Tcal >  2.492), the null hypothesis is not rejected, which allows us to conclude, with a 99% confidence level, that there was no improvement in the average processing time. If, instead of comparing the value calculated to the critical value of Student’s t-distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table B in the Appendix, for a left-tailed test with 24 degrees of freedom, the probability associated to Tcal ¼  2.18 is between 0.01 and 0.025 (P-value). Step 6: Decision: since P > 0.01, we do not reject the null hypothesis.

9.5.3

Solving Student’s t-Test for a Single Sample by Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. If we wish to compare means from a single sample, SPSS makes Student’s t-test available. The data in Example 9.9 are available in the file T_test_One_Sample.sav. The procedure to apply the test from Example 9.9 will be described. Initially, let´s select Analyze ! Compare Means → One-Sample T Test …, as shown in Fig. 9.19. We must select the variable Time and specify the value 18 that will be tested in Test Value, as shown in Fig. 9.20. Now, we must click on Options … to define the desired confidence level (Fig. 9.21). Finally, let’s click on Continue and on OK. The results of the test are shown in Fig. 9.22. This figure shows the result of the t-test (similar to the value calculated in Example 9.9) and the associated probability (P-value) for a bilateral test. For a unilateral test, the associated probability is 0.0195 (we saw in Example 9.9 that this probability would be between 0.01 and 0.025). Since 0.0195 > 0.01, we do not reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there was no improvement in the average processing time.

9.5.4

Solving Student’s t-Test for a Single Sample by Using Stata Software

The use of the images in this section has been authorized by StataCorp LP©. Student’s t-test is elaborated on Stata by using the command ttest. For one population mean, the test syntax is: ttest variable* == #

222

PART

IV Statistical Inference

FIG. 9.19 Procedure for elaborating the t-test from one sample on SPSS.

FIG. 9.20 Selecting the variable and specifying the value to be tested.

where the term variable* should be substituted for the name of the variable considered in the analysis and # for the value of the population mean to be tested. The data in Example 9.9 are available in the file T_test_One_Sample.dta. In this case, the variable being analyzed is called time and the goal is to verify if the average processing time is still 18 min, so, the command to be typed is: ttest time == 18

The result of the test can be seen in Fig. 9.23. We can see that the calculated value of the statistic (2.180) is similar to the one calculated in Example 9.9 and also generated on SPSS, as well as the associated probability for a left-tailed test (0.0196). Since P > 0.01, we do not reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there was no improvement in the processing time.

Hypotheses Tests Chapter

9

223

FIG. 9.21 Options—defining the confidence level.

FIG. 9.22 Results of the t-test for one sample for Example 9.9 on SPSS.

FIG. 9.23 Results of the t-test for one sample for Example 9.9 on Stata.

9.6 STUDENT’S T-TEST TO COMPARE TWO POPULATION MEANS FROM TWO INDEPENDENT RANDOM SAMPLES The t-test for two independent samples is applied to compare the means of two random samples (X1i, i ¼ 1, …, n1; X2j, j ¼ 1, …, n2) obtained from the same population. In this test, the population variance is unknown. For a bilateral test, the null hypothesis of the test states that the population means are the same. If the population means are different, the null hypothesis is rejected, so: H0: m1 ¼ m2 H1: m1 6¼ m2 The calculation of the T statistic depends on the comparison of the population variances between the groups.

Case 1: s21 6¼ s22 Considering that the population variances are different, the calculation of the T statistic is given by:

224

PART

IV Statistical Inference

FIG. 9.24 Nonrejection region (NR) and critical region (CR) of Student’s t-distribution for a bilateral test.



 X1  X2 Tcal ¼ sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ S21 S22 + n1 n2

(9.22)

 2 2 S1 S22 + n1 n2 n¼ 2  2 2 2 S1 =n1 S =n + 2 2 ð n1  1 Þ ð n2  1 Þ

(9.23)

with the following degrees of freedom:

Case 2: s21 5 s22 When the population variances are homogeneous, to calculate the T statistic, the researcher has to use:   X1  X2 rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Tcal ¼ 1 1 + Sp  n1 n 2 where:

sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðn1  1Þ  S21 + ðn2  1Þ  S22 Sp ¼ n1 + n2  2

(9.24)

(9.25)

and Tcal follows Student’s t-distribution with n ¼ n1 + n2  2 degrees of freedom. The value calculated must be compared to the value in Student’s t-distribution table (Table B in the Appendix). This table provides the critical values of tc considering that P(Tcal > tc) ¼ a (for a right-tailed test). For a bilateral test, we have P(Tcal <  tc) ¼ a/2 ¼ P(Tcal > tc), as shown in Fig. 9.24. Therefore, for a bilateral test, if the value of the statistic lies in the critical region, that is, if Tcal <  tc or Tcal > tc, the test allows us to reject the null hypothesis. On the other hand, if  tc  Tcal  tc, we do not reject H0. The unilateral probabilities associated to Tcal statistic (P1) can also be obtained from Table B. For a unilateral test, we have P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Therefore, for both tests, we reject H0 if P  a. Example 9.10: Applying Student’s t-Test to Two Independent Samples A quality engineer believes that the average time to manufacture a certain plastic product may depend on the raw materials used, which come from two different suppliers. A sample with 30 observations from each supplier is collected for a test and the results are shown in Tables 9.E.10 and 9.E.11. For a significance level a ¼ 5%, check if there is any difference between the means. Solution Step 1: The suitable test to compare two population means with an unknown s is Student’s t-test for two independent samples. Step 2: For this example, Student’s t-test hypotheses are: H0: m1 ¼ m2 H1: m1 6¼ m2 Step 3: The significance level to be considered is 5%.

Hypotheses Tests Chapter

9

225

TABLE 9.E.10 Manufacturing Time Using Raw Materials From Supplier 1 22.8

23.4

26.2

24.3

22.0

24.8

26.7

25.1

23.1

22.8

25.6

25.1

24.3

24.2

22.8

23.2

24.7

26.5

24.5

23.6

23.9

22.8

25.4

26.7

22.9

23.5

23.8

24.6

26.3

22.7

TABLE 9.E.11 Manufacturing Time Using Raw Materials From Supplier 2 26.8

29.3

28.4

25.6

29.4

27.2

27.6

26.8

25.4

28.6

29.7

27.2

27.9

28.4

26.0

26.8

27.5

28.5

27.3

29.1

29.2

25.7

28.4

28.6

27.9

27.4

26.7

26.8

25.6

26.1

Step 4: For the data in Tables 9.E.10 and 9.E.11, we calculate X 1 ¼ 24:277, X 2 ¼ 27:530, S21 ¼ 1.810, and S22 ¼ 1.559. Considering that the population variances are homogeneous, according to the solution generated on SPSS, let’s use Expressions (9.24) and (9.25) to calculate the Tcal statistic, as follows: rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 29  1:810 + 29  1:559 ¼ 1:298 30 + 30  2 24:277  27:530 rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ 9:708 Tcal ¼ 1 1 1:298  + 30 30

Sp ¼

with n ¼ 30 + 30 – 2 ¼ 58 degrees of freedom. Step 5: The critical region of the bilateral test, considering n ¼ 58 degrees of freedom and a ¼ 5%, can be defined from Student’s t-distribution table (Table B in the Appendix), as shown in Fig. 9.25. For a bilateral test, each one of the tails corresponds to half of significance level a. FIG. 9.25 Critical region of Example 9.10.

Step 6: Decision: since the value calculated lies in the critical region, that is, Tcal < 2.002, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that the population means are different. If, instead of comparing the value calculated to the critical value of Student’s t-distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table B in the Appendix, for a right-tailed test with n ¼ 58 degrees of freedom, probability P1 associated to Tcal ¼ 9.708 is less than 0.0005. For a bilateral test, this probability must be doubled (P ¼ 2P1). Step 6: Decision: since P < 0.05, the null hypothesis is rejected.

9.6.1

Solving Student’s t-Test From Two Independent Samples by Using SPSS Software

The data in Example 9.10 are available in the file T_test_Two_Independent_Samples.sav. The procedure for solving Student’s t-test to compare two population means from two independent random samples on SPSS is described. The use of the images in this section has been authorized by the International Business Machines Corporation©. We must click on Analyze ! Compare Means → Independent-Samples T Test …, as shown in Fig. 9.26.

226

PART

IV Statistical Inference

Let’s include the variable Time in Test Variable(s) and the variable Supplier in Grouping Variable. Next, let’s click on Define Groups … to define the groups (categories) of the variable Supplier, as shown in Fig. 9.27. If the confidence level desired by the researcher is different from 95%, the button Options … must be selected to change it. Finally, let’s click on OK. The results of the test are shown in Fig. 9.28. The value of the t statistic for the test is 9.708 and the associated bilateral probability is 0.000 (P < 0.05), which leads us to reject the null hypothesis, and allows us to conclude, with a 95% confidence level, that the population means are different. We can notice that Fig. 9.28 also shows the result of Levene’s test. Since the significance level observed is 0.694, value greater than 0.05, we can also conclude, with a 95% confidence level, that the variances are homogeneous.

FIG. 9.26 Procedure for elaborating the t-test from two independent samples on SPSS.

FIG. 9.27 Selecting the variables and defining the groups.

Hypotheses Tests Chapter

9

227

FIG. 9.28 Results of the t-test for two independent samples for Example 9.10 on SPSS.

FIG. 9.29 Results of the t-test for two independent samples for Example 9.10 on Stata.

9.6.2

Solving Student’s t-Test From Two Independent Samples by Using Stata Software

The use of the images in this section has been authorized by StataCorp LP©. The t-test to compare the means of two independent groups on Stata is elaborated by using the following syntax: ttest variable*, by(groups*)

where the term variable* must be substituted for the quantitative variable being analyzed, and the term groups* for the categorical variable that represents them. The data in Example 9.10 are available in the file T_test_Two_Independent_Samples.dta. The variable supplier shows the groups of suppliers. The values for each group of suppliers are specified in the variable time. Thus, we must type the following command: ttest time, by(supplier)

The result of the test can be seen in Fig. 9.29. We can see that the calculated value of the statistic (9.708) is similar to the one calculated in Example 9.10 and also generated on SPSS, as well as the associated probability for a bilateral test (0.000). Since P < 0.05, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population means are different.

9.7 STUDENT’S T-TEST TO COMPARE TWO POPULATION MEANS FROM TWO PAIRED RANDOM SAMPLES This test is applied to check whether the means of two paired or related samples, obtained from the same population (before and after) with a normal distribution, are significantly different or not. Besides the normality of the data of each sample, the test requires the homogeneity of the variances between the groups. Different from the t-test for two independent samples, first, we must calculate the difference between each pair of values in position i (di ¼ Xbefore,i  Xafter,i, i ¼ 1, …, n) and, after that, test the null hypothesis that the mean of the differences in the population is zero.

228

PART

IV Statistical Inference

For a bilateral test, we have: H0: md ¼ 0, md ¼ mbefore  mafter H1: md ¼ 6 0 The Tcal statistic for the test is given by: Tcal ¼

d  md pﬃﬃﬃ  t Sd = n n¼n1

where:

Xn d¼

and Sd ¼

i¼1

(9.26)

d

(9.27)

n

sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Xn  2 d d i¼1 i

(9.28)

n1

The value calculated must be compared to the value in Student’s t-distribution table (Table B in the Appendix). This table provides the critical values of tc considering that P(Tcal > tc) ¼ a (for a right-tailed test). For a bilateral test, we have P(Tcal <  tc) ¼ a/2 ¼ P(Tcal > tc), as shown in Fig. 9.30. FIG. 9.30 Nonrejection region (NR) and critical region (CR) of Student’s t-distribution for a bilateral test.

Therefore, for a bilateral test, the null hypothesis is rejected if Tcal <  tc or Tcal > tc. If  tc  Tcal  tc, we do not reject H0. The unilateral probabilities associated to Tcal statistic (P1) can also be obtained from Table B. For a unilateral test, we have P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Therefore, for both tests, we reject H0 if P  a. Example 9.11: Applying Student’s t-Test to Two Paired Samples A group of 10 machine operators, responsible for carrying out a certain task, is trained to perform the same task more efficiently. To verify if there is a reduction in the time taken to perform the task, we measured the time spent by each operator, before and after the training course. Test the hypothesis that the population means of both paired samples are similar, that is, that there is no reduction in time taken to perform the task after the training course. Consider a ¼ 5%.

TABLE 9.E.12 Time Spent Per Operator Before the Training Course 3.2

3.6

3.4

3.8

3.4

3.5

3.7

3.2

3.5

3.9

3.4

3.0

3.2

3.6

TABLE 9.E.13 Time Spent Per Operator After the Training Course 3.0

3.3

3.5

3.6

3.4

3.3

Solution Step 1: In this case, the most suitable test is Student’s t-test for two paired samples. Since the test requires the normality of the data in each sample and the homogeneity of the variances between the groups, K-S or S-W tests, besides Levene’s test, must be applied for such verification. As we will see, in the solution of this example on SPSS, all of these assumptions will be validated.

Hypotheses Tests Chapter

9

229

Step 2: For this example, Student’s t-test hypotheses are: H0: md ¼ 0 H1: md 6¼ 0 Step 3: The significance level to be considered is 5%. Step 4: In order to calculate the Tcal statistic, first, we must calculate di:

TABLE 9.E.14 Calculating di Xbefore, Xafter, di

i

i

3.2

3.6

3.4

3.8

3.4

3.5

3.7

3.2

3.5

3.9

3.0

3.3

3.5

3.6

3.4

3.3

3.4

3.0

3.2

3.6

0.2

0.3

0.1

0.2

0

0.2

0.3

0.2

0.3

0.3

Xn d 0:2 + 0:3 + ⋯ + 0:3 i¼1 i ¼ ¼ 0:19 d¼ n 10 sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð0:2  0:19Þ2 + ð0:3  0:19Þ2 + ⋯ + ð0:3  0:19Þ2 Sd ¼ ¼ 0:137 9 Tcal ¼

d 0:19 pﬃﬃﬃ ¼ pﬃﬃﬃﬃﬃﬃ ¼ 4:385 Sd = n 0:137= 10

Step 5: The critical region of the bilateral test can be defined from Student’s t-distribution table (Table B in the Appendix), considering n ¼ 9 degrees of freedom and a ¼ 5%, as shown in Fig. 9.31. For a bilateral test, each tail corresponds to half of significance level a. FIG. 9.31 Critical region of Example 9.11.

Step 6: Decision: since the value calculated lies in the critical region (Tcal > 2.262), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that there is a significant difference between the times spent by the operators before and after the training course. If, instead of comparing the value calculated to the critical value of Student’s t-distribution, we used the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table B in the Appendix, for a right-tailed test with n ¼ 9 degrees of freedom, the P1 probability associated to Tcal ¼ 4.385 is between 0.0005 and 0.001. For a bilateral test, this probability must be doubled (P ¼ 2P1), so, 0.001 < P < 0.002. Step 6: Decision: since P < 0.05, the null hypothesis is rejected.

9.7.1

Solving Student’s t-Test From Two Paired Samples by Using SPSS Software

First, we must test the normality of the data in each sample, as well as the variance homogeneity between the groups. Using the same procedures described in Sections 9.3.3 and 9.4.5 (the data must be placed in a table the same way as in Section 9.4.5), we obtain Figs. 9.32 and 9.33. Based on Fig. 9.32, we conclude that there is normality of the data for each sample. From Fig. 9.33, we can conclude that the variances between the samples are homogeneous.

230

PART

IV Statistical Inference

The use of the images in this section has been authorized by the International Business Machines Corporation©. To solve Student’s t-test for two paired samples on SPSS, we must open the file T_test_Two_Paired_Samples.sav. Then, we have to click on Analyze ! Compare Means → Paired-Samples T Test …, as shown in Fig. 9.34. We must select the variable Before and move it to Variable1 and the variable After to Variable2, as shown in Fig. 9.35. If the desired confidence level is different from 95%, we must click on Options … to change it. Finally, let’s click on OK. The results of the test are shown in Fig. 9.36. The value of the t-test is 4.385 and the significance level observed for a bilateral test is 0.002, value less than 0.05, which leads us to reject the null hypothesis and allows us to conclude, with a 95% confidence level, that there is a significant difference between the times spent by the operators before and after the training course.

FIG. 9.32 Results of the normality tests on SPSS.

FIG. 9.33 Results of Levene’s test on SPSS.

FIG. 9.34 Procedure for elaborating the t-test from two paired samples on SPSS.

Hypotheses Tests Chapter

9

231

FIG. 9.35 Selecting the variables that will be paired.

FIG. 9.36 Results of the t-test for two paired samples.

FIG. 9.37 Results of Student’s t-test for two paired samples for Example 9.11 on Stata.

9.7.2

Solving Student’s t-Test From Two Paired Samples by Using Stata Software

The t-test to compare the means of two paired groups will be solved on Stata for the data in Example 9.11. The use of the images in this section has been authorized by StataCorp LP©. Therefore, let’s open the file T_test_Two_Paired_Samples.dta. The paired variables are called before and after. In this case, we must type the following command: ttest before == after

The result of the test can be seen in Fig. 9.37. We can see that the calculated value of the statistic (4.385) is similar to the one calculated in Example 9.11 and on SPSS, as well as the probability associated to the statistic for a bilateral test (0.0018). Since P < 0.05, we reject the null hypothesis that the times spent by the operators before and after the training course are the same, with a 95% confidence level.

232

PART

9.8

IV Statistical Inference

ANOVA TO COMPARE THE MEANS OF MORE THAN TWO POPULATIONS

ANOVA is a test used to compare the means of three or more populations, through the analysis of sample variances. The test is based on a sample obtained from each population, aiming at determining if the differences between the sample means suggest significant differences between the population means, or if such differences are only a result of the implicit variability of the sample. ANOVA’s assumptions are: (i) The samples must be independent from each other; (ii) The data in the populations must have a normal distribution; (iii) The population variances must be homogeneous.

9.8.1

One-Way ANOVA

One-way ANOVA is an extension of Student’s t-test for two population means, allowing the researcher to compare three or more population means. The null hypothesis of the test states that the population means are the same. If there is at least one group with a mean that is different from the others, the null hypothesis is rejected. As stated in Fa´vero et al. (2009), the one-way ANOVA allows researcher to verify the effect of a qualitative explanatory variable (factor) on a quantitative dependent variable. Each group includes the observations of the dependent variable in one category of the factor. Assuming that size n independent samples are obtained from k populations (k 3) and that the means of these populations can be represented by m1, m2, …, mk, the analysis of variance tests the following hypotheses: H0 : m1 ¼ m2 ¼ … ¼ mk H1 : 9ði, jÞ mi 6¼ mj , i 6¼ j

(9.29)

According to Maroco (2014), in general, the observations for this type of problem can be represented according to Table 9.2. where Yij represents observation i of sample or group P j (i ¼ 1, …, nj; j ¼ 1, …, k) and nj is the dimension of sample or group j. The dimension of the global sample is N ¼ ki¼1ni. Pestana and Gageiro (2008) present the following model: Yij ¼ mi + eij

(9.30)

Yij ¼ m + ðmi  mÞ  eij

(9.31)

Yij ¼ m + ai + eij

(9.32)

where: m is the global mean of the population; mi is the mean of sample or group i; ai is the effect of sample or group i; eij is the random error.

TABLE 9.2 Observations of the One-Way ANOVA Samples or Groups 1

2

K

Y11

Y12

Y1k

Y21

Y22

Y2k

Yn11

Yn22

Ynkk

Hypotheses Tests Chapter

9

233

Therefore, ANOVA assumes that each group comes from a population with a normal distribution, mean mi, and a homogeneous variance, that is, Yij  N(mi,s), resulting in the hypothesis that the errors (residuals) have a normal distribution with a mean equal to zero and a constant variance, that is, eij  N(0,s), besides being independent (Fa´vero et al., 2009). The technique’s hypotheses are tested from the calculation of the group variances, and that is where the name ANOVA comes from. The technique involves the calculation of the variations between the groups (Y i  Y) and within each group (Yij  Y i ). The residual sum of squares within groups (RSS) is calculated by: RSS ¼

nj  k X X

Yij  Y i

2 (9.33)

i¼1 j¼1

The residual sum of squares between groups, or the sum of squares of the factor (SSF), is given by: SSF ¼

k X

 2 ni  Y i  Y

(9.34)

i¼1

Therefore, the total sum is: TSS ¼ RSS + SSF ¼

ni  k X 2 X Yij  Y

(9.35)

i¼1 j¼1

According to Fa´vero et al. (2009) and Maroco (2014), the ANOVA statistic is given by the division between the variance of the factor (SSF divided by k  1 degrees of freedom) and the variance of the residuals (RSS divided by N  k degrees of freedom): SSF ðk  1Þ MSF ¼ Fcal ¼ RSS MSR ðN  kÞ

(9.36)

where: MSF represents the mean square between groups (estimate of the variance of the factor); MSR represents the mean square within groups (estimate of the variance of the residuals). Table 9.3 summarizes the calculations of the one-way ANOVA. The value of F can be null or positive, but never negative. Therefore, ANOVA requires an asymmetrical F-distribution to the right. The calculated value (Fcal) must be compared to the value in the F-distribution table (Table A in the Appendix). This table provides the critical values of Fc ¼ Fk1,N k,a where P(Fcal > Fc) ¼ a (right-tailed test). Therefore, one-way ANOVA’s null hypothesis is rejected if Fcal > Fc. Otherwise, if (Fcal  Fc), we do not reject H0. We will use these concepts when we study the estimation of regression models in Chapter 13.

TABLE 9.3 Calculating the One-Way ANOVA Source of Variation

Sum of Squares

Between the groups

SSF ¼

Within the groups

RSS ¼

Total

TSS ¼

Pk



Degrees of Freedom 2

Mean Squares

k1

MSF

2 Pk Pni  i¼1 j¼1 Yij  Y i

Nk

RSS MSR ¼ ðNk Þ

Pk Pni 

N1

i¼1 ni

i¼1

Yi Y

j¼1

Yij  Y

2

SSF ¼ ðk1 Þ

F MSF F ¼ MSR

Source: Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro; Maroco, J., 2014. Ana´lise estatı´stica com o SPSS Statistics, sixth ed. Edic¸o˜es Sı´labo, Lisboa.

234

PART

IV Statistical Inference

TABLE 9.E.15 Percentage of Sucrose for the Three Suppliers Supplier 1 (n1 5 12)

Supplier 2 (n2 5 10)

Supplier 3 (n3 5 10)

0.33

1.54

1.47

0.79

1.11

1.69

1.24

0.97

1.55

1.75

2.57

2.04

0.94

2.94

2.67

2.42

3.44

3.07

1.97

3.02

3.33

0.87

3.55

4.01

0.33

2.04

1.52

0.79

1.67

2.03

Y 1 ¼ 1:316

Y 2 ¼ 2:285

Y 3 ¼ 2:338

S1 ¼ 0.850

S2 ¼ 0.948

S3 ¼ 0.886

1.24 3.12

Example 9.12: Applying the One-Way ANOVA Test A sample with 32 products is collected to analyze the quality of the honey supplied by three different suppliers. One of the ways to test the quality of the honey is finding out how much sucrose it contains, which usually varies between 0.25% and 6.5%. Table 9. E.15 shows the percentage of sucrose in the sample collected from each supplier. Check if there are differences in this quality indicator among the three suppliers, considering a 5% significance level. Solution Step 1: In this case, the most suitable test is the one-way ANOVA. First, we must verify the assumptions of normality for each group and of variance homogeneity between the groups through the Kolmogorov-Smirnov, Shapiro-Wilk, and Levene tests. Figs. 9.38 and 9.39 show the results obtained by using SPSS software.

FIG. 9.38 Results of the tests for normality on SPSS.

FIG. 9.39 Results of Levene’s test on SPSS.

Hypotheses Tests Chapter

9

235

Since the significance level observed in the tests for normality for each group and in the variance homogeneity test between the groups is greater than 5%, we can conclude that each one of the groups shows data with a normal distribution and that the variances between the groups are homogeneous, with a 95% confidence level. Since the assumptions of the one-way ANOVA were met, the technique can be applied. Step 2: For this example, ANOVA’s null hypothesis states that there are no differences in the amount of sucrose coming from the three suppliers. If there is at least one supplier with a population mean that is different from the others, the null hypothesis will be rejected. Thus, we have: H0: m1 ¼ m2 ¼ m3 H1: 9(i,j) mi 6¼ mj, i 6¼ j Step 3: The significance level to be considered is 5%. Step 4: The calculation of the Fcal statistic is specified here. For this example, we know that k ¼ 3 groups and the global sample size is N ¼ 32. The global sample mean is Y ¼ 1:938. The sum of squares between groups (SSF) is: SSF ¼ 12  ð1:316  1:938Þ2 + 10  ð2:285  1:938Þ2 + 10  ð2:338  1:938Þ2 ¼ 7:449 Therefore, the mean square between groups (MSB) is: MSF ¼

SSF 7:449 ¼ ¼ 3:725 ðk  1Þ 2

The calculation of the sum of squares within groups (RSS) is shown in Table 9.E.16.

TABLE 9.E.16 Calculation of the Sum of Squares Within Groups (SSW) 

Yij  Y i

Supplier

Sucrose

Yij  Y i

1

0.33

0.986

0.972

1

0.79

0.526

0.277

1

1.24

0.076

0.006

1

1.75

0.434

0.189

1

0.94

0.376

0.141

1

2.42

1.104

1.219

1

1.97

0.654

0.428

1

0.87

0.446

0.199

1

0.33

0.986

0.972

1

0.79

0.526

0.277

1

1.24

0.076

0.006

1

3.12

1.804

3.255

2

1.54

0.745

0.555

2

1.11

1.175

1.381

2

0.97

1.315

1.729

2

2.57

0.285

0.081

2

2.94

0.655

0.429

2

3.44

1.155

1.334

2

3.02

0.735

0.540

2

3.55

1.265

1.600

2

2.04

0.245

0.060

2

1.67

0.615

0.378

2

Continued

236

PART

IV Statistical Inference

TABLE 9.E.16 Calculation of the Sum of Squares Within Groups (SSW)—cont’d 

Yij  Y i

Supplier

Sucrose

Yij  Y i

3

1.47

0.868

0.753

3

1.69

0.648

0.420

3

1.55

0.788

0.621

3

2.04

0.298

0.089

3

2.67

0.332

0.110

3

3.07

0.732

0.536

3

3.33

0.992

0.984

3

4.01

1.672

2.796

3

1.52

0.818

0.669

3

2.03

0.308

0.095

RSS

2

23.100

Therefore, the mean square within groups is: MSR ¼

RSS 23:100 ¼ ¼ 0:797 ðN  k Þ 29

Thus, the value of the Fcal statistic is: Fcal ¼

MSF 3:725 ¼ ¼ 4:676 MSR 0:797

Step 5: According to Table A in the Appendix, the critical value of the statistic is Fc ¼ F2, 29,

5%

¼ 3.33.

Step 6: Decision: since the value calculated lies in the critical region (Fcal > Fc), we reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is at least one supplier with a population mean that is different from the others. If, instead of comparing the value calculated to the critical value of Snedecor’s F-distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table A in the Appendix, for n1 ¼ 2 degrees of freedom in the numerator and n2 ¼ 29 degrees of freedom in the denominator, the probability associated to Fcal ¼ 4.676 is between 0.01 and 0.025 (P-value). Step 6: Decision: since P < 0.05, the null hypothesis is rejected.

9.8.1.1 Solving the One-Way ANOVA Test by Using SPSS Software The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 9.12 are available in the file One_Way_ANOVA.sav. First of all, let´s click on Analyze ! Compare Means → One-Way ANOVA …, as shown in Fig. 9.40. Let’s include the variable Sucrose in the list of dependent variables (Dependent List) and the variable Supplier in the box Factor, according to Fig. 9.41. After that, we must click on Options … and select the option Homogeneity of variance test (Levene’s test for variance homogeneity). Finally, let’s click on Continue and on OK to obtain the result of Levene’s test, besides the ANOVA table. Since ANOVA does not make the normality test available, it must be obtained by applying the same procedure described in Section 9.3.3. According to Fig. 9.42, we can verify that each one of the groups has data that follow a normal distribution. Moreover, through Fig. 9.43, we can conclude that the variances between the groups are homogeneous.

Hypotheses Tests Chapter

9

237

FIG. 9.40 Procedure for the one-way ANOVA.

FIG. 9.41 Selecting the variables.

From the ANOVA table (Fig. 9.44), we can see that the value of the F-test is 4.676 and the respective P-value is 0.017 (we saw in Example 9.12 that this value would be between 0.01 and 0.025), value less than 0.05. This leads us to reject the null hypothesis and allows us to conclude, with a 95% confidence level, that at least one of the population means is different from the others (there are differences in the percentage of sucrose in the honey of the three suppliers).

9.8.1.2 Solving the One-Way ANOVA Test by Using Stata Software The use of the images in this section has been authorized by StataCorp LP©. The one-way ANOVA on Stata is generated from the following syntax: anova variabley* factor*

238

PART

IV Statistical Inference

FIG. 9.42 Results of the tests for normality for Example 9.12 on SPSS.

FIG. 9.43 Results of Levene’s test for Example 9.12 on SPSS.

FIG. 9.44 Results of the one-way ANOVA for Example 9.12 on SPSS.

FIG. 9.45 Results of the one-way ANOVA on Stata.

in which the term variabley* should be substituted for the quantitative dependent variable and the term factor* for the qualitative explanatory variable. The data in Example 9.12 are available in the file One_Way_Anova.dta. The quantitative dependent variable is called sucrose and the factor is represented by the variable supplier. Thus, we must type the following command: anova sucrose supplier

The result of the test can be seen in Fig. 9.45. We can see that the calculated value of the statistic (4.68) is similar to the one calculated in Example 9.12 and also generated on SPSS, as well as the probability associated to the value of the statistic (0.017). Since P < 0.05, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that at least one of the population means is different from the others.

Hypotheses Tests Chapter

9.8.2

9

239

Factorial ANOVA

Factorial ANOVA is an extension of the one-way ANOVA, assuming the same assumptions, but considering two or more factors. Factorial ANOVA presumes that the quantitative dependent variable is influenced by more than one qualitative explanatory variable (factor). It also tests the possible interactions between the factors, through the resulting effect of the combination of factor A’s level i and factor B’s level j, as discussed by Pestana and Gageiro (2008), Fa´vero et al. (2009), and Maroco (2014). For Pestana and Gageiro (2008) and Fa´vero et al. (2009), the main objective of the factorial ANOVA is to determine whether the means for each factor level are the same (an isolated effect of the factors on the dependent variable), and to verify the interaction between the factors (the joint effect of the factors on the dependent variable). For educational purposes, the factorial ANOVA will be described for the two-way model.

9.8.2.1 Two-Way ANOVA According to Fa´vero et al. (2009) and Maroco (2014), the observations of the two-way ANOVA can be represented, in general, as shown in Table 9.4. For each cell, we can see the values of the dependent variable in the factors A and B that are being studied. where Yijk represents observation k (k ¼ 1, …, n) of factor A’s level i (i ¼ 1, …, a) and of factor B’s level j (j ¼ 1, …, b). First, in order to check the isolated effects of factors A and B, we must test the following hypotheses (Fa´vero et al., 2009; Maroco, 2014): HA0 : m1 ¼ m2 ¼ … ¼ ma

(9.37)

HA1 : 9ði, jÞ mi 6¼ mj , i 6¼ j ði, j ¼ 1, …, aÞ and HB0 : m1 ¼ m2 ¼ … ¼ mb

(9.38)

HB1 : 9ði, jÞ mi 6¼ mj , i 6¼ j ði, j ¼ 1, …, bÞ

TABLE 9.4 Observations of the Two-Way ANOVA Factor B

Factor A

1

1

2

… …

b

Y111

Y121

Y112

Y122

Yab2

Y11n

Y12n

Yabn

Y211

Y221

Y212

Y222

Y2b2

Y21n

Y22n

Y2bn

a

Ya11

Ya21

Yab1

Ya12

Ya22

Yab2

Ya1n

Ya2n

Yabn

2

Yab1

Y2b1

Source: Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro; Maroco, J., 2014. Ana´lise estatı´stica com o SPSS Statistics, sixth ed. Edic¸o˜es Sı´labo, Lisboa.

240

PART

IV Statistical Inference

Now, in order to verify the joint effect of the factors on the dependent variable, we must test the following hypotheses (Fa´vero et al., 2009; Maroco, 2014): H0 : gij ¼ 0, for i 6¼ j ðthere is no interaction between the factors A and BÞ H1 : gij 6¼ 0, for i 6¼ j ðthere is interaction between the factors A and BÞ

(9.39)

The model presented by Pestana and Gageiro (2008) can be described as: Yijk ¼ m + ai + bj + gij + eijk

(9.40)

where: m is the population’s global mean; ai is the effect of factor A’s level i, given by mi  m; bi is the effect of factor B’s level j, given by mj  m; gij is the interaction between the factors; eijk is the random error that follows a normal distribution with a mean equal to zero and a constant variance. To standardize the effects of the levels chosen of both factors, we must assume that: a X i¼1

ai ¼

b X

bj ¼

a X

j¼1

gij ¼

i¼1

b X

gij ¼ 0

(9.41)

i¼1

Let’s consider Y, Y ij , Y i , and Y j the general mean of the global sample, the mean per sample, the mean of factor A’s level i, and the mean of factor B’s level j, respectively. We can describe the residual sum of squares (RSS) as: RSS ¼

a X b X n  2 X Yijk  Y ij

(9.42)

i¼1 j¼1 k¼1

On the other hand, the sum of squares of factor A (SSFA), the sum of squares of factor B (SSFB), and the sum of squares of the interaction (SSFAB) are represented below in Expressions (9.43)–(9.45), respectively: SSFA ¼ b  n 

a  X

Yi  Y

2

(9.43)

i¼1

SSFB ¼ a  n 

b  2 X Yj  Y

(9.44)

j¼1

SSFAB ¼ n 

a X b  X

Y ij  Y i  Y j + Y

2 (9.45)

i¼1 j¼1

Therefore, the sum of total squares can be written as follows: TSS ¼ RSS + SSFA + SSFB + SSFAB ¼

a X b X n  2 X Yijk  Y

(9.46)

i¼1 j¼1 k¼1

Thus, the ANOVA statistic for factor A is given by: SSFA MSFA ð a  1Þ FA ¼ ¼ RSS MSR ðn  1Þ  ab where: MSFA is the mean square of factor A; MSR is the mean square of the errors.

(9.47)

Hypotheses Tests Chapter

9

241

TABLE 9.5 Calculations of the Two-Way ANOVA Source of Variation Factor A

SSF A ¼ b  n 

Factor B

SSF B ¼ a  n 

Interaction

SSF AB ¼ n 

Error

RSS ¼

Total

Degrees of Freedom

Sum of Squares

TSS ¼

Pa 

2

2 Pb  j¼1 Y j  Y

i¼1

Yi Y

2 Pa Pb  i¼1 j¼1 Y ij  Y i  Y j + Y

Mean Squares

F

a1

A MSF A ¼ ðSSF a1Þ

A FA ¼ MSF MSR

b1

B MSF B ¼ ðSSF b1Þ

B FB ¼ MSF MSR

(a  1). (b  1)

AB MSF AB ¼ ða1SSF Þ  ðb1Þ

AB FAB ¼ MSF MSR

RSS MSR ¼ ðn1 Þ  ab

Pa Pb Pn

 2 Yijk  Y ij

(n  1)  ab

Pa Pb Pn

 2 Yijk  Y

N1

i¼1

i¼1

j¼1

j¼1

k¼1

k¼1

Source: Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro; Maroco, J., 2014. Ana´lise estatı´stica com o SPSS Statistics, sixth ed. Edic¸o˜es Sı´labo, Lisboa.

On the other hand, the ANOVA statistic for factor B is given by: SSFB MSFB ðb  1Þ FB ¼ ¼ RSS MSR ðn  1Þ  ab

(9.48)

where: MSFB is the mean square of factor B. And the ANOVA statistic for the interaction is represented by: SSFAB ða  1Þ  ðb  1Þ MSFAB FAB ¼ ¼ RSS MSR ðn  1Þ  ab

(9.49)

where: MSFAB is the mean square of the interaction. The calculations of the two-way ANOVA are summarized in Table 9.5. cal cal The calculated values of the statistics (Fcal A , FB , and FAB) must be compared to the critical values obtained from the c F-distribution table (Table A in the Appendix): FA ¼ Fa1, (n1)ab, a, FcB ¼ Fb1, (n1)ab, a, and FcAB ¼ F(a1)(b1), (n1)ab, a. c cal c cal c For each statistic, if the value lies in the critical region (Fcal A > FA, FB > FB, FAB > FAB), we must reject the null hypothesis. Otherwise, we do not reject H0. Example 9.13: Using the Two-Way ANOVA A sample with 24 passengers who travel from Sao Paulo to Campinas in a certain week is collected. The following variables are analyzed (1) travel time in minutes, (2) the bus company chosen, and (3) the day of the week. The main objective is to verify if there is a relationship between the travel time and the bus company, between the travel time and the day of the week, and between the bus company and the day of the week. The levels considered in the variable bus company are Company A (1), Company B (2), and Company C (3). On the other hand, the levels regarding the day of the week are Monday (1), Tuesday (2), Wednesday (3), Thursday (4), Friday (5), Saturday (6), and Sunday (7). The results of the sample are shown in Table 9.E.17 and are available in the file Two_Way_ANOVA.sav as well. Test these hypotheses, considering a 5% significance level.

242

PART

IV Statistical Inference

TABLE 9.E.17 Data From Example 9.13 (Using the Two-Way ANOVA) Time (Min)

9.8.2.1.1

Company

Day of the Week

90

2

4

100

1

5

72

1

6

76

3

1

85

2

2

95

1

5

79

3

1

100

2

4

70

1

7

80

3

1

85

2

3

90

1

5

77

2

7

80

1

2

85

3

4

74

2

7

72

3

6

92

1

5

84

2

4

80

1

3

79

2

1

70

3

6

88

3

5

84

2

4

Solving the Two-Way ANOVA Test by Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. Step 1: In this case, the most suitable test is the two-way ANOVA. First, we must verify if there is normality in the variable Time (metric) in the model (as shown in Fig. 9.46). According to this figure, we can conclude that variable Time follows a normal distribution, with a 95% confidence level. The hypothesis of variance homogeneity will be verified in Step 4. Step 2: The null hypothesis H0 of the two-way ANOVA for this example assumes that the population means of each level of the factor Company and of each level of the factor Day_of_the_week are equal, that is, HA0 : m1 ¼ m2 ¼ m3 and HB0 : m1 ¼ m2 ¼ … ¼ m7. The null hypothesis H0 also states that there is no interaction between the factor Company and the factor Day_of_the_week, that is, H0: gij ¼ 0 for i 6¼ j. Step 3: The significance level to be considered is 5%.

Hypotheses Tests Chapter

9

243

FIG. 9.46 Results of the normality tests on SPSS.

FIG. 9.47 Procedure for elaborating the two-way ANOVA on SPSS.

Step 4: The F statistics in ANOVA for the factor Company, for the factor Day_of_the_week, and for the interaction Company * Day_of_the_week will be obtained through the SPSS software, according to the procedure specified below. In order to do that, let´s click on Analyze ! General Linear Model → Univariate …, as shown in Fig. 9.47. After that, let´s include the variable Time in the box of dependent variables (Dependent Variable) and the variables Company and Day_of_the_week in the box of Fixed Factor(s), as shown in Fig. 9.48. This example is based on the one-way ANOVA, in which the factors are fixed. If one of the factors were chosen randomly, it would be inserted into the box Random Factor(s), resulting in a case of a three-way ANOVA. The button Model … defines the variance analysis model to be tested. Through the button Contrasts …, we can assess if the category of one of the factors is significantly different from the other categories of the same factor. Charts can be constructed through the button Plots …, thus allowing the visualization of the existence or nonexistence of interactions between the factors. Button Post Hoc …, on the other hand, allows us to compare multiple means. Finally, from the button Options …, we can obtain descriptive statistics and the result of Levene’s variance homogeneity test, as well as select the appropriate significance level (Fa´vero et al., 2009; Maroco, 2014).

244

PART

IV Statistical Inference

FIG. 9.48 Selection of the variables to elaborate the two-way ANOVA.

Therefore, since we want to test variance homogeneity, we must select, in Options …, the option Homogeneity tests, as shown in Fig. 9.49. Finally, let’s click on Continue and on OK to obtain Levene’s variance homogeneity test and the two-way ANOVA table. In Fig. 9.50, we can see that the variances between groups are homogeneous (P ¼ 0.451 > 0.05). Based on Fig. 9.51, we can conclude that there are no significant differences between the travel times of the companies analyzed, that is, the factor Company does not have a significant impact on the variable Time (P ¼ 0.330 > 0.05). On the other hand, we conclude that there are significant differences between the days of the week, that is, the factor Day_of_the_week has a significant effect on the variable Time (P ¼ 0.003 < 0.05). We finally conclude that there is no significant interaction, with a 95% confidence level, between the two factors Company and Day_of_the_week, since P ¼ 0.898 > 0.05. 9.8.2.1.2

Solving the Two-Way ANOVA Test by Using Stata Software

The use of the images in this section has been authorized by StataCorp LP©. The command anova on Stata specifies the dependent variable being analyzed, as well as the respective factors. The interactions are specified using the character # between the factors. Thus, the two-way ANOVA is generated through the following syntax: anova variableY* factorA* factorB* factorA#factorB

or simply: anova variabley* factorA*## factorB*

in which the term variabley* should be substituted for the quantitative dependent variable and the terms factorA* and factorB* for the respective factors. If we type the syntax anova variableY* factorA* factorB*, only the ANOVA for each factor will be elaborated, and not between the factors.

Hypotheses Tests Chapter

9

245

FIG. 9.49 Test of variance homogeneity.

FIG. 9.50 Results of Levene’s test on SPSS.

The data presented in Example 9.13 are available in the file Two_Way_ANOVA.dta. The quantitative dependent variable is called time and the factors correspond to the variables company and day_of_the_week. Thus, we must type the following command: anova time company##day_of_the_week

The results can be seen in Fig. 9.52 and are similar to those presented on SPSS, which allows us to conclude, with a 95% confidence level, that only the factor day_of_the_week has a significant effect on the variable time (P ¼ 0.003 < 0.05), and that there is no significant interaction between the two factors analyzed (P ¼ 0.898 > 0.05).

246

PART

IV Statistical Inference

FIG. 9.51 Results of the two-way ANOVA for Example 9.13 on SPSS.

FIG. 9.52 Results of the two-way ANOVA for Example 9.13 on Stata.

9.8.2.2 ANOVA With More Than Two Factors The two-way ANOVA can be generalized to three or more factors. According to Maroco (2014), the model becomes very complex, since the effect of multiple interactions can make the effect of the factors a bit confusing. The generic model with three factors presented by the author is: Yijkl ¼ m + ai + bj + gk + abij + agik + bgjk + abgijk + eijkl

9.9

(9.50)

FINAL REMARKS

This chapter presented the concepts and objectives of parametric hypotheses tests and the general procedures for constructing each one of them. We studied the main types of tests and the situations in which each one of them must be used. Moreover, the advantages and disadvantages of each test were established, as well as their assumptions. We studied the tests for normality (Kolmogorov-Smirnov, Shapiro-Wilk, and Shapiro-Francia), variance homogeneity tests (Bartlett’s w2, Cochran’s C, Hartley’s Fmax, and Levene’s F), Student’s t-test for one population mean, for two independent means, and for two paired means, as well as ANOVA and its extensions.

Hypotheses Tests Chapter

9

247

Regardless of the application’s main goal, parametric tests can provide good and interesting research results that will be useful in the decision-making process. From a conscious choice of the modeling software, the correct use of each test must always be made based on the underlying theory, without ever ignoring the researcher’s experience and intuition.

9.10 EXERCISES (1) In what situations should parametric tests be applied and what are the assumptions of these tests? (2) What are the advantages and disadvantages of parametric tests? (3) What are the main parametric tests to verify the normality of the data? In what situations must we use each one of them? (4) What are the main parametric tests to verify the variance homogeneity between groups? In what situations must we use each one of them? (5) To test a single population mean, we can use z-test and Student’s t-test. In what cases must each one of them be applied? (6) What are the main mean comparison tests? What are the assumptions of each test? (7) The monthly aircraft sales data throughout last year can be seen in the table below. Check and see if there is normality in the data. Consider a ¼ 5%. Jan.

Feb.

Mar.

Apr.

May

Jun.

Jul.

Aug.

Sept.

Oct.

Nov.

Dec.

48

52

50

49

47

50

51

54

39

56

52

55

(8) Test the normality of the temperature data listed (a ¼ 5%): 12.5

14.2

13.4

14.6

12.7

10.9

16.5

14.7

11.2

10.9

12.1

12.8

13.8

13.5

13.2

14.1

15.5

16.2

10.8

14.3

12.8

12.4

11.4

16.2

14.3

14.8

14.6

13.7

13.5

10.8

10.4

11.5

11.9

11.3

14.2

11.2

13.4

16.1

13.5

17.5

16.2

15.0

14.2

13.2

12.4

13.4

12.7

11.2

(9) The table shows the final grades of two students in nine subjects. Check and see if there is variance homogeneity between the students (a ¼ 5%). Student 1

6.4

5.8

6.9

5.4

7.3

8.2

6.1

5.5

6.0

Student 2

6.5

7.0

7.5

6.5

8.1

9.0

7.5

6.5

6.8

(10) A fat-free yogurt manufacturer states that the number of calories in each cup is 60 cal. In order to check if this information is true, a random sample with 36 cups is collected; and we observed that the average number of calories was 65 cal. with a standard deviation of 3.5. Apply the appropriate test and check if the manufacturer’s statement is true, considering a significance level of 5%. (11) We would like to compare the average waiting time before being seen by a doctor (in minutes) in two hospitals. In order to do that, we collected a sample with 20 patients from each hospital. The data are available in the tables. Check and see if there are differences between the average waiting times in both hospitals. Consider a ¼ 1%.

248

PART

IV Statistical Inference

Hospital 1

72

58

91

88

70

76

98

101

65

73

79

82

80

91

93

88

97

83

71

74

66

40

55

70

76

61

53

50

47

61

52

48

60

72

57

70

66

55

46

51

Hospital 2

(12) Thirty teenagers whose total cholesterol level is higher than what is advisable underwent treatment that consisted of a diet and physical activities. The tables show the levels of LDL cholesterol (mg/dL) before and after the treatment. Check if the treatment was effective (a ¼ 5%). Before the treatment 220

212

227

234

204

209

211

245

237

250

208

224

220

218

208

205

227

207

222

213

210

234

240

227

229

224

204

210

215

228

After the treatment 195

180

200

204

180

195

200

210

205

211

175

198

195

200

190

200

222

198

201

194

190

204

230

222

209

198

195

190

201

210

(13) An aerospace company produces civilian and military helicopters at its three factories. The tables show the monthly production of helicopters in the last 12 months at each factory. Check if there is a difference between the population means. Consider a ¼ 5%. Factory 1 24

26

28

22

31

25

27

28

30

21

20

24

26

24

30

24

27

25

29

30

27

26

25

25

24

26

20

22

22

27

20

26

24

25

Factory 2 28

Factory 3

29

Chapter 10

Nonparametric Tests Mathematics has wonderful strength that is capable of making us understand many mysteries of our faith. Saint Jerome

10.1 INTRODUCTION As studied in the previous chapter, hypotheses tests are divided into parametric and nonparametric. Applied to quantitative data, parametric tests, formulate hypotheses about population parameters, such as the population mean (m), population standard deviation (s), population variance (s2), population proportion (p), etc. Parametric tests require strong assumptions regarding the data distribution. For example, in many cases, we should assume that the samples are collected from populations whose data follow a normal distribution. Or, still, for comparison tests of two paired population means or k population means (k 3), the population variances must be homogeneous. Conversely, nonparametric tests can formulate hypotheses about the qualitative characteristics of the population, then, they can be applied to qualitative data, in nominal or ordinal scales. Since assumptions regarding the data distribution are in smaller number and weaker than the parametric tests, they are also known as distribution-free tests. Nonparametric tests are an alternative to parametric ones when their hypotheses are violated. Given that they require a smaller number of assumptions, they are simpler and easier to apply, but less robust when compared to parametric tests. In short, the main advantages of nonparametric tests are: (a) They can be applied in a wide variety of situations, because they do not require strict premises concerning the population, as parametric methods do. Notably, nonparametric methods do not require that the populations have a normal distribution. (b) Differently from parametric methods, nonparametric methods can be applied to qualitative data, in nominal and ordinal scales. (c) They are easy to apply because they require simpler calculations when compared to parametric methods. The main disadvantages are: (a) With regard to quantitative data, since they must be transformed into qualitative data for the application of nonparametric tests, we lose too much information. (b) Since nonparametric tests are less efficient than parametric tests, we need greater evidence (a larger sample or one with greater differences) to reject the null hypothesis. Thus, since parametric tests are more powerful than nonparametric ones, that is, they have a higher probability of rejecting the null hypothesis when it is really false, they must be chosen as long as all the assumptions are confirmed. On the other hand, nonparametric tests are an alternative to parametric ones when the hypotheses are violated or in cases in which the variables are qualitative. Nonparametric tests are classified according to the variables’ level of measurement and to sample size. For a single sample, we will study the binomial, chi-square (w2), and sign tests. The binomial test is applied to binary variables. The w2 test can be applied to nominal variables as well as to ordinal variables. While the sign test is only applied to ordinal variables. In the case of two paired samples, the main tests are the McNemar test, the sign test, and the Wilcoxon test. The McNemar test is applied to qualitative variables that assume only two categories (binary), while the sign test and the Wilcoxon test are applied to ordinal variables. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00010-0 © 2019 Elsevier Inc. All rights reserved.

249

250

PART

IV Statistical Inference

TABLE 10.1 Classification of Nonparametric Statistical Tests Dimension

Level of Measurement

Nonparametric Test

One sample

Binary

Binomial

Nominal or ordinal

w2

Ordinal

Sign test

Binary

McNemar test

Ordinal

Sign test Wilcoxon test

Nominal or ordinal

w2

Ordinal

Mann-Whitney U

Binary

Cochran’s Q

Ordinal

Friedman’s test

Nominal or ordinal

w2

Ordinal

Kruskal-Wallis test

Two paired samples

Two independent samples

K paired samples

K independent samples

Source: Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro.

Considering two independent samples, we can highlight the w2 test and the Mann-Whitney U test. The w2 test can be applied to nominal or ordinal variables, while the Mann-Whitney U test only considers ordinal variables. For k paired samples (k 3), we have Cochran’s Q test that considers binary variables and Friedman’s test that considers ordinal variables. Finally, in the case of more than two independent samples, we will study the w2 test for nominal or ordinal variables and the Kruskal-Wallis test for ordinal variables. Table 10.1 shows this classification. Nonparametric tests in which the variables’ level of measurement is ordinal can also be applied to quantitative variables, but they must only be used in these cases, when the hypotheses of the parametric tests are rejected.

10.2

TESTS FOR ONE SAMPLE

In this case, a random sample is taken from the population and we test the hypothesis that the sample data have a certain characteristic or distribution. Among the nonparametric statistical tests for a single sample, we can highlight the binomial test, the w2 test, and the sign test. The binomial test is applied to binary data, the w2 test to nominal or ordinal data, while the sign test is applied to ordinal data.

10.2.1

Binomial Test

The binomial test is applied to an independent sample in which the variable that the researcher is interested in (X) is binary (dummy) or dichotomous, that is, it only has two possibilities: success or failure. We usually call result X ¼ 1 a success and result X ¼ 0 a failure, because it is more convenient. The probability of success in choosing a certain observation is represented by p and the probability of failure by q, that is: P½X ¼ 1 ¼ p and P½X ¼ 0 ¼ q ¼ 1  p For a bilateral test, we must consider the following hypotheses: H0: p ¼ p0 H1: p ¼ 6 p0 According to Siegel and Castellan (2006), the number of successes (Y) or the number of results of type [X ¼ 1] results in a sequence of N observations is:

Nonparametric Tests Chapter

N X

10

251

Xi

i¼1

For the authors, in a sample of size N, the probability of obtaining k objects in a category and N  k objects in the other category is given by:   N (10.1) P½Y ¼ k ¼  pk  qNk k ¼ 0, 1,…, N k where: p: probability of success; q: probability of failure, where:   N! N ¼ k k!ðN  kÞ! Table F1 in the Appendix provides the probability of P[Y ¼ k] for several values of N, k, and p. However, when we test hypotheses, we must use the probability of obtaining values that are greater than or equal to the value observed: N   X N  pi  qNi (10.2) Pð Y  k Þ ¼ i i¼k Or the probability of obtaining values that are less than or equal to the value observed: k   X N  pi  qNi Pð Y  k Þ ¼ i i¼0

(10.3)

According to Siegel and Castellan (2006), when p ¼ q ¼ ½, instead of calculating the probabilities based on the expressions presented, it is more convenient to use Table F2 in the Appendix. This table provides the unilateral probabilities, under the null hypothesis H0: p ¼ 1/2, of obtaining values that are as extreme as or more extreme than k, where k is the lowest of the frequencies observed (P(Y  k)). Due to the symmetry of a binomial distribution, when p ¼ ½, we have P(Y k) ¼ P(Y  N  k). A unilateral test is used when we predict, in advance, which of both categories must contain the smallest number of cases. For a bilateral test (when the estimate simply refers to the fact that both frequencies will differ), we just need to double the values from Table F2 in the Appendix. This final value obtained is called P-value, which, according to what was discussed in Chapter 9, corresponds to the probability (unilateral or bilateral) associated to the value observed in the sample. P-value indicates the lowest significance level observed, which would lead to the rejection of the null hypothesis. Thus, we reject H0 if P  a. In the case of large samples (N > 25), the sample distribution of variable Y is closer to a standard normal distribution, so, the probability can be calculated by the following statistic: Zcal ¼

jN  p^  N  pj  0:5 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Npq

(10.4)

where p^refers to the sample estimate of the proportion of successes so that we can test H0. The value of Zcal calculated by using Expression (10.4) must be compared to the critical value of the standard normal distribution (see Table E in the Appendix). This table provides the critical values of zc where P(Zcal > zc) ¼ a (for a righttailed unilateral test). For a bilateral test, we have P(Zcal <  zc) ¼ a/2 ¼ P(Zcal > zc). Therefore, for a right-tailed unilateral test, the null hypothesis is rejected if Zcal > zc. Now, for a bilateral test, we reject H0 if Zcal <  zc or Zcal > zc. Example 10.1: Applying the Binomial Test to Small Samples A group of 18 students took an intensive English course and were submitted to two different learning methods. At the end of the course, each student chose his/her favorite teaching method, as shown in Table 10.E.1. We believe there are no differences between both teaching methods. Test the null hypothesis with a significance level of 5%.

252

PART

IV Statistical Inference

TABLE 10.E.1 Frequencies Obtained After Students Made Their Choice Events

Method 1

Method 2

Total

Frequency

11

7

18

Proportion

0.611

0.389

1.0

Solution Before we start the general procedure to construct the hypotheses tests, we will explain a few parameters in order to facilitate the understanding. Choosing the method that will be expressed as X ¼ 1 (method 1) and X ¼ 0 (method 2), the probability of choosing method 1 is represented by P[X ¼ 1] ¼ p and method 2 by P[X ¼ 0] ¼ q. The number of successes (Y ¼ k) corresponds to the total number of type X ¼ 1 results and k ¼ 11. Step 1: The most suitable test in this case is the binomial test because the data are categorized into two classes. Step 2: The null hypothesis states that there are no differences in the probabilities of choosing between both methods: H0: p ¼ q ¼ ½ H1: p ¼ 6 q Step 3: The significance level to be considered is 5%. Step 4: We have N ¼ 18, k ¼ 11, p ¼ ½, and q ¼ ½. Due to the symmetry of the binomial distribution, when p ¼ ½, P(Y k) ¼ P(Y  N  k), that is, P(Y 11) ¼ P(Y  7). So, let’s calculate P(Y  7) by using Expression (10.3) and show how this probability can be obtained directly from Table F2 in the Appendix. The probability of a maximum of seven students choosing method 2 is given by: P ðY  7Þ ¼ P ðY ¼ 0Þ + P ðY ¼ 1Þ + ⋯ + P ðY ¼ 7Þ  0  18 18! 1 1  ¼ 3:815  E  06  P ðY ¼ 0Þ ¼ 0!18! 2 2  1  17 18! 1 1 P ðY ¼ 1Þ ¼  ¼ 6:866  E  05  1!17! 2 2  7  11 18! 1 1   ¼ 0:121 P ðY ¼ 7Þ ¼ 7!11! 2 2 Therefore: P ðY  7Þ ¼ 3:815  E  06 + ⋯ + 0:121 ¼ 0:240 Since p ¼ ½, probability P(Y  7) could be obtained directly from Table F2 in the Appendix. For N ¼ 18 and k ¼ 7 (the lowest frequency observed), the associated unilateral probability is P1 ¼ 0.240. Since it is a bilateral test, this value must be doubled (P ¼ 2P1), so, the associated bilateral probability is P ¼ 0.480. Note: In the general procedure of hypotheses tests, Step 4 corresponds to the calculation of the statistic based on the sample. On the other hand, Step 5 determines the probability associated to the value of the statistic obtained from Step 4. In the case of the binomial test, Step 4 calculates the probability associated to the occurrence in the sample directly. Step 5: Decision: since the associated probability is greater than a (P ¼ 0.480 > 0.05), we do not reject H0, which allows us to conclude, with a 95% confidence level, that there are no differences in the probabilities of choosing method 1 or 2.

Example 10.2: Applying the Binomial Test to Large Samples Redo the previous example considering the following results:

TABLE 10.E.2 Frequencies Obtained After Students Made Their Choice Events

Method 1

Method 2

Total

Frequency

18

12

30

Proportion

0.6

0.4

1.0

Nonparametric Tests Chapter

10

253

FIG. 10.1 Critical region of Example 10.2.

Solution Step 1: Let’s apply the binomial test. Step 2: The null hypothesis states that there are no differences between the probabilities of choosing both methods, that is: H 0: p ¼ q ¼ ½ 6 q H 1: p ¼ Step 3: The significance level to be considered is 5%. Step 4: Since N > 25, we can consider that the sample distribution of variable Y is similar to a standard normal distribution, so, the probability can be calculated from Z statistic:

Zcal ¼

jN  p^  N  p j  0:5 j30  0:6  30  0:5j  0:5 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ ¼ 0:913 Npq 30  0:5  0:5

Step 5: The critical region of a standard normal distribution (Table E in the Appendix), for a bilateral test in which a ¼ 5%, is shown in Fig. 10.1. For a bilateral test, each one of the tails corresponds to half of significance level a. Step 6: Decision: since the value calculated is not in the critical region, that is, 1.96  Zcal  1.96, the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that there are no differences in the probabilities of choosing between the methods (p ¼ q ¼ ½). If we used P-value instead of the critical value of the statistic, Steps 5 and 6 would be: Step 5: According to Table E in the Appendix, the unilateral probability associated to statistic Zcal ¼ 0.913 is P1 ¼ 0.1762. For a bilateral test, this probability must be doubled (P-value ¼ 0.3564). Step 6: Decision: since P > 0.05, we do not reject H0.

10.2.1.1 Solving the Binomial Test Using SPSS Software Example 10.1 will be solved using IBM SPSS Statistics Software®. The use of the images in this section has been authorized by the International Business Machines Corporation©. The data are available in the file Binomial_Test.sav. The procedure for solving the binomial test using SPSS is described. Let’s select Analyze → Nonparametric Tests → Legacy Dialogs → Binomial … (Fig. 10.2). First, let’s insert variable Method into the Test Variable List. In Test Proportion, we must define p ¼ 0.50, since the probability of success and failure is the same (Fig. 10.3). Finally, let’s click on OK. The results can be seen in Fig. 10.4. The associated probability for a bilateral test is P ¼ 0.481, similar to the value calculated in Example 10.1. Since P > a (0.481 > 0.05), we do not reject H0, which allows us to conclude, with a 95% confidence level, that p ¼ q ¼ ½.’

10.2.1.2 Solving the Binomial Test Using Stata Software Example 10.1 will also be solved using Stata Statistical Software®. The use of the images presented in this section has been authorized by Stata Corp LP©. The data are available in the file Binomial_Test.dta. The syntax of the binomial test on Stata is: bitest variable* = #p

where the term variable* must be replaced by the variable considered in the analysis and #p by the probability of success specified in the null hypothesis.

254

PART

IV Statistical Inference

FIG. 10.2 Procedure for applying the binomial test on SPSS.

FIG. 10.3 Selecting the variable and the proportion for the binomial test.

In Example 10.1, our studied variable is method and, through the null hypothesis, there are no differences in the choice between both methods, so, the command to be typed is: bitest method = 0.5

The result of the binomial test is shown in Fig. 10.5. We can see that the associated probability for a bilateral test is P ¼ 0.481, similar to the value calculated in Example 10.1, and also obtained via SPSS software. Since P > 0.05, we do not reject H0, which allows us to conclude, with a 95% confidence level, that p ¼ q ¼ ½.

Nonparametric Tests Chapter

10

255

FIG. 10.4 Results of the binomial test.

FIG. 10.5 Results of the binomial test for Example 10.1 on Stata.

10.2.2

Chi-Square Test (x2) for One Sample

The w2 test presented in this section is an extension of the binomial test and is applied to a single sample in which the variable being studied assumes two or more categories. The variables can be nominal or ordinal. The test compares the frequencies observed to the frequencies expected in each category. The w2 test assumes the following hypotheses: H0: there is no significant difference between the frequencies observed and the ones expected H1: there is a significant difference between the frequencies observed and the ones expected The statistic for the test, analogous to Expression (4.1) in Chapter 4, is given by: w2cal ¼

k X ðOi  Ei Þ2 i¼1

Ei

(10.5)

where: Oi: the number of observations in the ith category; Ei: expected frequency of observations in the ith category when H0 is not rejected; k: the number of categories. The values of w2cal approximately follow a w2 distribution with n ¼ k  1 degrees of freedom. The critical values of the chisquare (w2c ) statistic can be found in Table D in the Appendix, which provides the critical values of w2c , where P(w2cal > w2c ) ¼ a (for a right-tailed unilateral test). In order for the null hypothesis H0 to be rejected, the value of the w2cal statistic must be in the critical region (CR), that is, w2cal > w2c . Otherwise, we do not reject H0 (Fig. 10.6). P-value (the probability associated to the value of the w2cal statistic calculated from the sample) can also be obtained from Table D. In this case, we reject H0 if P  a.

FIG. 10.6 w2 distribution, highlighting critical region (CR) and nonrejection of H0 (NR) region.

256

PART

IV Statistical Inference

Example 10.3: Applying the x2 Test to One Sample A candy store would like to find out if the number of chocolate candies sold daily varies depending on the day of the week. In order to do that, a sample was collected throughout 1 week, chosen randomly, and the results can be seen in Table 10.E.3. Test the hypothesis that sales do not depend on the day of the week. Assume that a ¼ 5%.

TABLE 10.E.3 Frequencies Observed Versus Frequencies Expected Events

Sunday

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Frequencies observed

35

24

27

32

25

36

31

Frequencies expected

30

30

30

30

30

30

30

Solution Step 1: The most suitable test to compare the frequencies observed to the ones expected from one sample with more than two categories is the w2 for a single sample. Step 2: Through the null hypothesis, there are no significant differences between the sales observed and the ones expected for each day of the week. On the other hand, through the alternative hypothesis, there is a difference in at least one day of the week: H0: Oi ¼ Ei H1: Oi ¼ 6 Ei Step 3: The significance level to be considered is 5%. Step 4: The value of the statistic is given by: w2cal ¼

k X ðOi  Ei Þ2

Ei

i¼1

¼

ð35  30Þ2 ð24  30Þ2 ð31  30Þ2 + +⋯+ ¼ 4:533 30 30 30

Step 5: The critical region of the w test, considering a ¼ 5% and n ¼ 6 degrees of freedom, is shown in Fig. 10.7. 2

FIG. 10.7 Critical Region of Example 10.3.

Step 6: Decision: since the value calculated is not in the critical region, that is, w2cal < 12.592, the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that the number of chocolate candies sold daily does not vary depending on the day of the week. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 of the construction of the hypotheses tests will be: Step 5: According to Table D in the Appendix, for n ¼ 6 degrees of freedom, the probability associated to the statistic w2cal ¼ 4.533 (P-value) is between 0.1 and 0.9. Step 6: Decision: since P > 0.05, we do not reject the null hypothesis.

10.2.2.1 Solving the w2 Test for One Sample Using SPSS Software The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.3 are available in the file Chi-Square_One_Sample.sav. The procedure for applying the w2 test on SPSS is described. First, let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → Chi-Square …, as shown in Fig. 10.8. After that, we should insert the variable Day_week into the Test Variable List. The variable being studied has seven categories. The options Get from data and Use specified range (Lower ¼ 1 and Upper ¼ 7) in Expected Range generate

Nonparametric Tests Chapter

10

257

FIG. 10.8 Procedure for elaborating the w2 test on SPSS.

the same results. The frequencies expected for the seven categories are exactly the same. Thus, we must select the option All categories equal in Expected Values, as shown in Fig. 10.9. Finally, let’s click on OK to obtain the results of the w2 test, as shown in Fig. 10.10. Therefore, the value of the w2 statistic is 4.533, similar to the value calculated in Example 10.3. Since the Pvalue ¼ 0.605 > 0.05 (in Example 10.3, we saw that 0.1 < P < 0.9), we do not reject H0, which allows us to conclude, with a 95% confidence level, that the sales do not depend on the day of the week.

10.2.2.2 Solving the w2 Test for One Sample Using Stata Software The use of the images presented in this section has been authorized by Stata Corp LP©. The data in Example 10.3 are available in the file Chi-Square_One_Sample.dta. The variable being studied is day_week. The w2 test for one sample on Stata can be obtained from the command csgof (chi-square goodness of fit), which allows us to compare the distribution of frequencies observed to the ones expected of a certain categorical variable with more than two categories. In order for this command to be used, first, we must type: findit csgof

and install it through the link csgof from http://www.ats.ucla.edu/stat/stata/ado/analysis. After doing this, we can type the following command: csgof day_week

The result is shown in Fig. 10.11. We can see that the result of the test is similar to the one calculated in Example 10.3 and on SPSS, as well as to the probability associated to the statistic.

10.2.3

Sign Test for One Sample

The sign test is an alternative to the t-test for a single random sample when the data distribution of the population does not follow a normal distribution. The only assumption required by the sign test is that the distribution of the variable be continuous.

258

PART

IV Statistical Inference

FIG. 10.9 Selecting the variable and the procedure to elaborate the w2 test.

FIG. 10.10 Results of the w2 test for Example 10.3 on SPSS.

The sign test is based on the population median (m). The probability of obtaining a sample value that is less than the median and the probability of obtaining a sample value that is greater than the median are the same (p ¼ ½). The null hypothesis of the test is that m is equal to a certain value specified by the investigator (m0). For a bilateral test, we have: H0: m ¼ m0 H1: m ¼ 6 m0 The quantitative data are converted into signs, (+) or (), that is, values greater than the median (m0) start being represented by (+) and values less than m0 by (). Data with values equal to m0 are excluded from the sample. Thus, the sign test is applied to ordinal data and offers little power to the researcher, since this conversion results in a considerable loss of information regarding the original data.

Nonparametric Tests Chapter

10

259

FIG. 10.11 Results of the w2 test for Example 10.3 on Stata.

Small samples Let’s establish that N is the number of positive and negative signs (sample size disregarding any ties) and k is the number of signs that corresponds to the lowest frequency. For small samples (N  25), we will use the binomial test with p ¼ ½ to calculate P(Y  k). This probability can be obtained directly from Table F2 in the Appendix. Large samples When N > 25, the binomial distribution is more similar to a normal distribution. The value of Z is given by: Z¼

ðX  0:5Þ  N=2 pﬃﬃﬃﬃ  Nð0, 1Þ 0:5 N

(10.6)

where X corresponds to the lowest or highest frequency. If X represents the lowest frequency, we must calculate X + 0.5. On the other hand, if X represents the highest frequency, we must calculate X  0.5. Example 10.4: Applying the Sign Test to a Single Sample We estimate that the median retirement age in a certain Brazilian city is 65. One random sample with 20 retirees was drawn from the population and the results can be seen in Table 10.E.4. Test the null hypothesis that m ¼ 65, at the significance level of 10%.

TABLE 10.E.4 Retirement Age 59

62

66

37

60

64

66

70

72

61

64

66

68

72

78

93

79

65

67

59

Solution Step 1: Since the data do not follow a normal distribution, the most suitable test for testing the population median is the sign test. Step 2: The hypotheses of the test are: H0: m ¼ 65 6 65 H 1: m ¼ Step 3: The significance level to be considered is 10%. Step 4: Let’s calculate P(Y  k). To facilitate our understanding, let’s sort the data in Table 10.E.4 in ascending order.

TABLE 10.E.5 Data From Table 10.E.4 Sorted in Ascending Order 37

59

59

60

61

62

64

64

65

66

66

66

67

68

70

72

72

78

79

93

260

PART

IV Statistical Inference

Excluding value 65 (a tie), we have the number of () signs is 8, the number of (+) signs is 11, and N ¼ 19. From Table F2 in the Appendix, for N ¼ 19, k ¼ 8, and p ¼ ½, the associated unilateral probability is P1 ¼ 0.324. Since we are using a bilateral test, this value must be doubled, so, the associated bilateral probability is 0.648 (P-value). Step 5: Decision: since P > a (0.648 > 0.10), we do not reject H0, a fact that allows us to conclude, with a 90% confidence level, that m ¼ 65.

10.2.3.1 Solving the Sign Test for One Sample Using SPSS Software The use of the images in this section has been authorized by the International Business Machines Corporation©. SPSS makes the sign test available only for two related samples (2 Related Samples). Thus, in order for us to use the test for a single sample, we must generate a new variable with n values (sample size including ties), all of them equal to m0. The data in Example 10.4 are available in the file Sign_Test_One_Sample.sav. The procedure for applying the sign test on SPSS is shown. First of all, we must click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples …, as shown in Fig. 10.12. After that, we must insert variable 1 (Age_pop) and variable 2 (Age_sample) into Test Pairs. Let’s select the option regarding the sign test (Sign) in Test Type, as shown in Fig. 10.13. Next, let’s click on OK to obtain the results of the sign test, as shown in Figs. 10.14 and 10.15. Fig. 10.14 shows the frequencies of negative and positive signs, the total number of ties, and the total frequency. Fig. 10.15 shows the associated probability for a bilateral test, which is similar to the value found in Example 10.4. Since P ¼ 0.648 > 0.10, we do not reject the null hypothesis, which allows us to conclude, with a 90% confidence level, that the median retirement age is 65.

FIG. 10.12 Procedure for elaborating the sign test on SPSS.

Nonparametric Tests Chapter

10

261

FIG. 10.13 Selecting the variables and the sign test.

FIG. 10.14 Frequencies observed.

FIG. 10.15 Sign test for Example 10.4 on SPSS.

10.2.3.2 Solving the Sign Test for One Sample Using Stata Software The use of the images presented in this section has been authorized by Stata Corp LP©. Different from SPSS software, Stata makes the sign test for one sample available. On Stata, the sign test for a single sample as well as for two paired samples can be obtained from the command signtest. The syntax of the test for one sample is: signtest variable* = #

262

PART

IV Statistical Inference

FIG. 10.16 Results of the sign test for Example 10.4 on Stata.

where the term variable* must be replaced by the variable considered in the analysis and # by the value of the population median to be tested. The data in Example 10.4 are available in the file Sign_Test_One_Sample.dta. The variable analyzed is age and the main objective is to verify if the median retirement age is 65. The command to be typed is: signtest age = 65

The result of the test is shown in Fig. 10.16. Analogous to the results presented in Example 10.4 and also generated on SPSS, the number of positive signs is 11, the number of negative signs is 8, and the associated probability for a bilateral test is 0.648. Since P > 0.10, we do not reject the null hypothesis, which allows us to conclude, with a 90% confidence level, that the median retirement age is 65.

10.3

TESTS FOR TWO PAIRED SAMPLES

These tests investigate if two samples are somehow related. The most common examples analyze a situation before and after a certain event. We will study the following tests: the McNemar test for binary variables and the sign and Wilcoxon tests for ordinal variables.

10.3.1

McNemar Test

The McNemar test is applied to assess the significance of changes in two related samples with qualitative or categorical variables that assume only two categories (binary variables). The main goal of the test is to verify if there are any significant changes before and after the occurrence of a certain event. In order to do that, let’s use a 2 2 contingency table, as shown in Table 10.2. According to Siegel and Castellan (2006), the + and  signs are used to represent the possible changes in the answers before and after. The frequencies of each occurrence are represented in their respective cells in Table 10.2. For example, if there are changes from the first answer (+) to the second answer (), the result will be written in the right upper cell, so, B represents the total number of observations that presented changes in their behavior from (+) to (). Analogously, if there are changes from the first answer () to the second answer (+), the result will be written in the left lower cell, so, C represents the total number of observations that presented changes in their behavior from () to (+).

Nonparametric Tests Chapter

10

263

TABLE 10.2 2 × 2 Contingency Table After Before

+

2

+

A

B



C

D

On the other hand, while A represents the total number of observations that remained with the same answer (+) before and after, D represents the total number of observations with the same answer () in both periods. Thus, the total number of individuals that change their answer can be represented by B + C. Through the null hypothesis of the test, the total number of changes in each direction is equally likely, that is: H0: P(B ! C) ¼ P(C ! B) H1: P(B ! C) 6¼ P(C ! B) According to Siegel and Castellan (2006), McNemar statistic is calculated based on the chi-square (w2) statistic presented in Expression (10.5), that is: w2cal ¼

2 X ðOi  Ei Þ2 i¼1

Ei

¼

ðB  ðB + CÞ=2Þ2 ðC  ðB + CÞ=2Þ2 ðB  CÞ2 + ¼  w21 ðB + CÞ=2 ðB + CÞ=2 B+C

(10.7)

According to the same authors, a correction factor must be used in order for a continuous w2 distribution to become more similar to a discrete w2 distribution, so: w2cal ¼

ðjB  Cj  1Þ2 with 1 degree of freedom B+C

(10.8)

The value calculated must be compared to the critical value of the w2 distribution (Table D in the Appendix). This table provides the critical values of w2c where P(w2cal > w2c ) ¼ a (for a right-tailed unilateral test). If the value of the statistic is in the critical region, that is, if w2cal > w2c , we reject H0. Otherwise, we should not reject H0. The probability associated to the w2cal statistic (P-value) can also be obtained from Table D. In this case, the null hypothesis is rejected if P  a. Otherwise, we do not reject H0. Example 10.5: Applying the McNemar Test A bill of law proposing the end of full retirement pensions for federal civil servants was being analyzed by the Senate. Aiming at verifying if this measure would bring any changes in the number of people taking public exams, an interview with 60 workers was carried out, before and after the reform, so that they could express their preference in working for a private or a public organization. The results can be seen in Table 10.E.6. Test the hypothesis that there were no significant changes in the workers’ answers before and after the social security reform. Assume that a ¼ 5%.

TABLE 10.E.6 Contingency Table After the Reform Before the Reform

Private

Public

Private

22

3

Public

21

14

Solution Step 1: McNemar is the most suitable test for evaluating the significance of before and after type changes in two related samples, applied to nominal or categorical variables. Step 2: Through the null hypothesis, the reform would not be efficient in changing people’s preferences towards the private sector. In other words, among the workers who changed their preferences, the probability of them changing their preference

264

PART

IV Statistical Inference

from private to public organizations after the reform is the same as the probability of them changing from public to private organizations. That is: H0: P(Private ! Public) ¼ P(Public ! Private) H1: P(Private ! Public) 6¼ P(Public ! Private) Step 3: The significance level to be considered is 5%. Step 4: The value of the statistic, according to Expression (10.7), is: 2

2

ðj321jÞ jÞ w2cal ¼ ðjBC B + C ¼ 3 + 21 ¼ 13:5 with n ¼ 1

If we use the correction factor, the value of the statistic from Expression (10.8) becomes: 2

2

j1Þ j1Þ w2cal ¼ ðjBC ¼ ðj321 ¼ 12:042 with n ¼ 1 B+C 3 + 21

Step 5: The value of the critical chi-square (w2c ) obtained from Table D, in the Appendix, considering a ¼ 5% and n ¼ 1 degree of freedom, is 3.841. Step 6: Decision: since the value calculated is in the critical region, that is, w2cal > 3.841, we reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there were significant changes in the choice of working at a private or a public organization after the social security reform. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, for n ¼ 1 degree of freedom, the probability associated to statistic w2cal ¼ 12.042 or 13.5 (P-value) is less than 0.005 (a probability of 0.005 is associated to statistic w2cal ¼ 7.879). Step 6: Decision: since P < 0.05, we must reject H0.

10.3.1.1 Solving the McNemar Test Using SPSS Software Example 10.5 will be solved using SPSS software. The use of the images in this section has been authorized by the International Business Machines Corporation©. The data are available in the file McNemar_Test.sav. The procedure for applying the McNemar test on SPSS is presented. Let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples …, as shown in Fig. 10.17. After that, we should insert variable 1 (Before) and variable 2 (After) into Test Pairs. Let’s select the McNemar test option in Test Type, as shown in Fig. 10.18. Finally, we must click on OK to obtain Figs. 10.19 and 10.20. Fig. 10.19 shows the frequencies observed before and after the reform (Contingency Table). The result of the McNemar test is shown in Fig. 10.20. According to Fig. 10.20, the significance level observed in the McNemar test is 0.000, value lower than 5%, so, the null hypothesis is rejected. Hence, we may conclude, with a 95% confidence level, that there was a significant change in choosing to work at a public or a private organization after the social security reform.

10.3.1.2 Solving the McNemar Test Using Stata Software Example 10.5 will also be solved using Stata software. The use of the images presented in this section has been authorized by Stata Corp LP©. The data are available in the file McNemar_Test.dta. The McNemar test can be calculated on Stata by using the command mcc followed by the paired variables. In our example, the paired variables are called before and after, so, the command to be typed is: mcc before after

The result of the McNemar test is shown in Fig. 10.21. We can see that the value of the statistic is 13.5, similar to the value calculated by Expression (10.7), without the correction factor. The significance level observed from the test is 0.000, lower than 5%, which allows us to conclude, with a 95% confidence level, that there was a significant change before and after the reform. The result of the McNemar test could have also been obtained by using the command mcci 14 21 3 22.

10.3.2

Sign Test for Two Paired Samples

The sign test can also be applied to two paired samples. In this case, the sign is given by the difference between the pairs, that is, if the difference results in a positive number, each pair of values is replaced by a (+) sign. On the other hand, if the result of the difference is negative, each pair of values is replaced by a () sign. In case of a tie, the data will be excluded from the sample.

Nonparametric Tests Chapter

FIG. 10.17 Procedure for elaborating the McNemar test on SPSS.

FIG. 10.18 Selecting the variables and McNemar test.

10

265

266

PART

IV Statistical Inference

FIG. 10.19 Frequencies observed.

FIG. 10.20 McNemar Test for Example 10.5 on SPSS. FIG. 10.21 Results of the McNemar test for Example 10.5 on Stata.

Analogous to the sign test for a single sample, the sign test presented in this section is also an alternative to the t-test for comparing two related samples when the data distribution is not normal. In this case, the quantitative data are transformed into ordinal data. Thus, the sign test is much less powerful than the t-test, because it only uses the difference sign between the pairs as information. Through the null hypothesis, the population median of the differences (md) is zero. Therefore, for a bilateral test, we have: H0: md ¼ 0 6 0 H1: md ¼ In other words, we tested the hypothesis that there are no differences between both samples (the samples come from populations with the same median and the same continuous distribution), that is, the number of (+) signs is the same as number of () signs. The same procedure presented in Section 10.2.3 for a single sample will be used in order to calculate the sign statistic in the case of two paired samples. Small samples We say that N is the number of positive and negative signs (sample size disregarding the ties) and k is the number of signs that corresponds to the lowest frequency. If N  25, we will use the binomial test with p ¼ ½ to calculate P(Y  k). This probability can be obtained directly from Table F2 in the Appendix.

Nonparametric Tests Chapter

10

267

Large samples When N > 25, the binomial distribution is more similar to a normal distribution, and the value of Z is given by Expression (10.6): Z¼

ðX  0:5Þ  N=2 pﬃﬃﬃﬃ  Nð0, 1Þ 0:5 N

where X corresponds to the lowest or highest frequency. If X represents the lowest frequency, we must use X + 0.5. On the other hand, if X represents the highest frequency, we must use X  0.5. Example 10.6: Applying the Sign Test to Two Paired Samples A group of 30 workers are submitted to a training course aiming at improving their productivity. The result, in terms of the average number of parts produced per hour per employee and before and after the training, is shown in Table 10.E.7. Test the null hypothesis that there were no alterations in productivity before and after the training course. Assume that a ¼ 5%.

TABLE 10.E.7 Productivity Before and After the Training Course Before

After

Difference Sign

36

40

+

39

41

+

27

29

+

41

45

+

40

39



44

42



38

39

+

42

40



40

42

+

43

45

+

37

35



41

40



38

38

0

45

43



40

40

0

39

42

+

38

41

+

39

39

0

41

40



36

38

+

38

36



40

38



36

35



40

42

+

40

41

+

38

40

+

37

39

+

40

42

+ Continued

268

PART

IV Statistical Inference

TABLE 10.E.7 Productivity Before and After the Training Course—cont’d Before

After

Difference Sign

38

36



40

40

0

Solution Step 1: Since the data do not follow a normal distribution, the sign test can be an alternative to the t-test for two paired samples. Step 2: The null hypothesis assumes that there is no difference in productivity before and after the training course, that is: H 0: m d ¼ 0 H1: md 6¼ 0 Step 3: The significance level to be considered is 5%. Step 4: Since N > 25, the binomial distribution is more similar to a normal distribution, and the value of Z is given by:

ðX  0:5Þ  N=2 ð11 + 0:5Þ  13 pﬃﬃﬃﬃﬃﬃ ¼ 0:588 pﬃﬃﬃﬃ ¼ 0:5  26 0:5  N

Step 5: By using the standard normal distribution table (Table E in the Appendix), we must determine the critical region (CR) for a

FIG. 10.22 Critical region of Example 10.6.

bilateral test, as shown in Fig. 10.22. Step 6: Decision: since the value calculated is not in the critical region, that is, 1.96  Zcal  1.96, the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that there is no difference in productivity before and after the training course. If instead of comparing the value calculated to the critical value of the standard normal distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table E in the Appendix, the unilateral probability associated to statistic Zcal ¼  0.59 is P1 ¼ 0.278. For a bilateral test, this probability must be doubled (P-value ¼ 0.556). Step 6: Decision: since P > 0.05, we reject the null hypothesis.

10.3.2.1 Solving the Sign Test for Two Paired Samples Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.6 can be found in the file Sign_Test_Two_Paired_Samples.sav. The procedure for applying the sign test to two paired samples on SPSS is shown. We have to click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples …, as shown in Fig. 10.23. After that, let’s insert variable 1 (Before) and variable 2 (After) into Test Pairs. Let’s also select the option regarding the sign test (Sign) in Test Type, as shown in Fig. 10.24. Finally, let’s click on OK to obtain the results of the sign test for two paired samples (Figs. 10.25 and 10.26). Fig. 10.25 shows the frequencies of negative and positive signs, the total number of ties, and the total frequency. Fig. 10.26 shows the result of the z test, besides the associated P probability for a bilateral test, values that are similar to the ones calculated in Example 10.6. Since P ¼ 0.556 > 0.05, the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that there is no difference in productivity before and after the training course.

Nonparametric Tests Chapter

FIG. 10.23 Procedure for elaborating the sign test on SPSS.

FIG. 10.24 Selecting the variables and the sign test.

10

269

270

PART

IV Statistical Inference

FIG. 10.25 Frequencies observed.

FIG. 10.26 Sign test (two paired samples) for Example 10.6 on SPSS.

10.3.2.2 Solving the Sign Test for Two Paired Samples Using Stata Software The use of the images presented in this section has been authorized by Stata Corp LP©. The data in Example 10.6 also are available on Stata in the file Sign_Test_Two_Paired_Samples.dta. The paired variables are before and after. As discussed in Section 10.2.3.2 for a single sample, the sign test on Stata is carried out from the command signtest. In the case of two paired samples, we must use the same command. However, it must be followed by the names of the paired variables, with the equal sign between them, since the objective is to test the equality of the respective medians. Thus, the command to be typed for our example is: signtest after = before

The result of the test is shown in Fig. 10.27 and includes the number of positive signs (15), the number of negative signs (11), as well as the probability associated to the statistic for a bilateral test (P ¼ 0.557). These values are similar to the ones calculated in Example 10.6 and also generated on SPSS. Since P > 0.05, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is no difference in productivity before and after the training course.

10.3.3

Wilcoxon Test

Analogous to the sign test for two paired samples, the Wilcoxon test is an alternative to the t-test when the data distribution does not follow a normal distribution. The Wilcoxon test is an extension of the sign test; however, it is more powerful. Besides the information about the direction of the differences for each pair, the Wilcoxon test considers the magnitude of the difference within the pairs (Fa´vero et al., 2009). The logical foundations and the method used in the Wilcoxon test are described, based on Siegel and Castellan (2006). Let’s assume that di is the difference between the values for each pair of data. First of all, we have to place all of the di’s in ascending order according to their absolute value (without considering the sign) and calculate the respective ranks using this order. For example, position 1 is attributed to the lowest j di j, position 2 to the second lowest, and so on. At the end, we must attribute the di difference sign for each rank. The sum of all positive ranks is represented by Sp and the sum of all negative ranks by Sn. Occasionally, the values for a certain pair of data are the same (di ¼ 0). In this case, they are excluded from the sample. It is the same procedure used in the sign test, so, the value of N represents the sample size disregarding these ties.

Nonparametric Tests Chapter

10

271

FIG. 10.27 Results of the sign test (two paired samples) for Example 10.6 on Stata.

Another type of tie may happen, in which two or more differences have the same absolute value. In this case, the same rank will be attributed to the ties, which will correspond to the mean of the ranks that would have been attributed if the differences had been different. For example, suppose that three pairs of data indicate the following differences: 1, 1, and 1. Rank 2 is attributed to each pair, which corresponds to the average between 1, 2, and 3. In order, the next value will receive rank 4, since ranks 1, 2, and 3 have already been used. The null hypothesis assumes that the median of the differences in the population (md) is zero, that is, the populations do not differ in location. For a bilateral test, we have: H0: md ¼ 0 H1: md 6¼ 0 In other words, we must test the hypothesis that there are no differences between both samples (the samples come from populations with the same median and the same continuous distribution), that is, the sum of the positive ranks (Sp) is the same as the sum of the negative ranks (Sn). Small samples If N  15, Table I in the Appendix shows the unilateral probabilities associated to the several critical values of Sc (P(Sp > Sc) ¼ a). For a bilateral test, this value must be doubled. If the probability obtained (P-value) is less than or equal to a, we must reject H0. Large samples As N grows, the Wilcoxon distribution becomes more similar to a standard normal distribution. Thus, for N > 15, we must calculate the value of variable z that, according to Siegel and Castellan (2006), Fa´vero et al. (2009), and Maroco (2014), is:   N  ð N + 1Þ min Sp , Sn  4 Zcal ¼ sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Xg Xg t3  t N  ðN + 1Þ  ð2N + 1Þ j¼1 j j¼1 j  48 24

(10.9)

272

PART

IV Statistical Inference

where: Pg 3 Pg t j¼1 j



t j¼1 j

is a correction factor whenever there are ties; g: the number of groups of tied ranks; tj: the number of tied observations in group j. 48

The value calculated must be compared to the critical value of the standard normal distribution (Table E in the Appendix). This table provides the critical values of zc where P(Zcal > zc) ¼ a (for a right-tailed unilateral test). For a bilateral test, we have P(Zcal <  zc) ¼ P(Zcal > zc) ¼ a/2. The null hypothesis H0 of a bilateral test is rejected if the value of the Zcal statistic is in the critical region, that is, if Zcal <  zc or Zcal > zc. Otherwise, we do not reject H0. The unilateral probabilities associated to statistic Zcal (P1) can also be obtained from Table E. For a unilateral test, we consider P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Thus, for both tests, we reject H0 if P  a. Example 10.7: Applying the Wilcoxon Test A group of 18 students from the 12th grade took an English proficiency exam, without ever having taken an extracurricular course. The same group of students was submitted to an intensive English course for 6 months and, at the end, they took the proficiency exam again. The results can be seen in Table 10.E.8. Test the hypothesis that there was no improvement before and after the course. Assume that a ¼ 5%.

TABLE 10.E.8 Students’ Grades Before and After the Intensive Course Before

After

56

60

65

62

70

74

78

79

47

53

52

59

64

65

70

75

72

75

78

88

80

78

26

26

55

63

60

59

71

71

66

75

60

71

17

24

Solution Step 1: Since the data do not follow a normal distribution, the Wilcoxon test can be applied, because it is more powerful than the sign test for two paired samples. Step 2: Through the null hypothesis, there is no difference in the students’ performance before and after the course, that is: H0: md ¼ 0 H1: md ¼ 6 0

Nonparametric Tests Chapter

10

273

Step 3: The significance level to be considered is 5%. Step 4: Since N > 15, the Wilcoxon distribution is more similar to a normal distribution. In order to calculate the value of z, first of all, we have to calculate di and the respective ranks, as shown in Table 10.E.9.

TABLE 10.E.9 Calculation of di and the Respective Ranks di

di’s Rank

Before

After

56

60

4

7.5

65

62

3

5.5

70

74

4

7.5

78

79

1

2

47

53

6

10

52

59

7

11.5

64

65

1

2

70

75

5

9

72

75

3

5.5

78

88

10

15

80

78

2

4

26

26

0

55

63

8

13

60

59

1

2

71

71

0

66

75

9

14

60

71

11

16

17

24

7

11.5

Since there are two pairs of data with equal values (di ¼ 0), they are excluded from the sample, so, N ¼ 16. The sum of the positive ranks is Sp ¼ 2 + ⋯ + 16 ¼ 124.5. The sum of the negative ranks is Sn ¼ 2 + 4 + 5.5 ¼ 11.5. Thus, we can calculate the value of z by using Expression (10.9):   N  ðN + 1Þ 16  17 min Sp , Sn  11:5  4 4 Zcal ¼ sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Xg ﬃ ¼ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ 2:925 Xg 3 16  17  33 59  11 t  t N  ðN + 1Þ  ð2N + 1Þ  j¼1 j j¼1 j  24 48 48 24 Step 5: By using the standard normal distribution table (Table E in the Appendix), we determine the critical region (CR) for the bilateral test, as shown in Fig. 10.28.

FIG. 10.28 Critical region of Example 10.7.

274

PART

IV Statistical Inference

Step 6: Decision: since the value calculated is in the critical region, that is, Zcal < 1.96, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that there is a difference in the students’ performance before and after the course. If instead of comparing the value calculated to the critical value of the standard normal distribution, we use the calculation of the P-value, Steps 5 and 6 will be: Step 5: According to Table E in the Appendix, the unilateral probability associated to statistic Zcal ¼  2.925 is p1 ¼ 0.0017. For a bilateral test, this probability must be doubled (P-value ¼ 0.0034). Step 6: Decision: since P < 0.05, we must reject the null hypothesis.

10.3.3.1 Solving the Wilcoxon Test Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.7 are available in the file Wilcoxon_Test.sav. The procedure for applying the Wilcoxon test to two paired samples on SPSS is shown. Let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples …, as shown in Fig. 10.29. First of all, let’s insert variable 1 (Before) and variable 2 (After) into Test Pairs. Let’s also select the option related to the Wilcoxon test in Test Type, as shown in Fig. 10.30. Finally, let’s click on OK to obtain the results of the Wilcoxon test for two paired samples (Figs. 10.31 and 10.32). Fig. 10.31 shows the number of negative, positive, and tied ranks, besides the mean and the sum of all positive and negative ranks. Fig. 10.32 shows the result of the z test, besides the associated P probability for a bilateral test, values similar to the ones found in Example 10.7. Since P ¼ 0.003 < 0.05, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is a difference in the students’ performance before and after the course.

FIG. 10.29 Procedure for elaborating the Wilcoxon test on SPSS.

Nonparametric Tests Chapter

10

275

FIG. 10.30 Selecting the variables and Wilcoxon test.

FIG. 10.31 Ranks.

FIG. 10.32 Wilcoxon test for Example 10.7 on SPSS.

10.3.3.2 Solving the Wilcoxon Test Using Stata Software

The use of the images presented in this section has been authorized by Stata Corp LP©. The data in Example 10.7 are available in the file Wilcoxon_Test.dta. The paired variables are called before and after. The Wilcoxon test on Stata is carried out from the command signrank followed by the name of the paired variables with an equal sign between them. For our example, we must type the following command: signrank before = after

276

PART

IV Statistical Inference

FIG. 10.33 Results of the Wilcoxon test for Example 10.7 on Stata.

The result of the test is shown in Fig. 10.33. Since P < 0.05, we reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is a difference in the students’ performance before and after the course.

10.4

TESTS FOR TWO INDEPENDENT SAMPLES

In these tests, we try to compare two populations represented by their respective samples. Different from the tests for two paired samples, here, it is not necessary for the samples to have the same size. Among the tests for two independent samples, we can highlight the chi-square test (for nominal or ordinal variables) and the Mann-Whitney test for ordinal variables.

10.4.1

Chi-Square Test (x2) for Two Independent Samples

In Section 10.2.2, the w2 test was applied to a single sample in which the variable being studied was qualitative (nominal or ordinal). Here the test will be applied to two independent samples, from nominal or ordinal qualitative variables. This test has already been studied in Chapter 4 (Section 4.2.2), in order to verify if there is an association between two qualitative variables, and it will be described once again in this section. The test compares the frequencies observed in each one of the cells of a contingency table to the frequencies expected. The w2 test for two independent samples assumes the following hypotheses: H0: there is no significant difference between the frequencies observed and the ones expected H1: there is a significant difference between the frequencies observed and the ones expected Therefore, the w2 statistic measures the discrepancy between a table with the contingency observed and a table with the contingency expected, starting from the hypothesis that there is no connection between the categories of both variables studied. If the distribution of frequencies observed is exactly the same as the distribution of frequencies expected, the result of the w2 statistic is zero. Thus, a low value of w2 indicates independence between the variables. As already presented in Expression (4.1) in Chapter 4, the w2 statistic for two independent samples is given by:  2 I X J X Oij  Eij 2 (10.10) w ¼ Eij i¼1 j¼1 where: Oij: the number of observations in the ith category of variable X and in the jth category of variable Y; Eij: frequency expected of observations in the ith category of variable X and in the jth category of variable Y; I: the number of categories (rows) of variable X; J: the number of categories (columns) of variable Y.

Nonparametric Tests Chapter

10

277

FIG. 10.34 w2 distribution.

The values of w2cal approximately follow an w2 distribution with n ¼ (I  1)(J  1) degrees of freedom. The critical values of the chi-square statistic (w2c ) can be found in Table D, in the Appendix. This table provides the critical values of w2c where P(w2cal > w2c ) ¼ a (for a right-tailed unilateral test). In order for the null hypothesis H0 to be rejected, the value of the w2cal statistic must be in the critical region, that is, w2cal > w2c . Otherwise, we do not reject H0 (Fig. 10.34). Example 10.8: Applying the x2 Test to Two Independent Samples Let’s consider Example 4.1 in Chapter 4 once again, which refers to a study carried out with 200 individuals aiming at analyzing the joint behavior of variable X (Health insurance agency) with variable Y (Level of satisfaction). The contingency table showing the joint distribution of the variables’ absolute frequencies, besides the marginal totals, is presented in Table 10.E.10. Test the hypothesis that there is no association between the categories of both variables, considering a ¼ 5%.

TABLE 10.E.10 Joint Distribution of the Absolute Frequencies of the Variables Being Studied Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

40

16

12

68

Live Life

32

24

16

72

Mena Health

24

32

4

60

Total

96

72

32

200

Solution Step 1: The most suitable test to compare the frequencies observed in each cell of a contingency table to the frequencies expected is the w2 for two independent samples. Step 2: The null hypothesis states that there are no connections between the categories of variables Agency and Level of satisfaction, that is, the frequencies observed and expected are the same for each pair of variable categories. The alternative hypothesis states that there are differences in at least one pair of categories: H0: Oij ¼ Eij H1: Oij ¼ 6 Eij Step 3: The significance level to be considered is 5%. Step 4: In order to calculate the statistic, it is necessary to compare the values observed and the values expected. Table 10.E.11 presents the distribution’s values observed with their respective relative frequencies in relation to the row’s general total. The calculation could also be done in relation to the column’s general total, achieving the same result as the w2 statistic. The data in Table 10.E.11 demonstrate a dependence between the variables. Supposing that there was no connection between the variables, we would expect a proportion of 48% in relation to the total of the Dissatisfied row for all three agencies, 36% in the Neutral level, and 16% in the Satisfied level. The calculations of the values expected can be found in Table 10.E.12. For example, the calculation of the first cell is 0.48 68 ¼ 32.6. In order to calculate the w2 statistic, we must apply Expression (10.10) to the data in Tables 10.E.11 and 10.E.12. The cal2 ðOij Eij Þ culation of each term Eij is represented in Table 10.E.13, jointly with the resulting w2cal measure of the sum of the categories. Step 5: The critical region (CR) of the w2 distribution (Table D in the Appendix), considering a ¼ 5% and n ¼ (I  1)(J  1) ¼ 4 degrees of freedom, is shown in Fig. 10.35.

278

PART

IV Statistical Inference

TABLE 10.E.11 Values Observed in Each Category With Their Respective Proportions in Relation to the Row’s General Total Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

40 (58.8%)

16 (23.5%)

12 (17.6%)

68 (100%)

Live Life

32 (44.4%)

24 (33.3%)

16 (22.2%)

72 (100%)

Mena Health

24 (40%)

32 (53.3%)

4 (6.7%)

60 (100%)

Total

96 (48%)

72 (36%)

32 (16%)

200 (100%)

TABLE 10.E.12 Values Expected From Table 10.E.11 Assuming a Nonassociation Between the Variables Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

32.6 (48%)

24.5 (36%)

10.9 (16%)

68 (100%)

Live Life

34.6 (48%)

25.9 (36%)

11.5 (16%)

72 (100%)

Mena Health

28.8 (48%)

21.6 (36%)

9.6 (16%)

60 (100%)

Total

96 (48%)

72 (36%)

32 (16%)

200 (100%)

TABLE 10.E.13 Calculation of the x2 Statistic Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total Health

1.66

2.94

0.12

Live Life

0.19

0.14

1.74

Mena Health

0.80

5.01

3.27

Total

w2cal

¼ 15.861

FIG. 10.35 Critical region of Example 10.8.

Step 6: Decision: since the value calculated is in the critical region, that is, w2cal > 9.488, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is an association between the variable categories. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D, in the Appendix, the probability associated to the w2cal statistic ¼ 15.861, for n ¼ 4 degrees of freedom, is less than 0.005. Step 6: Decision: since P < 0.05, we reject H0.

Nonparametric Tests Chapter

10

279

10.4.1.1 Solving the w2 Statistic Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.8 are available in the file HealthInsurance.sav. In order to calculate the w2 statistic for two independent samples, we must click on Analyze → Descriptive Statistics → Crosstabs … Let’s insert variable Agency in Row(s) and variable Satisfaction in Column(s), as shown in Fig. 10.36. In Statistics …, let’s select option Chi-square, as shown in Fig. 10.37. Then, we must finally click on Continue and OK. The result is shown in Fig. 10.38. From Fig. 10.38, we can see that the value of w2 is 15.861, similar to what was calculated in Example 10.8. For the confidence level of 95%, as P ¼ 0.003 < 0.05, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is an association between the variable categories, that is, the frequencies observed differ from the frequencies expected in at least one pair of categories.

10.4.1.2 Solving the w2 Statistic by Using Stata Software

The use of the images presented in this section has been authorized by Stata Corp LP©. As presented in Chapter 4, the calculation of the w2 statistic on Stata is done by using the command tabulate, or simply tab, followed by the name of the variables being studied, using option chi2, or simply ch. The syntax of the test is: tab variable1* variable2*, ch

The data in Example 10.8 are also available in the file HealthCareInsurance.dta. The variables being studied are agency and satisfaction. Thus, we must type the following command: tab agency satisfaction, ch

The results can be seen in Fig. 10.39 and are similar to the ones presented in Example 10.8 and on Stata.

FIG. 10.36 Selecting the variables.

280

PART

IV Statistical Inference

FIG. 10.37 Selecting the w2 statistic.

FIG. 10.38 Results of the w2 test for Example 10.8 on SPSS.

FIG. 10.39 Results of the w2 test for Example 10.8 on Stata.

Nonparametric Tests Chapter

10.4.2

10

281

Mann-Whitney U Test

The Mann-Whitney U test is one of the most powerful nonparametric tests, applied to quantitative or qualitative variables in an ordinal scale, and it aims at verifying if two nonpaired or independent samples are drawn from the same population. It is an alternative to Student’s t-test when the normality hypothesis is violated or when the sample is small. In addition, it may be considered a nonparametric version of the t-test for two independent samples. Since the original data are transformed into ranks (orders), we lose some information, so, the Mann-Whitney U test is not as powerful as the t-test. Different from the t-test that verifies the equality of the means of two independent populations and with continuous data, the Mann-Whitney U test verifies the equality of the medians. For a bilateral test, the null hypothesis is that the median of both populations is equal, that is: H 0 : m 1 ¼ m2 H1: m1 6¼ m2 The calculation of the Mann-Whitney U statistic is specified, for small and large samples. Small samples Method: (a) Let’s consider N1 the size of the sample with the smallest number of observations and N2 the size of the sample with the largest number of observations. We assume that both samples are independent. (b) In order to apply the Mann-Whitney U test, we must join both samples into a single combined sample that will be formed by N ¼ N1 + N2 elements. However, we must identify the original sample of each observation in the combined sample. The combined sample must be ordered in ascending order and the ranks are attributed to each observation. For example, rank 1 is attributed to the lowest observation and rank N to the highest observation. If there are ties, we attribute the mean of the corresponding ranks. (c) After that, we must calculate the sum of the ranks for each sample, that is, calculate R1, which corresponds to the sum of the ranks in the sample with the smallest number of observations, and R2, which corresponds to sum of the ranks in the sample with the largest number of observations. (d) Thus, we can calculate quantities U1 and U2 as follows: U1 ¼ N1  N2 +

N 1  ð N 1 + 1Þ  R1 2

(10.11)

U2 ¼ N1  N 2 +

N 2  ð N 2 + 1Þ  R2 2

(10.12)

(e) The Mann-Whitney U statistic is given by: Ucal ¼ min ðU1 , U2 Þ Table J in the Appendix shows the critical values of U in a way that P(Ucal < Uc) ¼ a (for a left-tailed unilateral test), for values of N2  20 and significance levels of 0.05, 0.025, 0.01, and 0.005. In order for the null hypothesis H0 of the left-tailed unilateral test to be rejected, the value of the Ucal statistic must be in the critical region, that is, Ucal < Uc. Otherwise, we do not reject H0. For a bilateral test, we must consider P(Ucal < Uc) ¼ a/2, since P(Ucal < Uc) + P(Ucal > Uc) ¼ a. The unilateral probabilities associated to the Ucal statistic (P1) can also be obtained from Table J. For a unilateral test, we have P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Thus, we reject H0 if P  a. Large samples As the sample size grows (N2 > 20), the Mann-Whitney distribution becomes more similar to a standard normal distribution.

282

PART

IV Statistical Inference

The real value of the Z statistic is given by: ðU  N1  N2 =2Þ Zcal ¼ vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 0 Xg 1 Xg u 3 u 3 t  t j N  N u N1  N2 j¼1 j¼1 j A t @  N  ð N  1Þ 12 12 where: Pg

t3  j¼1 j

(10.13)

Pg

t j¼1 j

is a correction factor when there are ties; g: the number of groups with tied ranks; tj: the number of tied observations in group j. 12

The value calculated must be compared to the critical value of the standard normal distribution (see Table E in the Appendix). This table provides the critical values of zc where P(Zcal > zc) ¼ a (for a right-tailed unilateral test). For a bilateral test, we have P(Zcal <  zc) ¼ P(Zcal > zc) ¼ a/2. Therefore, for a bilateral test, the null hypothesis is rejected if Zcal <  zc or Zcal > zc. Unilateral probabilities associated to the Zcal (P1 ¼ P) statistic can also be obtained from Table E. For a bilateral test, this probability must be doubled (P ¼ 2P1). Thus, the null hypothesis is rejected if P  a. Example 10.9: Applying the Mann-Whitney U Test to Small Samples Aiming at assessing the quality of two machines, the diameters of the parts produced (in mm) in each one of them are compared, as shown in Table 10.E.14. Use the most suitable test, at a significance level of 5%, to test if both samples come from or do not come from populations with the same medians.

TABLE 10.E.14 Diameter of Parts Produced in Two Machines Mach. A

48.50

48.65

48.58

48.55

48.66

48.64

48.50

Mach. B

48.75

48.64

48.80

48.85

48.78

48.79

49.20

48.72

Solution Step 1: By applying the normality test to both samples, we can see that the data from machine B do not follow a normal distribution. So, the most suitable test to compare the medians of two independent populations is the Mann-Whitney U test. Step 2: Through the null hypothesis, the median diameters of the parts in both machines are the same, so: H0: mA ¼ mB H1: mA ¼ 6 mB Step 3: The significance level to be considered is 5%. Step 4: Calculation of the U statistic: (a) N1 ¼ 7 (sample size from machine B) N2 ¼ 8 (sample size from machine A) (b) Combined sample and respective ranks (Table 10.E.15):

TABLE 10.E.15 Combined Data Data

Machine

Ranks

48.50

A

1.5

48.50

A

1.5

48.55

A

3

48.58

A

4

48.64

A

5.5

Nonparametric Tests Chapter

10

283

TABLE 10.E.15 Combined Data—cont’d Data

Machine

Ranks

48.64

B

5.5

48.65

A

7

48.66

A

8

48.72

A

9

48.75

B

10

48.78

B

11

48.79

B

12

48.80

B

13

48.85

B

14

49.20

B

15

(c) R1 ¼ 80.5 (sum of the ranks from machine B with the smallest number of observations); R2 ¼ 39.5 (sum of the ranks from machine A with the largest number of observations). (d) Calculation of U1 and U2: N1  ðN1 + 1Þ 78  R1 ¼ 7  8 +  80:5 ¼ 3:5 2 2 N2  ðN2 + 1Þ 89  R2 ¼ 7  8 +  39:5 ¼ 52:5 U2 ¼ N1  N2 + 2 2 U1 ¼ N1  N2 +

(e) Calculation of the Mann-Whitney U statistic: Ucal ¼ min ðU1 , U2 Þ ¼ 3:5 Step 5: According to Table J, in the Appendix, for N1 ¼ 7, N2 ¼ 8, and P(Ucal < Uc) ¼ a/2 ¼ 0.025 (bilateral test), the critical value of the Mann-Whitney U statistic is Uc ¼ 10. Step 6: Decision: since the calculated statistic is in the critical region, that is, Ucal < 10, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the medians of both populations are different. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table J, in the Appendix, unilateral probability P1 associated to Ucal ¼ 3.5, for N1 ¼ 7 and N2 ¼ 8, is less than 0.005. For a bilateral test, this probability must be doubled (P < 0.01). Step 6: Decision: since P < 0.05, we must reject H0.

Example 10.10: Applying the Mann-Whitney U Test to Large Samples As described previously, as the sample size grows (N2 > 20), the Mann-Whitney distribution becomes more similar to a standard normal distribution. Even though the data in Example 10.9 represent a small sample (N2 ¼ 8), which would be the value of z in this case, by using Expression (10.13)? Interpret the result. Solution ðU  N1  N2 =2Þ ð3:5  7  8=2Þ Zcal ¼ vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Xg 1 ¼ sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Xg 0    2:840 u u 7  8 153  15 16  4 tj3  tj 3 N  N u N1  N2 j¼1 j¼1  A t @  15  14 12 12 N  ðN  1Þ 12 12

284

PART

IV Statistical Inference

The critical value of the zc statistic for a bilateral test, at the significance level of 5%, is 1.96 (see Table E in the Appendix). Since Zcal <  1.96, the null hypothesis would also be rejected by the Z statistic, which allows us to conclude, with a 95% confidence level, that the population medians are different. Instead of comparing the value calculated to the critical value, we could obtain the value of P-value directly from Table E. Thus, the unilateral probability associated to statistic Zcal ¼  2.840 is P1 ¼ 0.0023. For a bilateral test, this probability must be doubled (P-value ¼ 0.0046).

10.4.2.1 Solving the Mann-Whitney Test Using SPSS Software The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.9 are available in the file Mann-Whitney_Test.sav. Since group 1 is the one with the smallest number of observations, in Data → Define Variable Properties …, we assign value 1 to group B and value 2 to group A for variable Machine. In order to elaborate the Mann-Whitney test on SPSS, we must click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Independent Samples …, as shown in Fig. 10.40. After that, we should insert the variable Diameter in the box Test Variable List and the variable Machine in Grouping Variable, defining the respective groups. Let’s select the option Mann-Whitney U in Test Type, as shown in Fig. 10.41. Finally, let’s click on OK to obtain Figs. 10.42 and 10.43. Fig. 10.42 shows the mean and the sum of the ranks for each group, while Fig. 10.43 shows the statistic of the test. The results in Fig. 10.42 are similar to the ones calculated in Example 10.9. According to Fig. 10.43, the result of the MannWhitney U statistic is 3.50, similar to the value calculated in Example 10.9. The bilateral probability associated to the U statistic is P ¼ 0.002 (we saw in Example 10.9 that this probability is less than 0.01). For the same data in Example 10.9, if we had to calculate the Z statistic and the respective associated bilateral probability, the result would be Zcal ¼  2.840 and P ¼ 0.005, similar to the values calculated in Example 10.10. For both tests, as the associated bilateral probability is less than 0.05, the null hypothesis is rejected, which allows us to conclude that the medians of both populations are different.

FIG. 10.40 Procedure to elaborate the Mann-Whitney test on SPSS.

Nonparametric Tests Chapter

10

285

FIG. 10.41 Selecting the variables and Mann-Whitney test.

FIG. 10.42 Ranks.

FIG. 10.43 Mann-Whitney test for Example 10.9 on SPSS.

10.4.2.2 Solving the Mann-Whitney Test Using Stata Software

The use of the images presented in this section has been authorized by Stata Corp LP©. The Mann-Whitney test is elaborated on Stata from the command ranksum (equality test for nonpaired data), by using the following syntax: ranksum variable*, by (groups*)

286

PART

IV Statistical Inference

FIG. 10.44 Results of the Mann-Whitney test for Examples 10.9 and 10.10 on Stata.

where the term variable* must be replaced by the quantitative variable studied and the term groups* by the categorical variable that represents the groups. Let’s open the file Mann-Whitney_Test.dta that contains the data from Examples 10.9 and 10.10. Both groups are represented by the variable machine and the quality characteristic by the variable diameter. Thus, the command to be typed is: ranksum diameter, by (machine)

The results obtained are shown in Fig. 10.44. We can see that the calculated value of the statistic (2.840) corresponds to the value calculated in Example 10.10, for large samples, from Expression (10.13). The probability associated to the statistic for a bilateral test is 0.0045. Since P < 0.05, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that the population medians are different.

10.5

TESTS FOR K PAIRED SAMPLES

These tests analyze the differences between k (three or more) paired or related samples. According to Siegel and Castellan (2006), the null hypothesis to be tested is that k samples have been drawn from the same population. The main tests for k paired samples are Cochran’s Q test (for binary variables) and Friedman’s test (for ordinal variables).

10.5.1

Cochran’s Q Test

Cochran’s Q test for k paired samples is an extension of the McNemar test for two samples, and it aims to test the hypothesis that the frequency in which or proportion of three or more related groups differ significantly from one another. In the same way as in the McNemar test, the data are binary. According to Siegel and Castellan (2006), Cochran’s Q test compares the characteristics of several individuals or characteristics of the same individual observed under different conditions. For example, we can analyze if k items differ significantly for N individuals. Or, we may have only one item to analyze and the objective is to compare the answer of N individuals under k different conditions. Let’s suppose that the study data are organized in one table with N rows and k columns, in which N is the number of cases and k is the number of groups or conditions. Through the null hypothesis of Cochran’s Q test, there are no differences between the frequencies or proportions of success (p) of the k related groups, that is, the proportion of a desired answer (success) is the same in each column. Through the alternative hypothesis, there are differences between at least two groups, so: H0: p1 ¼ p2 ¼ … ¼ pk H1: 9(i,j) pi 6¼ pj, i 6¼ j

Nonparametric Tests Chapter

10

287

Cochran’s Q statistic is given by:

X Xk 2

k Xk  2 2 ð k  1 Þ  k  G  G k  ð k  1Þ  Gj  G j¼1 j j¼1 j j¼1 Qcal ¼ ¼ XN XN XN XN k L L2 L L2 k i¼1 i i¼1 i i¼1 i i¼1 i

(10.14)

which approximately follows a w2 distribution with k  1 degrees of freedom, where: Gj: the total number of successes in the jth column; G: mean of the Gj; Li: the total number of successes in the ith row. The value calculated must be compared to the critical value of the w2 distribution (Table D in the Appendix). This table provides the critical values of w2c where P(w2cal > w2c ) ¼ a (for a right-tailed unilateral test). If the value of the statistic is in the critical region, that is, if Qcal > w2c , we must reject H0. Otherwise, we do not reject H0. The probability associated to the calculated value of the statistic (P-value) can also be obtained from Table D. In this case, the null hypothesis is rejected if P  a; otherwise we do not reject H0. Example 10.11: Applying Cochran’s Q Test We are interested in assessing 20 consumers’ level of satisfaction regarding three supermarkets, trying to investigate if their clients are satisfied (score 1) or not (score 0) with the quality, variety and price of their products—for each supermarket. Check the hypothesis that the probability of receiving a good evaluation from clients is the same for all three supermarkets, considering a significance level of 10%. Table 10.E.16 shows the results of the evaluation.

TABLE 10.E.16 Results of the Evaluation for All Three Supermarkets Consumer

A

B

C

Li

L2i

1

1

1

1

3

9

2

1

0

1

2

4

3

0

1

1

2

4

4

0

0

0

0

0

5

1

1

0

2

4

6

1

1

1

3

9

7

0

0

1

1

1

8

1

0

1

2

4

9

1

1

1

3

9

10

0

0

1

1

1

11

0

0

0

0

0

12

1

1

0

2

4

13

1

0

1

2

4

14

1

1

1

3

9

15

0

1

1

2

4

16

0

1

1

2

4

17

1

1

1

3

9

18

1

1

1

3

9

19

0

0

1

1

1

20

0

0

1

Total

G1 ¼ 11

G2 ¼ 11

G3 ¼ 16

1 P20

1 P20

i¼1 Li

¼ 38

2 i¼1 Li

¼ 90

288

PART

IV Statistical Inference

FIG. 10.45 Critical region of Example 10.11.

Solution Step 1: The most suitable test to compare proportions of three or more paired groups is Cochran’s Q test. Step 2: Through the null hypothesis, the proportion of successes (score 1) is the same for all three supermarkets. Through the alternative hypothesis, the proportion of satisfied clients differs for at least two supermarkets, so: H0: p1 ¼ p2 ¼ p3 H1: 9(i,j) pi 6¼ pj, i 6¼ j Step 3: The significance level to be considered is 10%. Step 4: The calculation of Cochran’s Q statistic from Expression (10.14), is given by: X Xk 2

k 2   ðk  1Þ  k  G  G j¼1 j j¼1 j ð3  1Þ  3  112 + 112 + 162  382 ¼ 4:167 Qcal ¼ ¼ XN XN 3  38  90 2 L  L k i i i¼1 i¼1 Step 5: The critical region (CR) of the w2 distribution (Table D in the Appendix), considering a ¼ 10% and n ¼ k  1 ¼ 2 degrees of freedom, is shown in Fig. 10.45. Step 6: Decision: since the value calculated is not in the critical region, that is, Qcal < 4.605, the null hypothesis is not rejected, which allows us to conclude, with a 90% level of confidence, that the proportion of satisfied clients is equal for all three supermarkets. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table D, in the Appendix, for n ¼ 2 degrees of freedom, the probability associated to statistic Qcal ¼ 4.167 is greater than 0.10 (P-value > 0.10). Step 6: Decision: since P > 0.10, we should not reject H0.

10.5.1.1 Solving Cochran’s Q Test by Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.11 are available in the file Cochran_Q_Test.sav. The procedure for elaborating Cochran’s Q test on SPSS is shown. First of all, let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → K Related Samples …, as shown in Fig. 10.46. After that, we must insert variables A, B, and C in the box Test Variables, and select option Cochran’s Q in Test Type, as shown in Fig. 10.47. Finally, let’s click on OK to obtain the results of the test. Fig. 10.48 shows the frequencies of each group and Fig. 10.49 shows the result of the statistic. The value of Cochran’s Q statistic is 4.167, similar to the value calculated in Example 10.11. The probability associated to the statistic is 0.125 (we saw in Example 10.11 that P > 0.10). Since P > a, the null hypothesis is not rejected, which allows us to conclude, with a 90% level of confidence, that there are no differences in the proportion of satisfied clients for all three supermarkets.

10.5.1.2 Solution of Cochran’s Q Test on Stata Software The use of the images presented in this section has been authorized by Stata Corp LP©. The data from Example 10.11 are also available in the file Cochran_Q_Test.dta. The command used to elaborate the test is cochran followed by the k paired variables. In our case, the variables that represent the three groups of supermarkets, a, b, and c, respectively. So, the command to be typed is: cochran a b c

Nonparametric Tests Chapter

FIG. 10.46 Procedure for elaborating Cochran’s Q test on SPSS.

FIG. 10.47 Selecting the variables and Cochran’s Q test.

10

289

290

PART

IV Statistical Inference

FIG. 10.48 Frequencies.

FIG. 10.49 Cochran’s Q test for Example 10.11 on SPSS.

FIG. 10.50 Results of Cochran’s Q test for Example 10.11 on Stata.

The results of Cochran’s Q test on Stata are in Fig. 10.50. We can verify that the result of the statistic and the respective associated probability are similar to the results calculated in Example 10.11, and also generated on SPSS, which allows us to conclude, with a 90% level of confidence, that the proportion of dissatisfied clients is the same for all three supermarkets.

10.5.2

Friedman’s Test

Friedman’s test is applied to quantitative or qualitative variables in an ordinal scale, and has as its main objective to verify if k paired samples are drawn from the same population. It is an extension of the Wilcoxon test for three or more paired samples. It is also an alternative to the analysis of variance when its hypotheses (normality of data and homogeneity of variances) are violated or when the sample size is too small. The data are represented in one table with double entry, with N rows and k columns, in which the rows represent the several individuals or corresponding sets of individuals, and the columns represent the different conditions. Therefore, the null hypothesis of Friedman’s test assumes that the k samples (columns) come from the same population or from populations with the same median (m). For a bilateral test, we have: H0: m1 ¼ m2 ¼ … ¼ mk H1: 9(i,j) mi 6¼ mj, i 6¼ j To apply Friedman’s statistic, we must attribute ranks from 1 to k to each element of each row. For example, position 1 is attributed to the lowest observation on the row and position N to the highest observation. If there are ties, we attribute the mean of the corresponding ranks. Friedman’s statistic is given by: Fcal ¼

k   X 12 2 Rj  3  N  ð k + 1 Þ  N  k  ðk + 1Þ j¼1

(10.15)

Nonparametric Tests Chapter

10

291

where: N: the number of rows; k: the number of columns; Rj: sum of the ranks in column j. However, according to Siegel and Castellan (2006), whenever there are ties between the ranks of the same group or row, Friedman’s statistic must be corrected in a way that considers the changes in the sample distribution, as follows: Xk  2 12  Rj  3  N 2  k  ð k + 1 Þ 2 j¼1 0  (10.16) Fcal ¼ XN Xgi  3 N k t ij i¼1 j¼1 N  k  ð k + 1Þ + ð k  1Þ where: gi: the number of sets with tied observations in the ith group, including the sets of size 1; tij: size of the jth set of ties in the ith group. The value calculated must be compared to the critical value of the sample distribution. When N and k are small (k ¼ 3 and 3 < N < 13, or k ¼ 4 and 2 < N < 8, or k ¼ 5 and 3 < N < 5), we must use Table K in the Appendix, which shows the critical values of Friedman’s statistic (Fc), where P(Fcal > Fc) ¼ a (for a right-tailed unilateral test). For high values of N and k, the sample distribution can be approximated by the w2 distribution with n ¼ k  1 degrees of freedom. Therefore, if the value of the Fcal statistic is in the critical region, that is, if Fcal > Fc for a small N and K or Fcal > w2c for a high N and K, we must reject the null hypothesis. Otherwise, we do not reject H0. Example 10.12: Applying Friedman’s Test A research is carried out in order to verify the efficacy that breakfast has in weight loss and, in order to do that, 15 patients were followed up for 3 months. Data regarding patients’ weight were collected during three different periods, as shown in Table 10.E.17: before the treatment (BT), after the treatment (AT), and after 3 months of treatment (A3M). Check and see if the treatment had any results. Assume that a ¼ 5%.

TABLE 10.E.17 Patients’ Weight in Each Period Period Patient

BT

AT

A3M

1

65

62

58

2

89

85

80

3

96

95

95

4

90

84

79

5

70

70

66

6

72

65

62

7

87

84

77

8

74

74

69

9

66

64

62

10

135

132

132

11

82

75

71

12

76

73

67

13

94

90

88

14

80

80

77

15

73

70

68

292

PART

IV Statistical Inference

Solution Step 1: Since the data do not follow a normal distribution, Friedman’s test is an alternative to ANOVA to verify if the three paired samples are drawn from the same population. Step 2: Through the null hypothesis, there is no difference among the treatments. Through the alternative hypothesis, the treatment had some results, so: H0: m1 ¼ m2 ¼ m3 H1: 9(i,j) mi 6¼ mj, i 6¼ j Step 3: The significance level to be considered is 5%. Step 4: In order to calculate Friedman’s statistic, first, we must attribute ranks from 1 to 3 to each element in each row, as shown in Table 10.E.18. If there are ties, we attribute the mean of the corresponding ranks.

TABLE 10.E.18 Attributing Ranks Period Patient

BT

AT

A3M

1

3

2

1

2

3

2

1

3

3

1.5

1.5

4

3

2

1

5

2.5

2.5

1

6

3

2

1

7

3

2

1

8

2.5

2.5

1

9

3

2

1

10

3

1.5

1.5

11

3

2

1

12

3

2

1

13

3

2

1

14

2.5

2.5

1

15

3

2

1

Rj

43.5

30.5

16

Mean of the ranks

2.900

2.030

1.067

As shown in Table 10.E.18, there are two ties in patient 3, two in patient 5, two in patient 8, two in patient 10, and two in patient 14. Therefore, the total number of size 2 ties is 5 and the total number of size 1 ties is 35. Thus: gi N X X

tij3 ¼ 35 1 + 5 23 ¼ 75

i¼1 j¼1

Since there are ties, the real value of Friedman’s statistic is calculated from Expression (10.16), as follows: X k  2   12  Rj  3  N 2  k  ð k + 1 Þ 2 12  43:52 + 30:52 + 162  3  152  3  42 j¼1 0  ¼ Fcal ¼ XN Xgi ð15  3  75Þ Nk t3 15  3  4 + j¼1 ij i¼1 2 N  k  ðk + 1Þ + ðk  1Þ 0 ¼ 27:527 Fcal

Nonparametric Tests Chapter

10

293

FIG. 10.51 Critical region of Example 10.12.

If we applied Expression (10.15) without the correction factor, the result of Friedman’s test would be 25.233. Step 5: Since k ¼ 3 and N ¼ 15, let’s use the w2 distribution. The critical region (CR) of the w2 distribution (Table D in the Appendix), considering a ¼ 5% and n ¼ k  1 ¼ 2 degrees of freedom, is shown in Fig. 10.51. 0 > 5.991, we reject the null hypothesis, which Step 6: Decision: since the value calculated is in the critical region, that is, Fcal allows us to conclude, with a 95% confidence level, that the treatment has good results. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, for n ¼ 2 degrees of freedom, the probability associated to statistic F 0cal ¼ 27.527 is less than 0.005 (P-value 12.592, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is a difference in productivity among the four shifts. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, the probability associated to the statistic w2cal ¼ 13.143, for n ¼ 6 degrees of freedom, is between 0.05 and 0.025. Step 6: Decision: since P < 0.05, we reject H0.

FIG. 10.57 Critical region of Example 10.13.

Nonparametric Tests Chapter

10

297

FIG. 10.58 Selecting the variables.

10.6.1.1 Solving the w2 Test for k Independent Samples on SPSS

The use of the images in this section has been authorized by the International Business Machines Corporation©. The data from Example 10.13 are available in the file Chi-Square_k_Independent_Samples.sav. Let’s click on Analyze → Descriptive Statistics → Crosstabs … After that, we should insert the variable Productivity in Row(s) and the variable Shift in Column(s), as shown in Fig. 10.58. In Statistics …, let’s select the option Chi-square, as shown in Fig. 10.59. If we wish to obtain the observed and expected frequency distribution table, in Cells …, we must select the options Observed and Expected in Counts, as shown in Fig. 10.60. Finally, let’s click on Continue and OK. The results can be seen in Figs. 10.61 and 10.62. From Fig. 10.62, we can see that the value of w2 is 13.143, similar to the one calculated in Example 10.13. For a confidence level of 95%, since P ¼ 0.041 < 0.05 (we saw in Example 10.13 that this probability is between 0.025 and 0.05), we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is a difference in productivity among the four shifts.

10.6.1.2 Solving the w2 Test for k Independent Samples on Stata

The use of the images presented in this section has been authorized by Stata Corp LP©. The data in Example 10.13 are available in the file Chi-Square_k_Independent_Samples.dta. The variables being studied are productivity and shift. The syntax of the w2 test for k independent samples is similar to the one presented in Section 10.4.1 for two independent samples. Thus, we must use the command tabulate, or simply tab, followed by the name of the variables being studied, besides the option chi2, or simply ch. The difference is that, in this case, the categorical variable that represents the groups has more than two categories. Therefore, the syntax of the test for the data in Example 10.13 is: tabulate productivity shift, chi2

298

PART

IV Statistical Inference

FIG. 10.59 Selecting the w2 statistic.

FIG. 10.60 Selecting the observed and expected frequencies distribution table.

Nonparametric Tests Chapter

10

299

FIG. 10.61 Distribution of the observed and expected frequencies.

FIG. 10.62 Results of the w2 test for Example 10.13 on SPSS.

FIG. 10.63 Results of the w2 test for Example 10.13 on Stata.

or simply: tab productivity shift, ch

The results can be seen in Fig. 10.63. The value of the w2 statistic as well as the probability associated to it is similar to the results presented in Example 10.13, and also generated on SPSS.

10.6.2

Kruskal-Wallis Test

The Kruskal-Wallis test aims at verifying if k independent samples (k > 2) come from the same population. It is an alternative to the analysis of variance when the hypotheses of data normality and equality of variances are violated, or when the

300

PART

IV Statistical Inference

sample is small, or even when the variable is measured in an ordinal scale. For k ¼ 2, the Kruskal-Wallis test is equivalent to the Mann-Whitney test. The data are represented in a table with double entry with N rows and k columns, in which the rows represent the observations and the columns represent the different samples or groups. The null hypothesis of the Kruskal-Wallis test assumes that all k samples come from the same population or from identical populations with the same median (m). For a bilateral test, we have: H0: m1 ¼ m2 ¼ … ¼ mk H1: 9(i,j) mi 6¼ mj, i 6¼ j In the Kruskal-Wallis test, all N observations (N is the total number of observations in the global sample) are organized in a single series, and we attribute ranks to each element in the series. Thus, position 1 is attributed to the lowest observation in the global sample, position 2 to the second lowest observation, and so on, and so forth, up to position N. If there are ties, we attribute the mean of the corresponding ranks. The Kruskal-Wallis statistic (H) is given by: Hcal ¼

k R2 X 12 j  3  ð N + 1Þ  N  ðN + 1Þ j¼1 nj

(10.17)

where: k: the number of samples or groups; nj: the number of observations in the sample or group j; N: the number of observations in the global sample; Rj: sum of the ranks in the sample or group j. However, according to Siegel and Castellan (2006), whenever there are ties between two or more ranks, regardless of the group, the Kruskal-Wallis statistic must be corrected in a way that considers the changes in the sample distribution, so: 0 ¼ Hcal

1

H  Xg  3 t  t j j j¼1

(10.18)

ðN 3  N Þ

where: g: the number of clusters with different tied ranks; tj: the number of tied ranks in the jth cluster. According to Siegel and Castellan (2006), the main objective for correcting these ties is to increase the value of H, making the result more significant. The value calculated must be compared to the critical value of the sample distribution. If k ¼ 3 and n1, n2, n3  5, we must use Table L in the Appendix, which shows the critical values of the Kruskal-Wallis statistic (Hc), where P(Hcal > Hc) ¼ a (for a right-tailed unilateral test). Otherwise, the sample distribution can be approximated by the w2 distribution with n ¼ k  1 degrees of freedom. Therefore, if the value of the Hcal statistic is in the critical region, that is, if Hcal > Hc for k ¼ 3 and n1, n2, n3  5, or Hcal > w2c for other values, the null hypothesis is rejected, which allows us to conclude that there is no difference between the samples. Otherwise, we do not reject H0. Example 10.14 Applying the Kruskal-Wallis Test A group of 36 patients with the same level of stress was submitted to three different treatments, that is, 12 patients were submitted to treatment A, 12 patients to treatment B, and the remaining 12 to treatment C. At the end of the treatment, each patient answered a questionnaire that evaluates a person’s stress level, which is classified in three phases: the resistance phase, for those who got a maximum of three points, the warning phase, for those who got more than 6 points, and the exhaustion phase, for those who got more than 8 points. The results can be seen in Table 10.E.20. Verify if the three treatments lead to the same results. Consider a significance level of 1%.

Nonparametric Tests Chapter

10

301

TABLE 10.E.20 Stress Level After the Treatment Treatment A

6

5

4

5

3

4

5

2

4

3

5

2

Treatment B

6

7

5

8

7

8

6

9

8

6

8

8

Treatment C

5

9

8

7

9

11

7

8

9

10

7

8

Solution Step 1: Since the variable is measured in an ordinal scale, the most suitable test to verify if the three independent samples are drawn from the same population is the Kruskal-Wallis test. Step 2: Through the null hypothesis, there is no difference among the treatments. Through the alternative hypothesis, there is a difference between at least two treatments, so: H0: m1 ¼ m2 ¼ m3 H1: 9(i,j) mi 6¼ mj, i 6¼ j Step 3: The significance level to be considered is 1%. Step 4: In order to calculate the Kruskal-Wallis statistic, first of all, we must attribute ranks from 1 to 36 to each element in the global sample, as shown in Table 10.E.21. In case of ties, we attribute the mean of the corresponding ranks.

TABLE 10.E.21 Attributing Ranks

A

15.5

10.5

6

10.5

3.5

6

10.5

1.5

B

15.5

20

10.5

26.5

20

26.5

15.5

C

10.5

32.5

26.5

20

32.5

36

20

Sum

Mean

6

3.5

10.5

1.5

85.5

7.13

32.5

26.5

15.5

26.5

26.5

262

21.83

26.5

32.5

35

20

26.5

318.5

26.54

Since there are ties, the Kruskal-Wallis statistic is calculated from Expression (10.18). First of all, we calculate the value of H: k R X 12 12 85:52 + 2622 + 318:52 j  3  37    3  ðN + 1Þ ¼ 12 N  ðN + 1Þ j¼1 nj 36  37 2

Hcal ¼

Hcal ¼ 22:181 From Tables 10.E.20 and 10.E.21, we can verify that there are eight tied groups. For example, there are two groups with 2 points (with a rank of 1.5), two groups with 3 points (with a rank of 3.5), three groups with 4 points (with a rank of 6) and, thus, successively, up to four groups with 9 points (with a rank of 32.5). The Kruskal-Wallis statistic is corrected to: H Xg 

’ Hcal ¼

1

t 3  tj j¼1 j ðN 3  N Þ

22:181        ¼ 22:662 23  2 + 23  2 + 33  3 + ⋯ + 43  4  3  1 36  36 

Step 5: Since n1, n2, n3 > 5, let’s use the w2 distribution. The critical region (CR) of the w2 distribution (Table D in the Appendix), considering a ¼ 1% and n ¼ k  1 ¼ 2 degrees of freedom, is shown in Fig. 10.64.

FIG. 10.64 Critical region of Example 10.14.

302

PART

IV Statistical Inference

Step 6: Decision: since the value calculated is in the critical region, that is, H 0cal > 9.210, we must reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there is a difference among the treatments. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, for n ¼ 2 degrees of freedom, the probability associated to the statistic H 0cal ¼ 22.662 is less than 0.005 (P < 0.005). Step 6: Decision: since P < 0.01, we reject H0.

10.6.2.1 Solving the Kruskal-Wallis Test by Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.14 are available in the file Kruskal-Wallis_Test.sav. In order to elaborate the Kruskal-Wallis test on SPSS, let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → K Independent Samples …, as shown in Fig. 10.65. After that, we should insert the variable Result in the box Test Variable List, define the groups of the variable Treatment and select the Kruskal-Wallis test, as shown in Fig. 10.66. Let’s click on OK to obtain the results of the Kruskal-Wallis test. Fig. 10.67 shows the mean of the ranks for each group, similar to the values calculated in Table 10.E.21. The value of the Kruskal-Wallis statistic and the significance level of the test are in Fig. 10.68. The value of the test is 22.662, similar to the value calculated in Example 10.14. The probability associated to the statistic is 0.000 (we saw in Example 10.14 that this probability is less than 0.005). Since P < 0.01, we reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there is a difference among the treatments.

FIG. 10.65 Procedure for elaborating the Kruskal-Wallis test on SPSS.

Nonparametric Tests Chapter

FIG. 10.66 Selecting the variable and defining the groups for the Kruskal-Wallis test.

FIG. 10.67 Ranks.

FIG. 10.68 Results of the Kruskal-Wallis test for Example 10.14 on SPSS.

10.6.2.2 Solving the Kruskal-Wallis Test by Using Stata

The use of the images presented in this section has been authorized by Stata Corp LP©. On Stata, the Kruskal-Wallis test is elaborated through the command kwallis, using the following syntax: kwallis variable*, by(groups*)

10

303

304

PART

IV Statistical Inference

FIG. 10.69 Results of the Kruskal-Wallis test for Example 10.14 on Stata.

where the term variable* must be replaced by the quantitative or ordinal variable being studied and the term groups* by the categorical variable that represents the groups. Let’s open the file Kruskal-Wallis_Test.dta that contains the data from Example 10.14. All three groups are represented by the variable treatment and the characteristic analyzed by the variable result. Thus, the command to be typed is: kwallis result, by(treatment)

The result of the test can be seen in Fig. 10.69. Analogous to the results presented in Example 10.14 and generated on SPSS, Stata calculates the original value of the statistic (22.181) and with the correction factor whenever there are ties (22.662). Since the probability associated to the statistic is 0.000, we reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there is no difference among the treatments.

10.7

FINAL REMARKS

In the previous chapter, we studied parametric tests. This chapter, however, was totally dedicated to the study of nonparametric tests. Nonparametric tests are classified according to the variables’ level of measurement and to the sample size. So, for each situation, the main types of nonparametric tests were studied. In addition, the advantages and disadvantages of each test as well as their assumptions were also established. For each nonparametric test, the main inherent concepts, the null and alternative hypotheses, the respective statistics, and the solution of the examples proposed on SPSS and on Stata were presented. Whatever the main objective for their application, nonparametric tests can provide the collection of good and interesting research results that will be useful in any decision-making process. The correct use of each test, from a conscious choice of the modeling software, must always be done based on the underlying theory, without ignoring the researcher’s experience and intuition.

10.8

EXERCISES

(1) In what situations are nonparametric tests applied? (2) What are the advantages and disadvantages of nonparametric tests? (3) What are the differences between the sign test and the Wilcoxon test for two paired samples? (4) Which test is an alternative to the t-test for one sample when the data distribution does not follow a normal distribution? (5) A group of 20 consumers tasted two types of coffee (A and B). At the end, they chose one of the brands, as shown in the table. Test the null hypothesis that there is no difference in these consumers’ preference, with a significance level of 5%.

Nonparametric Tests Chapter

Events

Brand A

Brand B

Total

Frequency

8

12

20

Proportion

0.40

0.60

1.00

10

305

(6) A group of 60 readers evaluated three novels and, at the end, they chose one of the three options, as shown in the table. Test the null hypothesis that there is no difference in these readers’ preference, with a significance level of 5%.

Events

Book A

Book B

Book C

Total

Frequency

29

15

16

60

Proportion

0.483

0.250

0.267

1.00

(7) A group of 20 teenagers went on the Points Diet for 30 days. Check and see if there was weight loss after the diet. Assume that a ¼ 5%.

Before

After

58

56

67

62

72

65

88

84

77

72

67

68

75

76

69

62

104

97

66

65

58

59

59

60

61

62

67

63

73

65

58

58

67

62

67

64

78

72

85

80

306

PART

IV Statistical Inference

(8) Aiming to compare the average service times in two bank branches, data on 22 clients from each bank branch were collected, as shown in the table. Use the most suitable test, with a significance level of 5%, to test whether both samples come or do not come from populations with the same medians. Bank Branch A

Bank Branch B

6.24

8.14

8.47

6.54

6.54

6.66

6.87

7.85

2.24

8.03

5.36

5.68

7.09

3.05

7.56

5.78

6.88

6.43

8.04

6.39

7.05

7.64

6.58

6.97

8.14

8.07

8.30

8.33

2.69

7.14

6.14

6.58

7.14

5.98

7.22

6.22

7.58

7.08

6.11

7.62

7.25

5.69

7.5

8.04

(9) A group of 20 Business Administration students evaluated their level of learning based on three subjects studied in the field of Applied Quantitative Methods, by answering if their level of learning was high (1) or low (0). The results can be seen in the table. Check and see if the proportion of students with a high level of learning is the same for each subject. Consider a significance level of 2.5%. Student

A

B

C

1

0

1

1

2

1

1

1

3

0

0

0

4

0

1

0

5

0

1

1

6

1

1

1

7

1

0

1

Nonparametric Tests Chapter

Student

A

B

C

8

0

1

1

9

0

0

0

10

0

0

0

11

1

1

1

12

0

0

1

13

1

0

1

14

0

1

1

15

0

0

1

16

1

1

1

17

0

0

1

18

1

1

1

19

0

1

1

20

1

1

1

10

307

(10) A group of 15 consumers evaluated their level of satisfaction (1—somewhat dissatisfied, 2—somewhat satisfied, and 3—very satisfied) with three different bank services. The results can be seen in the table. Verify if there is a difference between the three services. Assume a significance level of 5%. Consumer

A

B

C

1

3

2

3

2

2

2

2

3

1

2

1

4

3

2

2

5

1

1

1

6

3

2

1

7

3

3

2

8

2

2

1

9

3

2

2

10

2

1

1

11

1

1

2

12

3

1

1

13

3

2

1

14

2

1

2

15

3

1

2

Part V

Multivariate Exploratory Data Analysis Two or more variables can relate to one another in several different ways. While one researcher may be interested in the study of the interrelationship between categorical (or nonmetric) variables, for example, in order to assess the existence of possible associations between its categories, another researcher may wish to create performance indicators (new variables) from the existence of correlations between the original metric variables. A third researcher may be interested in identifying homogeneous groups possibly formed from the existence of similarities in the variables between the observations of a certain dataset. In all of these situations, researchers may use multivariate exploratory techniques. Multivariate exploratory techniques, also known as interdependence methods, can probably be used in all fields of human knowledge in which researchers aim to study the relationship between the variables of a certain dataset, without intending to estimate confirmatory models. That is, without having to elaborate inferences regarding the findings for other observations, different from the ones considered in the analysis itself, since neither models nor equations are estimated to predict data behavior. This characteristic is crucial to distinguish the techniques studied in Part V of this book from those considered to be dependence methods, such as, the simple and multiple regression models, binary and multinomial logistic regression models, and regression models for count data, all of them studied in Part VI. Therefore, there is no definition of a predictor variable in exploratory models and, thus, their main objectives refer to the reduction or structural simplification of data, to the classification or clustering of observations and variables, to the investigation of the existence of correlation between metric variables, or association between categorical variables and between their categories, to the creation of performance rankings of observations from variables, and to the elaboration of perceptual maps. Exploratory techniques are considered extremely relevant for developing diagnostics regarding the behavior of the data being analyzed. Thus, their varied procedures are commonly adopted in a preliminary way, or even simultaneously, with the application of a certain confirmatory model. Based on pedagogical and conceptual criteria, we have chosen to discuss the two main sets of existing multivariate exploratory techniques in Part V; therefore, the chapters are structured in the following way: Chapter 11: Cluster Analysis Chapter 12: Principal Component Factor Analysis

The decision about the technique to be used also goes through the measurement scale of the variables available in the dataset, which can be categorical or metric (or even binary, a special case of categorization). The type of question itself, when collecting the data, in some situations, may result in a categorical or metric response, which will favor the use of one or more techniques to the detriment of others. Hence, the clear, precise, and preliminary definition of the research objectives is essential to obtain variables in the measurement scale suitable for the application of a certain technique that will serve as a tool for achieving the objectives proposed. While the cluster analysis techniques (Chapter 11), whose procedures can be hierarchical or nonhierarchical, are used when we wish to study similar behavior between the observations (individuals, companies, municipalities, countries, among other examples) regarding certain metric or binary variables and the possible existence of homogeneous clusters (cluster of observations), the principal component factor analysis (Chapter 12) can be chosen as the technique to be used when the main goal is the creation of new variables (factors, or cluster of variables) that capture the joint behavior of the

310

PART

V Multivariate Exploratory Data Analysis

BOX V.1 Exploratory Techniques and Main Objectives Exploratory Technique

Measurement Scale

Main Objectives

Cluster Analysis

Metric or Binary Metric or Binary Metric

Sorting and allocation of the observations into internally homogeneous groups and heterogeneous between one another. Definition of an interesting number of groups. Evaluation of the representativeness of each variable for the formation of a previously established number of groups. From a predefined number of groups, identification of the allocation of each observation. Identification of the correlations between the original variables for creating factors that represent the combination of those variables (reduction or structural simplification). Verification of the validity of previously established constructs. Construction of rankings through the creation of performance indicators from the factors. Extraction of orthogonal factors for future use in multivariate confirmatory techniques that require the absence of multicollinearity.

Hierarchical

Nonhierarchical

Principal Component Factor Analysis

original metric variables. Chapter 11 also presents the procedures for elaborating the multidimensional scaling technique in SPSS and in Stata. It can be considered a natural extension of the cluster analysis, and it has as its main objectives to determine the relative positions (coordinates) of each observation in the dataset and to construct two-dimensional charts in which these coordinates are plotted. It is important to mention that even though they are not discussed in this book, correspondence analysis techniques are very useful when researchers intend to study possible associations between the variables and between their respective categories. While the simple correspondence analysis is applied to the study of the interdependence relationship between only two categorical variables, which characterizes it as a bivariate technique, the multiple correspondence analysis can be used for a larger number of categorical variables, being, in fact, a multivariate technique. For more details on correspondence analysis techniques, we recommend Fa´vero and Belfiore (2017). Box V.1 shows the main objectives of each one of the exploratory techniques discussed in Part V. Each chapter is structured according to the same presentation logic. First, we introduce the concepts regarding each technique, always followed by the algebraic solution of some practical exercises, from datasets elaborated primarily with a more educational focus. Next, the same exercises are solved in the statistical software packages IBM SPSS Statistics Software and Stata Statistical Software. We believe that this logic facilitates the study and understanding of the correct use of each of the techniques and the analysis of the results obtained. In addition to this, the practical application of the models in SPSS and Stata also offers benefits to researchers, because, at any given moment, the results can be compared to the ones already obtained algebraically in the initial sections of each chapter, besides providing an opportunity to use these important software packages. At the end of each chapter, additional exercises are proposed, whose answers, presented through the outputs generated in SPSS, are available at the end of the book.

Chapter 11

Cluster Analysis Maybe Hamlet is right. We could be bounded in a nutshell, but counting ourselves kings of infinite space. Stephen Hawking

11.1 INTRODUCTION Cluster analysis represents a set of very useful exploratory techniques that can be applied whenever we intend to verify the existence of similar behavior between observations (individuals, companies, municipalities, countries, among other examples) in relation to certain variables, and there is the intention of creating groups or clusters, in which an internal homogeneity prevails. In this regard, this set of techniques has as its main objective to allocate observations to a relatively small number of clusters that are internally homogeneous and heterogeneous between themselves, and that represent the joint behavior of the observations from certain variables. That is, the observations of a certain group must be relatively similar to one another, in relation to the variables inserted in the analysis, and significantly different from the observations found in other groups. Clustering techniques are considered exploratory, or interdependent, since their applications do not have a predictive nature for other observations not initially present in the sample. Moreover, the inclusion of new observations into the dataset makes it necessary to reapply the modeling, so that, possibly, new clusters can be generated. Besides, the inclusion of a new variable can also generate a complete rearrangement of the observations in the groups. Researchers can choose to develop a cluster analysis when their main goal is to sort and allocate observations to groups and, from then on, to analyze what the ideal number of clusters formed is. Or they can, a priori, define the number of groups they wish to create, based on certain criteria, and verify how the sorting and allocation of observations behave in that specific number of groups. Regardless of the objective, clustering will continue being exploratory. If a researcher aims to use a technique to, in fact, confirm the creation of groups and to make the analysis predictive, he can use techniques as, for example, discriminant analysis or multinomial logistic regression. Elaborating a cluster analysis does not require vast knowledge of matrix algebra or statistics, different from techniques such as factor analysis and correspondence analysis. The researcher interested in applying a cluster analysis needs to, starting from the definition of the research objectives, choose a certain distance or similarity measure that will be the basis for the observations to be considered less or much closer, and a certain agglomeration schedule that will have to be defined between hierarchical and nonhierarchical methods. Therefore, he will be able to analyze, interpret, and compare the outcomes. It is important to highlight that the outcomes obtained through hierarchical and nonhierarchical agglomeration schedules can be compared and, in this regard, the researcher is free to develop the technique, using one method or another, and to reapply it, if he deems necessary. While hierarchical schedules allow us to identify the sorting and allocation of observations, offering possibilities for researchers to study, assess, and decide the number of clusters formed in nonhierarchical schedules, we start with a known number of clusters and, from then on, we begin allocating the observations to these clusters, with a future evaluation of the representativeness of each variable when creating them. Therefore, the result of one method can serve as input to carry out the other, making the analysis cyclical. Fig. 11.1 shows the logic from which a cluster analysis can be elaborated. When choosing the distance or similarity measure and the agglomeration schedule, we must take some aspects into consideration, such as, the previously desired number of clusters, which were defined based on some resource allocation criteria, as well as certain constraints that may lead the researcher to choose a specific solution. According to Bussab et al. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00011-2 © 2019 Elsevier Inc. All rights reserved.

311

312

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.1 Logic for elaborating a cluster analysis.

(1990), different criteria regarding distance measures and agglomeration schedules may lead to different cluster formations, and the homogeneity desired by the researcher fundamentally depends on the objectives set in the research. Imagine that a researcher is interested in studying the interdependence between individuals living in a certain municipality based only on two metric variables (age, in years, and average family income, in R\$). His main goal is to assess the effectiveness of social programs aimed at providing health care and then, based on these variables, to propose a still unknown number of new programs aimed at homogeneous groups of people. After collecting the data, the researcher constructed a scatter plot, as shown in Fig. 11.2. Based on the chart seen in Fig. 11.2, the researcher identified four clusters and highlighted them in a new chart (Fig. 11.3). From the creation of these clusters, the researcher decided to develop an analysis of the behavior of the observations in each group, or, more precisely, of the existing variability within the clusters and between them, so that he could clearly and consciously base his decision as regards the allocation of individuals to these four new social programs. In order to illustrate this issue, the researcher constructed the chart found in Fig. 11.4.

FIG. 11.2 Scatter plot with individuals’ Income and Age.

Cluster Analysis Chapter

11

313

FIG. 11.3 Highlighting the creation of four clusters.

FIG. 11.4 Illustrating the variability within the clusters and between them.

Based on this chart, the researcher was able to notice that the groups formed showed a lot of internal homogeneity, with a certain individual being closer to other individuals in the same group than to individuals in other groups. This is the core of cluster analysis. If the number of social programs to be provided for the population (number of clusters) had already been given to the researcher, due to budgetary, legal, or political constraints, even so we would be able to use clustering, solely, to determine the allocation of individuals from the municipality to that number of programs (groups).

314

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.5 Rearranging the clusters due to the presence of elderly billionaires.

Having concluded the research and allocated the individuals to the different social, health care programs, the following year, the researcher decided to carry out the same research with individuals from the same municipality. However, in the meantime, a group of elderly billionaires decided to move to that city, and, when he constructed the new scatter plot, the researcher realized that those four clusters, clearly formed the previous year, did not exist anymore, since they fused when the billionaires were included. The new scatter plot can be seen in Fig. 11.5. This new situation exemplifies the importance of always reapplying the cluster analysis whenever new observations are included (and also new variables), which deprives it from and makes its predictive power totally unfeasible, as we have already discussed. Moreover, before elaborating any cluster analysis, this example shows that it is advisable for the researcher to study the data behavior and to check the existence of discrepant observations in relation to certain variables, since the creation of clusters is very sensitive to the presence of outliers. Excluding or retaining outliers in the dataset, however, will depend on the research objectives and on the type of data researcher have. Since, if certain observations represent anomalies in terms of variable values, when compared to the other observations, and end up forming small, insignificant, or even individual clusters, they can, in fact, be excluded. On the other hand, if these observations represent one or more relevant groups, even if they are different from the others, they must be considered in the analysis and, whenever the technique is reapplied, they can be separated so that other segmentations can be better structured in new groups, formed with higher internal homogeneity. We would like to emphasize that cluster analysis methods are considered static procedures, since the inclusion of new observations or variables may change the clusters, thus, making it mandatory to develop a new analysis. In this example, we realized that the original variables from which the groups are established are metric, since the clustering started from the study of the distance behavior (dissimilarity measures) between the observations. In some cases, as we will study throughout this chapter, cluster analyses can be elaborated from the similarity behavior (similarity measures) between observations that present binary variables. However, it is common for researchers to use the incorrect arbitrary weighting procedure with qualitative variables, as, for example, variables on the Likert scale, and, from then on, to apply a cluster analysis. This is a major error, since there are exploratory techniques meant exclusively for the study of the behavior of qualitative variables as, for example, the correspondence analysis. Historically speaking, even though many distance and similarity measures date back to the end of the 19th century and the beginning of the 20th century, cluster analyses, as a better structured set of techniques, began in the field of Anthropology with Driver and Kroeber (1932), and in Psychology with Zubin (1938a,b) and Tryon (1939), as discussed by Reis

Cluster Analysis Chapter

11

315

(2001) and Fa´vero et al. (2009). With the acknowledgment that observation clustering and classification procedures are scientific methods, together with astonishing technological developments, mainly verified after the 1960s, cluster analyses started being used more frequently after Sokal and Sneath’s (1963) relevant work was published, in which procedures are carried out to compare the biological similarities of organisms with similar characteristics and the respective species. Currently, cluster analysis offers several application possibilities in the fields of consumer behavior, market segmentation, strategy, political science, economics, finance, accounting, actuarial science, engineering, logistics, computer science, education, medicine, biology, genetics, biostatistics, psychology, anthropology, demography, geography, ecology, climatology, geology, archeology, criminology and forensics, among others. In this chapter, we will discuss cluster analysis techniques, aiming at: (1) introducing the concepts; (2) presenting the step by step of modeling, in an algebraic and practical way; (3) interpreting the results obtained; and (4) applying the technique in SPSS and in Stata. Following the logic proposed in the book, first, we will present the algebraic solution of an example jointly with the presentation of the concepts. Only after the introduction of concepts will the procedures for elaborating the techniques in SPSS and Stata be presented.

11.2 CLUSTER ANALYSIS Many are the procedures for elaborating a cluster analysis, since there are different distance or similarity measures for metric or binary variables, respectively. Besides, after defining the distance or similarity measure, the researcher still needs to determine, among several possibilities, the observation clustering method, from certain hierarchical or nonhierarchical criteria. Therefore, when one wishes to group observations in internally homogeneous clusters, what initially seems trivial can become quite complex, because there are multiple combinations between different distance or similarity measures and clustering methods. Hence, based on the underlying theory and on his research objectives, as well as on his experience and intuition, it is extremely important for the researcher to define the criteria from which the observations will be allocated to each one of the groups. In the following sections, we will discuss the theoretical development of the technique, along with a practical example. In Sections 11.2.1 and 11.2.2, the concepts of distance and similarity measures and clustering methods are presented and discussed, respectively, always followed by the algebraic solutions developed from a dataset.

11.2.1

Defining Distance or Similarity Measures in Cluster Analysis

As we have already discussed, the first phase for elaborating a cluster analysis consists in defining the distance (dissimilarity) or similarity measure that will be the basis for each observation to be allocated to a certain group. Distance measures are frequently used when the variables in the dataset are essentially metric, since, the greater the differences between the variable values of two observations the smaller the similarity between them or, in other words, the higher the dissimilarity. On the other hand, similarity measures are often used when the variables are binary, and what most interests us is the frequency of converging answer pairs 1-1 or 0-0 of two observations. In this case, the greater the frequency of converging pairs, the higher the similarity between the observations. An exception to this rule is Pearson’s correlation coefficient between two observations, calculated from metric variables, however, with similarity characteristics, as we will see in the following section. We will study the dissimilarity measures for metric variables in Section 11.2.1.1 and, in Section 11.2.1.2, we will discuss the similarity measures for binary variables.

11.2.1.1 Distance (Dissimilarity) Measures Between Observations for Metric Variables As a hypothetical situation, imagine that we intend to calculate the distance between two observations i (i ¼ 1, 2) from a dataset that has three metric variables (X1i, X2i, X3i), with values in the same unit of measure. These data can be found in Table 11.1. It is possible to illustrate the configuration of both observations in a three-dimensional space from these data, since we have exactly three variables. Fig. 11.6 shows the relative position of each observation, emphasizing the distance between them (d12). Distance d12, which is a dissimilarity measure, can be easily calculated by using, for instance, its projection over the horizontal plane formed by axes X1 and X2, called distance d0 12, as shown in Fig. 11.7.

316

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.1 Part of a Dataset With Two Observations and Three Metric Variables Observation i

X1i

X2i

X3i

1

3.7

2.7

9.1

2

7.8

8.0

1.5

X3

1

d12

2 X2 X1 FIG. 11.6 Three-dimensional scatter plot for the hypothetical situation with two observations and three variables.

Thus, based on the well-known Pythagorean distance formula for right-angled triangles, we can determine d12 through the following expression: qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ (11.1) d12 ¼ ðd0 12 Þ2 + ðX31  X32 Þ2 where j X31  X32 j is the distance of the vertical projections (axis X3) from points 1 and 2. However, distance d0 12 is unknown to us, so, once again, we need to use the Pythagorean formula, now using the distances of the projections from Points 1 and 2 over the other two axes (X1 and X2), as shown in Fig. 11.8. Thus, we can say that: qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ (11.2) d0 12 ¼ ðX11  X12 Þ2 + ðX21  X22 Þ2 and, substituting (2) in (1), we have: d12 ¼

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðX11  X12 Þ2 + ðX21  X22 Þ2 + ðX31  X32 Þ2 ,

(11.3)

which is the expression of distance (dissimilarity measure) between Points 1 and 2, also known as the Euclidean distance formula.

Cluster Analysis Chapter

11

317

X3

|X31–X32|

1

d12

d′

12

2

X2 X1 FIG. 11.7 Three-dimensional chart highlighting the projection of d12 over the horizontal plane.

FIG. 11.8 Projection of the points over the plane formed by X1 and X2 with emphasis on d´12.

X2

|X21–X22|

12

2

1 |X11–X12|

X1

318

PART

V

Multivariate Exploratory Data Analysis

Therefore, for the data in our example, we have: qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ d12 ¼ ð3:7  7:8Þ2 + ð2:7  8:0Þ2 + ð9:1  1:5Þ2 ¼ 10:132 whose unit of measure is the same as for the original variables in the dataset. It is important to highlight that, if the variables do not have the same unit of measure, a data standardization procedure will have to be carried out previously, as we will discuss later. We can generalize this problem for a situation in which the dataset has n observations and, for each observation i (i ¼ 1, ..., n), values corresponding to each one of the j (j ¼ 1, ..., k) metric variables X, as shown in Table 11.2. So, Expression (11.4), based on Expression (11.3), presents the general definition of the Euclidian distance between any two observations p and q. vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ k   2  2  2ﬃ u 2 uX X1p  X1q + X2p  X2q + … + Xkp  Xkq ¼ t Xjp  Xjq (11.4) dpq ¼ j¼1

Although the Euclidian distance is the most commonly used in cluster analyses, there are other dissimilarity measures that can be used, and using each one of them will depend on the researcher’s assumptions and objectives. Next, we will discuss other dissimilarity measures that can be used: l

Euclidean squared distance: instead of the Euclidian distance, it can be used when the variables show a small dispersion in value, and the use of the squared Euclidian distance makes it easier to interpret the outputs of the analysis and the allocation of the observations to the groups. Its expression is given by: k   2  2  2 X 2 Xjp  Xjq dpq ¼ X1p  X1q + X2p  X2q + … + Xkp  Xkq ¼

(11.5)

j¼1 l

Minkowski Distance: it is the most general dissimilarity measure expression from which others derive. It is given by: "

 m k  X   dpq ¼ Xjp  Xjq 

#1

m

(11.6)

j¼1

where m takes on positive integer values (m ¼ 1, 2, ...). We can see that the Euclidian distance is a particular case of the Minkowski distance, when m ¼ 2.

TABLE 11.2 General Model of a Dataset for Elaborating the Cluster Analysis Variable j Observation i

X1i

X2i

Xki

1

X11

X21

Xk1

2

X12

X22

P

X1p

X2p

q

X1q

X2q

Xkq

n

X1n

X2n

Xkn

Xk2

Xkp

Cluster Analysis Chapter

l

11

319

Manhattan Distance: also referred to as the absolute or city block distance, it does not consider the triangular geometry that is inherent to Pythagoras’ initial expression and only considers the differences between the values of each variable. Its expression, also a particular case of the Minkowski distance when m ¼ 1, is given by: dpq ¼

k   X   Xjp  Xjq 

(11.7)

j¼1 l

Chebyshev Distance: also referred to as infinite or maximum distance, it is a particular case of the Manhattan distance because it only considers, for two observations, the maximum difference between all the j variables being studied. Its expression is given by:     (11.8) dpq ¼ max Xjp  Xjq 

It is a particular case of the Minkowski distance as well, when m ¼ ∞. l

Canberra Distance: used for the cases in which the variables only have positive values, it assumes values between 0 and j (number of variables). Its expression is given by:     k X  X  X jp jq   (11.9) dpq ¼ j¼1 Xjp + Xjq

Whenever there are metric variables, the researcher can also use Pearson’s correlation, which, even though, is not a dissimilarity measure (in fact, it is a similarity measure), can provide important information when the aim is to group rows from the dataset. Pearson’s correlation expression, between the values of any two observations p and q, based on Expression (4.11) presented in Chapter 4, can be written as follows: k  X

   Xjp  Xp  Xjq  Xq

j¼1

rpq ¼ vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ uX k  2 u 2 u k  uX t Xjp  Xp  t Xjq  Xq j¼1

(11.10)

j¼1

where Xp and Xq represent the mean of all variable values for observations p and q, respectively, that is, the mean of each one of the rows in the dataset. Therefore, we can see that we are dealing with a coefficient of correlation between rows and not between columns (variables). It is the most common in data analysis and its values vary between 1 and 1. Pearson’s correlation coefficient can be used as a similarity measure between the rows of the dataset in analyses that include time series, for example, that is, cases in which the observations represent periods. In this case, the researcher may intend to study the correlations between different periods, to investigate, for instance, a possible recurrence of behavior in the same row for the set of variables, which may cause certain periods, not necessarily subsequent ones, to be grouped by similarity of behavior. Going back to the data presented in Table 11.1, we can calculate the different distance measures between observations 1 and 2, given by Expressions (11.4)–(11.9), as well as the correlational similarity measure, given by Expression (11.10). Table 11.3 shows these calculations and the respective results. Based on the results shown in Table 11.3, we can see that different measures produce different results, which may cause the observations to be allocated to different homogeneous clusters, depending on which measure was chosen for the analysis, as discussed by Vicini and Souza (2005) and Malhotra (2012). Therefore, it is essential for the researcher to always underpin his choice and to bear in mind the reasons why he decided to use a certain measure, instead of others. Simply using more than one measure, when analyzing the same dataset, can support this decision, since, in this case, the results can be compared. This becomes really clear when we include a third observation in the analysis, as shown in Table 11.4. While the Euclidian distance suggests that the most similar observations (the shortest distance) are 2 and 3, when we use the Chebyshev distance, observations 1 and 3 are the most similar. Table 11.5 shows these distances for each pair of observations, highlighting, in bold characters, the smallest value of each distance.

320

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.3 Distance and Correlational Similarity Measures Between Observations 1 and 2 Observation i

X1i

X2i

X3i

Mean

1

3.7

2.7

9.1

5.167

2

7.8

8.0

1.5

5.767

Euclidian Distance qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ d12 ¼ ð3:7  7:8Þ2 + ð2:7  8:0Þ2 + ð9:1  1:5Þ2 ¼ 10:132 Squared Euclidean Distance d12 ¼ (3.7  7.8)2 + (2.7  8.0)2 + (9.1  1.5)2 ¼ 102.660 Manhattan Distance d12 ¼ j3.7  7.8j + j 2.7  8.0j + j9.1  1.5j ¼ 17.000 Chebyshev Distance d12 ¼ j9.1  1.5j ¼ 7.600 Canberra Distance 3:77:8j j2:78:0j j9:11:5j d12 ¼ ðj3:7 + 7:8Þ + ð2:7 + 8:0Þ + ð9:1 + 1:5Þ ¼ 1:569

Pearson’s Correlation (Similarity) ð3:75:167Þ  ð7:85:767Þ + ð2:75:167Þ  ð8:05:767Þ + ð9:15:167Þ  ð1:55:767Þ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ r12 ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ 0:993 2 2 2 2 2 2 ð3:75:167Þ + ð2:75:167Þ + ð9:15:167Þ 

ð7:85:767Þ + ð8:05:767Þ + ð1:55:767Þ

TABLE 11.4 Part of the Dataset With Three Observations and Three Metric Variables Observation i

X1i

X2i

X3i

1

3.7

2.7

9.1

2

7.8

8.0

1.5

3

8.9

1.0

2.7

TABLE 11.5 Euclidian and Chebyshev Distances Between the Pairs of Observations Seen in Table 11.4 Distance

Pair of Observations 1 and 2

Pair of Observations 1 and 3

Pair of Observations 2 and 3

Euclidian

d12 ¼ 10.132

d13 ¼ 8.420

d23 5 7.187

Chebyshev

d12 ¼ 7.600

d13 5 6.400

d23 ¼ 7.000

Hence, in a certain cluster schedule, and only due to the dissimilarity measure chosen, we would have different initial clusters. Besides deciding which distance measure to choose, the researcher also has to verify if the data need to be treated previously. So far, in the examples we have already discussed, we were careful to choose metric variables with values in the same unit of measure (as, for example, students’ grades in Math, Physics, and Chemistry, which vary from 0 the 10). However, if the variables are measured in different units (as, for example, income in R\$, educational level in years of study, and number of children), the intensity of the distances between the observations may be arbitrarily influenced by the variables that will possibly present greater magnitude in their values, to the detriment the others. In these situations, the

Cluster Analysis Chapter

11

321

researcher must standardize the data, so that the arbitrary nature of the measurement units may be eliminated, making each variable have the same contribution over the distance measure considered. Z-scores procedure is the most frequently used method to standardize variables. In it, for each observation i, the value of a new standardized variable ZXj is obtained by subtracting the corresponding original variable value Xj from its mean and, after that, the resulting value is divided by its standard deviation, as presented in Expression (11.11). ZXji ¼

Xji  Xj sj

(11.11)

where X and s represent the mean and the standard deviation of variable Xj. Hence, regardless of the magnitude of the values and of the type of measurement units of the original variables in a dataset, all the respective variables standardized by the Z-scores procedure will have a mean equal to zero and a standard deviation equal to 1, which ensures that possible arbitrary measurement units over the distance between each pair of observations will be eliminated. In addition, Z-scores have the advantage of not changing the distribution of the original variable. Therefore, if the original variables are different units, distance measure Expressions (11.4)–(11.9) must have the terms Xjp and Xjq, respectively, substituted for ZXjp and ZXjq. Table 11.6 presents these expressions, based on the standardized variables. Even though Pearson’s correlation is not a dissimilarity measure (in fact, it is a similarity measure), it is important to mention that its use also requires that the variables be standardized by using the Z-scores procedure in case they do not have the same measurement units. If the main goal were to group variables, which is the main goal of the following chapter (factor analysis), the standardization of variables through the Z-scores procedure would, in fact, be irrelevant, given that the analysis would consist in assessing the correlation between columns of the dataset. On the other hand, as the objective of this chapter is to group rows from the dataset that represent the observations, the standardization of the variables is necessary for elaborating an accurate cluster analysis.

11.2.1.2 Similarity Measures Between Observations for Binary Variables Now, imagine that we intend to calculate the distance between two observations i (i ¼ 1, 2) coming from a dataset that has seven variables (X1i, ..., X7i), however, all of them related to the presence or absence of characteristics. In this situation, it is common for the presence or absence of a certain characteristic to be represented by a binary variable, or a dummy, which assumes value 1, in case the characteristic occurs, and 0, if otherwise. These data can be found in Table 11.7. It is important to highlight that the use of binary variables does not generate arbitrary weighting problems resulting from the variable categories, contrary to what would happen if discrete values (1, 2, 3, ...) were assigned to each category of each qualitative variable. In this regard, if a certain qualitative variable has k categories, (k  1) binary variables will be necessary to represent the presence or absence of each one of the categories. Thus, all the binary variables will be equal to 0 in case the reference category occurs.

TABLE 11.6 Distance Measure Expressions With Standardized Variables Distance Measure (Dissimilarity)

Expression

Euclidian

sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 k  P dpq ¼ ZX jp  ZX jq j¼1

Squared Euclidean

dpq ¼

dpq ¼

Chebyshev Canberra

j¼1

"

Minkowski

Manhattan

2 k  P ZX jp  ZX jq

dpq ¼

m k  P   ZX jp  ZX jq 

j¼1

 k  P   ZX jp  ZX jq 

j¼1

dpq ¼ ma´xjZXjp  ZXjq j dpq ¼

k ZX ZX P j jp jq j

ZX + ZX jq Þ j¼1 ð jp

#1

m

322

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.7 Part of the Dataset With Two Observations and Seven Binary Variables Observation i

X1i

X2i

X3i

X4i

X5i

X6i

X7i

1

0

0

1

1

0

1

1

2

0

1

1

1

1

0

1

Therefore, by using Expression (11.4), we can calculate the squared Euclidean distance between observations 1 and 2, as follows: d12 ¼

7  X

Xj1  Xj2

2

¼ ð0  0Þ2 + ð0  1Þ2 + ð1  1Þ2 + ð1  1Þ2 + ð0  1Þ2 + ð1  0Þ2 + ð1  1Þ2 ¼ 3,

j¼1

which represents the total number of variables with answer differences between observations 1 and 2. Therefore, for any two observations p and q, the greater the number of equal answers (0-0 or 1-1), the shorter the squared Euclidean distance between them will be, since: 8   2 < 0 if X ¼ X ¼ 0 jp jq 1 Xjp  Xjq ¼ (11.12) : 1 if X 6¼ X jp

jq

As discussed by Johnson and Wichern (2007), each stretch of the distance represented by Expression (11.12) is considered to be a dissimilarity measure, since the greater the number of answer discrepancies, the greater the squared Euclidean distances. On the other hand, the calculations equally ponder the pairs of answers 0-0 and 1-1, without giving higher relative importance to the pair of answers 1-1 that, in many cases, is a stronger similarity indicator than the pair of answers 0-0. For example, when we group people, the fact that two of them eat lobster every day is a stronger similarity evidence than the absence of this characteristic for both. Hence, many authors, aiming at defining similarity measures between observations, proposed the use of coefficients that would take the similarity of the answers 1-1 and 0-0 into consideration, and these pairs would not necessarily have the same relative importance. In order for us to be able to present these measures, it is necessary to construct an absolute frequency table of answers 0 and 1 for each pair of observations p and q, as shown in Table 11.8. Next, based on this table, we will discuss the main similarity measures, bearing in mind that the use of each one depends on the researcher’s assumptions and objectives. Simple matching coefficient (SMC): it is the most frequently used similarity measure for binary variables, and it is discussed and used by Zubin (1938a), and by Sokal and Michener (1958). This coefficient, which matches the weights of the converging 1-1 and 0-0 answers, has its expression given by:

l

spq ¼

a+d a+b+c+d

(11.13)

TABLE 11.8 Absolute Frequencies of Answers 0 and 1 for Two Observations p and q Observation p Observation q

1

0

Total

1

a

b

a+b

0

c

d

c+d

Total

a+c

b+d

a+b+c+d

Cluster Analysis Chapter

l

2a 2a+b+c

a a + 2  ðb + cÞ

a a+b+c+d

(11.17)

(11.18)

Yule similarity coefficient: proposed by Yule (1900) and used by Yule and Kendall (1950), this similarity coefficient for binary variables offers as an answer a coefficient that varies from 1 to 1. As we can see, through its expression presented, the coefficient generated is undefined if one or both vectors compared present all the values equal to 0 or 1. Software such as Stata generate the Yule coefficient equal to 1, if b ¼ c ¼ 0 (a total convergence of answers), and equal to 1, if a ¼ d ¼ 0 (a total divergence of answers). spq ¼

l

(11.16)

Ochiai similarity coefficient: even though it is known by this name, it was initially proposed by Driver and Kroeber (1932), and, later on, it was used by Ochiai (1957). This coefficient is undefined when one or both observations being studied present all the variable values equal to 0. However, if both vectors present all the values equal to 0, software such as Stata present the Ochiai coefficient equal to 1. If this happens for only one of the two vectors, the Ochiai coefficient is considered equal to 0. Its expression is given by: a spq ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ða + bÞ  ða + cÞ

l

(11.15)

Russel and Rao similarity coefficient: it is also widely used and it only favors the similarities of 1-1 answers in the calculation of its coefficient. It was proposed by Russell and Rao (1940), and its expression is given by: spq ¼

l

(11.14)

anti-Dice similarity coefficient: it was initially proposed by Sokal and Sneath (1963) and Anderberg (1973), the name anti-Dice comes from the fact that this coefficient doubles the weight over the frequencies of different type 1-1 answer pairs, that is, it doubles the weight over the answer divergences. Just as the Jaccard and the Dice coefficients, the antiDice coefficient also ignores the frequency of 0-0 answer pairs. Its expression is given by: spq ¼

l

a a+b+c

Dice similarity coefficient (DSC): although it is only known by this name, it was suggested and discussed by Czekanowski (1932), Dice (1945) and Sørensen (1948). It is similar to the Jaccard index; however, it doubles the weight over the frequency of converging type 1-1 answer pairs. Just as in that case, software such as Stata present the Dice coefficient equal to 1, for the cases in which all the variables are equal to 0 for two observations, thus, avoiding any uncertainty in the calculation. Its expression is given by: spq ¼

l

323

Jaccard index: even though it was first proposed by Gilbert (1884), it received this name because it was discussed and used in two extremely important papers developed by Jaccard (1901, 1908). This measure, also known as Jaccard similarity coefficient, does not take the frequency of the pair of answers 0-0 into consideration, which is considered irrelevant. However, it is possible to come across a situation in which all the variables are equal to 0 for two observations, that is, there is only frequency in cell d of Table 11.8. In this case, software packages such as Stata present the Jaccard index equal to 1, which makes sense from a similarity standpoint. Its expression is given by: spq ¼

l

11

adbc ad+bc

(11.19)

Rogers and Tanimoto similarity coefficient: this coefficient, which doubles the weight of discrepant answers 0-1 and 1-0 in relation to the weight of the combinations of converging type 1-1 and 0-0 answers, was initially proposed by Rogers and Tanimoto (1960). Its expression, which becomes equal to the anti-Dice coefficient when the frequency of 0-0 answers is equal to 0 (d ¼ 0), is given by: spq ¼

a+d a + d + 2  ðb + c Þ

(11.20)

324

l

PART

V

Multivariate Exploratory Data Analysis

Sneath and Sokal similarity coefficient: different from the Rogers and Tanimoto coefficient, this coefficient, proposed by Sneath and Sokal (1962), doubles the weight of converging type 1-1 and 0-0 answers in relation to the other answer combinations (1-0 and 0-1). Its expression, which becomes equal to the Dice coefficient when the frequency of type 0-0 answers is equal to 0 (d ¼ 0), is given by: spq ¼

l

2  ða + d Þ 2  ða + dÞ + b + c

(11.21)

Hamann similarity coefficient: Hamann (1961) proposed this similarity coefficient for binary variables aiming at having the frequencies of discrepant answers (1-0 and 0-1) subtracted from the total of converging answers (1-1 and 0-0). This coefficient, which varies from 1 (total answer divergence) to 1 (total answer convergence), is equal to two times the simple matching coefficient minus 1. Its expression is given by: spq ¼

ða + d Þ  ð b + cÞ a+b+c+d

(11.22)

As was discussed in Section 11.2.1.1 as regards the dissimilarity measures applied to metric variables, let’s go back to the data presented in Table 11.7, aiming at calculating the different similarity measures between observations 1 and 2, which only have binary variables. In order to do that, from that table, we must construct the absolute frequency table of answers 0 and 1 for the observations mentioned (Table 11.9). So, using Expressions (11.13)–(11.22), we are able to calculate the similarity measures themselves. Table 11.10 presents the calculations and the results of each coefficient. Analogous to what was discussed when the dissimilarity measures were calculated, we can clearly see that different similarity measures generate different results, which may cause, when defining the cluster method, the observations to be allocated to different homogeneous clusters, depending on which measure was chosen for the analysis. Bear in mind that it does not make any sense to apply the Z-scores standardization procedure to calculate the similarity measures discussed in this section, since the variables used for the cluster analysis are binary. At this moment, it is important to emphasize that, instead of using similarity measures to define the clusters whenever there are binary variables, it is very common to define clusters from the coordinates of each observation, which can be generated when elaborating simple or multiple correspondence analyses, for instance. This is an exploratory technique applied solely to datasets that have qualitative variables, aiming at creating perceptual maps, which are constructed based on the frequency of the categories of each one of the variables in analysis (Fa´vero and Belfiore, 2017). After defining the coefficient that will be used, based on the research objectives, on the underlying theory, and on his experience and intuition, the researcher must move on to the definition of the cluster schedule. The main cluster analysis schedules will be studied in the following section.

11.2.2

Agglomeration Schedules in Cluster Analysis

As discussed by Vicini and Souza (2005) and Johnson and Wichern (2007), in cluster analysis, choosing the clustering method, also known as agglomeration schedule, is as important as defining the distance (or similarity) measure, and this decision must also be made based on what researchers intends to do in terms of their research objectives.

TABLE 11.9 Absolute Frequencies of Answers 0 and 1 for Observations 1 and 2 Observation 1 Observation 2

1

0

Total

1

3

2

5

0

1

1

2

Total

4

3

7

Cluster Analysis Chapter

11

325

TABLE 11.10 Similarity Measures Between Observations 1 and 2 Simple Matching:

Jaccard:

s12 ¼ 3 +7 1 ¼ 0:571

s12 ¼ 36 ¼ 0:500

Dice:

Anti-Dice:

s12 ¼ 2  ð32Þð+3Þ2 + 1 ¼ 0:667

s12 ¼ 3 + 2 3ð2 + 1Þ ¼ 0:333

Russell and Rao:

Ochiai:

s12 ¼ 37 ¼ 0:429

3 s12 ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ 0:671

Yule:

Rogers and Tanimoto:

1 s12 ¼ 33  112 + 2  1 ¼ 0:200

s12 ¼ 3 + 1 +32+1ð2 + 1Þ ¼ 0:400

Sneath and Sokal:

Hamann:

s12 ¼ 2  ð23 +ð13Þ++12Þ + 1 ¼ 0:727

s12 ¼ ð3 + 1Þ7 ð2 + 1Þ ¼ 0:143

ð3 + 2Þ  ð3 + 1Þ

Basically, agglomeration schedules can be classified into two types, hierarchicals and nonhierarchicals. While the former characterize themselves for favoring a hierarchical structure (step by step) when forming clusters, nonhierarchical schedules use algorithms to maximize the homogeneity within each cluster, without going through a hierarchical process for such. Hierarchical agglomeration schedules can be clustering or partitioning, depending on how the process starts. If all the observations are considered to be separated and, from their distances (or similarities), groups are formed until we reach a final stage with only one cluster, then this process is known as clustering. Among all hierarchical agglomeration schedules, the most commonly used are those that have the following linkage methods: nearest-neighbor or singlelinkage, furthest-neighbor or complete-linkage, or between-groups or average-linkage. On the other hand, if all the observations are considered grouped and, stage after stage, smaller groups are formed by the separation of each observation, until these subdivisions generate individual groups (that is, totally separated observations), then, we have a partitioning process. Conversely, nonhierarchical agglomeration schedules, among which the most popular is the k-means procedure, refer to processes in which clustering centers are defined, and from which the observations are allocated based on their proximity to them. Different from hierarchical schedules, in which the researcher can study the several possibilities for allocating observations and even define the ideal number of clusters based on each one of the grouping stages, a nonhierarchical agglomeration schedule requires that we previously stipulate the number of clusters from which the clustering centers will be defined and the observations allocated. That is why we recommend the generation of a hierarchical agglomeration schedule before constructing a nonhierarchical schedule, when there is no reasonable estimate of the number of clusters that can be formed from the observations in the dataset and based on the variables in study. Fig. 11.9 shows the logic of agglomeration schedules in cluster analysis. We will study hierarchical agglomeration schedules in Section 11.2.2.1, and Section 11.2.2.2 will be used to discuss the nonhierarchical k-means agglomeration schedule.

11.2.2.1 Hierarchical Agglomeration Schedules In this section, we will discuss the main hierarchical agglomeration schedules, in which larger and larger clusters are formed at each clustering stage because new observations or groups are added to it, due to a certain criterion (linkage method) and based on the distance measure chosen. In Section 11.2.2.1.1, the main concepts of these schedules will be presented, and, in Section 11.2.2.1.2, a practical example will be presented and solved algebraically. 11.2.2.1.1

Notation

There are three main linkage methods in hierarchical agglomeration schedules, as shown in Fig. 11.9: the nearest-neighbor or single-linkage, the furthest-neighbor or complete-linkage, and the between-groups or average-linkage. Table 11.11 illustrates the distance to be considered in each clustering stage, based on the linkage method chosen.

326

PART

V

Multivariate Exploratory Data Analysis

Agglomeration schedule

Nonhierarchical (k-means)

Hierarchical

Agglomerative

Partitioning

Linkage method

Furthest neighbor (complete linkage)

Nearest neighbor (single linkage)

Between groups (average linkage)

FIG. 11.9 Agglomeration schedules in cluster analysis.

TABLE 11.11 Distance to be Considered Based on the Linkage Method Linkage Method Single (Nearest-Neighbor or Single-Linkage)

Illustration

Distance (Dissimilarity) d23

1

4 3 2

Complete (Furthest-Neighbor or Complete-Linkage)

5

d15 1

4 3 5

2

Average (Between-Groups or Average-Linkage)

d13 + d14 + d15 + d23 + d24 + d25 6

1

4 3 2

5

The single-linkage method favors the shortest distances (thus, the nomenclature nearest neighbor) so that new clusters can be formed at each clustering stage through the incorporation of observations or groups. Therefore, applying it is advisable in cases in which the observations are relatively far apart, that is, different, and we would like to form clusters considering a minimum of homogeneity. On the other hand, its analysis may be hampered when there are observations or clusters just a little farther apart from each other, as shown in Fig. 11.10. The complete-linkage method, on the other hand, goes in the opposite direction, that is, it favors the greatest distances between the observations or groups so that new clusters can be formed (hence, the name furthest neighbor) and, in this regard, using it is advisable in cases in which there is no considerable distance between the observations, and the researcher needs to identify the heterogeneities between them. Finally, in the average-linkage method, two groups merge based on the average distance between all the pairs of observations that are in these groups (hence, the name average linkage). Accordingly, even though there are changes in the calculation of the distance measures between the clusters, the average-linkage method ends up preserving the order of the observations in each group, offered by the single-linkage method, in case there is a considerable distance between the observations. The same happens with the sorting solution provided by the complete-linkage method, if the observations are very close to each other.

Cluster Analysis Chapter

11

327

FIG. 11.10 Single-linkage method—Hampered analysis when there are observations or clusters just a little further apart.

Johnson and Wichern (2007) proposed a logical sequence of steps in order to facilitate the understanding of a cluster analysis, elaborated through a certain hierarchical agglomerative method: 1. If n is the number of observations in a dataset, we must start the agglomeration schedule with exactly n individual groups (stage 0), such that we will initially have a distances (or similarities) matrix D0 formed by the distances between each pair of observations. 2. In the first stage, we must choose the smallest distance among all of those that form matrix D0, that is, the one that connects the two most similar observations. At this exact moment, we will not have n individual groups any longer, we will have (n  1) groups, and one of them is formed by two observations. 3. In the following clustering stage, we must repeat the previous stage. However, we now have to take the distance between each pair of observations, and between the first group already formed, and each one of the other observations into consideration, based on one of the linkage methods adopted. In other words, we will have, after the first clustering stage, matrix D1 with dimensions (n  1)  (n  1), in which one of the rows will be represented by the first grouped pair of observations. Consequently, in the second stage, a new group will be formed by the grouping of two new observations or by adding a certain observation to the first group previously formed in the first stage. 4. The previous process must be repeated (n  1) times, until there is only a single group formed by all the observations. In other words, in the stage (n  2) we will have matrix Dn-2 that will only contain the distance between the last two remaining groups, before the final fusion. 5. Finally, from the clustering stages and the distances between the clusters formed, it is possible to develop a tree-shaped diagram that summarizes the clustering process, and explains the allocation of each observation in each cluster. This diagram is known as a dendrogram or a phenogram. Therefore, the values that form the D matrices of each one of the stages will be a function of the distance measure chosen and of the linkage method adopted. In a certain clustering stage s, imagine that a researcher groups two clusters M and N formed previously, containing observations m and n, respectively, so that cluster MN can be formed. Next, he intends to group MN with another cluster W, with w observations. Since we know that the decision to choose the next cluster will always be the smallest distance between each pair of observations or groups in the hierarchical agglomerative methods, the agglomeration schedule will be essential in order for the distances that will form each matrix Ds to be analyzed. Using this logic and based on Table 11.11, let’s discuss the criterion to calculate the distance between the clusters MN and W, inserted in matrix Ds, based on the linkage method: l

Nearest-Neighbor or Single-Linkage Method:

  dðMN ÞW ¼ min dMW ; dNW

(11.23)

where dMW and dNW are the distances between the closest observations in clusters M and W and in clusters N and W, respectively. l

Furthest-Neighbor or Complete-Linkage Method:

  dðMN ÞW ¼ max dMW ; dNW

(11.24)

where dMW and dNW are the distances between the farthest observations in clusters M and W and in clusters N and W, respectively.

328

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.12 Example: Grades in Math, Physics, and Chemistry on the College Entrance Exam

l

Student (Observation)

Grade in Mathematics (X1i)

Grade in Physics (X2i)

Grade in Chemistry (X3i)

Gabriela

3.7

2.7

9.1

Luiz Felipe

7.8

8.0

1.5

Patricia

8.9

1.0

2.7

Ovidio

7.0

1.0

9.0

Leonor

3.4

2.0

5.0

Between-Groups or Average-Linkage Method: m +nX w X

dðMN ÞW ¼

dpq

p¼1 q¼1

ðm + nÞ  ðwÞ

(11.25)

where dpq represents the distance between any observation p in cluster MN and any observation q in cluster W, and m + n and w represent the number of observations in clusters MN and W, respectively. In the following section, we will present a practical example that will be solved algebraically, and from which the concepts of hierarchical agglomerative methods will be established. 11.2.2.1.2

A Practical Example of Cluster Analysis With Hierarchical Agglomeration Schedules

Imagine that a college professor, who is very concerned about his students’ capacity to learn the subject he teaches, Quantitative Methods, is interested in allocating them to groups with the highest homogeneity possible, based on the grades they obtained on the college entrance exams in subjects considered quantitative (Math, Physics, and Chemistry). In order to do that, the professor collected information on these grades, which vary from 0 to 10. In addition, since he will carry out a cluster analysis, first, in an algebraic way, he decided, for pedagogical purposes, to only work with five students. This dataset can be seen in Table 11.12. Based on the data obtained, the chart in Fig. 11.11 is constructed, and, since the variables are metric, the dissimilarity measure known as Euclidian distance will be used for the cluster analysis. Besides, since all the variables have values in the same unit of 0 measure (grades from 0 to 10), in this case, it will not be necessary to standardize them through Z-scores. In the following sections, hierarchical agglomeration schedules based on the Euclidian distance will be elaborated through the three linkage methods being studied. 11.2.2.1.2.1 Nearest-Neighbor or Single-Linkage Method At this moment, from the data presented in Table 11.12, let’s develop a cluster analysis through a hierarchical agglomeration schedule with the single-linkage method. First of all, we define matrix D0, formed by the Euclidian distances (dissimilarities) between each pair of observations, as follows:

Cluster Analysis Chapter

11

329

Chemistry

Gabriela Ovidio

Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.11 Three-dimensional chart with the relative position of the five students.

It is important to mention that at this initial moment each observation is considered an individual cluster, that is, in stage 0, we have 5 clusters (sample size). Highlighted in matrix D0 is the smallest distance between all the observations and, therefore, in the first stage, observations Gabriela and Ovidio will initially be grouped, and will now be a new cluster. We must construct matrix D1 so that we can go to the next clustering stage, in which the distances between the cluster Gabriela-Ovidio and the other observations are calculated. Observations that are still isolated. Thus, by using the singlelinkage method and based on the Expression (11.23), we have: dðGabrielaOvidioÞLuiz Felipe ¼ min f10:132; 10:290g ¼ 10:132 dðGabrielaOvidioÞPatricia ¼ min f8:420; 6:580g ¼ 6:580 dðGabrielaOvidioÞLeonor ¼ min f4:170; 5:474g ¼ 4:170 Matrix D1 can be seen:

330

PART

V

Multivariate Exploratory Data Analysis

In the same way, in matrix D1 the smallest distance between all of them is highlighted. Therefore, in the second stage, observation Leonor is inserted into the already-formed cluster Gabriela-Ovidio. Observations Luiz Felipe and Patricia still remain isolated. We must construct matrix D2 so that we can take the next step, in which the distances between the cluster GabrielaOvidio-Leonor and the two remaining observations are calculated. Analogously, we have: dðGabrielaOvidioLeonorÞLuizFelipe ¼ min f10:132; 8:223g ¼ 8:223 dðGabrielaOvidioLeonorÞPatricia ¼ min f6:580; 6:045g ¼ 6:045 Matrix D2 can be written as:

In the third clustering stage, observation Patricia is incorporated into the cluster Gabriela-Ovidio-Leonor, since the corresponding distance is the smallest among all the ones presented in matrix D2. Therefore, we can write matrix D3, which comes next, taking into consideration the following criterion: dðGabrielaOvidioLeonorPatriciaÞLuizFelipe ¼ min f8:223; 7:187g ¼ 7:187

Finally, in the fourth and last stage, all the observations are allocated to the same cluster, thus, concluding the hierarchical process. Table 11.13 presents a summary of this agglomeration schedule constructed by using the singlelinkage method. Based on this agglomeration schedule, we can construct a tree-shaped diagram, known as a dendrogram or phenogram, whose main objective is to illustrate the step by step of the clusters and to facilitate the visualization of how each observation is allocated to each stage. The dendrogram can be seen in Fig. 11.12. Through Figs. 11.13 and 11.14, we are able to interpret the dendrogram constructed. First of all, we drew three lines (I, II, and III) that are orthogonal to the dendrogram lines, as shown in Fig. 11.13, which allow us to identify the number of clusters in each clustering stage, as well as the observations in each cluster. Therefore, line I “cuts” the dendrogram immediately after the first clustering stage and, at this moment, we can verify that there are four clusters (four intersections with the dendrogram’s horizontal lines), one of them formed by observations Gabriela and Ovidio, and the others, by the individual observations.

Cluster Analysis Chapter

11

331

TABLE 11.13 Agglomeration Schedule Through the Single-Linkage Method Stage

Cluster

Grouped Observation

Smallest Euclidian Distance

1

Gabriela

Ovidio

3.713

2

Gabriela-Ovidio

Leonor

4.170

3

Gabriela-Ovidio-Leonor

Patricia

6.045

4

Gabriela-Ovidio-LeonorPatricia

Luiz Felipe

7.187

0

1

2

Euclidean distance 3 4 5

6

7

8

Gabriela Ovidio Leonor Patricia Luiz Felipe FIG. 11.12 Dendrogram—Single-linkage method.

3

4

Euclidean distance 5 6

7

8

3

4

Euclidean distance 5 6

7

8

FIG. 11.13 Interpreting the dendrogram—Number of clusters and allocation of observations.

Gabriela Ovidio Leonor Patricia Luiz Felipe

Gabriela Ovidio Leonor Patricia Luiz Felipe

FIG. 11.14 Interpreting the dendrogram—Distance leaps.

332

PART

V

Multivariate Exploratory Data Analysis

On the other hand, line II intersects three horizontal lines of the dendrogram, which means that, after the second stage, in which observation Leonor was incorporated into the already formed cluster Gabriela-Ovidio, there are three clusters. Finally, line III is drawn immediately after the third stage, in which observation Patricia merges with the cluster Gabriela-Ovidio-Leonor. Since two intersections between this line and the dendrogram’s horizontal lines are identified, we can see that observation Luiz Felipe remains isolated, while the others form a single cluster. Besides providing a study of the number of clusters in each clustering stage and of the allocation of observations, a dendrogram also allows the researcher to analyze the magnitude of the distance leaps in order to establish the clusters. A high magnitude leap, in comparison to the others, can indicate that a certain observation or cluster, a considerably different one, is incorporated into already formed clusters, which offers subsidies for the establishment of a solution regarding the number of clusters without the need for a next clustering stage. Although we know that setting an inflexible, mandatory number of clusters may hamper the analysis, at least giving an idea of this number, given the distance measure used and the linkage method adopted, may help researchers better understand the characteristics of the observations that led to this fact. Moreover, since the number of clusters is important for constructing nonhierarchical agglomeration schedules, this piece of information (considered an output of the hierarchical schedule) may serve as input for the k-means procedure. Fig. 11.14 presents three distance leaps (A, B, and C), regarding each one of the clustering stages, and, from their analysis, we can see that leap B, which represents the incorporation of observation Patricia into the cluster that had already been formed Gabriela-Ovidio-Leonor, is the greatest of the three. Therefore, in case we intend to set the ideal number of clusters in this example, the researcher may choose the solution with three clusters (line II in Fig. 11.13), without the stage in which observation Patricia is incorporated, since it possibly has characteristics that are not so homogeneous and that make it unfeasible to include it in the previously formed cluster, given the large distance leap. Thus, in this case, we would have a cluster formed by Gabriela, Ovidio, and Leonor, another one formed only by Patricia, and a third one formed only by Luiz Felipe. When using dissimilarity measures in methods clustering, a very useful criterion for identifying the number of clusters consists in identifying a considerable distance leap (whenever possible), and defining the number of clusters formed in the clustering stage immediately before the great leap, since very high leaps may incorporate observations with characteristics that are not so homogeneous. Furthermore, it is also important to mention that, if the distance leaps from a stage to another are small, due to the existence of variables with values that are too close to the observations, which can make it difficult to read the dendrogram, the researcher may use the squared Euclidean distance, so that the leaps can become clearer and better explained, making it easier to identify the clusters in the dendrogram, and providing better arguments for the decision making process. Software such as SPSS shows dendrograms with rescaled distance measures, in order to facilitate the interpretation of the allocation of each observation and the visualization of the large distance leaps. Fig. 11.15 illustrates how clusters can be established after the single-linkage method is elaborated. Next, we will develop the same example. However, now, let’s use the complete- and average-linkage methods, so that we can compare the order of the observations and the distance leaps. 11.2.2.1.2.2 Furthest-Neighbor or Complete-Linkage Method Matrix D0, shown here, is obviously the same, and the smallest Euclidian distance, the one highlighted, is between observations Gabriela and Ovidio that become the first cluster. It is important to emphasize that the first cluster will always be the same, regardless of the linkage method used, since the first stage will always consider the smallest distance between two pairs of observations, which are still isolated.

Cluster Analysis Chapter

11

333

Chemistry

Gabriela Ovidio

Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.15 Suggestion of clusters formed after the single-linkage method.

In the complete-linkage method, we must use Expression (11.24) to construct matrix D1, as follows: dðGabrielaOvidioÞLuizFelipe ¼ max f10:132; 10:290g ¼ 10:290 dðGabrielaOvidioÞPatricia ¼ max f8:420; 6:580g ¼ 8:420 dðGabrielaOvidioÞLeonor ¼ max f4:170; 5:474g ¼ 5:474 Matrix D1 can be seen and by analyzing it, we can see that observation Leonor will be incorporated into the cluster formed by Gabriela and Ovidio. Once again, the smallest value, among all the ones shown in matrix D1, is highlighted.

As verified when using the single-linkage method, here, observations Luiz Felipe and Patricia also remain isolated at this stage. The differences between the methods start arising now. Therefore, we will construct matrix D2 using the following criteria: dðGabrielaOvidioLeonorÞLuizFelipe ¼ max f10:290; 8:223g ¼ 10:290 dðGabrielaOvidioLeonorÞPatricia ¼ max f8:420; 6:045g ¼ 8:420

334

PART

V

Multivariate Exploratory Data Analysis

Matrix D2 can be written as follows:

In the third clustering stage, a new cluster is formed by the fusion of observations Patricia and Luiz Felipe, since the furthest-neighbor criterion adopted in the complete-linkage method makes the distance between these two observations become the smallest among all the ones calculated to construct matrix D2. Therefore, notice that at this stage differences related to the single-linkage method appear, in terms of the sorting and allocation of the observations to groups. Hence, to construct matrix D3, we must take the following criterion into consideration: dðGabrielaOvidioLeonorÞðLuizFelipePatriciaÞ ¼ max f10:290; 8:420g ¼ 10:290

In the same way, in the fourth and last stage, all the observations are allocated to the same cluster, since there is the clustering between Gabriela-Ovidio-Leonor and Luiz Felipe-Patricia. Table 11.14 shows a summary of this agglomeration schedule, elaborated by using the complete-linkage method. This agglomeration schedule’s dendrogram can be seen in Fig. 11.16. We can initially see that the sorting of the observations is different from what was observed in the dendrogram seen in Fig. 11.12. Analogous to what was carried out in the previous method, we chose to draw two vertical lines (I and II) over the largest distance leap, as shown in Fig. 11.17. Thus, if the researcher chooses to consider three clusters, the solution will be the same as the one achieved previously through the single-linkage method, one formed by Gabriela, Ovidio, and Leonor, another one by Luiz Felipe, and a third one by Patricia (line I in Fig. 11.17). However, if he chooses to define two clusters (line II), the solution will be different since, in this case, the second cluster will be formed by Luiz Felipe and Patricia, while in the previous case, it was formed only by Luiz Felipe, since observation Patricia was allocated to the first cluster.

TABLE 11.14 Agglomeration Schedule Through the Complete-Linkage Method Stage

Cluster

Grouped Observation

Smallest Euclidian Distance

1

Gabriela

Ovidio

3.713

2

Gabriela-Ovidio

Leonor

5.474

3

Luiz Felipe

Patricia

7.187

4

Gabriela-Ovidio-Leonor

Luiz Felipe-Patricia

10.290

Cluster Analysis Chapter

0

1

2

3

4

Euclidean distance 5 6

7

8

9

10

11

335

11

Gabriela Ovidio Leonor Luiz Felipe Patricia FIG. 11.16 Dendrogram—Complete-linkage method.

3

4

5

Euclidean distance 6 7 8

9

10

11

Gabriela Ovidio Leonor Luiz Felipe Patricia FIG. 11.17 Interpreting the dendrogram—Clusters and distance leaps.

Similar to what was done in the previous method, Fig. 11.18 illustrates how the clusters can be established after the complete-linkage method is carried out. Defining the clustering method can be based on the application of the average-linkage method, in which two groups merge based on the average distance between all the pairs of observations that belong to these groups. Therefore, as we have already discussed, if the most suitable method is the single linkage because there are observations considerably far apart from one another, the sorting and allocation of the observations will be maintained by the average-linkage method. On the other hand, the outputs of this method will show consistency with the solution achieved through the complete-linkage method as regards the sorting and allocation of the observations, if they are very similar in the variables in study. Thus, it is advisable for the researcher to apply the three linkage methods when elaborating a cluster analysis through hierarchical agglomeration schedules. Therefore, let’s move on to the average-linkage method. 11.2.2.1.2.3 Between-Groups or Average-Linkage Method First of all, let’s show the Euclidian distance matrix between each pair of observations (matrix D0), once again, highlighting the smallest distance between them.

336

PART

V

Multivariate Exploratory Data Analysis

Chemistry

Gabriela Ovidio

Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.18 Suggestion of clusters formed after the complete-linkage method.

By using Expression (11.25), we are able to calculate the terms of matrix D1, given that the first cluster GabrielaOvidio has already been formed. Thus, we have: 10:132 + 10:290 ¼ 10:211 2 8:420 + 6:580 ¼ 7:500 dðGabrielaOvidioÞPatricia ¼ 2 4:170 + 5:474 ¼ 4:822 dðGabrielaOvidioÞLeonor ¼ 2

dðGabrielaOvidioÞLuizFelipe ¼

Matrix D1 can be seen and, through it, we can see that observation Leonor is once again incorporated into the cluster formed by Gabriela and Ovidio. The smallest value among all the ones presented in matrix D1 has also been highlighted.

Cluster Analysis Chapter

11

337

In order to construct matrix D2, in which the distances between the cluster Gabriela-Ovidio-Leonor and the two remaining observations are calculated, we must perform the following calculations: 10:132 + 10:290 + 8:223 ¼ 9:548 3 8:420 + 6:580 + 6:045 dðGabrielaOvidioLeonorÞPatricia ¼ ¼ 7:015 3

dðGabrielaOvidioLeonorÞLuizFelipe ¼

Note that the distances used to calculate the dissimilarities to be inserted into matrix D2 are the original Euclidian distances between each pair of observations, that is, they come from matrix D0. Matrix D2 can be seen:

As verified when the single-linkage method was elaborated, here, observation Patricia is also incorporated into the cluster already formed by Gabriela, Ovidio and Leonor, and observation Luiz Felipe remains isolated. Finally, matrix D3 can be constructed from the following calculation: dðGabrielaOvidioLeonorPatriciaÞLuizFelipe ¼

10:132 + 10:290 + 8:223 + 7:187 ¼ 8:958 4

Once again, in the fourth and last stage, all the observations are in the same cluster. Table 11.15 and Fig. 11.19 present a summary of this agglomeration schedule and the corresponding dendrogram, respectively, resulting from this averagelinkage method.

TABLE 11.15 Agglomeration Schedule Through the Average-Linkage Method Stage

Cluster

Grouped Observation

Smallest Euclidian Distance

1

Gabriela

Ovidio

3.713

2

Gabriela-Ovidio

Leonor

4.822

3

Gabriela-Ovidio-Leonor

Patricia

7.015

4

Gabriela-Ovidio-Leonor-Patricia

Luiz Felipe

8.958

338

PART

V

Multivariate Exploratory Data Analysis

Euclidean distance 0

1

2

3

4

5

6

7

8

9

Gabriela Ovidio Leonor Patricia Luiz Felipe FIG. 11.19 Dendrogram—Average-linkage method.

Despite having other distance values, we can see that Table 11.15 and Fig. 11.19 show the same sorting and the same allocation of observations in the clusters as those presented in Table 11.13 and in Fig. 11.12, respectively, obtained when the single-linkage method was elaborated. Hence, we can state that the observations are significantly different from the variables studied, fact proven by the consistency of the answers obtained from the single- and average-linkage methods. If the observations were more similar, fact that has not been observed in the diagram seen in Fig. 11.11, the consistency of answers would occur between the completeand average-linkage methods, as already discussed. Therefore, when possible, the initial construction of scatter plots may help researchers, even if in a preliminary way, choose the method to be adopted. Hierarchical agglomeration schedules are very useful and offer us the possibility to analyze, in an exploratory way, the similarity between observations based on the behavior of certain variables. However, it is essential for researchers to understand that these methods are not conclusive by themselves and more than one answer may be obtained, depending on what is desired and on the data behavior. Besides, it is necessary for researchers to be aware of how sensitive these methods are to the presence of outliers. The existence of a very discrepant observation may cause other observations, not so similar to one another, to be allocated to the same cluster because they are extremely different from the observation considered an outlier. Hence, it is advisable to apply the hierarchical agglomeration schedules with the linkage method chosen several times, and, in each application, to identify one or more observations considered outliers. This procedure will make the cluster analysis become more reliable, since more and more homogeneous clusters may be formed. Researchers are free to characterize the most discrepant observation as the one that ended up becoming isolated after the penultimate clustering stage, that is, if it happens before the total fusion. Nonetheless, many are the methods to define an outlier. Barnett and Lewis (1994), for instance, mention almost 1000 articles in the existing literature on outliers, and, for pedagogical purposes, in the Appendix of this chapter, we will discuss an efficient procedure in Stata for detecting outliers when a researcher is carrying out a multivariate data analysis. It is also important to emphasize, as we have already discussed in this section, that different linkage methods, when elaborating hierarchical agglomeration schedules, must be applied to the same dataset, and the resulting dendrograms, compared. This procedure will help researchers in their decision-making processes with regard to choosing the ideal number of clusters, and also to sorting the observations and allocating each one of them to the different clusters formed. This will even allow researchers to make coherent decisions about the number of clusters that may be considered input in a possible nonhierarchical analysis. Last but not least, it is worth mentioning that the agglomeration schedules presented in this section (Tables 11.13, 11.14, and 11.15) provide increasing values of the clustering measures because a dissimilarity measure was used (Euclidian distance) as a comparison criterion between the observations. If we had chosen Pearson’s correlation between the observations, a similarity measure also used for metric variables, as we discussed in Section 11.2.1.1, the values of the clustering measures in the agglomeration schedules would be decreasing. The latter is also true for cluster analyses in which similarity measures are used, as the ones studied in Section 11.2.1.2, to assess the behavior of observations based on binary variables. In the following section we will develop the same example, in an algebraic way, using the nonhierarchical k-means agglomeration schedule.

11.2.2.2 Nonhierarchical K-Means Agglomeration Schedule Among all the nonhierarchical agglomeration schedules, the k-means procedure is the most often used by researchers in several fields of knowledge. Given that the number of clusters is previously defined by the researcher, this procedure can be

Cluster Analysis Chapter

11

339

elaborated after the application of a hierarchical agglomeration schedule when we have no idea of the number of clusters that can be formed, and, in this situation, the output obtained from this procedure can serve as input for the nonhierarchical. 11.2.2.2.1

Notation

As the one developed in Section 11.2.2.1.1, we now present a logical sequence of steps, based on Johnson and Wichern (2007), in order to facilitate the understanding of the cluster analysis (k-means procedure): 1. We define the initial number of clusters and the respective centroids. The main objective is to divide the observations from the dataset into K clusters, such that those within each cluster are the closest to each other if compared to any other that belongs to a different cluster. For such, the observations need to be allocated arbitrarily to the K clusters, so that the respective centroids can be calculated. 2. We must choose a certain observation that is closer to a centroid and reallocate it to this cluster. At this moment, another cluster has just lost that observation, and, therefore, the centroids of the cluster that receives it and of the cluster that loses it must be recalculated. 3. We must continue repeating the previous step until it is no longer possible to reallocate any observation due to its close proximity to a centroid from another cluster. Centroid coordinate x must be recalculated whenever including or excluding a certain observation p in the respective cluster, based on the following expressions: N  x + xp , if observation p is inserted into the cluster under analysis N+1 N  x  xp , if observation p is excluded from the cluster under analysis xnew ¼ N 1 xnew ¼

(11.26) (11.27)

where N and x refer to the number of observations in the cluster and to its centroid coordinate before the reallocation of that observation, respectively. In addition, xp refers to the coordinate of observation p, which changed clusters. For two variables (X1 and X2), Fig. 11.20 shows a hypothetical situation that represents the end of the k-means procedure, in which it is no longer possible to reallocate any observation because there are no more close proximities to centroids of other clusters. The matrix with distances between observations does not need to be defined at each step, different from hierarchical agglomeration schedules, which reduces the requirements in terms of technological capabilities, allowing nonhierarchical agglomeration schedules to be applied to considerably larger dataset than those traditionally studied through hierarchical schedules. FIG. 11.20 Hypothetical situation that represents the end of the K-means procedure.

340

PART

V

Multivariate Exploratory Data Analysis

In addition, bear in mind that the variables must be standardized before elaborating the k-means procedure, and in the hierarchical agglomeration schedules too, if the respective values are not in the same unit of measure. Finally, after concluding this procedure, it is important for researchers to analyze if the values of a certain metric variable differ between the groups defined, that is, if the variability between the clusters is significantly higher than the internal variability of each cluster. The F-test of the one-way analysis of variance, or one-way ANOVA, allows us to develop this analysis, and its null and alternative hypotheses can be defined as follows: H0: the variable under analysis has the same mean in all the groups formed. H1: the variable under analysis has a different mean in at least one of the groups in relation to the others. Therefore, a single F-test can be applied for each variable, aiming to assess the existence of at least one difference among all the comparison possibilities, and, in this regard, the main advantage of applying it is the fact that adjustments in the discrepant dimensions of the groups do not need to be carried out to analyze several comparisons. On the other hand, rejecting the null hypothesis at a certain significance level, does not allow the researcher to know which group(s) is(are) statistically different from the others in relation to the variable being analyzed. The F statistical expression, corresponding to this test, is given by the following expression: K X

2 Nk  Xk  X

k¼1

variability between the groups ¼ X K  1 2 variability within the groups Xki  Xk

(11.28)

ki

nK where N is the number of observations in the k-th cluster, Xk is the mean of variable X in the same k-th cluster, X is the general average of variable X, and Xki is the value that variable X takes on for a certain observation i present in the k-th cluster. In addition, K represents the number of clusters to be compared, and n, the sample size. By using the F statistic, researchers will be able to identify the variables whose means most differ between the groups, that is, those that most contribute to the formation of at least one of the K clusters (highest F statistic), as well as those that do not contribute to the formation of the suggested number of clusters, at a certain significance level. In the following section, we will discuss a practical example that will be solved algebraically, and from which the concepts of the k-means procedure may be established. 11.2.2.2.2 A Practical Example of a Cluster Analysis With the Nonhierarchical K-Means Agglomeration Schedule To solve the nonhierarchical k-means agglomeration schedule algebraically, let’s use the data from our own example, which can be found in Table 11.12 and are shown in Table 11.16. Software packages such as SPSS use the Euclidian distance as the standard dissimilarity measure, reason why we will develop the algebraic procedures based on this measure. This criterion will even allow the results obtained to be compared to the ones found when elaborating the hierarchical agglomeration schedules in Section 11.2.2.1.2, as, in those situations, the Euclidian distance was also used. In the same way, it will not be necessary to standardize the variables through Z-scores, since all of them are in the same unit of measure (grades from 0 to 10). Otherwise, it is crucial for researchers to standardize the variables before elaborating the k-means procedure. TABLE 11.16 Example: Grades in Math, Physics, and Chemistry on the College Entrance Exams Student (Observation)

Grade in Mathematics (X1i)

Grade in Physics (X2i)

Grade in Chemistry (X3i)

Gabriela

3.7

2.7

9.1

Luiz Felipe

7.8

8.0

1.5

Patricia

8.9

1.0

2.7

Ovidio

7.0

1.0

9.0

Leonor

3.4

2.0

5.0

Cluster Analysis Chapter

11

341

TABLE 11.17 Arbitrary Allocation of the Observations in K 5 3 Clusters and Calculation of the Centroid Coordinates— Initial Step of the K-Means Procedure Centroid Coordinates Variable Cluster

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Gabriela

3:7 + 7:8 ¼ 5:75 2

2:7 + 8:0 ¼ 5:35 2

9:1 + 1:5 ¼ 5:30 2

8:9 + 7:0 ¼ 7:95 2

1:0 + 1:0 ¼ 1:00 2

2:7 + 9:0 ¼ 5:85 2

3.40

2.00

5.00

Luiz Felipe Patricia Ovidio Leonor

Using the logical sequence presented in Section 11.2.2.2.1, we will develop the k-means procedure with K ¼ 3 clusters. This number of clusters may have come from a decision made by the researcher and based on a certain preliminary criterion, or it was chosen based on the outputs of the hierarchical agglomeration schedules. In our case, the decision was made based on the comparison of the dendrograms that had already been constructed, and by the similarity of the outputs obtained by the single- and average-linkage methods. Thus, we need to arbitrarily allocate the observations to three clusters, so that the respective centroids can be calculated. Therefore, we can establish that observations Gabriela and Luiz Felipe form the first cluster, Patricia and Ovidio, the second, and Leonor, the third. Table 11.17 shows the arbitrary formation of these preliminary clusters, as well as the calculation of the respective centroid coordinates, which makes the initial step of the k-means procedure algorithm possible. Based on these coordinates, we constructed the chart seen in Fig. 11.21, which shows the arbitrary allocation of each observation to its cluster and the respective centroids. Based on the second step of the logical sequence presented in Section 11.2.2.2.1, we must choose a certain observation and calculate the distance between it and all the cluster centroids, assuming that it is or it is not reallocated to each cluster. Selecting the first observation (Gabriela), for example, we can calculate the distances between it and the centroids of the clusters that have already been formed (Gabriela-Luiz Felipe, Patricia-Ovidio, and Leonor) and, after that, assume that it leaves its cluster (Gabriela-Luiz Felipe), and is inserted into one of the other two clusters, forming the cluster GabrielaPatricia-Ovidio or Gabriela-Leonor. Thus, from Expressions (11.26) and (11.27), we must recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Gabriela to one of the two clusters takes place, as shown in Table 11.18. Thus, from Tables 11.16, 11.17, and 11.18, we can calculate the following Euclidian distances: l

Assumption that Gabriela is not reallocated: qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dGabrielaðGabrielaLuizFelipeÞ ¼ ð3:70  5:75Þ2 + ð2:70  5:35Þ2 + ð9:10  5:30Þ2 ¼ 5:066 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dGabrielaðPatriciaOvidioÞ ¼ ð3:70  7:95Þ2 + ð2:70  1:00Þ2 + ð9:10  5:85Þ2 ¼ 5:614 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dGabrielaLeonor ¼ ð3:70  3:40Þ2 + ð2:70  2:00Þ2 + ð9:10  5:00Þ2 ¼ 4:170

l

Assumption that Gabriela is reallocated: qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð3:70  7:80Þ2 + ð2:70  8:00Þ2 + ð9:10  1:50Þ2 ¼ 10:132 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dGabrielaðGabrielaPatriciaOvidioÞ ¼ ð3:70  6:53Þ2 + ð2:70  1:57Þ2 + ð9:10  6:93Þ2 ¼ 3:743 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dGabrielaðGabrielaLeonorÞ ¼ ð3:70  3:55Þ2 + ð2:70  2:35Þ2 + ð9:10  7:05Þ2 ¼ 2:085 dGabrielaLuizFelipe ¼

342

PART

V

Multivariate Exploratory Data Analysis

Chemistry

Gabriela Ovidio

CENTROID 2 CENTROID 1 Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.21 Arbitrary allocation of the observations in K ¼ 3 clusters and respective centroids—Initial step of the K-means procedure.

TABLE 11.18 Simulating the Reallocation of Gabriela and Calculating the New Centroid Coordinates Centroid Coordinates Variable Simulation

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

Excluding Gabriela

2  ð5:75Þ3:70 ¼ 7:80 21

2  ð5:35Þ2:70 ¼ 8:00 21

2  ð5:30Þ9:10 ¼ 1:50 21

Gabriela

Including Gabriela

2  ð7:95Þ + 3:70 ¼ 6:53 2+1

2  ð1:00Þ + 2:70 ¼ 1:57 2+1

2  ð5:85Þ + 9:10 ¼ 6:93 2+1

Including Gabriela

1  ð3:40Þ + 3:70 ¼ 3:55 1+1

1  ð2:00Þ + 2:70 ¼ 2:35 1+1

1  ð5:00Þ + 9:10 ¼ 7:05 1+1

Cluster

Patricia Ovidio Gabriela Leonor Obs.: Note that the values calculated for the Luiz Felipe centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.

Since Gabriela is the closest to the Gabriela-Leonor centroid (the shortest Euclidian distance), we must reallocate this observation to the cluster initially formed only by Leonor. So, the cluster in which observation Gabriela was at first (Gabriela-Luiz Felipe) has just lost it, and now Luiz Felipe has become an individual cluster. Therefore, the centroids of the cluster that receives it and the one that loses it must be recalculated. Table 11.19 shows the creation of the new clusters and the calculation of the respective centroid coordinates too.

Cluster Analysis Chapter

11

343

TABLE 11.19 New Centroids With the Reallocation of Gabriela Centroid Coordinates Variable Cluster

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

7.80

8.00

1.50

Patricia

7.95

1.00

5.85

3:7 + 3:4 ¼ 3:55 2

2:7 + 2:0 ¼ 2:35 2

9:1 + 5:0 ¼ 7:05 2

Ovidio Gabriela Leonor

Based on these new coordinates, we can construct the chart shown in Fig. 11.22. Once again, let’s repeat the previous step. At this moment, since observation Luiz Felipe is isolated, let’s simulate the reallocation of the third observation (Patricia). We must calculate the distances between it and the centroids of the clusters that have already been formed (Luiz Felipe, Patricia-Ovidio, and Gabriela-Leonor) and, afterwards, assume that it leaves its cluster (Patricia-Ovidio) and is inserted into one of the other two clusters, forming the cluster Luiz Felipe-Patricia or Gabriela-Patricia-Leonor. Also based on Expressions (11.26) and (11.27), we must recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Patricia to one of these two clusters happens, as shown in Table 11.20. Similar to what was carried out when simulating Gabriela’s reallocation, based on Tables 11.16, 11.19, and 11.20, let’s calculate the Euclidian distances between Patricia and each one of the centroids: Chemistry

Gabriela Ovidio

CENTROID 3

CENTROID 2 Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.22 New clusters and respective centroids—Reallocation of Gabriela.

344

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.20 Simulation of Patricia’s Reallocation—Next Step of the K-Means Procedure Algorithm Centroid Coordinates Variable Cluster

Simulation

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

Including Patricia

1  ð7:80Þ + 8:90 ¼ 8:35 1+1

1  ð8:00Þ + 1:00 ¼ 4:50 1+1

1  ð1:50Þ + 2:70 ¼ 2:10 1+1

Ovidio

Excluding Patricia

2  ð7:95Þ8:90 ¼ 7:00 21

2  ð1:00Þ1:00 ¼ 1:00 21

2  ð5:85Þ2:70 ¼ 9:00 21

Gabriela

Including Patricia

2  ð3:55Þ + 8:90 ¼ 5:33 2+1

2  ð2:35Þ + 1:00 ¼ 1:90 2+1

2  ð7:05Þ + 2:70 ¼ 5:60 2+1

Patricia

Patricia Leonor Obs.: Note that the values calculated of the Ovidio centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.

l

Assumption that Patricia is not reallocated: qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð8:90  7:80Þ2 + ð1:00  8:00Þ2 + ð2:70  1:50Þ2 ¼ 7:187 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dPatriciaðPatriciaOvidioÞ ¼ ð8:90  7:95Þ2 + ð1:00  1:00Þ2 + ð2:70  5:85Þ2 ¼ 3:290 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dPatriciaðGabrielaLeonorÞ ¼ ð8:90  3:55Þ2 + ð1:00  2:35Þ2 + ð2:70  7:05Þ2 ¼ 7:026 dPatriciaLuizFelipe ¼

l

Assumption that Patricia is reallocated: qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dPatriciaðLuizFelipePatriciaÞ ¼ ð8:90  8:35Þ2 + ð1:00  4:50Þ2 + ð2:70  2:10Þ2 ¼ 3:593 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dPatriciaOvidio ¼ ð8:90  7:00Þ2 + ð1:00  1:00Þ2 + ð2:70  9:00Þ2 ¼ 6:580 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dPatriciaðGabrielaPatriciaLeonorÞ ¼ ð8:90  5:33Þ2 + ð1:00  1:90Þ2 + ð2:70  5:60Þ2 ¼ 4:684

Bearing in mind that the Euclidian distance between Patricia and the cluster Patricia-Ovidio is the shortest, we have to reallocate it to another cluster and, at this moment, let’s maintain the solution presented in Table 11.19 and in Fig. 11.22. Next, we will develop the same procedure, however, simulating the reallocation of the fourth observation (Ovidio). Analogously, we must, therefore, calculate the distances between this observation and the centroids of the clusters that have already been formed (Luiz Felipe, Patricia-Ovidio, and Gabriela-Leonor) and, after that, assume that it leaves its cluster (Patricia-Ovidio) and is inserted into one of the other two clusters, forming the cluster Luiz Felipe-Ovidio or Gabriela-Ovidio-Leonor. Once again by using Expressions (11.26) and (11.27), we can recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Ovidio to one of these two clusters takes place, as shown in Table 11.21. Next, we can see the calculations of the Euclidian distances between Ovidio and each one of the centroids, defined from Tables 11.16, 11.19, and 11.21: l

Assumption that Ovidio is not reallocated: qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dOvidioLuizFelipe ¼ ð7:00  7:80Þ2 + ð1:00  8:00Þ2 + ð9:00  1:50Þ2 ¼ 10:290

Cluster Analysis Chapter

11

345

TABLE 11.21 Simulating Ovidio’s Reallocation—New Step of the K-Means Procedure Algorithm Centroid Coordinates Variable Cluster

Simulation

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

Including Ovidio

1  ð7:80Þ + 7:00 ¼ 7:40 1+1

1  ð8:00Þ + 1:00 ¼ 4:50 1+1

1  ð1:50Þ + 9:00 ¼ 5:25 1+1

Patricia

Excluding Ovidio

2  ð7:95Þ7:00 ¼ 8:90 21

2  ð1:00Þ1:00 ¼ 1:00 21

2  ð5:85Þ9:00 ¼ 2:70 21

Gabriela

Including Ovidio

2  ð3:55Þ + 7:00 ¼ 4:70 2+1

2  ð2:35Þ + 1:00 ¼ 1:90 2+1

2  ð7:05Þ + 9:00 ¼ 7:70 2+1

Ovidio

Ovidio Leonor Obs.: Note that the values calculated of the Patricia centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dOvidioðPatriciaOvidioÞ ¼ ð7:00  7:95Þ2 + ð1:00  1:00Þ2 + ð9:00  5:85Þ2 ¼ 3:290 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dOvidioðGabrielaLeonorÞ ¼ ð7:00  3:55Þ2 + ð1:00  2:35Þ2 + ð9:00  7:05Þ2 ¼ 4:187

l

Assumption that Ovidio is reallocated: qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dOvidioðLuizFelipeOvidioÞ ¼ ð7:00  7:40Þ2 + ð1:00  4:50Þ2 + ð9:00  5:25Þ2 ¼ 5:145 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dOvidioPatricia ¼ ð7:00  8:90Þ2 + ð1:00  1:00Þ2 + ð9:00  2:70Þ2 ¼ 6:580 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ dOvidioðGabrielaOvidioLeonorÞ ¼ ð7:00  4:70Þ2 + ð1:00  1:90Þ2 + ð9:00  7:70Þ2 ¼ 2:791

In this case, since observation Ovidio is the closest to the centroid of Gabriela-Ovidio-Leonor (the shortest Euclidian distance), we must reallocate this observation to the cluster formed originally by Gabriela and Leonor. Therefore, observation Patricia becomes an individual cluster. Table 11.22 shows the centroid coordinates of clusters Luiz Felipe, Patricia, and Gabriela-Ovidio-Leonor. We will not carry out the procedure proposed for the fifth observation (Leonor), since it had already fused with observation Gabriela in the first step of the algorithm. We can consider that the k-means procedure is concluded, since it is no

TABLE 11.22 New Centroids With Ovidio’s Reallocation Centroid Coordinates Variable Cluster

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

7.80

8.00

1.50

Patricia

8.90

1.00

2.70

Gabriela

4.70

1.90

7.70

Ovidio Leonor

346

PART

V

Multivariate Exploratory Data Analysis

Chemistry

Gabriela Ovidio CENTROID

Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.23 Solution of the K-means procedure.

longer possible to reallocate any observation due to closer proximity to another cluster’s centroid. Fig. 11.23 shows the allocation of each observation to its cluster and their respective centroids. Note that the solution achieved is equal to the one reached through the single- (Fig. 11.15) and average-linkage methods, when we elaborated the hierarchical agglomeration schedules. As we have already discussed, we can see that the matrix with the distances between the observations does not need to be defined at each step of the k-means procedure algorithm, different from the hierarchical agglomeration schedules, which reduces the requirements in terms of technological capabilities, allowing nonhierarchical agglomeration schedules to be applied to dataset significantly larger than the ones traditionally studied through hierarchical schedules. Table 11.23 shows the Euclidian distances between each observation of the original dataset and the centroids of each one of the clusters formed.

TABLE 11.23 Euclidian Distances Between Observations and Cluster Centroids Cluster Student (Observation)

Luiz Felipe

Patricia

Gabriela Ovidio Leonor

Gabriela

10.132

8.420

1.897

Luiz Felipe

0.000

7.187

9.234

Patricia

7.187

0.000

6.592

Ovidio

10.290

6.580

2.791

Leonor

8.223

6.045

2.998

Cluster Analysis Chapter

11

347

TABLE 11.24 Means per Cluster and General Mean of the Variable mathematics Cluster 1

Cluster 2

Cluster 3

XLuiz Felipe ¼ 7:80

XPatricia ¼ 8:90

XGabriela ¼ 3:70 XOvidio ¼ 7:00 XLeonor ¼ 3:40

X 1 ¼ 7:80

X 2 ¼ 8:90

X 3 ¼ 4:70

X ¼ 6:16

We would like to emphasize that this algorithm can be elaborated with another preliminary allocation of the observations to the clusters besides the one chosen in this example. Reapplying the k-means procedure with several arbitrary choices, given K clusters, allows the researcher to assess how stable the clustering procedure is, and to underpin the allocation of the observations to the groups in a consistent way. After concluding this procedure, it is essential to check, through the F-test of one-way ANOVA, if the values of each one of the three variables considered in the analysis are statistically different between the three clusters. To make the calculation of the F statistics that correspond to this test easier, we constructed Tables 11.24, 11.25, and 11.26, which show the means per cluster and the general mean of the variables mathematics, physics, and chemistry, respectively. So, based on the values presented in these tables and by using Expression (11.28), we are able to calculate the variation between the groups and within them for each one of the variables, as well as the respective F statistics. Tables 11.27, 11.28, and 11.29 show these calculations. Now, let’s analyze the rejection or not of the null hypothesis of the F-tests for each one of the variables. Since there are two degrees of freedom for the variability between the groups (K – 1 ¼ 2) and two degrees of freedom for the variability within the groups (n – K ¼ 2), by using Table A in the Appendix, we have Fc ¼ 19.00 (critical F at a significance level of 0.05). Therefore, only for the variable physics can we reject the null hypothesis that all the groups formed have the same

TABLE 11.25 Means per Cluster and General Mean of the Variable physics Cluster 1

Cluster 2

Cluster 3

XLuiz Felipe ¼ 8:00

XPatricia ¼ 1:00

XGabriela ¼ 2:70 XOvidio ¼ 1:00 XLeonor ¼ 2:00

X 1 ¼ 8:00

X 2 ¼ 1:00

X 3 ¼ 1:90

X ¼ 2:94

TABLE 11.26 Means per Cluster and General Mean of the Variable chemistry Cluster 1

Cluster 2

Cluster 3

XLuiz Felipe ¼ 1:50

XPatricia ¼ 2:70

XGabriela ¼ 9:10 XOvidio ¼ 9:00 XLeonor ¼ 5:00

X 1 ¼ 1:50 X ¼ 5:46

X 2 ¼ 2:70

X 3 ¼ 7:70

348

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.27 Variation and F Statistic for the Variable mathematics Variability between the groups

ð7:806:16Þ2 + ð8:906:16Þ2 + 3ð4:706:16Þ2 31 2

2

Variability within the groups

ð3:704:70Þ + ð7:004:70Þ + ð3:404:70Þ 53

F

8:296 3:990 ¼ 2:079

2

¼ 8:296

¼ 3:990

Note: The calculation of the variability within the groups only took cluster 3 into consideration, since the others show variability equal to 0, because they are formed by a single observation.

TABLE 11.28 Variation and F Statistic for the Variable physics Variability between the groups

ð8:002:94Þ2 + ð1:002:94Þ2 + 3ð1:902:94Þ2 31 2

2

Variability within the groups

ð2:701:90Þ + ð1:001:90Þ + ð2:001:90Þ 53

F

16:306 0:730 ¼ 22:337

2

¼ 16:306

¼ 0:730

Note: The same as the previous table.

TABLE 11.29 Variation and F Statistic for the Variable chemistry Variability between the groups

ð1:505:46Þ2 + ð2:705:46Þ2 + 3ð7:705:46Þ2 31 2

2

Variability within the groups

ð9:107:70Þ + ð9:007:70Þ + ð5:007:70Þ 53

F

19:176 5:470 ¼ 3:506

2

¼ 19:176

¼ 5:470

Note: The same as Table 11.27.

mean, since F calculated Fcal ¼ 22.337 > Fc ¼ F2,2,5% ¼ 19.00, So, for this variable, there is at least one group that has a mean that is statistically different from the others. For the variables mathematics and chemistry, however, we cannot reject the test’s null hypothesis at a significance level of 0.05. Software such as SPSS and Stata do not offer the Fc for the defined degrees of freedom and a certain significance level. However, they offer the Fcal significance level for these degrees of freedom. Thus, instead of analyzing if Fcal > Fc, we must verify if the Fcal significance level is less than 0.05 (5%). Therefore: If Sig. F (or Prob. F) < 0.05, there is at least one difference between the groups for the variable under analysis. The Fcal significance level can be obtained in Excel by using the command Formulas ! Insert Function ! FDIST, which will open a dialog box as the one shown in Fig. 11.24. As we can see in this figure, sig. F for the variable physics is less than 0.05 (sig. F ¼ 0.043), that is, there is at least one difference between the groups for this variable at a significance level of 0.05. An inquisitive researcher will be able to carry out the same procedure for the variables mathematics and chemistry. In short, Table 11.30 presents the results of the oneway ANOVA, with the variation of each variable, the F statistics, and the respective significance levels. The one-way ANOVA table still allows the researcher to identify the variables that most contribute to the formation of at least one of the clusters, because they have a mean that is statistically different from at least one of the groups in relation to the others, since they will have greater F statistic values. It is important to mention that F statistic values are very sensitive to the sample size, and, in this case, the variables mathematics and chemistry ended up not having statistically different means among the three groups, mainly because the sample is small (only five observations). We would like to emphasize that this one-way ANOVA can also be carried out soon after the application of a certain hierarchical agglomeration schedule, since it only depends on the classification of the observations within groups. The researcher must be careful about only one thing, when comparing the results obtained by a hierarchical schedule to the ones obtained by a nonhierarchical schedule, to use the same distance measure in both situations. Different allocations of the observations to the same number of clusters may happen if different distance measures are used in

Cluster Analysis Chapter

11

349

FIG. 11.24 Obtaining the F significance level (command Insert Function).

TABLE 11.30 One-way Analysis of Variance (ANOVA) Variable mathematics

Variability Between the Groups

Variability Within the Groups

F

Sig. F

8.296

3.990

2.079

0.325

physics

16.306

0.730

22.337

0.043

chemistry

19.176

5.470

3.506

0.222

a hierarchical schedule and in a nonhierarchical schedule. Therefore, different values of the F statistics in both situations can be calculated. In general, in case there are one or more variables that do not contribute to the formation of the suggested number of clusters, we recommend that the procedure be reapplied without it (or them). In these situations, the number of clusters may change and, if the researcher feels the need to underpin the initial input regarding the number of K clusters, he may even use a hierarchical agglomeration schedule without those variables before reapplying the k-means procedure, which will make the analysis cyclical. Moreover, the existence of outliers may generate considerably disperse clusters, and treating the dataset in order to identify extremely discrepant observations becomes an advisable procedure, before elaborating nonhierarchical agglomeration schedules. In the Appendix of this chapter, an important procedure in Stata for detecting multivariate outliers will be presented. As with hierarchical agglomeration schedules, the nonhierarchical k-means schedule cannot be used as an isolated technique to make a conclusive decision about the clustering of observations. The data behavior, sample size, and criteria adopted by the researcher may be extremely sensitive to the allocation of observations and the formation of clusters. The combination of the outputs found with the ones coming from other techniques can more powerfully underpin the choices made by the researcher, and provide higher transparency in the decision-making process. At the end of the cluster analysis, since the clusters formed can be represented in the dataset by a new qualitative variable with terms connected to each observation (cluster 1, cluster 2, ..., cluster K), other exploratory multivariate techniques can be elaborated from it, as, for example, a correspondence analysis, so that, depending on the researcher’s objectives, we can study a possible association between the clusters and the categories of other qualitative variables. This new qualitative variable, which represents the allocation of each observation, may also be used as an explanatory variable of a certain phenomenon in confirmatory multivariate models as, for example, multiple regression models, as long

350

PART

V

Multivariate Exploratory Data Analysis

as it is transformed into dummy variables that represent the categories (clusters) of this new variable generated in the cluster analysis, as we will study in Chapter 13. On the other hand, such a procedure only makes sense when we intend to propose a diagnostic regarding the behavior of the dependent variable, without aiming at having forecasts. Since a new observation does not have its place in a certain cluster, obtaining its allocation is only possible when we include such observation into a new cluster analysis, in order to obtain a new qualitative variable and, consequently, new dummies. In addition, this new qualitative variable can also be considered dependent on a multinomial logistic regression model, allowing the researcher to evaluate the probabilities each observation has to belong to each one of the clusters formed, due to the behavior of other explanatory variables not initially considered in the cluster analysis. We would also like to highlight that this procedure depends on the research objectives and construct established, and has a diagnostic nature as regards the behavior of the variables in the sample for the existing observations, without a predictive purpose. Finally, if the clusters formed present substantiality in relation to the number of observations allocated, by using other variables, we may even apply specific confirmatory techniques for each cluster identified, so that, possibly, better adjusted models can be generated. Next, the same dataset will be used to run cluster analyses in SPSS and Stata. In Section 11.3, we will discuss the procedures for elaborating the techniques studied in SPSS and their results too. In Section 11.4, we will study the commands to perform the procedures in Stata, with the respective outputs.

11.3 CLUSTER ANALYSIS WITH HIERARCHICAL AND NONHIERARCHICAL AGGLOMERATION SCHEDULES IN SPSS In this section, we will discuss the step by step for elaborating our example in the IBM SPSS Statistics Software. The main objective is to offer the researcher an opportunity to run cluster analyses with hierarchical and nonhierarchical schedules in this software package, given how easy it is to use it and how didactical the operations are. Every time an output is shown, we will mention the respective result obtained when performing the algebraic solution in the previous sections, so that the researcher can compare them and increase his own knowledge on the topic. The use of the images in this section has been authorized by the International Business Machines Corporation©.

11.3.1

Elaborating Hierarchical Agglomeration Schedules in SPSS

Going back to the example presented in Section 11.2.2.1.2, remember that our professor is interested in grouping students in homogeneous clusters based on their grades (from 0 to 10) obtained on the college entrance exams, in Mathematics, Physics, and Chemistry. The data can be found in the file CollegeEntranceExams.sav and they are exactly the same as the ones presented in Table 11.12. In this section, we will carry out the cluster analysis using the Euclidian distance between the observations and only considering the single-linkage method. In order for a cluster analysis to be elaborated through a hierarchical method in SPSS, we must click on Analyze → Classify → Hierarchical Cluster.... A dialog box as the one shown in Fig. 11.25 will open. Next, we must insert the original variables from our example (mathematics, physics, and chemistry) into Variables and the variable that identifies the observations (student) in Label Cases by, as shown in Fig. 11.26. If the researcher does not have a variable that represents the name of the observations (in this case, a string), he may leave this last cell blank. First of all, in Statistics..., let’s choose the options Agglomeration schedule and Proximity matrix, which make the table with the agglomeration schedule be presented in the outputs, constructed based on the distance measure to be chosen and on the linkage method to be defined, and the matrix with the distances between each pair of observations, respectively. Let’s maintain the option None in Cluster Membership. Fig. 11.27 shows how this dialog box will be. When we click on Continue, we will go back to the main dialog box of the hierarchical cluster analysis. Next, we must click on Plots.... As seen in Fig. 11.28, let’s select the option Dendrogram and the option None in Icicle. In the same way, let’s click on Continue, so that we can go back to the main dialog box. In Method..., which is the most important dialog box of the hierarchical cluster analysis, we must choose the singlelinkage method, also known as the nearest neighbor. Thus, in Cluster Method, let’s select the option Nearest neighbor. An inquisitive researcher may see that the complete (Furthest neighbor) and average (Between-groups linkage) linkage methods, discussed in Section 11.2.2.1, are also available in this option. Besides, since the variables in the dataset are metric, we have to choose one of the dissimilarity measures found in Measure → Interval. In order to maintain the same logic used when solving our example algebraically, we will choose the Euclidian distance as a dissimilarity measure and, therefore, we must select the option Euclidean distance. We can also see that, in this option, we can find the other dissimilarity measures studied in Section 11.2.1.1, such as, the squared

Cluster Analysis Chapter

11

351

FIG. 11.25 Dialog box for elaborating the cluster analysis with a hierarchical method in SPSS.

FIG. 11.26 Selecting the original variables.

Euclidean distance, Minkowski, Manhattan (Block, in SPSS), Chebyshev, and Pearson’s correlation that, even though is a similarity measure, is also used for metric variables. Although we do not use similarity measures in this example because we are not working with binary variables, it is important to mention that some similarity measures can be selected if necessary. Hence, as discussed in Section 11.2.1.2, in Measure → Binary, we can select the simple matching, Jaccard, Dice, Anti-Dice (Sokal and Sneath 2, in SPSS), Russell and Rao, Ochiai, Yule (Yule’s Q, in SPSS), Rogers and Tanimoto, Sneath and Sokal (Sokal and Sneath 1, in SPSS), and Hamann coefficients, among others.

352

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.27 Selecting the options that generate the agglomeration schedule and the matrix with the distances between the pairs of observations.

FIG. 11.28 Selecting the option that generates the dendrogram.

Cluster Analysis Chapter

11

353

FIG. 11.29 Dialog box for selecting the linkage method and the distance measure.

Still in the same dialog box, the researcher may request that the cluster analysis be elaborated from standardized variables. If necessary, for situations in which the original variables have different measurement units, the option Z scores in Transform Values → Standardize can be selected, which will make all the calculations be elaborated from the standardization of the variables, and which will begin having means equal to 0 and standard deviations equal to 1. After these considerations, the dialog box in our example will become what can be seen in Fig. 11.29. Next, we can click on Continue and on OK. The first output (Fig. 11.30) shows dissimilarity matrix D0 formed by the Euclidian distances between each pair of observations. We can even see that in the legend it says, “This is a dissimilarity matrix.” If this matrix were formed by similarity measures, resulting from calculations elaborated from binary variables, it would say, “This is a similarity matrix.”

FIG. 11.30 Matrix with Euclidian distances (dissimilarity measures) between pairs of observations.

354

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.31 Hierarchical agglomeration schedule—Single-linkage method and Euclidian distance.

Through this matrix, which is equal to the one whose values were calculated and presented in Section 11.2.2.1.2, we can verify that observations Gabriela and Ovidio are the most similar (the smallest Euclidian distance) in relation to the variables mathematics, physics, and chemistry (dGabrielaOvidio ¼ 3.713). Therefore, in the hierarchical schedule shown in Fig. 11.31, the first clustering stage occurs exactly by joining these two students, with Coefficient (Euclidian distance) equal to 3.713. Note that the columns Cluster Combined Cluster 1 and Cluster 2 refer to the isolated observations, when they are still not incorporated into a certain cluster or clusters that have already been formed. Obviously, in the first clustering stage, the first cluster is formed by the fusion of two isolated observations. Next, in the second stage, observation Leonor (5) is incorporated into the cluster previously formed by Gabriela (1) and Ovidio (4). With regard to the single-linkage method, we can see that the distance considered for the agglomeration of Leonor was the smallest between this observation and Gabriela or Ovidio, that is, the criterion adopted it was: dðGabrielaOvidioÞLeonor ¼ min f4:170; 5:474g ¼ 4:170 We can also see that, while columns Stage Cluster First Appears Cluster 1 and Cluster 2 indicate in which previous stage each corresponding observation was incorporated into a certain cluster, column Next Stage shows in which future stage the respective cluster will receive a new observation or cluster, given that we are dealing with a clustering method. In the third stage, observation Patricia (3) is incorporated to the already formed cluster, Gabriela-Ovidio-Leonor, respecting the following distance criterion: dðGabrielaOvidioLeonorÞPatricia ¼ min f8:420; 6:580; 6:045g ¼ 6:045 And, finally, given that we have five observations, in the fourth and last stage, observation Luiz Felipe, which is still isolated (note that the last observation to be incorporated into a cluster corresponds to the last value equal to 0 in the column Stage Cluster First Appears Cluster 2), is incorporated to the cluster already formed by the other observations, concluding the agglomeration schedule. The distance considered at this stage is given by: dðGabrielaOvidioLeonorPatriciaÞLuizFelipe ¼ min f10:132; 10:290; 8:223; 7:187g ¼ 7:187 Based on how the observations are sorted in the agglomeration schedule and on the distances used as a clustering criterion, the dendrogram can be constructed, and it can be seen in Fig. 11.32. Note that the distance measures are rescaled to construct the dendrograms in SPSS, so that the interpretation of each observation allocation to the clusters and, mainly, visualizing the highest distance leaps can be made easier, as discussed in Section 11.2.2.1.2.1. The way the observations are sorted in the dendrogram corresponds to what was presented in the agglomeration schedule (Fig. 11.31), and, from the analysis shown in Fig. 11.32, it is possible to see that the greatest distance leap occurs when Patricia merges with Gabriela-Ovidio-Leonor, which had already been formed. This leap could have already been identified in the agglomeration schedule found in Fig. 11.31, since a large increase in distance occurs when we go from the second to the third stage, that is, when we increase the Euclidian distance from 4.170 to 6.045 (44.96%), so that a new cluster can be formed by incorporating another observation. Therefore, we can choose the existing configuration at the end of the second clustering stage, in which three clusters are formed. As discussed in Section 11.2.2.1.2.1, the criterion for identifying the number of clusters that considers the clustering stage immediately before a large leap is very useful and commonly used. Fig. 11.33 shows a vertical line (a dashed line) that “cuts” the dendrogram in the region where the highest leaps occur. At this moment, since three intersections with lines from the dendrogram happen, we can identify three corresponding clusters formed by Gabriela-Ovidio-Leonor, Patricia, and Luiz Felipe, respectively.

Cluster Analysis Chapter

Dendrogram Using Single Linkage

Y

0 Gabriela

1

Ovidio

4

Leonor

5

Patricia

3

5

Rescaled Distance Cluster Combine 10 15 20

25

Luiz Felipe 2 FIG. 11.32 Dendrogram—Single-linkage method and rescaled euclidian distances in SPSS.

Dendrogram Using Single Linkage

Y

0

5

Rescaled Distance Cluster Combine 10 15 20

Gabriela

1

Ovidio

4

Leonor

5

Patricia

3

Individual Cluster Patricia

Luiz Felipe 2

Individual Cluster Luiz Felipe

FIG. 11.33 Dendrogram with cluster identification.

Cluster Gabriela-Ovidio-Leonor

25

11

355

356

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.34 Defining the number of clusters.

As discussed, it is common to find dendrograms that make it difficult to identify distance leaps, mainly due to the fact that there are considerably similar observations in the dataset in relation to all the variables under analysis. In these situations, it is advisable to use the squared Euclidean distance and the complete-linkage method (furthest neighbor). This criteria combination is very popular in datasets with extremely homogeneous observations. Having adopted the solution with three clusters, we can once again click on Analyze → Classify → Hierarchical Cluster... and, on Statistics..., select the option Single solution in Cluster Membership. In this option, we must insert number 3 into Number of clusters, as shown in Fig. 11.34. When we click on Continue, we will go back to the main dialog box of the cluster analysis. On Save..., let’s choose the option Single solution and, in the same way, insert number 3 into Number of clusters, as shown in Fig. 11.35, so that the new variable corresponding to the allocation of observations to the clusters can become available in the dataset. Next, we can click on Continue and on OK. Although the outputs generated are the same, it is important to notice that a new table of results is presented, corresponding to the allocation of the observations to the clusters itself. Fig. 11.36 shows, for three clusters, that, while observations Gabriela, Ovidio, and Leonor form a single cluster, called 1, observations Luiz Felipe and Patricia form two individual clusters, called 2 and 3, respectively. Even though these names are numerical, it is important to highlight that they only represent the labels (categories) of a qualitative variable. When elaborating the procedure described, we can see that a new variable is generated in the dataset. It is called CLU3_1 by SPSS, as shown in Fig. 11.37. This new variable is automatically classified by the software as Nominal, that is, qualitative, as shown in Fig. 11.38, which can be obtained when we click on Variable View, in the lower left-hand side of the screen in SPSS. As we have already discussed, variable CLU3_1 can be used in other exploratory techniques, such as, the correspondence analysis, or in confirmatory techniques. In the latter, it can be inserted, for example, into the explanatory variables vector (as long as it is transformed into dummies) of a multiple regression model, or as a dependent variable of a certain multinomial logistic regression model, in which researchers intend to study the behavior of other variables, not inserted into the cluster analysis, concerning the probability of inserting each observation into each one of the clusters formed. However, this decision depends on the research objectives. At this moment, the researcher may consider the cluster analysis with hierarchical agglomeration schedules concluded. Nevertheless, based on the generation of the new variable CLU3_1, by using the one-way ANOVA, he may still study if the values of a certain variable differ between the clusters formed, that is, if the variability between the groups is significantly higher than the variability within each one of them. Even if the analysis had not been developed when solving the hierarchical schedules algebraically, since we chose to carry it out only after the k-means procedure in Section 11.2.2.2.2, we can now show how it can be applied at this moment, since we have already allocated the observations to the groups.

Cluster Analysis Chapter

11

357

FIG. 11.35 Selecting the option to save the allocation of observations to the clusters with the new variable in the dataset—Hierarchical procedure.

FIG. 11.36 Allocating the observations to the clusters.

FIG. 11.37 Dataset with the new variable CLU3_1—Allocation of each observation.

358

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.38 Nominal (qualitative) classification of the variable CLU3_1.

In order to do that, let’s click on Analyze → Compare Means → One-Way ANOVA.... In the dialog box that will open, we must insert the variables mathematics, physics, and chemistry into Dependent List and variable CLU3_1 (Single Linkage) into Factor. The dialog box will be as the one shown in Fig. 11.39. In Options..., let’s choose the options Descriptive (in Statistics) and Means plot, as shown in Fig. 11.40. Next, we can click on Continue and on OK. While Fig. 11.41 shows the descriptive statistics of the clusters per variable, similar to Tables 11.24, 11.25, and 11.26, Fig. 11.42 uses these values and shows the calculation of the variation between the groups (Between Groups) and within them (Within Groups), as well as the F statistics for each variable and the respective significance levels. We can see that these values correspond to the ones calculated algebraically in Section 11.2.2.2.2 and shown in Table 11.30. From Fig. 11.42, we can see that sig. F for the variable physics is less than 0.05 (sig. F ¼ 0.043), that is, there is at least one group that has a statistically different mean, when compared to the others, at a significance level of 0.05. However, the same cannot be said about the variables mathematics and chemistry. Although we have an idea of which group has a statistically different mean compared to the others for the variable physics, based on the outputs seen in Fig. 11.41, constructing the diagrams may facilitate the analysis of the differences between the variable means per cluster even more. The charts generated by SPSS (Figs. 11.43, 11.44, and 11.45) allow us to see these differences between the groups for each variable analyzed. Therefore, from the chart seen in Fig. 11.44, it is possible to see that group 2, formed only by observation Luiz Felipe, in fact, has a mean different from the others in relation to the variable physics. Besides, even though we can see from the diagrams in Figs. 11.43 and 11.45 that there are mean differences of the variables mathematics and chemistry between the groups, these differences cannot be considered statistically significant, at a significance level of 0.05, since we are dealing with a very small number of observations, and the F statistic values are very sensitive to the sample size. This graphical analysis becomes really useful when we are studying datasets with a larger number of observations and variables.

FIG. 11.39 Dialog box with the selection of the variables to run the one-way analysis of variance in SPSS.

Cluster Analysis Chapter

11

FIG. 11.40 Selecting the options to carry out the one-way analysis of variance.

FIG. 11.41 Descriptive statistics of the clusters per variable.

FIG. 11.42 One-way analysis of variance—Between groups and within groups variation, F statistics, and significance levels per variable.

359

360

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.43 Means of the variable mathematics in the three clusters.

Mean of mathematics grade (0 to 10)

9.0

8.0

7.0

6.0

5.0

4.0 1

2

3

Single linkage

Mean of physics grade (0 to 10)

8.0

6.0

4.0

2.0

1

2

3

Single linkage FIG. 11.44 Means of the variable physics in the three clusters.

Finally, researchers can still complement their analysis by elaborating a procedure known as multidimensional scaling, since using the distance matrix may help them construct a chart that allows a two-dimensional visualization of the relative positions of each observation, regardless of the total number of variables. In order to do that, we must structure a new dataset, formed exactly by the distance matrix. For the data in our example, we can open the file CollegeEntranceExamMatrix.sav, which contains the Euclidian distance matrix shown in Fig. 11.46. Note that the columns of this new dataset refer to the observations in the original dataset, as well as the rows (squared distance matrix).

Cluster Analysis Chapter

11

361

Mean of chemistry grade (0 to 10)

8.0

6.0

4.0

2.0

1

2

3

Single linkage FIG. 11.45 Means of the variable chemistry in the three clusters.

FIG. 11.46 Dataset with the Euclidean distance matrix.

Let’s click on Analyze → Scale → Multidimensional Scaling (ASCAL).... In the dialog box that will open, we must insert the variables that represent the observations in Variables, as shown in Fig. 11.39. Since the data already correspond to the distances, nothing needs to be done regarding the field Distances (Fig. 11.47). In Model..., let’s select the option Ratio in Level of Measurement (note that the option Euclidean distance in Scaling Model has already been selected) and, in Options..., the option Group plots in Display, as shown in Figs. 11.48 and 11.49, respectively. Next, we can click on Continue and on OK. Fig. 11.50 shows the chart with the relative positions of the observations projected on a plane. This type of chart is really useful when researchers wish to prepare didactical presentations of observation clusters (individuals, companies, municipalities, countries, among other examples) and to make the interpretation of the clusters easier, mainly when there is a relatively large number of variables in the dataset.

362

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.47 Dialog box with the selection of the variables to run the multidimensional scaling in SPSS.

FIG. 11.48 Defining the nature of the variable that corresponds to the distance measure.

Cluster Analysis Chapter

11

363

FIG. 11.49 Selecting the option for constructing the twodimensional chart.

Derived stimulus configuration Euclidean distance model 1.0 Gabriela

Leonor

LuizFelipe

Dimension 2

0.5

0.0

–0.5

Ovidio

–1.0

Patricia

–1.5 –2

–1

0 Dimension 1 FIG. 11.50 Two-dimensional chart with the projected relative positions of the observations.

1

2

364

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.51 Dialog box for elaborating the cluster analysis with the nonhierarchical K-means method in SPSS.

11.3.2

Elaborating Nonhierarchical K-Means Agglomeration Schedules in SPSS

Maintaining the same logic proposed in the chapter, from the same dataset, we will develop a cluster analysis based on the nonhierarchical k-means agglomeration schedule. Thus, we must once again use the file CollegeEntranceExams.sav. In order to do that, we must click on Analyze → Classify → K-Means Cluster.... In the dialog box that will open, we must insert the variables mathematics, physics, and chemistry into Variables, and the variable student into Label Cases by. The main difference between this initial dialog box and the one corresponding to the hierarchical procedure is determining the number of clusters from which the k-means algorithm will be elaborated. In our example, let’s insert number 3 into Number of Clusters. Fig. 11.51 shows how the dialog box will be. We can see that we inserted the original variables into the field Variables. This procedure is acceptable, since, for our example, the values are in the same unit of measure. However, if this fact is not verified, before elaborating the k-means procedure, researchers must standardize them through the Z-scores procedure, in Analyze → Descriptive Statistics → Descriptives..., insert the original variables into Variables, and select the option Save standardized values as variables. When we click on OK, researchers will see that new standardized variables will become part of the dataset. Going back to the initial screen of the k-means procedure, we will click on Save.... In the dialog box that will open, we must select the option Cluster membership, as shown in Fig. 11.52. When we click on Continue, we will go back to the previous dialog box. In Options..., let’s select the options Initial cluster centers, ANOVA table, and Cluster information for each case, in Statistics, as shown in Fig. 11.53. Next, we can click on Continue and on OK. It is important to mention that SPSS already uses the Euclidian distance as a standard dissimilarity measure when elaborating the k-means procedure.

Cluster Analysis Chapter

11

365

FIG. 11.52 Selecting the option to save the allocation of observations to the clusters with the new variable in the dataset—Nonhierarchical procedure.

FIG. 11.53 Selecting the options to perform the K-means procedure.

The first two outputs generated refer to the initial step and to the iteration of the k-means algorithm. The centroid coordinates are presented in the initial step and, through them, we can notice that SPSS considers that the three clusters are formed by the first three observations in the dataset. Although this decision is different from the one we used in Section 11.2.2.2.2, this choice is purely arbitrary and, as we will see later, it will not impact the formation of clusters in the final step of the k-means algorithm at all. While Fig. 11.54 shows the values of the original variables for observations Gabriela, Luiz Felipe, and Patricia (as shown in Table 11.16) as the centroid coordinates of the three groups, in Fig. 11.55 we can see, after the first iteration of the algorithm, that the change in the centroid coordinate of the first cluster is 1.897, which corresponds exactly to the Euclidian distance between observation Gabriela and the cluster Gabriela-Ovidio-Leonor (as shown in Table 11.23). In this last

FIG. 11.54 First step of the K-means algorithm—Centroids of the three groups as observation coordinates.

366

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.55 First iteration of the K-means algorithm and change in the centroid coordinates.

FIG. 11.56 Final stage of the K-means algorithm—Allocation of the observations and distances to the respective cluster centroids.

figure, in the footnotes, it is also possible to see the measure 7.187 that corresponds to the Euclidian distance between observations Luiz Felipe and Patricia, which remain isolated after the iteration. The next three figures refer to the final stage of the k-means algorithm. While the output Cluster Membership (Fig. 11.56) shows the allocation of each observation to each one of the three clusters, as well as the Euclidian distances between each observation and the centroid of the respective group, the output Distances between Final Cluster Centers (Fig. 11.58) shows the Euclidian distances between the group centroids. These two outputs have values that were calculated algebraically in Section 11.2.2.2.2 and shown in Table 11.23. Moreover, the output Final Cluster Centers (Fig. 11.57) shows the centroid coordinates of the groups after the final stage of this nonhierarchical procedure, which correspond to the values already calculated and presented in Table 11.22.

FIG. 11.57 Final stage of the K-Means algorithm—Cluster centroid coordinates.

Cluster Analysis Chapter

11

367

FIG. 11.58 Final stage of the K-means algorithm—Distances between the cluster centroids.

FIG. 11.59 One-way analysis of variance in the K-means procedure—Variation between groups and within groups, F statistics, and significance levels per variable.

The ANOVA output (Fig. 11.59) is analogous to the one presented in Table 11.30 in Section 2.2.2.2 and in Fig. 11.42 in Section 11.3.1, and, through it, we can see that only the variable physics has a statistically different mean in at least one of the groups formed, when compared to the others, at a significance level of 0.05. As we have previously discussed, if one or more variables are not contributing to the formation of the suggested number of clusters, we recommend that the algorithm be reapplied without these variables. The researcher can even use a hierarchical procedure without the aforementioned variables before reapplying the k-means procedure. For the data in our example, however, the analysis would become univariate due to the exclusion of the variables mathematics and chemistry, which demonstrates the risk researchers take when working with extremely small datasets in cluster analysis. It is important to mention that the ANOVA output must only be used when studying the variables that most contribute to the formation of the specified number of clusters, since this is chosen so that the differences between the observations allocated to different groups can be maximized. Thus, as explained in this output’s footnotes, we cannot use the F statistic aiming at verifying the equality or not of the groups formed. For this reason, it is common to find the term pseudo F for this statistic in the existing literature. Finally, Fig. 11.60 shows the number of observations in each one of the clusters. Similar to the hierarchical procedure, we can see that a new variable (obviously qualitative) is generated in the dataset after the preparation of the k-means procedure, which is called QCL_1 by SPSS, as shown in Fig. 11.61. This variable ended up being identical to the variable CLU3_1 (Fig. 11.37) in this example. Nonetheless, this fact does not always happen with a larger number of observations and in the cases in which different dissimilarity measures are used in the hierarchical and nonhierarchical procedures. Having presented the procedures for the application of the cluster analysis in SPSS, let’s discuss this technique in Stata.

FIG. 11.60 Number of observations in each cluster.

368

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.61 Dataset with the new variable QCL_1—Allocation of each observation.

11.4 CLUSTER ANALYSIS WITH HIERARCHICAL AND NONHIERARCHICAL AGGLOMERATION SCHEDULES IN STATA Now, we will present the step by step for preparing our example in Stata Statistical Software®. In this section, our main objective is not to once again discuss the concepts related to the cluster analysis, but to give the researcher an opportunity to prepare the technique by using the commands this software has to offer. At each presentation of an output, we will mention the respective result obtained when performing its algebraic solution and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©.

11.4.1

Elaborating Hierarchical Agglomeration Schedules in Stata

Therefore, let’s begin with the dataset constructed by the professor and which contains the grades in Mathematics, Physics, and Chemistry obtained by five students in the college entrance exams. The dataset can be found in the file CollegeEntranceExams.dta and is exactly the same as the one presented in Table 11.12 in Section 11.2.2.1.2. Initially, we can type the command desc, which makes the analysis of the dataset characteristics possible, such as, the number of observations, the number of variables, and the description of each one of them. Fig. 11.62 shows the first output in Stata. As discussed previously, since the original variables have values in the same unit of measure, in this example, it is not necessary to standardize them by using the Z-scores procedure. However, if the researcher wishes to, he may obtain the standardized variables through the following commands: egen zmathematics = std(mathematics) egen zphysics = std(physics) egen zchemistry = std(chemistry)

FIG. 11.62 Description of the CollegeEntranceExams.dta dataset.

Cluster Analysis Chapter

11

369

TABLE 11.31 Terms in Stata Corresponding to the Measures for Metric Variables Measure for Metric Variables

Term in Stata

Euclidian

L2

Squared Euclidean

L2squared

Manhattan

L1

Chebyshev

Linf

Canberra

Canberra

Pearson’s Correlation

corr

First of all, let’s obtain the matrix with distances between the pairs of observations. In general, the sequence of commands for obtaining distance or similarity matrices in Stata is: matrix dissimilarity D = variables*, option* matrix list D

where the term variables* will have to be substituted for the list of variables to be considered in the analysis, and the term option* will have to be substituted for the term corresponding to the distance or similarity measure that the researcher wishes to use. While Table 11.31 shows the terms in Stata that correspond to each one of the measures for the metric variables studied in Section 11.2.1.1, Table 11.32 shows the terms related to the measures used for the binary variables studied in Section 11.2.1.2. Therefore, since we wish to obtain the Euclidian distance matrix between the pairs of observations, in order to maintain the criterion used in the chapter, we must type the following sequence of commands: matrix dissimilarity D = mathematics physics chemistry, L2 matrix list D

The output generated, which can be seen in Fig. 11.63, is in accordance with what was presented in matrix D0 in Section 11.2.2.1.2.1, and also in Fig. 11.30 when we elaborated the technique in SPSS (Section 11.3.1). Next, we will carry out the cluster analysis itself. The general command used to run a cluster analysis through a hierarchical schedule in Stata is given by: cluster method* variables*, measure(option*)

where, besides the substitution of the terms variables* and option*, as discussed previously, we must substitute the term method* for the linkage method chosen by the researcher. Table 11.33 shows the terms in Stata related to the methods discussed in Section 11.2.2.1. TABLE 11.32 Terms in Stata Corresponding to the Measures for Binary Variables Measure for Binary Variables

Term in Stata

Simple matching

matching

Jaccard

Jaccard

Dice

Dice

AntiDice

antiDice

Russell and Rao

Russell

Ochiai

Ochiai

Yule

Yule

Rogers and Tanimoto

Rogers

Sneath and Sokal

Sneath

Hamann

Hamann

370

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.63 Euclidean distance matrix between pairs of observations.

TABLE 11.33 Terms in Stata That Correspond to the Linkage Methods in Hierarchical Agglomeration Schedules Linkage Method

Term in Stata

Single

singlelinkage

Complete

completelinkage

Average

averagelinkage

Therefore, for the data in our example and following the criterion adopted throughout this chapter (single-linkage method with Euclidian distance - term L2), we must type the following command: cluster singlelinkage mathematics physics chemistry, measure(L2)

After that, we can type the command cluster list, which makes, in a summarized way, the criteria used by the researcher to develop the hierarchical cluster analysis. Fig. 11.64 shows the outputs generated. From Fig. 11.64 and by analyzing the dataset, we can verify that three new variables are generated, regarding the identification of each observation (_clus_1_id), the sorting of the observations when creating the clusters (_clus_1_ord), and the Euclidian distances used in order to group the new observation in each one of the clustering stages (_clus_1_hgt). Fig. 11.65 shows how the dataset is after this cluster analysis is elaborated. It is important to mention that Stata shows the variable _clu_1_hgt with the old values in one row, which can make the analysis a little confusing. Therefore, while distance 3.713 refers to the merger between observations Ovidio and Gabriela (first stage of the agglomeration schedule), distance 7.187 corresponds to the fusion between Luiz Felipe and the cluster already formed by all the other observations (last stage of the agglomeration schedule), as already shown in Table 11.13 and in Fig. 11.31. Thus, in order for researchers to correct this discrepancy and to obtain the real behavior of the distances in each new clustering stage, they can type the sequence of commands, whose output can be seen in Fig. 11.66. Note that a new variable

FIG. 11.64 Elaboration of the hierarchical cluster analysis and summary of the criteria used.

Cluster Analysis Chapter

11

371

FIG. 11.65 Dataset with the new variables.

FIG. 11.66 Stages of the agglomeration schedule and respective Euclidian distances.

is generated (dist) and it corresponds to the correction of the discrepancy found in variable _clu_1_hgt (term [_n-1]), presenting the value of each Euclidian distance in order to establish a new cluster in each stage of the agglomeration schedule. gen dist = _clus_1_hgt[_n-1] replace dist=0 if dist==. sort dist list student dist

Having carried out this phase, we can ask Stata to construct the dendrogram by typing one of the two equivalent commands: cluster dendrogram, labels(student) horizontal

or cluster tree, labels(student) horizontal

The diagram generated can be seen in Fig. 11.67. We can see that the dendrogram constructed by Stata, in terms of Euclidian distances, is equal to the one shown in Fig. 11.12, constructed when the modeling was solved algebraically. However, it differs from the one constructed by SPSS (Fig. 11.32) for not considering rescaled measures. Regardless of this fact, we will adopt three clusters as a possible solution, being one of them formed by Leonor, Ovidio, and Gabriela, another, by Patricia, and the third, by Luiz Felipe, since the criteria discussed about large distance leaps coherently lead us toward this decision. In order to generate a new variable, corresponding to the allocation of the observations to the three clusters, we must type the following sequence of commands. Note that we have named this new variable cluster. The output seen in Fig. 11.68 shows the allocation of the observations to the groups and is equivalent to the one shown in Fig. 11.36 (SPSS). cluster generate cluster = groups(3), name(_clus_1) sort _clus_1_id list student cluster

372

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.67 Dendrogram—Single-linkage method and Euclidian distances in Stata.

Dendrogram for_clus_1 cluster analysis Leonor

Ovidio

Gabriela

Patricia

Luiz Felipe 0

2

4

6

8

L2 dissimilarity measure

FIG. 11.68 Allocating the observations to the clusters.

Finally, by using the one-way analysis of variance (ANOVA), we will study if the values of a certain variable differ between the groups represented by the categories of the new qualitative variable cluster generated in the dataset, that is, if the variation between the groups is significantly higher than the variation within each one of them, following the logic proposed in Section 11.3.1. In order to do that, let’s type the following commands, in which the three metric variables (mathematics, physics, and chemistry) are individually related to the variable cluster: oneway mathematics cluster, tabulate oneway physics cluster, tabulate oneway chemistry cluster, tabulate

The results of the ANOVA for the three variables are in Fig. 11.69. The outputs in this figure, which show the results of the variation Between groups and Within groups, the F statistics, and the respective significance levels (Prob. F, or Prob > F in Stata) for each variable, are equal to the ones calculated algebraically and presented in Table 11.30 (Section 11.2.2.2.2) and also in Fig. 11.42, when this procedure was elaborated in SPSS (Section 11.3.1). Therefore, as we have already discussed, we can see that, while for the variable physics there is at least one cluster that has a statistically different mean, when compared to the others, at a significance level of 0.05 (Prob. F ¼ 0.0429 < 0.05), the variables mathematics and chemistry do not have statistically different means between the three groups formed for this sample and at the significance level set. It is important to bear in mind that, if there is a greater number of variables that have Prob. F less than 0.05, the one considered the most discriminant of the groups is the one with the highest F statistic (that is, the lowest significance level Prob. F).

Cluster Analysis Chapter

11

373

FIG. 11.69 ANOVA for the variables mathematics, physics, and chemistry.

Even if it is possible to conclude the hierarchical analysis at this moment, the researcher has the option to run a multidimensional scaling, in order to see the projections of the relative positions of the observations in a two-dimensional chart, similar to what was done in Section 11.3.1. In order to do that, he may type the following command: mds mathematics physics chemistry, id(student) method(modern) measure(L2) loss(sstress) config nolog

The outputs generated can be found in Figs. 11.70 and 11.71, and the chart of the latter is the one shown in Fig. 11.50.

374

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.70 Elaborating the multidimensional scaling in Stata. FIG. 11.71 Chart with projections of the relative positions of the observations.

Having presented the commands to carry out the cluster analysis with hierarchical agglomeration schedules in Stata, let’s move on to the elaboration of the nonhierarchical k-means agglomeration schedule in the same software package.

11.4.2

Elaborating Nonhierarchical K-Means Agglomeration Schedules in Stata

In order to apply the k-means procedure to the data in the file CollegeEntranceExams.dta, we must type the following command: cluster kmeans mathematics physics chemistry, k(3) name(kmeans) measure(L2) start(firstk)

where the term k(3) is the input for the algorithm to be elaborated with three clusters. Besides, we define that a new variable with the allocation of the observations to the three groups will be generated in the dataset with the name kmeans (term name(kmeans)), and the distance measure used will be the Euclidian distance (term L2). Moreover, the term firstk specifies that the coordinates of the first k observations of the sample will be used as centroids of the k clusters (in our case, k ¼ 3), which corresponds exactly to the criterion adopted by SPSS, as discussed in Section 11.3.2.

Cluster Analysis Chapter

11

375

FIG. 11.72 Elaborating the nonhierarchical K-means procedure and a summary of the criteria used.

Next, we can type the command cluster list kmeans so that, in a summarized way, the criteria adopted for elaborating the k-means procedure can be presented. The outputs in Fig. 11.72 show what is generated by Stata after we type the last two commands. The next two commands generate, in the outputs of the software, two tables that refer to the number of observations in each one of the three clusters formed, as well as to the allocation of each observation in these groups, respectively: table kmeans list student kmeans

Fig. 11.73 shows these outputs. These results correspond to the one found when the k-means procedure was solved algebraically in Section 11.2.2.2.2 (Fig. 11.23), and to the one obtained when this procedure was elaborated using SPSS in Section 11.3.2 (Figs. 11.60 and 11.61). Even though we are able to develop a one-way analysis of variance for the original variables in the dataset, from the new qualitative variable generated (kmeans), we chose not to carry out this procedure here, since we have already done that for the variable cluster generated in Section 11.4.1 after the hierarchical procedure, which is exactly the same as the variable kmeans in this case. On the other hand, for pedagogical purposes, we present the command that allows the means of each variable in the three clusters to be generated, so that they can be compared: tabstat mathematics physics chemistry, by(kmeans)

The output generated can be seen in Fig. 11.74 and is equivalent to the one presented in Tables 11.24, 11.25, and 11.26. Finally, the researcher can also construct a chart to show the interrelationships between the variables, two at a time. This chart, known as matrix, can give the researcher a better understanding of how the variables relate to one another and even FIG. 11.73 Number of observations in each cluster and allocation of observations.

376

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.74 Means per cluster and general means of the variables mathematics, physics, and chemistry.

FIG. 11.75 Interrelationship between the variables and relative position of the observations in each cluster—matrix chart.

make suggestions regarding the relative position of the observations in each cluster in these interrelationships. To construct the chart shown in Fig. 11.75, we must type the following command: graph matrix mathematics physics chemistry, mlabel(kmeans)

Obviously, this chart could have also been constructed in the previous section. However, we chose to present it only at the end of the preparation of the k-means procedure in Stata. By analyzing it, it is possible to verify, among other things, that only considering the variables mathematics and chemistry is not enough to make observations Luiz Felipe and Patricia (clusters 2 and 3, respectively) stay further apart. It is necessary to consider the variable physics so that these two students can, in fact, be allocated to different clusters when forming three clusters. Although it may seem pretty obvious when analyzing the data in their own dataset, the chart becomes extremely useful for larger samples with a considerable number of variables, fact that would multiply these interrelationships.

Cluster Analysis Chapter

11

377

11.5 FINAL REMARKS Many are the situations in which researchers may wish to group observations (individuals, companies, municipalities, countries, political parties, plant species, among other examples) from certain metric or even binary variables. Creating homogeneous clusters, reducing data structurally, and verifying the validity of previously established constructs are some of the main reasons that make researchers choose to work with cluster analysis. This set of techniques allows decision-making mechanisms to be better structured and justified from the behavior and interdependence relationship between the observations of a certain dataset. Since the variable that represents the clusters formed is qualitative, the outputs of the cluster analysis can serve as inputs in other multivariate techniques, both exploratory as well as confirmatory ones. It is strongly advisable for researchers to justify, clearly and transparently, the measure they chose and that will serve as the basis for the observations to be considered more or less similar, as well as the reasons that make them define nonhierarchical or hierarchical agglomeration schedules and, in this last case, determine the linkage methods. In the last few years, the evolution of technological capabilities and the development of new software, with extremely improved resources, caused new and better cluster analysis techniques to arise. Techniques that use more and more sophisticated algorithms and that are aimed at the decision-making process in several fields of knowledge, always with the main goal of grouping observations based on certain criteria. However, in this chapter, we tried to offer a general overview of the main cluster analysis methods, also considered to be the most popular. Lastly, we would like to highlight that the application of this important set of techniques must always be done by using the software chosen for the modeling correctly and sensibly, based on the underlying theory and on researchers’ experience and intuition.

11.6 EXERCISES 1) The scholarship department of a certain college wishes to investigate the interdependence relationship between the students entering university in a certain school year, based only on two metric variables (age, in years, and average family income, in US\$). The main objective is to propose a still unknown number of new scholarship programs aimed at homogeneous groups of students. In order to do that, data on 100 new students were collected and a dataset was constructed, which can be found in the files Scholarship.sav and Scholarship.dta, with the following variables: Variable

Description

student

A string variable that identifies all freshmen in the college

age

Student’s age (years)

income

Average family income (US\$)

We would like you to: a) Run a cluster analysis through a hierarchical agglomeration schedule, with the complete-linkage method (furthest neighbor) and the squared Euclidean distance. Only present the final part of the agglomeration schedule table and discuss the results. Reminder: Since the variables have different units of measure, it is necessary to apply the Z-scores standardization procedure to prepare the cluster analysis correctly. b) Based on the table found in the previous item and in the dendrogram, we ask you: how many clusters of students will be formed? c) Is it possible to identify one or more very discrepant students, in comparison to the others, regarding the two variables under analysis? d) If the answer to the previous item is “yes,” once again run the hierarchical cluster analysis with the same criteria, however, now, without the student(s) considered discrepant. From the analysis of the new results, can new clusters be identified? e) Discuss how the presence of outliers can hamper the interpretation of results in a clusters analysis. 2) The marketing department of a retail company wants to study possible discrepancies in their 18 stores spread throughout three regional centers and distributed all over the country. In order to maintain and preserve its brand’s image and identity, top management would like to know if their stores are homogeneous in terms of customers’

378

PART

V

Multivariate Exploratory Data Analysis

perception of attributes, such as, services, variety of goods, and organization. Thus, first, a research with samples of clients was developed in each store, so that data regarding these attributes could be collected. These were defined based on the average score obtained (0 to 100) in each store. Next, a dataset was constructed and it contains the following variables: Variable

Description

store

A string variable that varies from 01 the 18 and that identifies the commercial establishment (store)

regional

A string variable that identifies each regional center (Regional 1 to Regional 3)

services

Customers’ average evaluation of services rendered (score from 0 to 100)

assortment

Customers’ average evaluation of the variety of goods (score from 0 to 100)

organization

Customers’ average evaluation of the organization (score from 0 to 100)

These data can be found in the files Retail Regional Center.sav and Retail Regional Center.dta. We would like you to: a) Run a cluster analysis through a hierarchical agglomeration schedule, with the single-linkage method and the Euclidean distance. Present the matrix with distances between each pair of observations. Reminder: Since the variables are in the same unit, it is not necessary to apply the Z-scores standardization procedure. b) Present and discuss the agglomeration schedule table. c) Based on the table found in the previous item and in the dendrogram, we ask you: how many clusters of stores will be formed? d) Run a multidimensional scaling and, after that, present and discuss the two-dimensional chart generated with the relative positions of the stores. e) Run a cluster analysis by using the k-means procedure, with the number of clusters suggested in item (c), and interpret the one-way analysis of variance for each variable considered in the study, considering a significance level of 0.05. Which variable contributes the most to the creation of at least one of the clusters formed, that is, which of them is the most discriminant of the groups? f) Is there any correspondence between the allocations of the observations to the groups obtained by the hierarchical and nonhierarchical methods? g) Is it possible to identify an association between any regional center and a certain discrepant group of stores, which could justify the management’s concern regarding the brand’s image and identity? If the answer is “yes,” once again run the hierarchical cluster analysis with the same criteria, however, now, without this discrepant group of stores. By analyzing the new results, is it possible to see the differences between the others stores more clearly? 3) A financial market analyst has decided to carry out a survey with CEOs and directors of large companies that operate in the health, education, and transport industries, in order to investigate how these companies’ operations are carried out and the mechanisms that guide their decision making processes. In order to do that, he structured a questionnaire with 50 questions, whose answers are only dichotomous, or binary. After applying the questionnaire, he got answers from 35 companies and, from then on, structured a dataset, present in the files Binary Survey.sav and Binary Survey.dta. In a generic way, the variables are:

Variable

Description

q1 to q50

A list of 50 dummy variables that refer to the way the operations and the decision-making processes are carried out in these companies

sector

Company sector

The analyst’s main goal is to verify whether companies in the same sector show similarities in relation to the way their operations and decision making processes are carried out, at least from their own managers’ perspective. In order to do that, after collecting the data, a cluster analysis can be elaborated. We would like you to: a) Based on the hierarchical cluster analysis elaborated with the average-linkage method (between groups) and the simple matching similarity measure for binary variables, analyze the agglomeration schedule generated. b) Interpret the dendrogram.

Cluster Analysis Chapter

11

379

c) Check if there is any correspondence between the allocations of the companies to the clusters and the respective sectors, or, in other words, if the companies in the same sector show similarities regarding the way their operations and decisionmaking processes are carried out.

4) A greengrocer has decided to monitor the sales of his products for 16 weeks (4 months). The main objective is to verify if the sales behavior of three of their main products (bananas, oranges, and apples) is recurrent after a certain period, due to weekly wholesale price fluctuations, prices that are passed on to customers and may impact sales. These data can be found in the files Veggiefruit.sav and Veggiefruit.dta, which have the following variables:

Variable

Description

week

A string variable that varies from 1 to 16 and identifies the week in which the sales were monitored

week_month

A string variable that varies from 1 to 4 and identifies the week in each one of the months

banana

Number of bananas sold that week (un.)

orange

Number of oranges sold that week (un.)

apple

Number of apples sold that week (un.)

We would like you to: a) Run a cluster analysis through a hierarchical agglomeration schedule, with the single-linkage method (nearest neighbor) and Pearson’s correlation measure. Present the matrix of similarity measures (Pearson’s correlation) between each row in the dataset (weekly periods). Reminder: Since the variables are in the same unit of measure, it is not necessary to apply the Z-scores standardization procedure. b) Present and discuss the agglomeration schedule table. c) Based on the table found in the previous item and on the dendrogram, we ask you: is there any indication that the joint sales behavior of bananas, oranges and apples is recurrent in certain weeks?

APPENDIX A.1

Detecting Multivariate Outliers

Even though detecting outliers is extremely important when applying practically every single multivariate data analysis technique, we chose to add this Appendix to the present chapter because cluster analysis represents the first set of multivariate exploratory techniques being studied, whose outputs can be used as inputs in several other techniques, as well as because very discrepant observations may significantly interfere in the creation of clusters. Barnett and Lewis (1994) mention almost 1000 articles in the existing literature on outliers. However, we chose to show a very effective, computationally simple, and fast algorithm for detecting multivariate outliers, bearing in mind that the identification of outliers for each variable individually, that is, in a univariate way, has already been studied in Chapter 3. A) Brief Presentation of the Blocked Adaptive Computationally Efficient Outlier Nominators Algorithm Billor et al. (2000), in extremely important work, show an interesting algorithm that has the purpose of detecting multivariate outliers. It is called Blocked Adaptive Computationally Efficient Outlier Nominators or simply BACON. This algorithm, explained in a very clear and didactical way by Weber (2010), is defined based on the preparation of a few steps, described briefly: 1. From a dataset with n observations and j (j ¼ 1, ..., k) variables X, in which each observation is identified by i (i ¼ 1, ..., n), the distance between one observation i that has a vector with dimension k xi ¼ ðxi1 , xi2 , …, xik Þ and the general mean of all sample values (group G), which also has a vector with dimension k x ¼ ðx1 , x2 , …, xk Þ, is given by the following expression, known as the Mahalanobis distance:

diG ¼

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðxi  xÞ’  S1  ðxi  xÞ

(11.29)

380

PART

V

Multivariate Exploratory Data Analysis

where S represents the covariance matrix of the n observations. Therefore, the first step of the algorithm consists in identifying m (m > k) homogeneous observations (initial group M) that have the smallest Mahalanobis distances in relation to the entire sample. It is important to mention that the dissimilarity measure known as Mahalanobis distance, not discussed in this chapter, is adopted by the aforementioned authors due to the fact that it is not susceptible to variables that are in different measurement units. 2. Next, the Mahalanobis distances between each observation i and the mean of the m observation values that belong to group M are calculated, which also has a vector with dimension k xM ¼ ðxM1 , xM2 , …, xMk Þ, such that: diM ¼

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðxi  xM Þ’  S1 M  ðx i  x M Þ

(11.30)

where SM represents the covariance matrix of the m observations. 3. All the observations with Mahalanobis distances less than a certain threshold are added to the group M of observations. This threshold is defined as a corrected percentile of the w2 distribution (85% in the Stata standard). Steps 2 and 3 must be reapplied until there are no more modifications in group M, which will only have observations that are not considered outliers. Hence, the ones excluded from the group will be considered multivariate outliers. Weber (2010) codifies the algorithm proposed in the paper written by Billor et al. (2000) in Stata, thus proposing the command bacon. Next, we will present and discuss an example in which this command is used, and whose main advantage is to be very fast computationally, even when applied to large datasets. B) Example: The command bacon in Stata Before the specific preparation of this procedure in Stata, we must install the command bacon by typing findit bacon and clicking on the link st0197 from http://www.stata-journal.com/software/sj10-3. After that, we must click on click here to install. Lastly, going back to the Stata command screen, we can type ssc install moremata and mata: mata mlib index. Having done this, we may apply the command bacon. To apply this command, let’s use the file Bacon.dta, which shows data on the median household income (US\$) of 20,000 engineers, their age (years), and time he(she) has had a college degree (years). First of all, we can type the command desc, which makes the analysis of the dataset characteristics possible. Fig. 11.76 shows this first output. Next, we can type the following command that, based on the algorithm presented, identifies the observations considered multivariate outliers: bacon income age tgrad, generate(outbacon)

where the term generate(outbacon) makes a new dummy variable be generated in the dataset, called outbacon, which has values equal to 0 for observations not considered outliers, and values equal to 1 for the ones considered outliers. This output can be seen in Fig. 11.77.

FIG. 11.76 Description of the Bacon.dta dataset.

Cluster Analysis Chapter

11

381

FIG. 11.77 Applying the command bacon in Stata.

FIG. 11.78 Observations classified as multivariate outliers.

Through the figure, it is possible to see that four observations are classified as multivariate outliers. Besides, Stata considers 85% of the percentile standard of the w2 distribution, used as a separation threshold between the observations considered outliers and nonoutliers, as previously discussed and highlighted by Weber (2010). This is the reason why the term BACON outliers (p = 0.15) appears in the outputs. This value may be altered due to a criterion established by the researcher. However, we would like to emphasize that the standard percentile(0.15) is very adequate for obtaining consistent answers. From the following command, which generates the output seen in Fig. 11.78, we can investigate which observations are classified as outliers: list if outbacon == 1

Even if we are working with three variables, we can construct two-dimensional scatter plots, which allow us to identify the positions of the observations considered outliers in relation to the others. In order to do that, let’s type the following commands, which generate the mentioned charts for each pair of variables: scatter income age, ml(outbacon) note("0 = not outlier, 1 = outlier") scatter income tgrad, ml(outbacon) note("0 = not outlier, 1 = outlier") scatter age tgrad, ml(outbacon) note("0 = not outlier, 1 = outlier")

These three charts can be seen in Figs. 11.79, 11.80, and 11.81.

FIG. 11.79 Variables income and age—Relative position of the observations.

Median household income (US\$)

40,000

30,000

20,000

10,000

0 20

30

0 = not outlier, 1 = outlier

40 Age (years)

50

60

382

PART

V

Multivariate Exploratory Data Analysis

tgrad—

40,000 Median household income (US\$)

FIG. 11.80 Variables income and Relative position of the observations.

30,000

20,000

10,000

0 0

5

10 15 Time since graduation (years)

20

10 15 Time since graduation (years)

20

0 = not outlier, 1 = outlier

FIG. 11.81 Variables age position of the observations.

and

tgrad—Relative

60

Age (years)

50

40

30

20 0

5

0 = not outlier, 1 = outlier

Despite the fact that outliers have been identified, it is important to mention that the decision about what to do with these observations is entirely up to researchers, who must make it based on their research objectives. As already discussed throughout this chapter, excluding these outliers from the dataset may be an option. However, studying why they became multivariately discrepant can also result in many interesting research outcomes.

Chapter 12

Principal Component Factor Analysis Love and truth are so intertwined that it is practically impossible to disentangle and separate them. They are like the two sides of a coin. Mahatma Gandhi

12.1 INTRODUCTION Exploratory factor analysis techniques are very useful when we intend to work with variables that have, between themselves, relatively high correlation coefficients and one wishes to establish new variables that capture the joint behavior of the original variables. Each one of these new variables is called factor, which can be understood as the cluster of variables from criteria previously established. Therefore, factor analysis is a multivariate technique that tries to identify a relatively small number of factors that represent the joint behavior of interdependent original variables. Thus, while cluster analysis, studied in the previous chapter, uses distance or similarity measures to group observations and form clusters, factor analysis uses correlation coefficients to group variables and generate factors. Among the methods used to determine factors, the one known as principal components is, without a doubt, the most widely used in factor analysis, because it is based on the assumption that uncorrelated factors can be extracted from linear combinations of the original variables. Consequently, from a set of original variables correlated to one another, the principal component factor analysis allows another set of variables (factors) resulting from the linear combination of the first set to be determined. Even though, as we know, the term confirmatory factor analysis often appears in the existing literature, factor analysis is essentially an exploratory multivariate technique, or an interdependence, since it does not have a predictive nature for other observations not initially present in the sample, and the inclusion of new observations in the dataset makes it necessary to reapply the technique, so that more accurate and updated new factors can be generated. According to Reis (2001), factor analysis can be used with the main exploratory goal of reducing the data dimension, aiming at creating factors from the original variables, as well as with the objective of confirming an initial hypothesis that the data may be reduced to a certain factor, or a certain dimension, which was previously established. Regardless of the objective, factor analysis will continue to be exploratory. If researchers aim to use a technique to, in fact, confirm the relationships found in the factor analysis, they can use structural equation modeling, for instance. The principal component factor analysis has four main objectives: (1) to identify correlations between the original variables to create factors that represent the linear combination of those variables (structural reduction); (2) to verify the validity of previously established constructs, bearing in mind the allocation of the original variables to each factor; (3) to prepare rankings by generating performance indexes from the factors; and (4) to extract orthogonal factors for future use in confirmatory multivariate techniques that need the absence of multicollinearity. Imagine that a researcher is interested in studying the interdependence between several quantitative variables that translate the socioeconomic behavior of a nation’s municipalities. In this situation, factors that may possibly explain the behavior of the original variables can be determined, and, in this regard, the factor analysis is used to reduce the data structurally and, later on, to create a socioeconomic index that captures the joint behavior of these variables. From this index, we may even propose a performance ranking of the municipalities, and the factors themselves can be used in a possible cluster analysis. In another situation, factors extracted from the original variables can be used as explanatory variables of another variable (dependent), not initially considered in the analysis. For example, factors obtained from the joint behavior of grades in certain 12th grade subjects can be used as explanatory variables of students’ general classification in the college entrance exams, or whether students passed the exams or not. In these situations, note that the factors (orthogonal to one another) are Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00012-4 © 2019 Elsevier Inc. All rights reserved.

383

384

PART

V Multivariate Exploratory Data Analysis

used, instead of the original variables themselves, as explanatory variables of a certain phenomenon in confirmatory multivariate models, such as, multiple or logistic regression, in order to eliminate possible multicollinearity problems. Nevertheless, it is important to highlight that this procedure only makes sense when we intend to elaborate a diagnostic regarding the dependent variable’s behavior, without aiming at having forecasts for other observations not initially present in the sample. Since new observations do not have the corresponding values of the factors generated, obtaining these values is only possible if we include such observations in a new factor analysis. In a third situation, imagine that a retailer is interested in assessing their clients’ level of satisfaction by applying a questionnaire in which the questions have been previously classified into certain groups. For instance, questions A, B, and C were classified into the group quality of services rendered, questions D and E, into the group positive perception of prices, and questions F, G, H, and I, into the group variety of goods. After applying the questionnaire to a significant number of customers, in which these nine variables are collected by attributing scores that vary from 0 to 10, the retailer has decided to elaborate a principal component factor analysis to verify if, in fact, the combination of variables reflects the construct previously established. If this occurs, the factor analysis will have been used to validate the construct, presenting a confirmatory objective. In all of these situations, we can see that the original variables from which the factors will be extracted are quantitative, because a factor analysis begins with the study of the behavior of Pearson’s correlation coefficients between the variables. Nonetheless, it is common for researchers to use the incorrect arbitrary weighting procedure with qualitative variables, as, for example, variables on the Likert scale, and, from then on, to apply a factor analysis. This is a serious error! There are exploratory techniques meant exclusively for studying the behavior of qualitative variables as, for instance, the correspondence analysis and homogeneity analysis, and a factor analysis is definitely not meant for such purpose, as discussed by Fa´vero and Belfiore (2017). In a historical context, the development of factor analyses is partly due to Pearson’s (1896) and Spearman’s (1904) pioneer work. While Karl Pearson developed a rigorous mathematical treatment regarding what we traditionally call correlation at the beginning of the 20th century, Charles Edward Spearman published highly original work in which the interrelationships between students’ performance in several subjects, such as, French, English, Mathematics and Music were evaluated. Since the grades in these subjects showed strong correlation, Spearman proposed that scores resulting from apparently incompatible tests shared a single general factor, and students who got good grades had a more developed psychological or intelligence component. Generally speaking, Spearman excelled in applying mathematical methods and correlation studies to the analysis of the human mind. Decades later, in 1933, Harold Hotelling, a statistician, mathematician, and influential economics theoretician decided to call Principal Component Analysis the analysis that determines components from the maximization of the original data’s variance. Also in the first half of the 20th century, psychologist Louis Leon Thurstone, from an investigation of Spearman’s ideas and based on the application of certain psychological tests, whose results were submitted to a factor analysis, identified people’s seven primary mental abilities: spatial visualization, verbal meaning, verbal fluency, perceptual speed, numerical ability, reasoning, and rote memory. In psychology, the term mental factors is even used for variables that have greater influence over a certain behavior. Currently, factor analysis is used in several fields of knowledge, such as, marketing, economics, strategy, finance, accounting, actuarial science, engineering, logistics, psychology, medicine, ecology and biostatistics, among others. The principal component factor analysis must be defined based on the underlying theory and on the researcher’s experience, so that it can be possible to apply the technique correctly and to analyze the results obtained. In this chapter, we will discuss the principal component factor analysis technique, with the following objectives: (1) to introduce the concepts; (2) to present the step by step of modeling in an algebraic and practical way; (3) to interpret the results obtained; and (4) to show the application of the technique in SPSS and Stata. Following the logic proposed in the book, first, we develop the algebraic solution of an example linked to the presentation of the concepts. Only after introducing these concepts, we present and discuss the procedures for running the technique in SPSS and Stata be presented.

12.2

PRINCIPAL COMPONENT FACTOR ANALYSIS

Many are the procedures inherent to the factor analysis, with different methods for determining (extraction) factors from Pearson’s correlation matrix. The most frequently used method, which was adopted in this chapter for extracting factors, is known as principal components, in which the consequent structural reduction is also called Karhunen-Loe`ve transformation.

Principal Component Factor Analysis Chapter

12

385

TABLE 12.1 General Dataset Model for Developing a Factor Analysis Observation i

X1i

X2i

Xki

1

X11

X21

Xk1

2

X12

X22

Xk2

3

X13

X23

Xk3

n

X1n

X2n

Xkn

In the following sections, we will discuss the theoretical development of the technique, as well as a practical example. While the main concepts will be presented in Sections 12.2.1–12.2.5, Section 12.2.6 is meant for solving a practical example algebraically, from a dataset.

12.2.1

Pearson’s Linear Correlation and the Concept of Factor

Let’s imagine a dataset that has n observations and, for each observation i (i ¼ 1, …, n), values corresponding to each one of the k metric variables X, as shown in Table 12.1. From the dataset, and given our intention of extracting factors from k variables X, we must define correlation matrix r that displays the values of Pearson’s linear correlation between each pair of variables, as shown in Expression (12.1). 0 1 1 r12 ⋯ r1k B r21 1 ⋯ r2k C C r¼B (12.1) @ ⋮ ⋮ ⋱ ⋮ A rk1 rk2 ⋯ 1 Correlation matrix r is symmetrical in relation to the main diagonal that, obviously, shows values equal to 1. For example, for variables X1 and X2, Pearson’s correlation r12 can be calculated by using Expression (12.2). Xn     X1i  X1  X2i  X2 i¼1 r12 ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ (12.2) Xn  Xn  2ﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2ﬃ X  X X  X  1i 1 2i 2 i¼1 i¼1 where X1 and X2 represent the means of variables X1 and X2, respectively, and this expression is analogous to Expression (4.11), defined in Chapter 4. Thus, since Pearson’s correlation is a measure of the level of linear relationship between two metric variables, which may vary between 1 and 1, a value closer to one of these extreme values indicates the existence of a linear relationship between the two variables under analysis, which, therefore, may significantly contribute to the extraction of a single factor. On the other hand, a Pearson correlation that is very close to 0 indicates that the linear relationship between the two variables is practically nonexistent. Therefore, different factors can be extracted. Let’s imagine a hypothetical situation in which a certain dataset only has three variables (k ¼ 3). A three-dimensional scatter plot can be constructed from the values of each variable for each observation. The plot can be seen in Fig. 12.1. Only based on the visual analysis of the chart in Fig. 12.1, it is difficult to assess the behavior of the linear relationships between each pair of variables. Thus, Fig. 12.2 shows the projection of the points that correspond to each observation in each one of the planes formed by the pairs of variables, highlighting, in the dotted line, the adjustment that represents the linear relationship between the respective variables. While Fig. 12.2A shows that there is a significant linear relationship between variables X1 and X2 (a very high Pearson correlation), Fig. 12.2B and C make it very clear that there is no linear relationship between X3 and these variables. Fig. 12.3 displays these projections in a three-dimensional plot, with the respective linear adjustments in each plane (the dotted lines). Thus, in this hypothetical example, while variables X1 and X2 may be represented by a single factor in a very significant way, which we will call F1, variable X3 may be represented by another factor, F2, orthogonal to F1. Fig. 12.4 illustrates the extraction of these new factors in a three-dimensional way.

386

PART

V Multivariate Exploratory Data Analysis

FIG.12.1 Three-dimensional scatter plot for a hypothetical situation with three variables.

X3

X2

X1

X3

X3

X2

X1

X1

(A)

(B)

X2

(C)

FIG. 12.2 Projection of the points in each plane formed by a certain pair of variables. (A) Relationship between X1 and X2: positive and very high Pearson correlation. (B) Relationship between X1 and X3: Pearson correlation very close to 0. (C) Relationship between X2 and X3: Pearson correlation very close to 0.

So, factors can be understood as representations of latent dimensions that explain the behavior of the original variables. Having presented these initial concepts, it is important to emphasize that in many cases researchers may choose to not extract a factor represented in a considerable way by only one variable (in this case, factor F2), and what will define the extraction of each one of the factors is the calculation of the eigenvalues from correlation matrix r, as we will study in Section 12.2.3. Nevertheless, before that, it will be necessary to check the overall adequacy of the factor analysis, which will be discussed in the following section.

Principal Component Factor Analysis Chapter

12

387

X3

X2

X1

FIG. 12.3 Projection of the points in a three-dimensional plot with linear adjustments per plane.

12.2.2 Overall Adequacy of the Factor Analysis: Kaiser-Meyer-Olkin Statistic and Bartlett’s Test of Sphericity An adequate extraction of factors from the original variables requires correlation matrix r to have relatively high and statistically significant values. As discussed by Hair et al. (2009), even though visually analyzing correlation matrix r does not reveal if the factor extraction will in fact be adequate, a significant number of values less than 0.30 represent a preliminary indication that the factor analysis may not be adequate. In order to verify the overall adequacy of the factor extraction itself, we must use the Kaiser-Meyer-Olkin statistic (KMO) and Bartlett’s test of sphericity. The KMO statistic gives us the proportion of variance considered common to all the variables in the sample under analysis, that is, which can be attributed to the existence of a common factor. This statistic varies from 0 to 1 and, while values closer to 1 indicate that the variables share a very high proportion of variance (high Pearson correlations), values closer to 0 are a result of low Pearson correlations between the variables, which may indicate that the factor analysis will not be adequate. The KMO statistic, presented initially by Kaiser (1970), can be calculated through Expression (12.3). Xk Xk r2 c¼1 lc , l 6¼ c (12.3) KMO ¼ Xk Xk l¼1 X k Xk 2 + 2 r ’ lc lc l¼1 c¼1 l¼1 c¼1 where l and c represent the rows and columns of correlation matrix r, respectively, and the terms ’ represent the partial correlation coefficients between two variables. While Pearson’s correlation coefficients r are also called zero-order correlation coefficients, partial correlation coefficients ’ are also known as higher-order correlation coefficients. For three

388

PART

V Multivariate Exploratory Data Analysis

F2 X3

X2

X1

F1 FIG. 12.4 Factor extraction.

variables, they are also called first-order correlation coefficients, for four variables, second-order correlation coefficients, and so on. Let’s imagine a hypothetical situation in which a certain dataset shows three variables once again (k ¼ 3). Is it possible that, in fact, r12 reflects the level of linear relationship between X1 and X2 if variable X3 is related to the other two? In this situation, r12 may not represent the true level of linear relationship between X1 and X2 when X3 is present, which may provide a false impression regarding the nature of the relationship between the first two. Thus, partial correlation coefficients may contribute with the analysis, since, according to Gujarati and Porter (2008), they are used when researchers wish to find out the correlation between two variables, either by controlling or ignoring the effects of other variables present in the dataset. For our hypothetical situation, it is the correlation coefficient regardless of X3’s influence over X1 and X2, if any. Hence, for three variables X1, X2, and X3, we can define the first-order correlation coefficients the following way: r12  r13  r23 ’12,3 ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ     1  r213  1  r223 where ’12,3 represents the correlation between X1 and X2, maintaining X3 constant, r13  r12  r23 ’13,2 ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ     1  r212  1  r223 where ’13,2 represents the correlation between X1 and X3, maintaining X2 constant, and

(12.4)

(12.5)

Principal Component Factor Analysis Chapter

r23  r12  r13 ’23,1 ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ     1  r212  1  r213

12

389

(12.6)

where ’23,1 represents the correlation between X2 and X3, maintaining X1 constant. In general, a first-order correlation coefficient can be obtained through the following expression: rab  rac  rbc ’ab, c ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ    ﬃ 1  r2ac  1  r2bc

(12.7)

where a, b, and c can assume values 1, 2, or 3, corresponding to the three variables under analysis. Conversely, for a case in which there are four variables in the analysis, the general expression of a certain partial correlation coefficient (second-order correlation coefficient) is given by: ’ab, c  ’ad, c  ’bd, c (12.8) ’ab,cd ¼ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ    ﬃ 2 2 1  ’ad, c  1  ’bd, c where ’ab,cd represents the correlation between Xa and Xb, maintaining Xc and Xd constant, bearing in mind that a, b, c, and d may take on values 1, 2, 3, or 4, which correspond to the four variables under analysis. Obtaining higher-order correlation coefficients, in which five or more variables are considered in the analysis, should always be done based on the determination of lower-order partial correlation coefficients. In Section 12.2.6, we will propose a practical example by using four variables, in which the algebraic solution of the KMO statistic will be obtained through Expression (12.8). It is important to highlight that, even if Pearson’s correlation coefficient between two variables is 0, the partial correlation coefficient between them may not be equal to 0, depending on the values of Pearson’s correlation coefficients between each one of these variables and the others present in the dataset. In order for a factor analysis to be considered adequate, the partial correlation coefficients between the variables must be low. This fact denotes that the variables share a high proportion of variance, and disregarding one or more of them in the analysis may hamper the quality of the factor extraction. Therefore, according to a widely accepted criterion found in the existing literature, Table 12.2 gives us an indication of the relationship between the KMO statistic and the overall adequacy of the factor analysis. On the other hand, Bartlett’s test of sphericity (Bartlett, 1954) consists in comparing correlation matrix r to an identity matrix I of the same dimension. If the differences between the corresponding values outside the main diagonal of each matrix are not statistically different from 0, at a certain significance level, we may consider that the factor extraction will not be adequate. In other words, in this case, Pearson’s correlations between each pair of variables are statistically equal to 0, which makes any attempt of performing a factor extraction from the original variables unfeasible. So, we can define the null and alternative hypotheses of Bartlett’s test of sphericity the following way: 0 1 0 1 1 r12 ⋯ r1k 1 0 ⋯ 0 B r21 1 ⋯ r2k C B0 1 ⋯ 0C C B C H0 : r ¼ B @ ⋮ ⋮ ⋱ ⋮ A ¼ I ¼ @⋮ ⋮ ⋱ ⋮A rk1 rk2 ⋯ 1 0 0 ⋯ 1

TABLE 12.2 Relationship Between the KMO Statistic and the Overall Adequacy of the Factor Analysis KMO Statistic

Overall Adequacy of the Factor Analysis

Between 1.00 and 0.90

Marvelous

Between 0.90 and 0.80

Meritorious

Between 0.80 and 0.70

Middling

Between 0.70 and 0.60

Mediocre

Between 0.60 and 0.50

Miserable

Less than 0.50

Unacceptable

390

PART

V Multivariate Exploratory Data Analysis

0

1 r12 B r21 1 H1 : r ¼ B @ ⋮ ⋮ rk1 rk2

⋯ ⋯ ⋱ ⋯

1 0 r1k 1 B0 r2k C C 6¼ I ¼ B @⋮ ⋮ A 1 0

0 1 ⋮ 0

⋯ ⋯ ⋱ ⋯

1 0 0C C ⋮A 1

The statistic corresponding to Bartlett’s test of sphericity is an w2 statistic, which has the following expression:  

2k+5  ln jDj w2Bartlett ¼  ðn  1Þ  6

(12.9)

Þ degrees of freedom. We know that n is the sample size and k is the number of variables. In addition, D represents with k  ðk1 2 the determinant of correlation matrix r. Thus, for a certain number of degrees of freedom and a certain significance level, Bartlett’s test of sphericity allows us to check if the total value of the w2Bartlett statistic is higher than the statistic’s critical value. If this is true, we may state that Pearson’s correlations between the pairs of variables are statistically different from 0 and that, therefore, factors can be extracted from the original variables and the factor analysis is adequate. When we develop a practical example in Section 12.2.6, we will also discuss the calculations of the w2Bartlett statistic and the result of Bartlett’s test of sphericity. It is important to emphasize that we should always favor Bartlett’s test of sphericity over the KMO statistic to take a decision about the factor analysis’s overall adequacy. Given that the former is a test with a certain significance level, and the latter is only a coefficient (a statistic) calculated without any set distribution of probabilities or hypotheses that allow us to evaluate the corresponding significance level to make a decision. In addition, it is important to mention that for only two original variables the KMO statistic will always be equal to 0.50. Conversely, the w2Bartlett statistic may indicate if the null hypothesis of the test of sphericity was rejected or not, depending on the magnitude of Pearson’s correlation between both variables. Thus, while the KMO statistic will be 0.50 in these situations, Bartlett’s test of sphericity will allow researchers to decide whether to extract one factor from the two original variables or not. In contrast, for three original variables, it is very common for researchers to extract two factors with the statistical significance of Bartlett’s test of sphericity, however, with the KMO statistic less than 0.50. These two situations emphasize even more the greater relevance of Bartlett’s test of sphericity in relation to the KMO statistic in the decisionmaking process. Finally, we must mention that the recommendation to study Cronbach’s alpha’s magnitude, before studying the overall adequacy of the factor analysis, is commonly found in the existing literature, so that the reliability with which a factor can be extracted from original variables can be evaluated. We would like to highlight that Cronbach’s alpha only offers researchers indications of the internal consistency of the variables in the dataset so that a single factor can be extracted. Therefore, determining it is not a mandatory requisite for developing the factor analysis, since this technique allows the extraction of most factors. Nevertheless, for pedagogical purposes, we will discuss the main concepts of Cronbach’s alpha in the Appendix of this chapter, with its algebraic determination and corresponding applications in SPSS and Stata software. Having discussed these concepts and verified the overall adequacy of the factor analysis, we can now move on to the definition of the factors.

12.2.3 Defining the Principal Component Factors: Determining the Eigenvalues and Eigenvectors of Correlation Matrix r and Calculating the Factor Scores Since a factor represents the linear combination of the original variables, for k variables, we can define a maximum number of k factors (F1, F2, …, Fk), analogous to the maximum number of clusters that can be defined from a sample with n observations, as we discussed in the previous chapter, since a factor can also be understood as the result of the clustering of variables. Therefore, for k variables, we have: F1i ¼ s11  X1i + s21  X2i + ⋯ + sk1  Xki F2i ¼ s12  X1i + s22  X2i + ⋯ + sk2  Xki ⋮ Fki ¼ s1k  X1i + s2k  X2i + ⋯ + skk  Xki

(12.10)

where the terms s are known as factor scores, which represent the parameters of a linear model that relates a certain factor to the original variables. Calculating the factor scores is essential in the context of the factor analysis technique and is elaborated by determining the eigenvalues and eigenvectors of correlation matrix r. In Expression (12.11), we once again show correlation matrix r, which has already been presented in Expression (12.1).

Principal Component Factor Analysis Chapter

0

1 r12 B r21 1 r¼B @ ⋮ ⋮ rk1 rk2

⋯ ⋯ ⋱ ⋯

1 r1k r2k C C ⋮ A 1

12

391

(12.11)

This correlation matrix, with dimensions k  k, shows k eigenvalues l2 (l21 l22 … l2k ), which can be obtained from solving the following equation:   det l2  I  r ¼ 0

(12.12)

where I is the identity matrix, also with dimensions k  k. Since a certain factor represents the result of the clustering of variables, it is important to highlight that: l21 + l22 + ⋯ + l2k ¼ k

(12.13)

2 l  1 r 12 ⋯ r1k 2 r l  1 ⋯ r 21 2k ¼ 0 ⋮ ⋮ ⋱ ⋮ 2 r r ⋯ l  1 k1 k2

(12.14)

Expression (12.12) can be rewritten as follows:

from which we can define the eigenvalue matrix L2 the following way: 0

l21 B 0 L2 ¼ B @⋮ 0

0 l22 ⋮ 0

⋯ ⋯ ⋱ ⋯

1 0 0C C ⋮A l2k

(12.15)

In order to define the eigenvectors of matrix r based on the eigenvalues, we must solve the following equation system for each eigenvalue l2 (l21, l22, …, l2k ): l

Determining eigenvectors v11, v21, …, vk1 from the first eigenvalue (l21): 0

from where we obtain:

l

1 0 1 0 1 v11 0 l21  1 r12 ⋯ r1k 2 B r C B C B C 21 l1  1 ⋯ r2k C  B v21 C ¼ B 0 C B @ ⋮ ⋮ ⋱ ⋮ A @ ⋮ A @⋮A 2 vk1 0 rk1 rk2 ⋯ l1  1

(12.16)

 8  2 l1  1  v11  r12  v21 …  r1k  vk1 ¼ 0 > >   < r21  v11 + l21  1  v21 …  r2k  vk1 ¼ 0 > ⋮  >  : rk1  v11  rk2  v21 … + l21  1  vk1 ¼ 0

(12.17)

Determining eigenvectors v12, v22, …, vk2 from the second eigenvalue (l22): 1 0 1 0 1 v12 0 l22  1 r12 ⋯ r1k 2 B r C B v22 C B 0 C l  1 ⋯ r 21 2k C  B 2 B C¼B C @ ⋮ ⋮ ⋱ ⋮ A @ ⋮ A @⋮A vk2 0 rk1 rk2 ⋯ l22  1 0

(12.18)

392

PART

V Multivariate Exploratory Data Analysis

from where we obtain:

l

 8  2 l2  1  v12  r12  v22 …  r1k  vk2 ¼ 0 > >   < r21  v12 + l22  1  v22 …  r2k  vk2 ¼ 0 > ⋮  >  : rk1  v12  rk2  v22 … + l22  1  vk2 ¼ 0

(12.19)

Determining eigenvectors v1k, v2k, …, vkk from the kth eigenvalue (l2k ): 0

1 0 1 0 1 v1k 0 l2k  1 r12 ⋯ r1k 2 B r C B v2k C B 0 C l  1 ⋯ r 21 2k k B CB C¼B C @ ⋮ ⋮ ⋱ ⋮ A @ ⋮ A @⋮A vkk 0 rk1 rk2 ⋯ l2k  1

(12.20)

 8  2 lk  1  v1k  r12  v2k …  r1k  vkk ¼ 0 > >   < r21  v1k + l2k  1  v2k …  r2k  vkk ¼ 0 > ⋮  >  : rk1  v1k  rk2  v2k … + l2k  1  vkk ¼ 0

(12.21)

from where we obtain:

Thus, we can calculate the factor scores of each factor by determining the eigenvalues and eigenvectors of correlation matrix r. The factor scores vectors can be defined as follows: l

Factor scores of the first factor: 0 v 1 q11 ﬃﬃﬃﬃﬃ B C l2 C B 0 1 B v 1C s11 B 21 C B s21 C B qﬃﬃﬃﬃﬃ C B C 2C S1 ¼ @ A ¼ B B l1 C ⋮ B C B ⋮ C sk1 B vk1 C @ qﬃﬃﬃﬃﬃ A l21

l

Factor scores of the second factor: 0 v 1 q12 ﬃﬃﬃﬃﬃ B l2 C C 0 1 B B v 2C s12 B q22 C B s22 C B ﬃﬃﬃﬃﬃ C C ¼ B l2 C S2 ¼ B @ ⋮ A B 2C B C ⋮ B C sk2 B vk2 C @ qﬃﬃﬃﬃﬃ A l22

l

(12.22)

(12.23)

Factor scores of the kth factor: 0 v 1 q1kﬃﬃﬃﬃﬃ B l2 C C 0 1 B B v kC s1k B q2kﬃﬃﬃﬃﬃ C C B s2k C B C B 2C Sk ¼ B @ ⋮ A ¼ B lk C B C B ⋮ C skk B vkk C @ qﬃﬃﬃﬃﬃ A l2k

(12.24)

Principal Component Factor Analysis Chapter

12

393

Since the factor scores of each factor are standardized by the respective eigenvalues, the factors of the set of equations presented in Expression (12.10) must be obtained by multiplying each factor score by the corresponding original variable, standardized by using the Z-scores procedure. Thus, we can obtain each one of the factors based on the following equations: v v v ﬃﬃﬃﬃﬃ  ZX1i + q21ﬃﬃﬃﬃﬃ  ZX2i + ⋯ + qk1ﬃﬃﬃﬃﬃ  ZXki F1i ¼ q11 2 2 l1 l1 l21 v v v ﬃﬃﬃﬃﬃ  ZX1i + q22ﬃﬃﬃﬃﬃ  ZX2i + ⋯ + qk2ﬃﬃﬃﬃﬃ  ZXki F2i ¼ q12 (12.25) l22 l22 l22 ⋮ v v v Fki ¼ q1kﬃﬃﬃﬃﬃ  ZX1i + q2kﬃﬃﬃﬃﬃ  ZX2i + ⋯ + qkkﬃﬃﬃﬃﬃ  ZXki l2k l2k l2k where ZXi represents the standardized value of each variable X for a certain observation i. It is important to emphasize that all the factors extracted show, between themselves, Pearson correlations equal to 0, that is, they are orthogonal to one another. A more perceptive researcher will notice that the factor scores of each factor correspond exactly to the estimated parameters of a multiple linear regression model that has, as a dependent variable, the factor itself and, as explanatory variables, the standardized variables. Mathematically, it is also possible to verify the existing relationship between the eigenvectors, correlation matrix r, and eigenvalue matrix L2. Consequently, defining eigenvector matrix V as follows: 0 1 v11 v12 ⋯ v1k B v21 v22 ⋯ v2k C C V¼B (12.26) @ ⋮ ⋮ ⋱ ⋮ A vk1 vk2 ⋯ vkk we can prove that: V’  r  V ¼ L2 or:

0

v11 B v12 B @ ⋮ v1k

v21 v22 ⋮ v2k

⋯ ⋯ ⋱ ⋯

1 0 vk1 1 r12 B r21 1 vk2 C CB ⋮ A @ ⋮ ⋮ vkk rk1 rk2

⋯ ⋯ ⋱ ⋯

1 0 r1k v11 B v21 r2k C CB ⋮ A @ ⋮ 1 vk1

(12.27)

v12 v22 ⋮ vk2

⋯ ⋯ ⋱ ⋯

1 0 2 v1k l1 B v2k C C¼B 0 ⋮ A @⋮ vkk 0

0 l22 ⋮ 0

⋯ ⋯ ⋱ ⋯

1 0 0C C ⋮A l2k

(12.28)

In Section 12.2.6, we will discuss a practical example from which this relationship may be demonstrated. While in Section 12.2.2, we discussed the factor analysis’s overall adequacy, in this section, we will discuss the procedures for carrying out the factor extraction, if the technique is considered adequate. Even knowing that the maximum number of factors is also equal to k for k variables, it is essential for researchers to define, based on a certain criterion, the adequate number of factors that, in fact, represent the original variables. In our hypothetical example in Section 12.2.1, we saw that only two factors (F1 and F2) would be enough to represent the three original variables (X1, X2, and X3). Although researchers are free to determine the number of factors to be extracted in the analysis, in a preliminary way, since they may wish to verify the validity of a previously established construct (procedure known as a priori criterion), for instance, it is essential to carry out an analysis based on the magnitude of the eigenvalues calculated from correlation matrix r. As the eigenvalues correspond to the proportion of variance shared by the original variables to form each factor, as we will discuss in Section 12.2.4, since l21 l22 … l2k and bearing in mind that factors F1, F2, …, Fk are obtained from the respective eigenvalues, factors extracted from smaller eigenvalues are formed from smaller proportions of variance shared by the original variables. Since a factor represents a certain cluster of variables, factors extracted from eigenvalues less than 1 may possibly not be able to represent the behavior of a single original variable (of course there are exceptions to this rule, which occur in cases in which a certain eigenvalue is less than, but also very close to 1). The criterion for choosing the number of factors, in which only the factors that correspond to eigenvalues greater than 1 are considered, is often used and known as the latent root criterion or Kaiser criterion. The factor extraction method presented in this chapter is known as principal components, and the first factor F1, formed by the highest proportion of variance shared by the original variables, is also called principal factor. This method is often mentioned in the existing literature and is used in practical applications whenever researchers wish to elaborate a structural reduction

394

PART

V Multivariate Exploratory Data Analysis

of the data in order to create orthogonal factors, to define observation rankings by using the factors generated, and even to confirm the validity of previously established constructs. Other factor extraction methods, such as, the generalized least squares, unweighted least squares, maximum likelihood, alpha factoring, and image factoring, have different criteria and certain specificities and, even though they can also be found in the existing literature, they will not be discussed in this book. Moreover, it is common to discuss the need to apply the factor analysis to variables that have multivariate normal distribution, in order to show consistency when determining the factor scores. Nevertheless, it is important to emphasize that multivariate normality is a very rigid assumption, only necessary for a few factor extraction methods, such as, the maximum likelihood method. Most factor extraction methods do not require the assumption of data multivariate normality and, as discussed by Gorsuch (1983), the principal component factor analysis seems to be, in practice, very robust against breaks in normality.

12.2.4

Factor Loadings and Communalities

Having established the factors, we can now define the factor loadings, which simply are Pearson correlations between the original variables and each one of the factors. Table 12.3 shows the factor loadings for each variable-factor pair. Based on the latent root criterion (in which only factors resulting from eigenvalues greater than 1 are considered), we assume that the factor loadings between the factors that correspond to eigenvalues less than 1 and all the original variables are low, since they will have already presented higher Pearson correlations (loadings) with factors previously extracted from greater eigenvalues. In the same way, original variables that only share a small portion of variance with the other variables will have high factor loadings in only a single factor. If this occurs for all original variables, there will not be significant differences between correlation matrix r and identity matrix I, making the w2Bartlett statistic very low. This fact allows us to state that the factor analysis will not be adequate, and, in this situation, researchers may choose not to extract factors from the original variables. As the factor loadings are Pearson’s correlations between each variable and each factor, the sum of the squares of these loadings in each row of Table 12.3 will always be equal to 1, since each variable shares part of its proportion of variance with all the k factors, and the sum of the proportions of variance (factor loadings or squared Pearson correlations) will be 100%. Conversely, if less than k factors are extracted, due to the latent root criterion, the sum of the squared factor loadings in each row will not be equal to 1. This sum is called communality, which represents the total shared variance of each variable in all the factors extracted from eigenvalues greater than 1. So, we can say that: c211 + c212 + ⋯ ¼ communalityX1 c221 + c222 + ⋯ ¼ communalityX2 ⋮ c2k1 + c2k2 + ⋯ ¼ communalityXk

(12.29)

The main objective of the analysis of communalities is to check if any variable ends up not sharing a significant proportion of variance with the factors extracted. Even though there is no cutoff point from which a certain communality can be considered high or low, since the sample size can interfere in this assessment, the existence of considerably low communalities in relation to the others can indicate to researchers that they may need to reconsider including the respective variable into the factor analysis.

TABLE 12.3 Factor Loadings Between Original Variables and Factors Factor Variable

F1

F2

Fk

X1

c11

c12

c1k

X2

c21

c22

c2k

Xk

ck1

ck2

ckk

Principal Component Factor Analysis Chapter

12

395

Therefore, after defining the factors based on the factor scores, we can state that the factor loadings will be exactly the same as the parameters estimated in a multiple linear regression model that shows, as a dependent variable, a certain standardized variable ZX and, as explanatory variables, the factors themselves, and the coefficient of determination R2 of each model is equal to the communality of the respective original variable. The sum of the squared factor loadings in each column of Table 12.3, on the other hand, will be equal to the respective eigenvalue, since the ratio between each eigenvalue and the total number of variables can be understood as the proportion of variance shared by all k original variables to form each factor. So, we can say that: c211 + c221 + ⋯ + c2k1 ¼ l21 c212 + c222 + ⋯ + c2k2 ¼ l22 ⋮ c21k + c22k + ⋯ + c2kk ¼ l2k

(12.30)

After establishing the factors and the calculation of the factor loadings, it is also possible for some variables to have intermediate (neither very high nor very low) Pearson correlations (factor loadings) with all the factors extracted, although its communality is relatively not so low. In this case, although the solution of the factor analysis has already been obtained in an adequate way and considered concluded, researchers can, in the cases in which the factor loadings table shows intermediate values for one or more variables in all the factors, elaborate a rotation of these factors, so that Pearson’s correlations between the original variables and the new factors generated can be increased. In the following section, we will discuss factor rotation.

12.2.5

Factor Rotation

Once again, let’s imagine a hypothetical situation in which a certain dataset only has three variables (k ¼ 3). After preparing the principal component factor analysis, two factors, orthogonal to one another, are extracted, with factor loadings (Pearson correlations) with each one of the three original variables, according to Table 12.4. In order to construct a chart with the relative positions of each variable in each factor (a chart known as loading plot), we can consider the factor loadings to be coordinates (abscissas and ordinates) of the variables in a Cartesian plane formed by both orthogonal factors. The plot can be seen in Fig. 12.5. In order to better visualize the variables better represented by a certain factor, we can think about a rotation around the origin of the originally extracted factors F1 and F2, so that we can bring the points corresponding to variables X1, X2, and X3 0 0 closer to one of the new factors. These are called rotated factors F1 and F2. Fig. 12.6 shows this process in a simplified way. Based on Fig. 12.6, for each variable under analysis, we can see that while the loading for one factor increases, for the other, it decreases. Table 12.5 shows the loading redistribution for our hypothetical situation. Thus, for a generic situation, we can say that rotation is a procedure that maximizes the loadings of each variable in a certain factor, to the detriment of the others. In this regard, the final effect of rotation is the redistribution of factor loadings to factors that initially had smaller proportions of variance shared by all the original variables. The main objective is to minimize the number of variables with high loadings in a certain factor, since each one of the factors will start having more significant loadings only with some of the original variables. Consequently, rotation may simplify the interpretation of the factors.

TABLE 12.4 Factor Loadings Between Three Variables and Two Factors Factor Variable

F1

F2

X1

c11

c12

X2

c21

c22

X3

c31

c32

396

PART

V Multivariate Exploratory Data Analysis

FIG. 12.5 Loading plot for a hypothetical situation with three variables and two factors.

FIG. 12.6 Defining the rotated factors from the factors original.

TABLE 12.5 Original and Rotated Factor Loadings for Our Hypothetical Situation Factor Original Factor Loadings

Rotated Factor Loadings 0

Variable

F1

F2

F1

X1

c11

c12

j c11j > jc11 j

X2

c21

c22

j c21j > jc21 j

X3

c31

c32

j c31j < jc31 j

0

F2

0

jc12j < jc12 j

0

0

jc22j < jc22 j

0

jc32j > jc32 j

0

0

Principal Component Factor Analysis Chapter

12

397

Despite the fact that communalities and the total proportion of variance shared by all the variables in all the factors are not modified by the rotation (and neither are the KMO statistic or w2Bartlett), the proportion of variance shared by the original 0 variables in each factor is redistributed and, therefore, modified. In other words, new eigenvalues are set l 0 0 0 (l1, l2, …, lk) from the rotated factor loadings. Thus, we can say that: c0 211 + c0 212 + ⋯ ¼ communalityX1

c0 221 + c0 222 + ⋯ ¼ communalityX2 ⋮ c0 2k1 + c0 2k2 + ⋯ ¼ communalityXk

(12.31)

and that: c0 211 + c0 221 + ⋯ + c0 2k1 ¼ l0 1 6¼ l21 2 c0 212 + c0 222 + ⋯ + c0 2k2 ¼ l0 2 6¼ l22 ⋮ 2 c0 21k + c0 22k + ⋯ + c0 2kk ¼ l0 k 6¼ l2k 2

(12.32)

even if Expression (12.13) is respected, that is: l21 + l22 + ⋯ + l2k ¼ l0 1 + l0 2 + ⋯ + l0 k ¼ k 2

2

2

(12.33)

Besides, new rotated factor scores are obtained from the rotation of factors, s0 , such that the final expressions of the rotated factors will be: F01i ¼ s011  ZX1i + s021  ZX2i + ⋯ + s0k1  ZXki F02i ¼ s012  ZX1i + s022  ZX2i + ⋯ + s0k2  ZXki ⋮ F0ki ¼ s01k  ZX1i + s02k  ZX2i + ⋯ + s0kk  ZXki

(12.34)

It is important to highlight that the overall adequacy of the factor analysis (KMO statistic and Bartlett’s test of sphericity) is not altered by the rotation, since correlation matrix r continues the same. Even though there are several factor rotation methods, the orthogonal rotation method, also known as Varimax, whose main purpose is to minimize the number of variables that have high loadings on a certain factor through the redistribution of the factor loadings and maximization of the variance shared in factors that correspond to lower eigenvalues, is the most frequently used and will be used in this chapter to solve a practical example. That is where the name Varimax comes from. This method was proposed by Kaiser (1958). The algorithm behind the Varimax rotation method consists in determining a rotation angle y in which pairs of factors are equally rotated. Thus, as discussed by Harman (1976), for a certain pair of factors F1 and F2, for example, the rotated factor loadings c’ between the two factors and the k original variables are obtained from the original factor loadings c, through the following matrix multiplication: 0 0 0 1 1 c11 c12 c11 c012   0 C B c21 c22 C B 0 B C  cos y seny ¼ B c21 c22 C (12.35) @ ⋮ ⋮ A @ ⋮ ⋮ A seny cos y 0 0 ck1 ck2 ck1 ck2 where y, the counterclockwise rotation angle, is obtained by the following expression:

2ð D  k  A  BÞ y ¼ 0:25  arctan C  k  ð A 2  B2 Þ

(12.36)

where: A¼

k  X l¼1

c21l c22l  communalityl communalityl

k  X 2 l¼1

c1l  c2l communalityl

 (12.37)

 (12.38)

398

PART

V Multivariate Exploratory Data Analysis

" k X l¼1

c21l c22l  communalityl communalityl

2

  2

c1l  c2l communalityl

2 # (12.39)

k  X l¼1

  

c21l c22l c1l  c2l  2  communalityl communalityl communalityl

(12.40)

In Section 12.2.6, we will use these Varimax rotation method expressions to determine the rotated factor loadings from the original loadings. Besides Varimax, we can also mention other orthogonal rotation methods, such as, Quartimax and Equamax, even though they are less frequently mentioned in the existing literature and less used in practice. In addition to them, the researcher may also use oblique rotation methods, in which nonorthogonal factors are generated. Although they are not discussed in this chapter, we should also mention the Direct Oblimin and Promax methods in this category. Since oblique rotation methods can sometimes be used when we wish to validate a certain construct, whose initial factors are not correlated, we recommend that an orthogonal rotation method be used so that factors extracted in other multivariate techniques can be used later, such as, certain confirmatory models, in which the lack of multicollinearity of the explanatory variables is a mandatory premise.

12.2.6

A Practical Example of the Principal Component Factor Analysis

Imagine that the same professor, deeply engaged in academic and pedagogical activities, is now interested in studying how his students’ grades behave so that, afterwards, he can propose the creation of a school performance ranking. In order to do that, he collected information on the final grades, which vary from 0 to 10, of each one of his 100 students in the following subjects: Finance, Costs, Marketing, and Actuarial Science. Part of the dataset can be seen in Table 12.6. The complete dataset can be found in the file FactorGrades.xls. Through this dataset, it is possible to construct Table 12.7, which shows Pearson’s correlation coefficients between each pair of variables, calculated by using the logic presented in Expression (12.2).

TABLE 12.6 Example: Final Grades in Finance, Costs, Marketing, and Actuarial Science

Student

Final Grade in Finance (X1i)

Final Grade in Costs (X2i)

Final Grade in Marketing (X3i)

Final Grade in Actuarial Science (X4i)

Gabriela

5.8

4.0

1.0

6.0

Luiz Felipe

3.1

3.0

10.0

2.0

Patricia

3.1

4.0

4.0

4.0

Gustavo

10.0

8.0

8.0

8.0

Leticia

3.4

2.0

3.2

3.2

Ovidio

10.0

10.0

1.0

10.0

Leonor

5.0

5.0

8.0

5.0

Dalila

5.4

6.0

6.0

6.0

Antonio

5.9

4.0

4.0

4.0

8.9

5.0

2.0

8.0

… Estela

Principal Component Factor Analysis Chapter

12

399

TABLE 12.7 Pearson’s Correlation Coefficients for Each Pair of Variables finance

costs

marketing

finance

1.000

0.756

0.030

0.711

costs

0.756

1.000

0.003

0.809

0.030

0.003

1.000

0.044

0.711

0.809

0.044

1.000

marketing actuarial science

Therefore, we can write the expression of the correlation matrix r as follows: 0 1 0 1 r12 r13 r14 1:000 0:756 0:030 B r21 1 r23 r24 C B 0:756 1:000 0:003 C B r¼B @ r31 r32 1 r34 A ¼ @ 0:030 0:003 1:000 r41 r42 r43 1 0:711 0:809 0:044

actuarial science

1 0:711 0:809 C C 0:044 A 1:000

which has determinant D ¼ 0.137. By analyzing correlation matrix r, it is possible to verify that only the grades corresponding to the variable marketing do not have correlations with the grades in the other subjects, represented by the other variables. On the other hand, these show relatively high correlations with one another (0.756 between finance and costs, 0.711 between finance and actuarial, and 0.809 between costs and actuarial), which indicates that they may share significant variance to form one factor. Although this preliminary analysis is important, it cannot represent more than a simple diagnostic, since the overall adequacy of the factor analysis needs to be evaluated based on the KMO statistic and, mainly, by using the result of Bartlett’s test of sphericity. As we discussed in Section 12.2.2, the KMO statistic provides the proportion of variance considered common to all the variables present in the analysis, and, in order to establish its calculation, we need to determine partial correlation coefficients ’ between each pair of variables. In this case, it will be second-order correlation coefficients, since we are working with four variables simultaneously. Consequently, based on Expression (12.7), first, we need to determine the first-order correlation coefficients used to calculate of the second-order correlation coefficients. Table 12.8 shows these coefficients. Hence, from these coefficients and by using Expression (12.8), we can calculate the second-order correlation coefficients considered in the KMO statistic’s expression. Table 12.9 shows these coefficients. TABLE 12.8 First-Order Correlation Coefficients r12 r13  r23 ’12, 3 ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ 0:756 ð1r213 Þ  ð1r223 Þ r r

r

r13 r12  r23 ’13, 2 ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ 0:049 ð1r212 Þ  ð1r223 Þ r r

r

r14 r12  r24 ’14, 2 ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ 0:258 ð1r212 Þ  ð1r224 Þ r r

r

14 13 34 ¼ 0:711 ’14, 3 ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð1r213 Þ  ð1r234 Þ

23 12 13 ’23, 1 ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ 0:039 ð1r212 Þ  ð1r213 Þ

24 12 14 ’24, 1 ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ 0:590 ð1r212 Þ  ð1r214 Þ

r24 r23  r34 ¼ 0:810 ’24, 3 ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð1r223 Þ  ð1r234 Þ

r34 r13  r14 ’34, 1 ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ 0:033 ð1r213 Þ  ð1r214 Þ

r34 r23  r24 ’34, 2 ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ 0:080 ð1r223 Þ  ð1r224 Þ

TABLE 12.9 Second-Order Correlation Coefficients ’

’

’

12, 3 14, 3 24, 3 ’12, 34 ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ    ¼ 0:438

1’214, 3  1’224, 3

’

’

13, 2 14, 2 34, 2 ’13, 24 ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ    ¼ 0:029

1’214, 2  1’234, 2

’

’

14, 2 13, 2 34, 2 ’14, 23 ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ    ¼ 0:255

1’213, 2  1’234, 2

’

’

23, 1 24, 1 34, 1 ’23, 14 ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ    ¼ 0:072

1’224, 1  1’234, 1

’

’

24, 1 23, 1 34, 1 ’24, 13 ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ    ¼ 0:592

1’223, 1  1’234, 1

’

’

34, 1 23, 1 24, 1 ’34, 12 ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ    ¼ 0:069

1’223, 1  1’224, 1

400

PART

V Multivariate Exploratory Data Analysis

So, based on Expression (12.3), we can calculate the KMO statistic. The terms of the expression are given by: k X k X

r2lc ¼ ð0:756Þ2 + ð0:030Þ2 + ð0:711Þ2 + ð0:003Þ2 + ð0:809Þ2 + ð0:044Þ2 ¼ 1:734

l¼1 c¼1 k X k X

’2lc ¼ ð0:438Þ2 + ð0:029Þ2 + ð0:255Þ2 + ð0:072Þ2 + ð0:592Þ2 + ð0:069Þ2 ¼ 0:619

l¼1 c¼1

from where we obtain: KMO ¼

1:734 ¼ 0:737 1:734 + 0:619

Based on the criterion presented in Table 12.2, the value of the KMO statistic suggests that the overall adequacy of the factor analysis is middling. To test whether, in fact, correlation matrix r is statistically different from identity matrix I with the same dimension, we must use Bartlett’s test of sphericity, whose w2Bartlett statistic is given by Expression (12.9). For n ¼ 100 observations, k ¼ 4 variables, and correlation matrix r determinant D ¼ 0.137, we have:  

24+5 2  ln ð0:137Þ ¼ 192:335 wBartlett ¼  ð100  1Þ  6 Þ ¼ 6 degrees of freedom. Therefore, by using Table D in the Appendix, we have w2c ¼ 12.592 (critical w2 for 6 with 4  ð41 2 degrees of freedom and with a significance level of 0.05). Thus, since w2Bartlett ¼ 192.335 > w2c ¼ 12.592, we can reject the null hypothesis that correlation matrix r is statistically equal to identity matrix I, at a significance level of 0.05. Software packages like SPSS and Stata do not offer the w2c for the defined degrees of freedom and a certain significance level. However, they offer the significance level of w2Bartlett for these degrees of freedom. So, instead of analyzing if w2Bartlett > w2c , we must verify if the significance level of w2Bartlett is less than 0.05 (5%) so that we can continue performing the factor analysis. Thus: If P-value (either Sig. w2Bartlett, or Prob. w2Bartlett) < 0.05, correlation matrix r is not statistically equal to identity matrix I with the same dimension. The significance level of w2Bartlett can be obtained in Excel by using the command Formulas ! Insert Function ! CHIDIST, which will open a dialog box, as shown in Fig. 12.7. As we can see in Fig. 12.7, the P-value of the w2Bartlett statistic is considerably less than 0.05 (w2Bartlett Pvalue ¼ 8.11  1039), that is, Pearson’s correlations between the pairs of variables are statistically different from 0 and, therefore, factors can be extracted from the original variables, and the factor analysis very adequate.

FIG. 12.7 Obtaining the significance level of w2 (command Insert Function).

Principal Component Factor Analysis Chapter

12

401

Having verified the factor analysis’s overall adequacy, we can move on to the definition of the factors. In order to do that, we must initially determine the four eigenvalues l2 (l21 l22 l23 l24) of correlation matrix r, which can be obtained from solving Expression (12.12). Therefore, we have: 2 l  1 0:756 0:030 0:711 0:756 l2  1 0:003 0:809 0:030 0:003 l2  1 0:044 ¼ 0 0:711 0:809 0:044 l2  1 from where we obtain:

8 2 l1 ¼ 2:519 > > > > < l2 ¼ 1:000 2 > l23 ¼ 0:298 > > > : 2 l4 ¼ 0:183

Consequently, based on Expression (12.15), eigenvalue matrix L2 can be written as follows: 0 1 2:519 0 0 0 B 0 1:000 0 0 C C L2 ¼ B @ 0 0 0:298 0 A 0 0 0 0:183 Note that Expression (12.13) is satisfied, that is: l21 + l22 + ⋯ + l2k ¼ 2:519 + 1:000 + 0:298 + 0:183 ¼ 4 Since the eigenvalues correspond to the proportion of variance shared by the original variables to form each factor, we can construct a shared variance table (Table 12.10). By analyzing Table 12.10, we can say that while 62.975% of the total variance are shared to form the first factor, 25.010% are shared to form the second factor. The third and fourth factors, whose eigenvalues are less than 1, are formed through smaller proportions of shared variance. Since the most common criterion used to choose the number of factors is the latent root criterion (Kaiser criterion), in which only the factors that correspond to eigenvalues greater than 1 are taken into consideration, the researcher can choose to conduct all the subsequent analysis with only the first two factors, formed by sharing 87.985% of the total variance of the original variables, that is, with a total variance loss of 12.015%. Nonetheless, for pedagogical purposes, let’s discuss how to calculate the factor scores by determining the eigenvectors that correspond to the four eigenvalues. Consequently, in order to define the eigenvectors of matrix r based on the four eigenvalues calculated, we must solve the following equation systems for each eigenvalue, based on Expressions (12.16)–(12.21): Determining eigenvectors v11, v21, v31, v41 from the first eigenvalue (l21 ¼ 2.519):

l

8 ð2:519  1:000Þ  v11  0:756  v21 + 0:030  v31  0:711  v41 ¼ 0 > > > < 0:756  v + ð2:519  1:000Þ  v  0:003  v  0:809  v ¼ 0 11 21 31 41 > 0:030  v  0:003  v + ð 2:519  1:000 Þ  v + 0:044  v > 11 21 31 41 ¼ 0 > : 0:711  v11  0:809  v21 + 0:044  v31 + ð2:519  1:000Þ  v41 ¼ 0

TABLE 12.10 Variance Shared by the Original Variables to Form Each Factor Factor 1 2 3 4

Eigenvalue l2

Shared Variance (%)

2.519

2:519

1.000

1:000

0.298

0:298

0.183

0:183

Cumulative Shared Variance (%)

4

 100 ¼ 62:975

62.975

4

 100 ¼ 25, 010

87.985

4

 100 ¼ 7:444

95.428

4

 100 ¼ 4:572

100.000

402

PART

V Multivariate Exploratory Data Analysis

from where we obtain:

l

0

1 0 1 v11 0:5641 B v21 C B 0:5887 C B C¼B C @ v A @ 0:0267 A 31 v41 0:5783

Determining eigenvectors v12, v22, v32, v42 from the second eigenvalue (l22 ¼ 1.000): 8 ð1:000  1:000Þ  v12  0:756  v22 + 0:030  v32  0:711  v42 ¼ 0 > > > < 0:756  v + ð1:000  1:000Þ  v  0:003  v  0:809  v ¼ 0 12 22 32 42 > 0:030  v  0:003  v + ð 1:000  1:000 Þ  v + 0:044  v > 12 22 32 42 ¼ 0 > : 0:711  v12  0:809  v22 + 0:044  v32 + ð1:000  1:000Þ  v42 ¼ 0

from where we obtain:

l

0

1 0 1 v12 0:0068 B v22 C B 0:0487 C B C¼B C @ v A @ 0:9987 A 32 v42 0:0101

Determining eigenvectors v13, v23, v33, v43 from the third eigenvalue (l23 ¼ 0.298): 8 ð0:298  1:000Þ  v13  0:756  v23 + 0:030  v33  0:711  v43 ¼ 0 > > > < 0:756  v + ð0:298  1:000Þ  v  0:003  v  0:809  v ¼ 0 13 23 33 43 > 0:030  v  0:003  v + ð 0:298  1:000 Þ  v + 0:044  v ¼0 > 13 23 33 43 > : 0:711  v13  0:809  v23 + 0:044  v33 + ð0:298  1:000Þ  v43 ¼ 0

from where we obtain:

l

0

1 0 1 v13 0:8008 B v23 C B 0:2201 C B C¼B C @ v33 A @ 0:0003 A v43 0:5571

Determining eigenvectors v14, v24, v34, v44 from the fourth eigenvalue (l24 ¼ 0.183): 8 ð0:183  1:000Þ  v14  0:756  v24 + 0:030  v34  0:711  v44 ¼ 0 > > > < 0:756  v + ð0:183  1:000Þ  v  0:003  v  0:809  v ¼ 0 14 24 34 44 > 0:030  v  0:003  v + ð 0:183  1:000 Þ  v + 0:044  v > 14 24 34 44 ¼ 0 > : 0:711  v14  0:809  v24 + 0:044  v34 + ð0:183  1:000Þ  v44 ¼ 0

from where we obtain:

0

1 0 1 v14 0:2012 B v24 C B 0:7763 C B C¼B C @ v A @ 0:0425 A 34 v44 0:5959

Principal Component Factor Analysis Chapter

12

403

After having determined the eigenvectors, a more inquisitive researcher may prove the relationship presented in Expression (12.27), that is: V0  r  V ¼ L2 0 1 0 1 0:5641 0:5887 0:0267 0:5783 1:000 0:756 0:030 0:711 B 0:0068 0:0487 0:9987 0:0101 C B 0:756 1:000 0:003 0:809 C B C B C @ 0:8008 0:2201 0:0003 0:5571 A  @ 0:030 0:003 1:000 0:044 A 0:2012 0:7763 0:0425 0:5959 0:711 0:809 0:044 1:000 0 1 0 1 2:519 0 0 0 0:5641 0:0068 0:8008 0:2012 B 0:5887 0:0487 0:2201 0:7763 C B 0 1:000 0 0 C C B C B @ 0:0267 0:9987 0:0003 0:0425 A ¼ @ 0 0 0:298 0 A 0:5783 0:0101 0:5571 0:5959 0 0 0 0:183 Based on Expressions (12.22)–(12.24), we can calculate the factor scores that correspond to each one of the standardized variables for each one of the factors. Thus, from Expression (12.25), we are able to write the expressions for factors F1, F2, F3, and F4, as follows: 0:5641 0:5887 0:267 0:5783 F1i ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zfinancei + pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zcostsi  pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zmarketingi + pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zactuariali 2:519 2:519 2:519 2:519 0:0068 0:0487 0:9987 0:0101 F2i ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zfinancei + pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zcostsi + pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zmarketingi  pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zactuariali 1:000 1:000 1:000 1:000 0:8008 0:2201 0:0003 0:5571 F3i ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zfinancei  pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zcostsi  pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zmarketingi  pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zactuariali 0:298 0:298 0:298 0:298 0:2012 0:7763 0:0425 0:5959 F4i ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zfinancei  pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zcostsi + pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zmarketingi + pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ  Zactuariali 0:183 0:183 0:183 0:183 from where we obtain: F1i ¼ 0:355  Zfinancei + 0:371  Zcostsi  0:017  Zmarketingi + 0:364  Zactuariali F2i ¼ 0:007  Zfinancei + 0:049  Zcostsi + 0:999  Zmarketingi  0:010  Zactuariali F3i ¼ 1:468  Zfinancei  0:403  Zcostsi  0:001  Zmarketingi  1:021  Zactuariali F4i ¼ 0:470  Zfinancei  1:815  Zcostsi + 0:099  Zmarketingi + 1:394  Zactuariali Based on the factor expressions and on the standardized variables, we can calculate the values corresponding to each factor for each observation. Table 12.11 shows these results for part of the dataset. For the first observation in the sample (Gabriela), for example, we can see that: F1Gabriela ¼ 0:355  ð0:011Þ + 0:371  ð0:290Þ  0:017  ð1:650Þ + 0:364  ð0:273Þ ¼ 0:016 F2Gabriela ¼ 0:007  ð0:011Þ + 0:049  ð0:290Þ + 0:999  ð1:650Þ  0:010  ð0:273Þ ¼ 1:665 F3Gabriela ¼ 1:468  ð0:011Þ  0:403  ð0:290Þ  0:001  ð1:650Þ  1:021  ð0:273Þ ¼ 0:176 F4Gabriela ¼ 0:470  ð0:011Þ  1:815  ð0:290Þ + 0:099  ð1:650Þ + 1:394  ð0:273Þ ¼ 0:739 It is important to emphasize that all the factors extracted have Pearson correlations equal to 0, between themselves, that is, they are orthogonal to one another. A more inquisitive researcher may also verify that the factor scores that correspond to each factor are exactly the estimated parameters of a multiple linear regression model that has, as a dependent variable, the factor itself, and as explanatory variables, the standardized variables. Having established the factors, we can define the factor loadings, which correspond to Pearson’s correlation coefficients between the original variables and each one of the factors. Table 12.12 shows the factor loadings for the data in our example. For each original variable, the highest value of the factor loading was highlighted in Table 12.12. Consequently, while the variables finance, costs, and actuarial show stronger correlations with the first factor, we can see that only the variable marketing shows stronger correlation with the second factor. This proves the need for a second factor in order for all the

404

PART

V Multivariate Exploratory Data Analysis

TABLE 12.11 Calculation of the Factors for Each Observation Student

Zfinancei

Zcostsi

Zmarketingi

F2i

F3i

Gabriela

0.011

0.290

1.650

0.273

0.016

1.665

0.176

0.739

Luiz Felipe

0.876

0.697

1.532

1.319

1.076

1.503

0.342

0.831

Patricia

0.876

0.290

0.590

0.523

0.600

0.603

0.634

0.672

1.334

1.337

0.825

1.069

1.346

0.887

0.327

0.228

Leticia

0.779

1.104

0.872

0.841

0.978

0.922

0.161

0.379

Ovidio

1.334

2.150

1.650

1.865

1.979

1.553

0.812

0.841

Leonor

0.267

0.116

0.825

0.125

0.111

0.829

0.312

0.429

Dalila

0.139

0.523

0.118

0.273

0.242

0.139

0.694

0.623

0.021

0.290

0.590

0.523

0.281

0.597

0.682

0.250

Estela

0.982

0.113

1.297

1.069

0.802

1.293

0.305

1.616

Mean

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

Standard deviation

1.000

1.000

1.000

1.000

1.000

1.000

1.000

1.000

Gustavo

Antonio

Zactuariali

F1i

F4i

TABLE 12.12 Factor Loadings (Pearson’s Correlation Coefficients) Between Variables and Factors Factor Variable

F1

F2

F3

F4

finance

0.895

0.007

0.437

0.086

costs

0.934

0.049

0.120

0.332

0.042

0.999

0.000

0.018

0.918

0.010

0.304

0.255

marketing actuarial science

variables to share significant proportions of variance. However, the third and fourth factors present relatively low correlations with the original variables, which explains the fact that the respective eigenvalues are less than 1. If the variable marketing had not been inserted into the analysis, only the first factor would be necessary to explain the joint behavior of the other variables, and the other factors would also have respective eigenvalues less than 1. Therefore, as discussed in Section 12.2.4, we can verify that factor loadings between factors corresponding to eigenvalues less than 1 are relatively low, since they have already shown stronger Pearson correlations with factors previously extracted from greater eigenvalues. Based on Expression (12.30), we can see that the sum of the squared factor loadings in each column in Table 12.12 will be the respective eigenvalue that, as discussed before, can be understood as the proportion of variance shared by the four original variables to form each factor. Therefore, we have: ð0:895Þ2 + ð0:934Þ2 + ð0:042Þ2 + ð0:918Þ2 ¼ 2:519 ð0:007Þ2 + ð0:049Þ2 + ð0:999Þ2 + ð0:010Þ2 ¼ 1:000 ð0:437Þ2 + ð0:120Þ2 + ð0:000Þ2 + ð0:304Þ2 ¼ 0:298 ð0:086Þ2 + ð0:332Þ2 + ð0:018Þ2 + ð0:255Þ2 ¼ 0:183 from which we can prove that the second eigenvalue only reached value 1 due to the high existing factor loading for the variable marketing.

Principal Component Factor Analysis Chapter

12

405

Furthermore, from the factor loadings presented in Table 12.12, we can also calculate the communalities, which represent the total shared variance of each variable in all the factors extracted from eigenvalues greater than 1. So, based on Expression (12.29), we can write: communalityfinance ¼ ð0:895Þ2 + ð0:007Þ2 ¼ 0:802 communalitycosts ¼ ð0:934Þ2 + ð0:049Þ2 ¼ 0:875 communalitymarketing ¼ ð0:042Þ2 + ð0:999Þ2 ¼ 1:000 communalityactuarial ¼ ð0:918Þ2 + ð0:010Þ2 ¼ 0:843 Consequently, even though the variable marketing is the only one that has a high factor loading with the second factor, it is the variable in which the lowest proportion of variance is lost to form both factors. On the other hand, the variable finance is the one that presents the highest loss of variance to form these two factors (around 19.8%). If we had considered the factor loadings of the four factors, surely, all the communalities would be equal to 1. As we discussed in Section 12.2.4, we can see that the factor loadings are exactly the parameters estimated in a multiple linear regression model, which shows, as a dependent variable, a certain standardized variable and, as explanatory variables, the factors themselves, in which the coefficient of determination R2 of each model is equal to the communality of the respective original variable. Therefore, for the first two factors, we can construct a chart in which the factor loadings of each variable are plotted in each one of the orthogonal axes that represent factors F1 and F2, respectively. This chart, known as a loading plot, can be seen in Fig. 12.8. By analyzing the loading plot, the behavior of the correlations becomes clear. While the variables finance, costs, and actuarial show high correlation with the first factor (X-axis), the variable marketing shows strong correlation with the second factor (Y-axis). More inquisitive researchers may investigate the reasons why this phenomenon occurs, since, sometimes, while the subjects Finance, Costs, and Actuarial Science are taught in a more quantitative way, Marketing can be taught in a more qualitative and behavioral manner. However, it is important to mention that the definition of factors does not force researchers to name them, because, normally, this is not a simple task. Factor analysis does not have “naming factors” as one of its goals and, in case we intend to do that, researchers need to have vast knowledge about the phenomenon being studied, and confirmatory techniques can help them in this endeavor. At this moment, we can consider the preparation of the principal component factor analysis concluded. Nevertheless, as discussed in Section 12.2.5, if researchers wish to obtain a clearer visualization of the variables better represented by a certain factor, they can elaborate a rotation using the Varimax orthogonal method, which maximizes the loadings of each variable in a certain factor. In our example, since we already have an excellent idea of the variables with high loadings in each factor, being the loading plot (Fig. 12.8) already very clear, rotation may be considered unnecessary. Therefore, it will only be elaborated for pedagogical purposes, since, sometimes, researchers may find themselves in situations in which such phenomenon is not so clear. Consequently, based on the factor loadings for the first two factors (first two columns of Table 12.12), we will obtain rotated factor loadings c0 after rotating both factors for an angle y. Thus, based on Expression (12.35), we can write:

1

FIG. 12.8 Loading plot.

marketing

0.5

finance

costs

0

actuarial –0.5

–1 –1

–0.5

0

0.5

1

406

PART

V Multivariate Exploratory Data Analysis

0

0:895 B 0:934 B @ 0:042 0:918

0 0 1 c11 0:007   B c021 0:049 C cos y seny C ¼B @ ⋮ 0:999 A seny cos y c0k1 0:010

1 c012 c022 C C ⋮ A c0k2

where the counterclockwise rotation angle y is obtained from Expression (12.36). Nevertheless, before that, we must determine the values of terms A, B, C, and D present in Expressions (12.37)–(12.40). Constructing Tables 12.13–12.16 helps us for this purpose. So, taking the k ¼ 4 variables into consideration and based on Expression (12.36), we can calculate the counterclockwise rotation angle y as follows: 9 8

< 2  ð D  k  A  BÞ 2  ½ð0:181Þ  4  ð1:998Þ  ð0:012Þ = h i ¼ 0:029rad ¼ 0  25: arctan y ¼ 0:25  arctan :ð3:963Þ  4  ð1:998Þ2  ð0:012Þ2 ; C  k  ðA2  B2 Þ

TABLE 12.13 Obtaining Term A to Calculate Rotation Angle u  Variable

c1

c2

communality

c21l communalityl

l

communality

c1l  c2l 2  communality

l

finance

0.895

0.007

0.802

1.000

costs

0.934

0.049

0.875

0.995

0.042

0.999

1.000

0.996

0.918

0.010

0.843

1.000

A (sum)

1.998

marketing actuarial science



c2

2l  communality

TABLE 12.14 Obtaining Term B to Calculate Rotation Angle u  Variable

c1

c2

finance

0.895

0.007

0.802

0.015

costs

0.934

0.049

0.875

0.104

0.042

0.999

1.000

0.085

0.918

0.010

0.843

0.022

marketing actuarial science

B (sum)



0.012

TABLE 12.15 Obtaining Term C to Calculate Rotation Angle u  Variable

c1

c2

communality

c21l c22l  communalityl communalityl

finance

0.895

0.007

0.802

1.000

costs

0.934

0.049

0.875

0.978

0.042

0.999

1.000

0.986

0.918

0.010

0.843

0.999

C (sum)

3.963

marketing actuarial science

2

  2

c1l  c2l communalityl

2

Principal Component Factor Analysis Chapter

12

407

TABLE 12.16 Obtaining Term D to Calculate Rotation Angle u  Variable

c1

c2

communality

c21l communalityl

finance

0.895

0.007

0.802

0.015

costs

0.934

0.049

0.875

0.103

0.042

0.999

1.000

0.084

0.918

0.010

0.843

0.022

marketing actuarial science

D (sum)

   c22l c1l  c2l  2  communality  communality l

l

0.181

And, finally, we can calculate the rotated factor loadings: 0 0 0 1 c11 0:895 0:007   B c021 B 0:934 0:049 C cos 0:029 sen0:029 B B C @ 0:042 0:999 A  sen0:029 cos 0:029 ¼ @ c031 c041 0:918 0:010

1 0 c012 0:895 B 0:935 c022 C C¼B c032 A @ 0:013 c042 0:917

1 0:019 0:021 C C 1:000 A 0:037

Table 12.17 shows, in a consolidated way, the rotated factor loadings through the Varimax method for the data in our example. As we have already mentioned, even though the results without the rotation already showed which variables presented high loadings in each factor, rotation ended up distributing, even if lightly for the data in our example, the variable loadings to each one of the rotated factors. A new loading plot (now with rotated loadings) can also demonstrate this situation (Fig. 12.9).

TABLE 12.17 Rotated Factor Loadings Through the Varimax Method Factor Variable

F2

F1

finance

0.895

0.019

costs

0.935

0.021

0.013

1.000

0.917

0.037

marketing actuarial science

1

0

0

FIG. 12.9 Loading plot with rotated loadings.

marketing

0.5

costs 0

finance actuarial –0.5

–1 –1

–0.5

0

0.5

1

408

PART

V Multivariate Exploratory Data Analysis

Even though the plots in Figs. 12.8 and 12.9 are very similar, since rotation angle y is very small in this example, it is common for the researcher to find situations in which the rotation will contribute considerably for an easier understanding of the loadings, which can, consequently, simplify the interpretation of the factors. It is important to emphasize that the rotation does not change the communalities, that is, Expression (12.31) can be verified: communalityfinance ¼ ð0:895Þ2 + ð0:019Þ2 ¼ 0:802 communalitycosts ¼ ð0:935Þ2 + ð0:021Þ2 ¼ 0:875 communalitymarketing ¼ ð0:013Þ2 + ð1:000Þ2 ¼ 1:000 communalityactuarial ¼ ð0:917Þ2 + ð0:037Þ2 ¼ 0:843 Nonetheless, rotation changes the eigenvalues corresponding to each factor. Thus, for the two rotated factors, we have: ð0:895Þ2 + ð0:935Þ2 + ð0:013Þ2 + ð0:917Þ2 ¼ l0 1 ¼ 2:518 2 ð0:019Þ2 + ð0:021Þ2 + ð1:000Þ2 + ð0:037Þ2 ¼ l0 2 ¼ 1:002 2

0

0

Table 12.18 shows, based on the new eigenvalues l21 and l22, the proportions of variance shared by the original variables to form both rotated factors. In comparison to Table 12.10, we can see that even though there is no change in the sharing of 87.985% of the total variance of the original variables to form the rotated factors, the rotation redistributes the variance shared by the variables in each factor. As we have already discussed, the factor loadings correspond to the parameters estimated in a multiple linear regression model that shows, as a dependent variable, a certain standardized variable and, as explanatory variables, the factors themselves. Therefore, through algebraic operations, we can arrive at the factor scores expressions from the loadings, since they represent the estimated parameters of the respective regression models that have, as a dependent variable, the factors and, as explanatory variables, the standardized variables. Consequently, from the rotated factor loadings (Table 12.17), we arrive at 0 0 the following rotated factors expressions F1 and F2. F01i ¼ 0:355  Zfinancei + 0:372  Zcostsi + 0:012  Zmarketingi + 0:364  Zactuariali F02i ¼ 0:004  Zfinancei + 0:038  Zcostsi + 0:999  Zmarketingi  0:021  Zactuariali 0

Finally, the professor wishes to develop a school performance ranking of his students. Since the two rotated factors, F1 and 0 F2, are formed by the higher proportions of variance shared by the original variables (in this case, 62.942% and 25.043% of the total variance, respectively, as shown in Table 12.18) and correspond to eigenvalues greater than 1, they will be used to create the desired school performance ranking. A well-accepted criterion that is used to form rankings from factors is known as weighted rank-sum criterion, in which, for each observation, the values of all factors obtained (that have eigenvalues greater than 1) weighted by the respective proportions of shared variance are added, with the subsequent ranking of the observations based on the results obtained. This criterion is well accepted because it considers the performance of all the original variables, since only considering the first factor (principal factor criterion) may not consider the positive performance, for instance, obtained in a certain variable that may possibly share a considerable proportion of variance with the second factor. For 10 students chosen from the sample, Table 12.19 shows the result of the school performance ranking resulting from the ranking created after the sum of the values obtained from the factors weighted by the respective proportions of shared variance. The complete ranking can be found in the file FactorGradesRanking.xls. It is essential to highlight that the creation of performance rankings from original variables is considered to be a static procedure, since the inclusion of new observations or variables may alter the factor scores, which makes the preparation of a

TABLE 12.18 Variance Shared by the Original Variables to Form Both Rotated Factors Factor 1 2

0

Eigenvalue l 2

Shared Variance (%)

Cumulative Shared Variance (%)

2.518

2:518

1.002

1:002

4

 100 ¼ 62:942

62.942

4

 100 ¼ 25:043

87.985

Principal Component Factor Analysis Chapter

12

409

TABLE 12.19 School Performance Ranking Through the Weighted Rank-Sum Criterion 0

Student

Zfinancei

Zcostsi

Zmarketingi

Zactuariali

0

F1i

0

F2i

(F1i 0.62942) 0 + (F2i 0.25043)

Ranking

Adelino

1.30

2.15

1.53

1.86

1.959

1.568

1.626

1

Renata

0.60

2.15

1.53

1.86

1.709

1.570

1.469

2

Ovidio

1.33

2.15

1.65

1.86

1.932

1.611

0.813

13

Kamal

1.33

2.07

1.65

1.86

1.902

1.614

0.793

14

Itamar

1.29

0.55

1.53

1.04

1.022

1.536

0.259

57

Luiz Felipe

0.88

0.70

1.53

1.32

1.032

1.535

0.265

58

0.01

0.29

1.65

0.27

0.032

1.665

0.437

73

0.50

0.50

0.94

1.16

0.443

0.939

0.514

74

Viviane

1.64

1.16

1.01

1.00

1.390

1.029

1.133

99

Gilmar

1.52

1.16

1.40

1.44

1.512

1.409

1.304

100

⋮ Gabriela Marina ⋮

new factor analysis mandatory. As time goes by, the evolution of the phenomena represented by the variables may change the correlation matrix, which makes it necessary to reapply the technique in order to generate new factors obtained from more precise and updated scores. Here, therefore, we express a criticism against socioeconomic indexes that use previously established static scores for each variable when calculating the factor to be used to define the ranking in situations in which new observations are constantly included; more than this, in situations in which there is an evolution throughout time, which changes the correlation matrix of the original variables in each period. Finally, it is worth mentioning that the factors extracted are quantitative variables and, therefore, from them, other multivariate exploratory techniques can be elaborated, such as, a cluster analysis, depending on the researcher’s objectives. Besides, each factor can also be transformed into a qualitative variable as, for example, through its categorization into ranges, established based on a certain criterion and, from then on, a correspondence analysis could be elaborated, in order to assess a possible association between the generated categories and the categories of other qualitative variables. Factors can also be used as explanatory variables of a certain phenomenon in confirmatory multivariate models as, for instance, multiple regression models, since orthogonality eliminates multicollinearity problems. On the other hand, such procedure only makes sense when we intend to elaborate a diagnostic regarding the behavior of the dependent variable, without aiming at having forecasts. Since new observations do not show the corresponding values of the factors generated, obtaining it is only possible if we include such observations in a new factor analysis, in order to obtain new factor scores, since it is an exploratory technique. Furthermore, a qualitative variable obtained through the categorization of a certain factor into ranges can also be inserted as a dependent variable of a multinomial logistic regression model, allowing researchers to evaluate the probabilities each observation has of being in each range, due to the behavior of other explanatory variables not initially considered in the factor analysis. We would also like to highlight that this procedure has a diagnostic nature, trying to find out the behavior of the variables in the sample for the existing observations, without a predictive purpose. Next, this same example will be elaborated in the software packages SPSS and Stata. In Section 12.3, the procedures for preparing the principal component factor analysis in SPSS will be presented, as well as their results. In Section 12.4, the commands for running the technique in Stata will be presented, with their respective outputs.

410

12.3

PART

V Multivariate Exploratory Data Analysis

PRINCIPAL COMPONENT FACTOR ANALYSIS IN SPSS

In this section, we will discuss the step by step for developing our example in the IBM SPSS Statistics Software. Following the logic proposed in this book, the main objective is to give researchers an opportunity to elaborate the principal component factor analysis in this software package, given how easy it is to use it and how didactical the operations are. Every time we present an output, we will mention the respective result obtained when performing the algebraic solution of the technique in the previous section, so that researchers can compare them and broaden their own knowledge and understanding about it. The use of the images in this section has been authorized by the International Business Machines Corporation©. Going back to the example presented in Section 12.2.6, remember that the professor is interested in creating a school performance ranking of his students based on the joint behavior of their final grades in four subjects. The data can be found in the file FactorGrades.sav and are exactly the same as the ones partially presented in Table 12.6 in Section 12.2.6. Therefore, in order for the factor analysis to be elaborated, let’s click on Analyze → Dimension Reduction → Factor …. A dialog box as the one shown in Fig. 12.10 will open. Next, we must insert the original variables finance, costs, marketing, and actuarial into Variables, as shown in Fig. 12.11.

FIG. 12.10 Dialog box for running a factor analysis in SPSS.

FIG. 12.11 Selecting the original variables.

Principal Component Factor Analysis Chapter

12

411

Different from what was discussed in the previous chapter, when developing the cluster analysis, it is important to mention that the researcher does not need to worry about with the Z-scores standardization of the original variables to elaborate the factor analysis, since the correlations between original variables or between their corresponding standardized variables are exactly the same. Even so, if researchers choose to standardize each one of the variables, they will see that the outputs will be exactly the same. In Descriptives …, first, let’s select the option Initial solution in Statistics …, which makes all the eigenvalues of the correlation matrix be presented in the outputs, even the ones that are less than 1. In addition, let’s select the options Coefficients, Determinant, and KMO and Bartlett’s test of sphericity in Correlation Matrix, as shown in Fig. 12.12. When we click on Continue, we will go back to the main dialog box of the factor analysis. Next, we must click on Extraction …. As shown in Fig. 12.13, we will maintain the options regarding the factor extraction method selected FIG. 12.12 Selecting the initial options for running the factor analysis.

FIG. 12.13 Choosing the factor extraction method and the criterion for determining the number of factors.

412

PART

V Multivariate Exploratory Data Analysis

(Method: Principal components) and the choice criterion of the number of factors. In this case, as discussed in Section 12.2.3, only the factors that correspond to eigenvalues greater than 1 will be considered (latent root criterion or Kaiser criterion), and, therefore, we must maintain the option Based on Eigenvalue ! Eigenvalues greater than: 1 in Extract selected. Moreover, we will also maintain the options Unrotated factor solution, in Display, and Correlation matrix, in Analyze, selected. In the same way, let’s click on Continue so that we can go back to the main dialog box of the factor analysis. In Rotation …, for now, let’s select the option Loading plot(s) in Display, while still maintaining the option None in Method selected, as shown in Fig. 12.14. Choosing the extraction of unrotated factors at this moment is didactical, since the outputs generated may be compared to the ones obtained algebraically in Section 12.2.6. Nevertheless, researchers can choose to extract rotated factors at this opportunity. After clicking on Continue, we can select the button Scores … in the technique’s main dialog box. At this moment, let’s select the option Display factor score coefficient matrix, as shown in Fig. 12.15, which makes the factor scores that correspond to each factor extracted be presented in the outputs. Next, we can click on Continue and on OK. FIG. 12.14 Dialog box for selecting the rotation method and the loading plot.

FIG. 12.15 Selecting the option to present the factor scores.

Principal Component Factor Analysis Chapter

12

413

FIG. 12.16 Pearson’s correlation coefficients.

The first output (Fig. 12.16) shows correlation matrix r, equal to the one in Table 12.7 in Section 12.2.6, through which we can see that the variable marketing is the only one that shows low Pearson’s correlation coefficients with all the other variables. As we have already discussed, it is a first indication that the variables finance, costs, and actuarial can be correlated with a certain factor, while the variable marketing can correlate strongly with another one. We can also verify that the output seen in Fig. 12.16 shows the value of the determinant of correlation matrix r too, used to calculate the w2Bartlett statistic, as discussed when we presented Expression (12.9). In order to study the overall adequacy of the factor analysis, let’s analyze the outputs in Fig. 12.17, which shows the results of the calculations that correspond to the KMO statistic and w2Bartlett. While the first suggests that the overall adequacy of the factor analysis is considered middling (KMO ¼ 0.737), based on the criterion presented in Table 12.2, the w2Bartlett statistic ¼ 192.335 (Sig. w2Bartlett < 0.05 for 6 degrees of freedom) allows us to reject that correlation matrix r is statistically equal to identity matrix I with the same dimension, at a significance level of 0.05 and based on the hypotheses of Bartlett’s test of sphericity. Thus, we can conclude that the factor analysis is adequate. The values of the KMO and w2Bartlett statistics are calculated through Expressions (12.3) and (12.9), respectively, presented in Section 12.2.2, and are exactly the same as the ones obtained algebraically in Section 12.2.6. Next, Fig. 12.18 shows the four eigenvalues of correlation matrix r that correspond to each one of the factors extracted initially, with the respective proportions of variance shared by the original variables. Note that the eigenvalues are exactly the same as the ones obtained algebraically in Section 12.2.6, such that: l21 + l22 + ⋯ + l2k ¼ 2:519 + 1:000 + 0:298 + 0:183 ¼ 4

FIG. 12.17 Results of the KMO statistic and Bartlett’s test of sphericity.

FIG. 12.18 Eigenvalues and variance shared by the original variables to form each factor.

414

PART

V Multivariate Exploratory Data Analysis

Since in the analysis we will only consider the factors whose eigenvalues are greater than 1, the right-hand side of Fig. 12.18 shows the proportion of variance shared by the original variables to only form these factors. Therefore, analogous to what was the presented in Table 12.10, we can state that, while 62.975% of the total variance are shared to form the first factor, 25.010% are shared to form the second. Thus, to form these two factors, the total loss of variance of the original variables is equal to 12.015%. Having extracted two factors, Fig. 12.19 shows the factor scores that correspond to each one of the standardized variables for each one of these factors. Hence, we are able to write the expressions of factors F1 and F2 as follows: F1i ¼ 0:355  Zfinancei + 0:371  Zcostsi  0:017  Zmarketingi + 0:364  Zactuariali F2i ¼ 0:007  Zfinancei + 0:049  Zcostsi + 0:999  Zmarketingi  0:010  Zactuariali Note that the expressions are identical to the ones obtained in Section 12.2.6 from the algebraic definition of unrotated factor scores. Fig. 12.20 shows the factor loadings, which correspond to Pearson’s correlation coefficients between the original variables and each one of the factors. The values shown in Fig. 12.20 are equal to the ones presented in the first two columns of Table 12.12. The highest factor loading is highlighted for each variable and, therefore, we can verify that, while the variables finance, costs, and actuarial show stronger correlations with the first factor, only the variable marketing shows stronger correlation with the second factor. As we also discussed in Section 12.2.6, the sum of the squared factor loadings in the columns results in the eigenvalue of the corresponding factor, that is, it represents the proportion of variance shared by the four original variables to form each factor. Thus, we can verify that: ð0:895Þ2 + ð0:934Þ2 + ð0:042Þ2 + ð0:918Þ2 ¼ 2:519 ð0:007Þ2 + ð0:049Þ2 + ð0:999Þ2 + ð0:010Þ2 ¼ 1:000

FIG. 12.19 Factor scores.

FIG. 12.20 Factor loadings.

Principal Component Factor Analysis Chapter

12

415

On the other hand, the sum of the squared factor loadings in the rows results in the communality of the respective variable, that is, it represents the proportion of shared variance of each original variable in the two factors extracted. Therefore, we can also see that: communalityfinance ¼ ð0:895Þ2 + ð0:007Þ2 ¼ 0:802 communalitycosts ¼ ð0:934Þ2 + ð0:049Þ2 ¼ 0:875 communalitymarketing ¼ ð0:042Þ2 + ð0:999Þ2 ¼ 1:000 communalityactuarial ¼ ð0:918Þ2 + ð0:010Þ2 ¼ 0:843 In the SPSS outputs, the communalities table is also presented, as shown in Fig. 12.21. The loading plot that shows the relative position of each variable in each factor, based on the respective factor loadings, is also shown in the outputs, as shown in Fig. 12.22 (equivalent to Fig. 12.8 in Section 12.2.6), in which the X-axis represents factor F1, and the Y-axis, factor F2. Even though the relative position of the variables in each axis is very clear, that is, the magnitude of the correlations between each one of them and each factor, for pedagogical purposes, we chose to elaborate the rotation of the axes, which

FIG. 12.21 Communalities.

Component plot marketing 1.0

Component 2

0.5

finance

0.0

actuarial

costs

–0.5

–1.0 –1.0

–1.5

0.0 Component 1

FIG. 12.22 Loading plot.

0.5

1.0

416

PART

V Multivariate Exploratory Data Analysis

can sometimes facilitate the interpretation of the factors because it provides a better distribution of the variables’ factor loadings in each factor. Thus, once again, let’s click on Analyze → Dimension Reduction → Factor … and, on the button Rotation …, select the option Varimax, as shown in Fig. 12.23. When we click on Continue, we will go back to the main dialog box of the factor analysis. In Scores …, let’s select the option Save as variables, as shown in Fig. 12.24, so that the factors generated, now rotated, can be made available in the dataset as new variables. From these factors, the students’ school performance ranking will be created. Next, we can click on Continue and on OK. Figs. 12.25–12.29 show the outputs that present differences in relation to the previous ones, due to the rotation. In this regard, the results of the correlation matrix, of the KMO statistic, of Bartlett’s test of sphericity, and of the communalities table are not presented again, which, even though they were calculated from the rotated loadings, do not show changes in their values. Fig. 12.25 shows these rotated factor loadings and, through them, it is possible to verify, even if very tenuously, a certain redistribution of the variable loadings in each factor. Note that the rotated factor loadings in Fig. 12.25 are exactly the same as the ones obtained algebraically in Section 12.2.6, from Expressions (12.35) to (12.40), and presented in Table 12.17.

FIG. 12.23 Selecting the Varimax orthogonal rotation method.

FIG. 12.24 Selecting the option to save the factors as new variables in the dataset.

Principal Component Factor Analysis Chapter

12

417

FIG. 12.25 Rotated factor loadings through the Varimax method.

Component plot in rotated space marketing 1,0

Component 2

0,5

finance

0,0

costs actuarial

–0,5

–1,0 –1,0

–0,5

0,0

0,5

1,0

Component 1 FIG. 12.26 Loading plot with rotated loadings.

The new loading plot, constructed from the rotated factor loadings and equivalent to Fig. 12.9, can be seen in Fig. 12.26. The rotation angle calculated algebraically in Section 12.2.6 is also a part of the SPSS outputs and can be found in Fig. 12.27. As we have already discussed, from the rotated factor loadings, we can verify that there are no changes in the communality values of the variables considered in the analysis, that is: communalityfinance ¼ ð0:895Þ2 + ð0:019Þ2 ¼ 0:802 communalitycosts ¼ ð0:935Þ2 + ð0:021Þ2 ¼ 0:875 communalitymarketing ¼ ð0:013Þ2 + ð1:000Þ2 ¼ 1:000 communalityactuarial ¼ ð0:917Þ2 + ð0:037Þ2 ¼ 0:843 On the other hand, the new eigenvalues can be obtained as follows: ð0:895Þ2 + ð0:935Þ2 + ð0:013Þ2 + ð0:917Þ2 ¼ l0 1 ¼ 2:518 2 ð0:019Þ2 + ð0:021Þ2 + ð1:000Þ2 + ð0:037Þ2 ¼ l0 2 ¼ 1:002 2

418

PART

V M